Practical Issues
Practical Issues
Practical Issues
Machine Learning
• Model/Feature Selection
• Cross-Validation
Address Overfitting
Options:
1) Reduce number of features
Manually select which features to keep
Model selection algorithm
2) Regularization
Keep all features, but reduce the values of the parameters 𝜃 ’s
Each of the features contributes to predicting y
Model Selection
Suppose we are given a training set 𝑆. Then the algorithm could be:
1. Train each model 𝑀 on 𝑆, to get some hypothesis ℎ
2. Pick the hypothesis with the smallest training error
Model Selection
Suppose we are given a training set 𝑆. Then the algorithm could be:
1. Train each model 𝑀 on 𝑆, to get some hypothesis ℎ
2. Pick the hypothesis with the smallest training error
Does not work: Ex: for polynomial regression, the higher the order of
polynomial the better it fits the training set S, and thus the lower the training
error. Thus, this method always chooses high-order, high-variance model.
Cross-Validation: Hold-out cross
validation
Better algorithm:
1. Randomly split 𝑆 into 𝑆 (say, 70% of the data) and 𝑆 (remaining 30%). Here,
𝑆 is called hold-out cross validation set.
2. Train each model 𝑀 on 𝑆 only, to get some hypothesis ℎ
3. Select and output the hypothesis ℎ that had the smallest error 𝜀̂ (ℎ ) on the
hold out cross validation set.
Cross-Validation: Hold-out cross
validation
By testing on 𝑆 that the model did not see in the training phase, we obtain
a better estimate of each hypothesis ℎ ’s true generalization error
Usually, 1/4 - 1/3 of the data is used in the hold-out cross validation set
30% is a typical choice
Disadvantage: “wastes” about 30% of the data.
Cross-Validation: k-fold cross validation
The estimated generalization error of model 𝑀 is then calculated as the average of the
𝜀̂ (ℎ )’s (averaged over 𝑗).
3. Pick the model 𝑀 with the lowest estimated generalization error, and retrain that
model on the entire training set S. The resulting hypothesis is then output as our
final answer.
Cross Validation: k-fold cross validation
There is no single machine learning technique that works well under all
circumstances
Each technique has its own advantages and shortcomings
No machine learning approach will outperform all other machine learning
approaches under all circumstances
Feature Selection
Choose a subset based on the sorted scores for the criteria used
Disadvantages:
Ignores the relations between the attributes
Cannot identify redundant attributes – the attributes that does not bring
additional information beyond that provided by other attributes.
Feature Selection:
2. Wrapper
More powerful but more computationally expensive
Forward-search:
Question#1
Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:
Assume we will learn a decision tree on this data. Assume that when the decision tree splits on the real -valued
attribute X, it puts the split threshold halfway between the attributes that surround the split. For example: using the
gini measurement as the splitting criteria, the decision tree would initially choose to split at x=5, which is halfway
between x=4 and x=6.
Let Algorithm DT2 be the method of learning a decision tree with only two leaf nodes (i.e. only one split)
Let Algorithm DT* be the method of learning a decision tree fully with no pruning
Question#4a
Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:
What will be the training set error of DT2 on our data? You can express it as the number of misclassifications out of 10.
Question#4a
Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:
What will be the training set error of DT2 on our data? You can express it as the number of misclassifications out of 10?
1/10, because the DT will split at x=5 and will make one mistake on right branch
Question#4b
Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:
Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:
1/10, because the DT will split at approximately x=5 on each fold and the left out point will be consistent with the
predictions in all folds except for the “leave out x=8.5” fold
Question#4c
Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:
Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:
Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:
Suppose we are learning a classifier with binary output values Y=0 and Y=1.
There is one real-valued input X. The data is given below:
3/10, the leave-one-out points that will be wrongly predicted are x=8, x=8.5 and x=9