Module 2
Module 2
Module 2
Techniques
Dr. Jyotismita Chaki
Binary Classifier
• The algorithm which implements the classification on a dataset is
known as a classifier.
• If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
• Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or
DOG, etc.
Training a Binary Classifier
• Consider the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students
and employees of the US Census Bureau.
• Each image is labeled with the digit it represents.
• We only try to identify one digit—for example, the number 5.
• This “5-detector” will be an example of a binary classifier, capable of distinguishing between just two classes, 5 and
not-5.
• Let’s create the target vectors for this classification task:
y_train_5 = (y_train == 5) # True for all 5s, False for all other digits.
y_test_5 = (y_test == 5)
• Now let’s pick a classifier and train it.
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
• Now you can use it to detect images of the number 5:
sgd_clf.predict([some_digit]) array([ True])
• The classifier guesses that this image represents a 5 (True).
Performance measures
• Evaluating the performance of a Machine learning model is one of the
important steps while building an effective ML model.
• To evaluate the performance or quality of the model, different metrics are
used, and these metrics are known as performance metrics or evaluation
metrics.
• These performance metrics help us understand how well our model has
performed for the given data.
• In this way, we can improve the model's performance by tuning the hyper-
parameters.
• Each ML model aims to generalize well on unseen/new data, and
performance metrics help determine how well the model generalizes on
the new dataset.
Performance measures: Cross validation
• To tackle the problem of overfitting we can use Cross Validation.
• A key challenge with overfitting, and with machine learning in general, is
that we can’t know how well our model will perform on new data until we
actually test it.
• To address this, we can split our initial dataset into
separate training and test subsets.
• There are different types of Cross Validation Techniques but the overall
concept remains the same,
• To partition the data into a number of subsets
• Hold out a set at a time and train the model on remaining set
• Test model on hold out set
• Repeat the process for each subset of the dataset
Performance measures: Cross validation
Performance measures: Cross validation
• Let’s use the cross_val_score() function to evaluate your SGDClassifier
model using K-fold cross-validation, with three folds.
• Remember that K-fold cross- validation means splitting the training
set into K-folds (in this case, three), then making predictions and
evaluating them on each fold using a model trained on the remaining
folds
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
array([0.96355, 0.93795, 0.95615])
Performance measures: Confusion matrix
• A confusion matrix is a matrix that summarizes the performance of a
machine learning model on a set of test data.
• It is often used to measure the performance of classification models,
which aim to predict a categorical label for each input instance.
• The matrix displays the number of true positives (TP), true negatives
(TN), false positives (FP), and false negatives (FN) produced by the
model on the test data.
• For binary classification, the matrix will be of a 2X2 table.
• For multi-class classification, the matrix shape will be equal to the
number of classes i.e for n classes it will be nXn.
Performance measures: Confusion matrix
• A 2X2 Confusion matrix is shown below for
the image recognition having a Dog image
or Not Dog image.
• True Positive (TP): It is the total counts
having both predicted and actual values
are Dog.
• True Negative (TN): It is the total counts
having both predicted and actual values
are Not Dog.
• False Positive (FP): It is the total counts
having prediction is Dog while actually Not
Dog.
• False Negative (FN): It is the total counts
having prediction is Not Dog while actually,
it is Dog.
Performance measures: Confusion matrix
From the confusion matrix, we can find the following
metrics
1. Accuracy: Accuracy is used to measure the
performance of the model. It is the ratio of Total
correct instances to the total instances.
array([[54579, 0],
[ 0, 5421]])
Performance measures: Precision and Recall
• Scikit-Learn provides several functions to compute classifier metrics,
including precision and recall:
from sklearn.metrics import precision_score, recall_score
precision_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1522)
0.7290850836596654
recall_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1325)
0.7555801512636044
• It claims an image represents a 5, it is correct only 72.9% of the time.
• Moreover, it only detects 75.6% of the 5s.
Performance measures: Precision and Recall
• It is often convenient to combine precision and recall into a single metric called the F1 score, in
particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean
of precision and recall.
• Whereas the regular mean treats all values equally, the harmonic mean gives much more weight
to low values. As a result, the classifier will only get a high F1 score if both recall and precision are
high.
Now,
TPR = TP/P = 94/100 = 94%
TNR = TN/N = 850/900 = 94.4%
FPR = FP/N = 50/900 = 5.5%
FNR = FN/P =6/100 = 6%
roc_auc_score(y_train_5,
y_scores_forest)
0.9983436731328145
How to train binary classifiers: Summary
• Choose the appropriate metric for your task,
• Evaluate your classifiers using cross-validation,
• Select the precision/ recall tradeoff that fits your needs, and
• Compare various models using ROC curves and ROC AUC scores.
Multiclass Classification
• Whereas binary classifiers distinguish between two classes, multiclass classifiers (also called
multinomial classifiers) can distinguish between more than two classes.
• Some algorithms (such as Random Forest classifiers or naive Bayes classifiers) are capable of
handling multiple classes directly.
• Others (such as Support Vector Machine classifiers or Linear classifiers) are strictly binary
classifiers.
• However, there are various strategies that you can use to perform multiclass classification using
multiple binary classifiers.
• For example, one way to create a system that can classify the digit images into 10 classes (from 0
to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector,
and so on).
• Then when you want to classify an image, you get the decision score from each classifier for that
image and you select the class whose classifier outputs the highest score.
• This is called the one-versus-all (OvA) strategy (also called one-versus-the-rest).
Multiclass Classification
• Another strategy is to train a binary classifier for every pair of digits: one to
distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and
so on.
• This is called the one-versus-one (OvO) strategy. If there are N classes, you need
to train N × (N – 1) / 2 classifiers.
• For the MNIST problem, this means training 45 binary classifiers! When you want
to classify an image, you have to run the image through all 45 classifiers and see
which class wins the most duels.
• The main advantage of OvO is that each classifier only needs to be trained on the
part of the training set for the two classes that it must distinguish.
Multiclass Classification
• Some algorithms (such as Support Vector Machine classifiers) scale poorly with the size
of the training set, so for these algorithms OvO is preferred since it is faster to train many
classifiers on small training sets than training few classifiers on large training sets.
• For most binary classification algorithms, however, OvA is preferred.
• Instead of returning just one score per instance, it now returns 10 scores, one per class:
some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores
array([[-15955.22627845, -38080.96296175, -13326.66694897, 573.52692379, -17680.6846644
, 2412.53175101, -25526.86498156, -12290.15704709, -7946.05205023, -10631.35888549]])
• The highest score is indeed the one corresponding to class 5:
np.argmax(some_digit_scores)
5
Multiclass Classification
• If you want to force ScikitLearn to use one-versus-one or one-versus-all, you can
use the OneVsOneClassifier or OneVsRestClassifier classes.
• Simply create an instance and pass a binary classifier to its constructor.
• For example, this code creates a multi‐ class classifier using the OvO strategy,
based on a SGDClassifier:
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])
array([5], dtype=uint8)
len(ovo_clf.estimators_)
45
Multiclass Classification
• Training a RandomForestClassifieris just as easy:
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])
array([5], dtype=uint8)
• This time Scikit-Learn did not have to run OvA or OvO because Random Forest classifiers
can directly classify instances into multiple classes. You can call predict_proba() to get the
list of probabilities that the classifier assigned to each instance for each class:
forest_clf.predict_proba([some_digit])
array([[0. , 0. , 0.01, 0.08, 0. , 0.9 , 0. , 0. , 0. , 0.01]])
• You can see that the classifier is fairly confident about its prediction: the 0.9 at the 5th
index in the array means that the model estimates a 90% probability that the image
represents a 5.
Multiclass Classification
• Now of course you want to evaluate these classifiers. As usual, you want to use
cross- validation. Let’s evaluate the SGDClassifier’s accuracy using the
cross_val_score() function:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
array([0.8489802 , 0.87129356, 0.86988048])
• Simply scaling the inputs increases accuracy above 89%:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
array([0.89707059, 0.8960948 , 0.90693604])
Error Analysis
• Here, we will assume that you have found a promising model and you want to
find ways to improve it.
• One way to do this is to analyze the types of errors it makes.
• First, you can look at the confusion matrix.
Error Analysis
• It’s often more convenient to look at
an image representation of the
confusion matrix.
• This confusion matrix looks fairly
good, since most images are on the
main diagonal, which means that they
were classified correctly.
• The 5s look slightly darker than the
other digits, which could mean that
there are fewer images of 5s in the
dataset or that the classifier does not
perform as well on 5s as on other
digits.
• In fact, you can verify that both are
the case.
Error Analysis
• To plot the errors:
• First, you need to divide each value in the confusion matrix by the number of
images in the corresponding class, so you can compare error rates instead of
absolute number of errors.
• Now let’s fill the diagonal with zeros to keep only the errors, and let’s plot the
result:
Error Analysis
Error Analysis
• Now you can clearly see the kinds of errors the classifier makes.
• Remember that rows represent actual classes, while columns represent predicted
classes.
• The columns for classes 8 and 9 are quite bright, which tells you that many
images get misclassified as 8s or 9s.
• Similarly, the rows for classes 8 and 9 are also quite bright, telling you that 8s and
9s are often confused with other digits.
• Conversely, some rows are pretty dark, such as row 1: this means that most 1s are
classified correctly (a few are confused with 8s, but that’s about it).
• Notice that the errors are not perfectly symmetrical; for example, there are more
5s misclassified as 8s than the reverse.
Error Analysis
• Analyzing the confusion matrix can often give you insights on ways to
improve your classifier.
• Looking at this plot, it seems that your efforts should be spent on
improving classification of 8s and 9s, as well as fixing the specific 3/5
confusion.
• For example, you could try to gather more training data for these digits.
• Or you could engineer new features that would help the classifier—for
example, writing an algorithm to count the number of closed loops (e.g., 8
has two, 6 has one, 5 has none).
• Or you could preprocess the images (e.g., using Scikit-Image, Pillow, or
OpenCV) to make some patterns stand out more, such as closed loops.
Error Analysis
• Analyzing individual errors can also be a good way to gain insights on what your
classifier is doing and why it is failing, but it is more difficult and time-consuming.
• For example, let’s plot examples of 3s and 5s.
Error Analysis
• The two 5×5 blocks on the left show digits classified
as 3s, and the two 5×5 blocks on the right show
images classified as 5s.
• Some of the digits that the classifier gets wrong (i.e.,
in the bottom-left and top-right blocks) are so badly
written that even a human would have trouble
classifying them (e.g., the 5 on the 8th row and 1st
column truly looks like a 3).
• However, most misclassified images seem like
obvious errors to us, and it’s hard to understand why
the classifier made the mistakes it did.
• This classifier is quite sensitive to image shifting and
rotation. So one way to reduce the 3/5 confusion
would be to preprocess the images to ensure that
they are well centered and not too rotated.
• This will probably help reduce other errors as well.
Error Analysis: Mean Square Error
• When the purpose of the model is prediction, a reasonable
parameter to validate the model’s quality is the mean squared error
(MSE) of prediction.
• Lesser the MSE => Smaller is the error => Better the estimator.
• The Mean Squared Error is calculated as:
• MSE or E = (1/n) * Σ(actual – forecast)2
• where:
• Σ – a symbol that means “sum”
• n – sample size
• actual – the actual data value
• forecast – the predicted data value
Error Analysis: Mean Square Error
The squared error is calculated by (actual – forecast)2
Error Analysis: Mean Square Error
• Calculate the Mean Squared Error.
• MSE or E = (1/12) * (98) = 8.166.
• The squared error is zero when our
model makes a perfectly correct
prediction on every training example.
• Moreover, the closer E is to 0, the
better our model is.
• As a result, our goal will be to select
our parameter vector (the values for
all the weights in our model) such
that E is as close to 0 as possible.
Error Analysis: Mean Absolute Error
•n = the number of observations in the dataset,
•Σ = summation symbol (which means “add them all up”),
•|xi – x| = the absolute errors.
Class 1
Multiclass Classification: Performance
measure
Class 2
Multiclass Classification: Performance
measure
• Using this concept, we can calculate the class-wise accuracy,
precision, recall, and f1-scores and tabulate the results:
Multiclass Classification: Performance
measure
Multilabel Classification
• Until now each instance has always been assigned to just one class. In
some cases you may want your classifier to output multiple classes for each
instance.
• For example, consider a face-recognition classifier: what should it do if it
recognizes several people on the same picture?
• Of course it should attach one label per person it recognizes.
• Say the classifier has been trained to recognize three faces, Alice, Bob, and
Charlie; then when it is shown a picture of Alice and Charlie, it should
output [1, 0, 1] (meaning “Alice yes, Bob no, Charlie yes”).
• Such a classification system that outputs multiple binary labels is called a
multilabel classification system.
Multilabel Classification
• This code creates a y_multilabel array containing two target labels for each digit
image: the first indicates whether or not the digit is large (7, 8, or 9) and the
second indicates whether or not it is odd.
• The next lines create a KNeighborsClassifier instance (which supports multilabel
classification, but not all classifiers do) and we train it using the multiple targets
array.
• Now you can make a prediction, and notice that it outputs two labels.
Multilabel Classification
• There are many ways to evaluate a multilabel classifier, and selecting
the right metric really depends on your project. For example, one
approach is to measure the F1 score for each individual label (or any
other binary classifier metric discussed earlier), then simply compute
the average score.
• This code computes the average F1 score across all labels:
Multilabel Classification
• This assumes that all labels are equally important, which may not be
the case.
• In particular, if you have many more pictures of Alice than of Bob or
Charlie, you may want to give more weight to the classifier’s score on
pictures of Alice.
• One simple option is to give each label a weight equal to its support
(i.e., the number of instances with that
• target label).
• To do this, simply set average="weighted" in the preceding code.
Linear Regression
• Linear regression is a type of supervised machine learning algorithm
that computes the linear relationship between a dependent variable
and one or more independent features.
• When the number of the independent feature, is 1 then it is known as
Univariate Linear regression, and in the case of more than one
feature, it is known as multivariate linear regression.
Linear Regression
• Simple linear regression is the simplest form of linear regression, and it
involves only one independent variable and one dependent variable.
• The equation for simple linear regression is:
• where:
• Y is the dependent variable
• X is the independent variable
• β0 is the intercept
• β1 is the slope
Linear Regression
• Multiple linear regression involves more than one independent
variable and one dependent variable.
• The equation for multiple linear regression is:
• where:
• Y is the dependent variable
• X1, X2, …, Xp are the independent variables
• β0 is the intercept
• β1, β2, …, βn are the slopes
Linear Regression
Linear Regression
• Here Y is called a dependent or target variable and X is called an independent
variable also known as the predictor of Y.
• There are many types of functions or modules that can be used for regression.
• A linear function is the simplest type of function.
• Here, X may be a single feature or multiple features representing the problem.
• The model gets the best regression fit line by finding the best θ1 and θ2 values.
• θ1: intercept
• θ2: coefficient of x
• Once we find the best θ1 and θ2 values, we get the best-fit line. So when we are
finally using our model for prediction, it will predict the value of y for the input
value of x.
Linear Regression
• To achieve the best-fit regression line, the model aims to predict the
target value such that the error difference between the predicted
value and the true value Y is minimum.
• So, it is very important to update the θ1 and θ2 values, to reach the
best value that minimizes the error between the predicted y value
(pred) and the true y value (y).
• In Linear Regression, the Mean Squared Error (MSE) cost function is
employed, which calculates the average of the squared errors
between the predicted values and the actual values.
• MSE function can be calculated as:
Linear Regression: Simple: Example
7 5.2
12 8.5
Linear Regression: Simple: Example
Linear Regression: Simple: Example
Linear Regression: Simple: Example
Linear Regression: Simple: Example
Linear Regression: Simple: Example
Bias
Linear Regression: Multiple: Example
Linear Regression: Multiple: Example
Linear Regression: Multiple: Example
Linear Regression: Multiple: Example
Linear Regression: Multiple: Example
5 7 10
Error?
Regularization
• Regularization is a technique which makes slight modifications to the learning
algorithm such that the model generalizes better.
• This in turn improves the model’s performance on the unseen data as well.
• Regularization refers to techniques that are used to calibrate machine learning
models in order to minimize the adjusted loss function and prevent overfitting or
underfitting.
Regularization
• There are two main types of regularization techniques: Ridge
Regularization and Lasso Regularization.
Regularization: Ridge
• Also known as L2 Regression, it modifies the over-fitted or under
fitted models by adding the penalty equivalent to the sum of the
squares of the magnitude of coefficients.
• This means that the mathematical function representing our deep
learning model is minimized and coefficients are calculated.
• The magnitude of coefficients is squared and added.
• Ridge Regression performs regularization by shrinking the coefficients
present.
• The function depicted below (next slide) shows the cost function of
ridge regression.
Regularization: Ridge
• Residual sum of squares (RSS) measures how well a linear regression model
matches training data.
• It is represented by the formulation:
• X_poly now contains the original feature of X plus the square of this feature.
• Now you can fit a LinearRegression model to this extended training data
Polynomial Regression
Polynomial Regression
• Not bad: the model estimates y’ = 0.56x12 + 0.93x1 + 1.78 when in fact the original
function was y = 0.5x12 + 1.0x1 + 2.0 + Gaussian noise.
Polynomial Regression
Polynomial Regression
• This high-degree (300) Polynomial Regression model is severely overfitting the
training data, while the linear model is underfitting it.
• The model that will generalize best in this case is the quadratic model.
• It makes sense since the data was generated using a quadratic model, but in
general you won’t know what function generated the data.
• You can use cross-validation to get an estimate of a model’s generalization
performance.
• If a model performs well on the training data but generalizes poorly according to
the cross-validation metrics, then your model is overfitting.
• If it performs poorly on both, then it is underfitting.
• This is one way to tell when a model is too simple or too complex.
Polynomial Regression
• Another way is to look at the
learning curves: these are plots
of the model’s performance on
the training set and the
validation set.
• Let’s look at the learning curves
of the plain Linear Regression
model.
Polynomial Regression
• Let’s look at the performance on the training data: when there are just one or two instances in
the training set, the model can fit them perfectly, which is why the curve starts at zero.
• But as new instances are added to the training set, it becomes impossible for the model to fit the
training data perfectly, both because the data is noisy and because it is not linear at all.
• So the error on the training data goes up until it reaches a plateau, at which point adding new
instances to the training set doesn’t make the average error much better or worse.
• Now let’s look at the performance of the model on the validation data.
• When the model is trained on very few training instances, it is incapable of generalizing properly,
which is why the validation error is initially quite big.
• Then as the model is shown more training examples, it learns and thus the validation error slowly
goes down.
• However, once again a straight line cannot do a good job modeling the data, so the error ends up
at a plateau, very close to the other curve.
• These learning curves are typical of an underfitting model.
• Both curves have reached a plateau; they are close and fairly high.
Polynomial Regression
• The SoftMax Regression classifier predicts only one class at a time (i.e., it is
multiclass, not multioutput) so it should be used only with mutually exclusive
classes such as different types of plants.
• You cannot use it to recognize multiple people in one picture.
• Now that you know how the model estimates probabilities and makes
predictions, let’s take a look at training.
• The objective is to have a model that estimates a high probability for the target
class (and consequently a low probability for the other classes).
SoftMax Regression
• Let’s use SoftMax Regression to classify the iris flowers into all three classes.
Scikit-Learn’s LogisticRegression uses one-versus-all by default when you train it
on more than two classes, but you can set the multi_class hyperparameter to
"multinomial“ to switch it to SoftMax Regression instead.
• You must also specify a solver that supports SoftMax Regression, such as the
"lbfgs" solver (see Scikit-Learn’s documentation for more details).
• It also applies ℓ2 regularization by default, which you can control using the
hyperparameter C.
SoftMax Regression
• So the next time you find an iris with 5 cm long and 2 cm wide petals,
you can ask your model to tell you what type of iris it is, and it will
answer Iris-Virginica (class 2) with 94.2% probability (or Iris-Versicolor
with 5.8% probability):
SoftMax Regression
• In training the objective is to have a model that estimates high
probability for the target class and low probability for the other
classes.
• Keeping this objective in mind we are going to minimize the cost
function, also called as cross entropy.
• Cross entropy cost function penalizes the model when it estimates a
low probability for a target class.
• Cross entropy is a measure of how well a set of estimated class
probabilities match the target classes.
• So, we can say that it measures the performance of the model.
SoftMax Regression
• Where:
• H(y,p) is the cross-entropy loss.
• y is a one-hot encoded vector representing the true class (e.g., [0,1,0]
for the second class in a 3-class problem).
• p is a vector of predicted class probabilities for each class.
• log(pi) computes the natural logarithm of the predicted probability for
class i.
• The sum is taken over all classes.
SoftMax Regression
• The purpose of the Cross-Entropy is to take the output probabilities (P) and
measure the distance from the truth values (as shown in Figure below).
• For the example above the desired output is [1,0,0,0] for the class dog but the
model outputs [0.775, 0.116, 0.039, 0.070] .
• The probability will be equal to 1 or 0 depending on which class the instance
belongs.
• When there are just two classes (K = 2), this cost function is same as logistic
regression cost function.
SoftMax Regression
• Figure 4-25 shows the resulting decision boundaries, represented by the background colors.
• Notice that the decision boundaries between any two classes are linear.
• The figure also shows the probabilities for the Iris-Versicolor class, represented by the curved lines (e.g., the
line labeled with 0.450 represents the 45% probability boundary).
• Notice that the model can predict a class that has an estimated probability below 50%.
• For example, at the point where all decision boundaries meet, all classes have an equal estimated probability
of 33%.