Module 2

Supervised Learning
Techniques
Dr. Jyotismita Chaki
Binary Classifier
• The algorithm which implements the classification on a dataset is
known as a classifier.
• If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
• Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or
DOG, etc.
Training a Binary Classifier
• Consider the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students
and employees of the US Census Bureau.
• Each image is labeled with the digit it represents.
• We only try to identify one digit—for example, the number 5.
• This “5-detector” will be an example of a binary classifier, capable of distinguishing between just two classes, 5 and
not-5.
• Let’s create the target vectors for this classification task:
y_train_5 = (y_train == 5) # True for all 5s, False for all other digits.
y_test_5 = (y_test == 5)
• Now let’s pick a classifier and train it.
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
• Now you can use it to detect images of the number 5:
sgd_clf.predict([some_digit]) array([ True])
• The classifier guesses that this image represents a 5 (True).
Performance measures
• Evaluating the performance of a Machine learning model is one of the
important steps while building an effective ML model.
• To evaluate the performance or quality of the model, different metrics are
used, and these metrics are known as performance metrics or evaluation
metrics.
• These performance metrics help us understand how well our model has
performed for the given data.
• In this way, we can improve the model's performance by tuning the hyper-
parameters.
• Each ML model aims to generalize well on unseen/new data, and
performance metrics help determine how well the model generalizes on
the new dataset.
Performance measures: Cross validation
• To tackle the problem of overfitting we can use Cross Validation.
• A key challenge with overfitting, and with machine learning in general, is
that we can’t know how well our model will perform on new data until we
actually test it.
• To address this, we can split our initial dataset into
separate training and test subsets.
• There are different types of Cross Validation Techniques but the overall
concept remains the same,
• To partition the data into a number of subsets
• Hold out a set at a time and train the model on remaining set
• Test model on hold out set
• Repeat the process for each subset of the dataset
• Let’s use the cross_val_score() function to evaluate your SGDClassifier
model using K-fold cross-validation, with three folds.
• Remember that K-fold cross- validation means splitting the training
set into K-folds (in this case, three), then making predictions and
evaluating them on each fold using a model trained on the remaining
folds
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
array([0.96355, 0.93795, 0.95615])
Performance measures: Confusion matrix
• A confusion matrix is a matrix that summarizes the performance of a
machine learning model on a set of test data.
• It is often used to measure the performance of classification models,
which aim to predict a categorical label for each input instance.
• The matrix displays the number of true positives (TP), true negatives
(TN), false positives (FP), and false negatives (FN) produced by the
model on the test data.
• For binary classification, the matrix will be of a 2X2 table.
• For multi-class classification, the matrix shape will be equal to the
number of classes i.e for n classes it will be nXn.
• A 2X2 Confusion matrix is shown below for
the image recognition having a Dog image
or Not Dog image.
• True Positive (TP): It is the total counts
having both predicted and actual values
are Dog.
• True Negative (TN): It is the total counts
having both predicted and actual values
are Not Dog.
• False Positive (FP): It is the total counts
having prediction is Dog while actually Not
Dog.
• False Negative (FN): It is the total counts
having prediction is Not Dog while actually,
it is Dog.
From the confusion matrix, we can find the following
metrics
1. Accuracy: Accuracy is used to measure the
performance of the model. It is the ratio of Total
correct instances to the total instances.
For the above case:

Accuracy = (5+3)/(5+3+1+1) = 8/10 = 0.8
2. Precision: Precision is a measure of how accurate a

model’s positive predictions are. It is defined as the
ratio of true positive predictions to the total number
of positive predictions made by the model.
• However, many real-world applications have a high imbalance of
classes.
• These are the cases when one category has significantly more
frequent occurrences than the other.
• Imagine that the actual balance of spam and non-spam emails looks
like this. Out of 60 emails, only 3 (or 5%) are truly spam.
• Suppose your model has predicted that every email as non-spam.
• The model predicting the majority (non-spam) class all the time will
mostly be right, leading to very high accuracy.
• In this specific example, the accuracy is 95%: yes, the model missed
every spam email, but it was still right in 57 cases out of 60.
• However, this accuracy is now meaningless.
• The model does not serve the primary goal and does not help identify
the target event.
• The precision is 50%. The model labeled 6 emails as spam and was right half the
time. 3 out of 6 emails labeled as spam were, in fact, spam (true positives).
• The other 3 were misses (false positives). The precision is 3/(3+3)=50%.
• Precision answers the question: how often the positive predictions are correct?
3. Recall or sensitivity or true positive rate: Recall measures the
effectiveness of a classification model in identifying all relevant
instances from a dataset. It is the ratio of the number of true
positive (TP) instances to the sum of true positive and false negative
(FN) instances. Recall answers the question: can an ML model find
all instances of the positive class?
4. F1-Score: F1-score is used to evaluate the overall performance of a

classification model. It is the harmonic mean of precision and recall.
A higher F1 score denotes a better quality classifier.
• The recall is 100%. There were 3 spam emails in the dataset, and the
model found all of them! We calculate it as 3/(3+0). There were no
false negatives since the model did not miss spam.
• This way, recall shows yet another dimension of the model quality. All
in all, this fictional model has 95% accuracy, 50% precision, and 100%
recall.
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)
array([[53057, 1522],
[ 1325, 4096]])
• Each row in a confusion matrix represents an actual class, while each column represents a
predicted class.
• The first row of this matrix considers non-5 images (the negative class): 53,057 of them were
correctly classified as non-5s (they are called true negatives), while the remaining 1,522 were
wrongly classified as 5s (false positives).
• The second row considers the images of 5s (the positive class): 1,325 were wrongly classified as
non-5s (false negatives), while the remaining 4,096 were correctly classified as 5s (true positives).
• A perfect classifier would have only true positives and true negatives,
so its confusion matrix would have nonzero values only on its main
diagonal (top left to bottom right):
y_train_perfect_predictions = y_train_5 # pretend we reached perfection

confusion_matrix(y_train_5, y_train_perfect_predictions)
array([[54579, 0],
[ 0, 5421]])
Performance measures: Precision and Recall
• Scikit-Learn provides several functions to compute classifier metrics,
including precision and recall:
from sklearn.metrics import precision_score, recall_score
precision_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1522)
0.7290850836596654
recall_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1325)
0.7555801512636044
• It claims an image represents a 5, it is correct only 72.9% of the time.
• Moreover, it only detects 75.6% of the 5s.
Performance measures: Precision and Recall
• It is often convenient to combine precision and recall into a single metric called the F1 score, in
particular if you need a simple way to compare two classifiers. The F1 score is the harmonic mean
of precision and recall.
• Whereas the regular mean treats all values equally, the harmonic mean gives much more weight
to low values. As a result, the classifier will only get a high F1 score if both recall and precision are
high.
• To compute the F1 score, simply call the f1_score() function:

from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred) 0.7420962043663375
• The F1 score favors classifiers that have similar precision and recall.
Performance measures: Precision/Recall
Tradeoff
• The Idea behind the precision-recall trade-off is that when a person
changes the threshold for determining if a class is positive or negative
it will tilt the scales.
• It will cause precision to increase and recall to decrease, or vice
versa.
Tradeoff
• Scikit-Learn does not let you set the threshold directly, but it does give you access to the decision scores that it
uses to make predictions.
• Instead of calling the classifier’s predict() method, you can call its decision_function() method, which returns a
score for each instance, and then make predictions based on those scores using any threshold you want:
y_scores = sgd_clf.decision_function([some_digit])
y_scores
array([2412.53175101])
threshold = 0
y_some_digit_pred = (y_scores > threshold)
array([ True])
• The SGDClassifier uses a threshold equal to 0, so the previous code returns the same result as the predict() method
(i.e., True). Let’s raise the threshold:
threshold = 8000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
array([False])
• This confirms that raising the threshold decreases recall.
• The image actually represents a 5, and the classifier detects it when the threshold is 0, but it misses it when the
threshold is increased to 8,000.
Tradeoff
• Now how do you decide which threshold to use?
• For this you will first need to get the scores of all instances in the training set
using the cross_val_predict() function again, but this time specifying that you
want it to return decision scores instead of predictions:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
method="decision_function")
• Now with these scores you can compute precision and recall for all possible
thresholds using the precision_recall_curve() function:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
Tradeoff
Performance measures: Precision, Recall: ROC
Curve
• The receiver operating characteristic (ROC) curve is another common tool used
with binary classifiers.
• It is very similar to the precision/recall curve, but instead of plotting precision
versus recall, the ROC curve plots the true positive rate (another name for recall)
against the false positive rate.
• The FPR is the ratio of negative instances that are incorrectly classified as
positive.
• It is equal to one minus the true negative rate, which is the ratio of negative
instances that are correctly classified as negative.
• The TNR is also called specificity. Hence the ROC curve plots sensitivity (recall)
versus 1 – specificity.
Curve
Now,
TPR = TP/P = 94/100 = 94%
TNR = TN/N = 850/900 = 94.4%
FPR = FP/N = 50/900 = 5.5%
FNR = FN/P =6/100 = 6%
Here, TPR, TNR is high and FPR,

FNR is low. So our model is not
in underfit or overfit.
Curve
• To plot the ROC curve, you first need to compute the TPR and FPR for
various thres‐ hold values, using the roc_curve() function:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
• Then you can plot the FPR against the TPR using Matplotlib. This code
produces the plot in Figure 3-6:
def plot_roc_curve(fpr, tpr, label=None):
plt.plot(fpr, tpr, linewidth=2, label=label)
plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
[...] # Add axis labels and grid
plot_roc_curve(fpr, tpr)
plt.show()
Curve
• Once again there is a tradeoff: the higher
the recall (TPR), the more false positives
(FPR) the classifier produces.
• The dotted line represents the ROC curve
of a purely random classifier; a good
classifier stays as far away from that line as
possible (toward the top-left corner).
• One way to compare classifiers is to
measure the area under the curve (AUC). A
perfect classifier will have a ROC AUC equal
to 1, whereas a purely random classifier
will have a ROC AUC equal to 0.5.
from sklearn.metrics import
roc_auc_score
roc_auc_score(y_train_5, y_scores)
0.9611778893101814
Curve
• AUROC = 1
• A model whose
predictions are 100%
correct has an AUC
of 1.0.
Curve
• AUROC = 0.5
Curve
• AUROC = 0
•A model whose
predictions are 100%
wrong has an AUC of
0.0
Curve
• Let’s train a RandomForestClassifier and compare its ROC curve and ROC AUC score to
the SGDClassifier. First, you need to get scores for each instance in the training set.
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")
• But to plot a ROC curve, you need scores, not probabilities. A simple solution is to use
the positive class’s probability as the score:
y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)
• Now you are ready to plot the ROC curve. It is useful to plot the first ROC curve as well to
see how they compare (Figure 3-7):
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()
Curve
As you can see in Figure 3-7, the
RandomForestClassifier’s ROC
curve looks much better than
the SGDClassifier’s: it comes
much closer to the top-left
corner. As a result, its ROC AUC
score is also significantly better:
roc_auc_score(y_train_5,
y_scores_forest)
0.9983436731328145
How to train binary classifiers: Summary
• Choose the appropriate metric for your task,
• Evaluate your classifiers using cross-validation,
• Select the precision/ recall tradeoff that fits your needs, and
• Compare various models using ROC curves and ROC AUC scores.
Multiclass Classification
• Whereas binary classifiers distinguish between two classes, multiclass classifiers (also called
multinomial classifiers) can distinguish between more than two classes.
• Some algorithms (such as Random Forest classifiers or naive Bayes classifiers) are capable of
handling multiple classes directly.
• Others (such as Support Vector Machine classifiers or Linear classifiers) are strictly binary
classifiers.
• However, there are various strategies that you can use to perform multiclass classification using
multiple binary classifiers.
• For example, one way to create a system that can classify the digit images into 10 classes (from 0
to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector,
and so on).
• Then when you want to classify an image, you get the decision score from each classifier for that
image and you select the class whose classifier outputs the highest score.
• This is called the one-versus-all (OvA) strategy (also called one-versus-the-rest).
• Another strategy is to train a binary classifier for every pair of digits: one to
distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and
so on.
• This is called the one-versus-one (OvO) strategy. If there are N classes, you need
to train N × (N – 1) / 2 classifiers.
• For the MNIST problem, this means training 45 binary classifiers! When you want
to classify an image, you have to run the image through all 45 classifiers and see
which class wins the most duels.
• The main advantage of OvO is that each classifier only needs to be trained on the
part of the training set for the two classes that it must distinguish.
• Some algorithms (such as Support Vector Machine classifiers) scale poorly with the size
of the training set, so for these algorithms OvO is preferred since it is faster to train many
classifiers on small training sets than training few classifiers on large training sets.
• For most binary classification algorithms, however, OvA is preferred.
• Instead of returning just one score per instance, it now returns 10 scores, one per class:
some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores
array([[-15955.22627845, -38080.96296175, -13326.66694897, 573.52692379, -17680.6846644
, 2412.53175101, -25526.86498156, -12290.15704709, -7946.05205023, -10631.35888549]])
• The highest score is indeed the one corresponding to class 5:
np.argmax(some_digit_scores)
5
• If you want to force ScikitLearn to use one-versus-one or one-versus-all, you can
use the OneVsOneClassifier or OneVsRestClassifier classes.
• Simply create an instance and pass a binary classifier to its constructor.
• For example, this code creates a multi‐ class classifier using the OvO strategy,
based on a SGDClassifier:
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])
array([5], dtype=uint8)
len(ovo_clf.estimators_)
45
• Training a RandomForestClassifieris just as easy:
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])
array([5], dtype=uint8)
• This time Scikit-Learn did not have to run OvA or OvO because Random Forest classifiers
can directly classify instances into multiple classes. You can call predict_proba() to get the
list of probabilities that the classifier assigned to each instance for each class:
forest_clf.predict_proba([some_digit])
array([[0. , 0. , 0.01, 0.08, 0. , 0.9 , 0. , 0. , 0. , 0.01]])
• You can see that the classifier is fairly confident about its prediction: the 0.9 at the 5th
index in the array means that the model estimates a 90% probability that the image
represents a 5.
• Now of course you want to evaluate these classifiers. As usual, you want to use
cross- validation. Let’s evaluate the SGDClassifier’s accuracy using the
cross_val_score() function:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
array([0.8489802 , 0.87129356, 0.86988048])
• Simply scaling the inputs increases accuracy above 89%:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
array([0.89707059, 0.8960948 , 0.90693604])
Error Analysis
• Here, we will assume that you have found a promising model and you want to
find ways to improve it.
• One way to do this is to analyze the types of errors it makes.
• First, you can look at the confusion matrix.
Error Analysis
• It’s often more convenient to look at
an image representation of the
confusion matrix.
• This confusion matrix looks fairly
good, since most images are on the
main diagonal, which means that they
were classified correctly.
• The 5s look slightly darker than the
other digits, which could mean that
there are fewer images of 5s in the
dataset or that the classifier does not
perform as well on 5s as on other
digits.
• In fact, you can verify that both are
the case.
Error Analysis
• To plot the errors:
• First, you need to divide each value in the confusion matrix by the number of
images in the corresponding class, so you can compare error rates instead of
absolute number of errors.
• Now let’s fill the diagonal with zeros to keep only the errors, and let’s plot the
result:
Error Analysis
Error Analysis
• Now you can clearly see the kinds of errors the classifier makes.
• Remember that rows represent actual classes, while columns represent predicted
classes.
• The columns for classes 8 and 9 are quite bright, which tells you that many
images get misclassified as 8s or 9s.
• Similarly, the rows for classes 8 and 9 are also quite bright, telling you that 8s and
9s are often confused with other digits.
• Conversely, some rows are pretty dark, such as row 1: this means that most 1s are
classified correctly (a few are confused with 8s, but that’s about it).
• Notice that the errors are not perfectly symmetrical; for example, there are more
5s misclassified as 8s than the reverse.
Error Analysis
• Analyzing the confusion matrix can often give you insights on ways to
improve your classifier.
• Looking at this plot, it seems that your efforts should be spent on
improving classification of 8s and 9s, as well as fixing the specific 3/5
confusion.
• For example, you could try to gather more training data for these digits.
• Or you could engineer new features that would help the classifier—for
example, writing an algorithm to count the number of closed loops (e.g., 8
has two, 6 has one, 5 has none).
• Or you could preprocess the images (e.g., using Scikit-Image, Pillow, or
OpenCV) to make some patterns stand out more, such as closed loops.
Error Analysis
• Analyzing individual errors can also be a good way to gain insights on what your
classifier is doing and why it is failing, but it is more difficult and time-consuming.
• For example, let’s plot examples of 3s and 5s.
Error Analysis
• The two 5×5 blocks on the left show digits classified
as 3s, and the two 5×5 blocks on the right show
images classified as 5s.
• Some of the digits that the classifier gets wrong (i.e.,
in the bottom-left and top-right blocks) are so badly
written that even a human would have trouble
classifying them (e.g., the 5 on the 8th row and 1st
column truly looks like a 3).
• However, most misclassified images seem like
obvious errors to us, and it’s hard to understand why
the classifier made the mistakes it did.
• This classifier is quite sensitive to image shifting and
rotation. So one way to reduce the 3/5 confusion
would be to preprocess the images to ensure that
they are well centered and not too rotated.
• This will probably help reduce other errors as well.
Error Analysis: Mean Square Error
• When the purpose of the model is prediction, a reasonable
parameter to validate the model’s quality is the mean squared error
(MSE) of prediction.
• Lesser the MSE => Smaller is the error => Better the estimator.
• The Mean Squared Error is calculated as:
• MSE or E = (1/n) * Σ(actual – forecast)2
• where:
• Σ – a symbol that means “sum”
• n – sample size
• actual – the actual data value
• forecast – the predicted data value
The squared error is calculated by (actual – forecast)2
• Calculate the Mean Squared Error.
• MSE or E = (1/12) * (98) = 8.166.
• The squared error is zero when our
model makes a perfectly correct
prediction on every training example.
• Moreover, the closer E is to 0, the
better our model is.
• As a result, our goal will be to select
our parameter vector (the values for
all the weights in our model) such
that E is as close to 0 as possible.
Error Analysis: Mean Absolute Error
•n = the number of observations in the dataset,
•Σ = summation symbol (which means “add them all up”),
•|xi – x| = the absolute errors.
House Actual (in K) Predicted (in K) Absolute Error (in K)

2 BHK 200 230 30
3 BHK 300 290 10
4 BHK 400 740 340
5 BHK 500 450 50
MAE = (30K + 10K + 340K + 50K)/4 = 107.5K

Multiclass Classification: Performance
measure
/ Actual
measure
Class 1
measure
Class 2
measure
• Using this concept, we can calculate the class-wise accuracy,
precision, recall, and f1-scores and tabulate the results:
measure
Multilabel Classification
• Until now each instance has always been assigned to just one class. In
some cases you may want your classifier to output multiple classes for each
instance.
• For example, consider a face-recognition classifier: what should it do if it
recognizes several people on the same picture?
• Of course it should attach one label per person it recognizes.
• Say the classifier has been trained to recognize three faces, Alice, Bob, and
Charlie; then when it is shown a picture of Alice and Charlie, it should
output [1, 0, 1] (meaning “Alice yes, Bob no, Charlie yes”).
• Such a classification system that outputs multiple binary labels is called a
multilabel classification system.
• This code creates a y_multilabel array containing two target labels for each digit
image: the first indicates whether or not the digit is large (7, 8, or 9) and the
second indicates whether or not it is odd.
• The next lines create a KNeighborsClassifier instance (which supports multilabel
classification, but not all classifiers do) and we train it using the multiple targets
array.
• Now you can make a prediction, and notice that it outputs two labels.
• There are many ways to evaluate a multilabel classifier, and selecting
the right metric really depends on your project. For example, one
approach is to measure the F1 score for each individual label (or any
other binary classifier metric discussed earlier), then simply compute
the average score.
• This code computes the average F1 score across all labels:
• This assumes that all labels are equally important, which may not be
the case.
• In particular, if you have many more pictures of Alice than of Bob or
Charlie, you may want to give more weight to the classifier’s score on
pictures of Alice.
• One simple option is to give each label a weight equal to its support
(i.e., the number of instances with that
• target label).
• To do this, simply set average="weighted" in the preceding code.
Linear Regression
• Linear regression is a type of supervised machine learning algorithm
that computes the linear relationship between a dependent variable
and one or more independent features.
• When the number of the independent feature, is 1 then it is known as
Univariate Linear regression, and in the case of more than one
feature, it is known as multivariate linear regression.
Linear Regression
• Simple linear regression is the simplest form of linear regression, and it
involves only one independent variable and one dependent variable.
• The equation for simple linear regression is:
• where:
• Y is the dependent variable
• X is the independent variable
• β0 is the intercept
• β1 is the slope
Linear Regression
• Multiple linear regression involves more than one independent
variable and one dependent variable.
• The equation for multiple linear regression is:
• where:
• Y is the dependent variable
• X1, X2, …, Xp are the independent variables
• β0 is the intercept
• β1, β2, …, βn are the slopes
Linear Regression
Linear Regression
• Here Y is called a dependent or target variable and X is called an independent
variable also known as the predictor of Y.
• There are many types of functions or modules that can be used for regression.
• A linear function is the simplest type of function.
• Here, X may be a single feature or multiple features representing the problem.
• The model gets the best regression fit line by finding the best θ1 and θ2 values.
• θ1: intercept
• θ2: coefficient of x
• Once we find the best θ1 and θ2 values, we get the best-fit line. So when we are
finally using our model for prediction, it will predict the value of y for the input
value of x.
Linear Regression
• To achieve the best-fit regression line, the model aims to predict the
target value such that the error difference between the predicted
value and the true value Y is minimum.
• So, it is very important to update the θ1 and θ2 values, to reach the
best value that minimizes the error between the predicted y value
(pred) and the true y value (y).
• In Linear Regression, the Mean Squared Error (MSE) cost function is
employed, which calculates the average of the squared errors
between the predicted values and the actual values.
• MSE function can be calculated as:
Linear Regression: Simple: Example
7 5.2
12 8.5
Calculate the error
Calculate the error

Linear Regression: Multiple: Example
Bias
5 7 10
Error?
Regularization
• Regularization is a technique which makes slight modifications to the learning
algorithm such that the model generalizes better.
• This in turn improves the model’s performance on the unseen data as well.
• Regularization refers to techniques that are used to calibrate machine learning
models in order to minimize the adjusted loss function and prevent overfitting or
underfitting.
Regularization
• There are two main types of regularization techniques: Ridge
Regularization and Lasso Regularization.
Regularization: Ridge
• Also known as L2 Regression, it modifies the over-fitted or under
fitted models by adding the penalty equivalent to the sum of the
squares of the magnitude of coefficients.
• This means that the mathematical function representing our deep
learning model is minimized and coefficients are calculated.
• The magnitude of coefficients is squared and added.
• Ridge Regression performs regularization by shrinking the coefficients
present.
• The function depicted below (next slide) shows the cost function of
ridge regression.
• Residual sum of squares (RSS) measures how well a linear regression model
matches training data.
• It is represented by the formulation:
• This formula measures model prediction accuracy for ground-truth values

in the training data.
• If RSS = 0, the model perfectly predicts dependent variables.
• A score of zero is not always desirable, however, as it can
indicate overfitting on the training data, particularly if the training dataset
is small.
• Specifically, ridge regression corrects the overfitting by introducing a
regularization term (often called the penalty term) into the RSS
function.
• This penalty term is the sum of the squares of the model’s
coefficients.
• It is represented in the formulation:
• The L2 penalty term is inserted as the end of the RSS function, resulting in a new
formulation, the ridge regression estimator.
• Therein, its effect on the model is controlled by the hyperparameter lambda (λ):
• Lambda is the hyperparameter that is tuned to prevent overfitting.

• The choice of the regularization parameter λ is crucial in regularization.
• A larger λ value increases the amount of regularization, leading to more
coefficients being pushed towards zero.
• Conversely, a smaller λ value reduces the regularization effect, allowing more
variables to have non-zero coefficients.
Regularization: Lasso
• Also referred to as L1 regression.
• In this type of regularization, the absolute value of the magnitude of
coefficients multiplied with a regularizer term is added to the loss or
cost function.
• It can be represented with the following equation.
Regularization: Lasso
• A fraction of the sum of absolute values of coeffitients to the loss function
is added in the L1 regularization.
• In this way, you will be able to eliminate some coefficients with lesser
values by pushing those values towards 0.
• Since the L1 regularization adds an absolute value as a penalty to the cost
function, the feature selection will be done by retaining only some
important features and eliminating the lower or unimportant features.
• This technique is also robust to outliers, i.e., the model will be able to
easily learn about outliers in the dataset.
• This technique will not be able to learn complex patterns from the input
data.
Regularization
S.No L1 Regularization L2 Regularization
Panelizes the sum of absolute
1 Penalizes the sum of square coefficients.
value of coefficients.
4 Constructed in feature selection. No feature selection.
5 Robust to outliers. Not robust to outliers.
It gives more accurate predictions when
It generates simple and
6 the output variable is the function of
interpretable models.
whole input variables.
Unable to learn complex data
7 Able to learn complex data patterns.
patterns.
Regularization: Elastic Net
• We had a dataset of figure prices,
where each entry in the dataset
contained the age of the figure as
well as its price for that age in €
(or any other currency).
• We then wanted to predict the
price of a figure given its age using
linear regression, to see how much
the figures depreciate over time.
• The dataset looked like the figure:
• We then split our dataset into a train set and a test set, and trained
our linear regression (OLS regression) model on the training data.
Here’s how that looked like:
• We then noticed that this
model had a very low training
error but a rather high testing
error and thus we concluded
that our linear regression
model is overfit.
• We then tried to come up
with an imaginary, better
model that was less overfit
and looked more like this:
• Since our model parameters can be negative, adding them
might decrease our loss instead of increasing it.
• In order to circumvent this, we can either square our model
parameters or take their absolute values:
• What we can do now is combine the two penalties, and we get the
loss function of elastic net:
• Instead of one regularization parameter α we now use two

parameters, one for each penalty. α1 controls the L1 penalty
and α2 controls the L2 penalty.
• We can now use elastic net in the same way that we can use ridge or
lasso.
• If α1=0, then we have ridge regression.
• If α2=0, we have lasso.
• Alternatively, instead of using two α -parameters, we can also use just
one α and one L1-ratio-parameter, which determines the percentage
of our L1 penalty with regard to α.
• So if α = 1 and L1-ratio = 0.4, our L1 penalty will be multiplied with 0.4
and our L2 penalty will be multiplied with (1- L1Ratio = 0.6).
Gradient Descent
• Gradient Descent is known as one of the most
commonly used optimization algorithms to
train deep learning models by means of
minimizing errors between actual and
expected results.
• Further, gradient descent is also used to train
Neural Networks.
• Gradient Descent is defined as one of the
most commonly used iterative optimization
algorithms of deep learning to train the
machine learning and deep learning models.
• It helps in finding the local minimum of a
function.
Gradient Descent
• Learning rate (also referred to as step size or the alpha) is the size of
the steps that are taken to reach the minimum.
• This is typically a small value, and it is evaluated and updated based
on the behavior of the cost function.
• High learning rates result in larger steps but risks overshooting the
minimum. Conversely, a low learning rate has small step sizes.
• While it has the advantage of more precision, the number of
iterations compromises overall efficiency as this takes more time and
computations to reach the minimum.
Gradient Descent
Gradient Descent
• The cost (or loss) function measures the difference, or error, between
actual y and predicted y at its current position.
• This improves the deep learning model's efficacy by providing
feedback to the model so that it can adjust the parameters (weights)
to minimize the error and find the local or global minimum.
• It continuously iterates, moving along the direction of steepest
descent (or the negative gradient) until the cost function is close to or
at zero.
• At this point, the model will stop learning.
Gradient Descent: Batch Gradient
• In batch gradient descent, to update the model parameter values like
weight and bias, the entire training dataset is used to compute the gradient
and update the parameters at each iteration.
• This can be slow for large datasets but may lead to a more accurate model.
• It is effective for convex or relatively smooth error manifolds because it
moves directly toward an optimal solution by taking a large step in the
direction of the negative gradient of the cost function.
• However, it can be slow for large datasets because it computes the gradient
and updates the parameters using the entire training dataset at each
iteration.
• This can result in longer training times and higher computational costs.
• On the left, the learning rate is too low: the algorithm will eventually
reach the solution, but it will take a long time.
• In the middle, the learning rate looks pretty good: in just a few
iterations, it has already converged to the solution.
• On the right, the learning rate is too high: the algorithm diverges,
jumping all over the place and actually getting further and further
away from the solution at every step.
Gradient Descent: Stochastic
• Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm
that is used for optimizing deep learning models.
• It addresses the computational inefficiency of traditional Gradient Descent
methods when dealing with large datasets in deep learning projects.
• In SGD, instead of using the entire dataset for each iteration, only a single random
training example is selected to calculate the gradient and update the model
parameters.
• This random selection introduces randomness into the optimization process,
hence the term “stochastic” in stochastic Gradient Descent.
• The advantage of using SGD is its computational efficiency, especially when
dealing with large datasets.
• By using a single example or a small batch, the computational cost per iteration is
significantly reduced compared to traditional Gradient Descent methods that
require processing the entire dataset.
• Initialization: Randomly initialize the parameters of the model.
• Set Parameters: Determine the number of iterations and the learning rate (alpha) for
updating the parameters.
• Stochastic Gradient Descent Loop: Repeat the following steps until the model converges
or reaches the maximum number of iterations:
1. Shuffle the training dataset to introduce randomness.
2. Iterate over each training example in the shuffled order.
3. Compute the gradient of the cost function with respect to the model parameters using the
current training example.
4. Update the model parameters by taking a step in the direction of the negative gradient, scaled
by the learning rate.
5. Evaluate the convergence criteria, such as the difference in the cost function between
iterations of the gradient.
• Return Optimized Parameters: Once the convergence criteria are met or the maximum
number of iterations is reached, return the optimized model parameters.
• Note that since instances are picked randomly, some instances may be picked
several times per epoch while others may not be picked at all.
• If you want to be sure that the algorithm goes through every instance at each
epoch, another approach is to shuffle the training set, then go through it instance
by instance, then shuffle it again, and so on.
• However, this generally converges more slowly.
• To perform Linear Regression using SGD with Scikit-Learn, you can use the
SGDRegressor class, which defaults to optimizing the squared error cost function.
• The following code runs 50 epochs, starting with a learning rate of 0.1 (eta0=0.1).
Gradient Descent: Mini Batch
• Mini Batch gradient descent is the combination of both batch
gradient descent and stochastic gradient descent.
• It divides the training datasets into small batch sizes then performs
the updates on those batches separately.
• Splitting training datasets into smaller batches make a balance to
maintain the computational efficiency of batch gradient descent and
speed of stochastic gradient descent.
• Hence, we can achieve a special type of gradient descent with higher
computational efficiency and less noisy gradient descent.
• Figure (next slide) shows the paths taken by the three Gradient
Descent algorithms in parameter space during training.
• They all end up near the minimum, but Batch GD’s path actually stops
at the minimum, while both Stochastic GD and Mini-batch GD
continue to walk around.
• However, don’t forget that Batch GD takes a lot of time to take each
step, and Stochastic GD and Mini-batch GD would also reach the
minimum if you used a good learning schedule.
Polynomial Regression
• Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree polynomial.
• The Polynomial Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
• It is also called the special case of Multiple Linear Regression in ML.
• Because we add some polynomial terms to the Multiple Linear regression
equation to convert it into Polynomial Regression.
• It is a linear model with some modification in order to increase the accuracy.
• The dataset used in Polynomial regression for training is of non-linear nature.
• It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
• If we apply a linear model on a linear dataset, then it provides us a
good result as we have seen in Simple/Multiple Linear Regression, but
if we apply the same model without any modification on a non-linear
dataset, then it will produce a drastic output. Due to which loss
function will increase, the error rate will be high, and accuracy will be
decreased.
• So for such cases, where data points are arranged in a non-linear
fashion, we need the Polynomial Regression model.
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
• Clearly, a straight line will never fit this data properly. So let’s use Scikit-Learn’s
PolynomialFeatures class to transform our training data, adding the square (2nd-
degree polynomial) of each feature in the training set as new features (in this
case there is just one feature):
• X_poly now contains the original feature of X plus the square of this feature.
• Now you can fit a LinearRegression model to this extended training data
• Not bad: the model estimates y’ = 0.56x12 + 0.93x1 + 1.78 when in fact the original
function was y = 0.5x12 + 1.0x1 + 2.0 + Gaussian noise.
• This high-degree (300) Polynomial Regression model is severely overfitting the
training data, while the linear model is underfitting it.
• The model that will generalize best in this case is the quadratic model.
• It makes sense since the data was generated using a quadratic model, but in
general you won’t know what function generated the data.
• You can use cross-validation to get an estimate of a model’s generalization
performance.
• If a model performs well on the training data but generalizes poorly according to
the cross-validation metrics, then your model is overfitting.
• If it performs poorly on both, then it is underfitting.
• This is one way to tell when a model is too simple or too complex.
• Another way is to look at the
learning curves: these are plots
of the model’s performance on
the training set and the
validation set.
• Let’s look at the learning curves
of the plain Linear Regression
model.
• Let’s look at the performance on the training data: when there are just one or two instances in
the training set, the model can fit them perfectly, which is why the curve starts at zero.
• But as new instances are added to the training set, it becomes impossible for the model to fit the
training data perfectly, both because the data is noisy and because it is not linear at all.
• So the error on the training data goes up until it reaches a plateau, at which point adding new
instances to the training set doesn’t make the average error much better or worse.
• Now let’s look at the performance of the model on the validation data.
• When the model is trained on very few training instances, it is incapable of generalizing properly,
which is why the validation error is initially quite big.
• Then as the model is shown more training examples, it learns and thus the validation error slowly
goes down.
• However, once again a straight line cannot do a good job modeling the data, so the error ends up
at a plateau, very close to the other curve.
• These learning curves are typical of an underfitting model.
• Both curves have reached a plateau; they are close and fairly high.
Now let’s look at

the learning
curves of a 10th -
degree
polynomial model
on the same
data
• The error on the training data is much lower than with the Linear Regression
model.
• There is a gap between the curves.
• This means that the model performs significantly better on the training data than
on the validation data, which is the hallmark of an overfitting model.
• However, if you used a much larger training set, the two curves would continue to
get closer.
• One way to improve an overfitting model is to feed it more training data until the
validation error reaches the training error.
Polynomial Regression: Example
Logistic Regression
• Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique.
• It is used for predicting the categorical dependent variable using a given set of
independent variables.
• Logistic regression predicts the output of a categorical dependent variable.
• Therefore the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. But instead of giving the exact value
as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are
used.
• Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
Logistic Regression
• The curve from the logistic function indicates
the likelihood of something such as whether
the cells are cancerous or not, a mouse is
obese or not based on its weight, etc.
• Logistic Regression is a significant machine
learning algorithm because it has the ability
to provide probabilities and classify new data
using continuous and discrete datasets.
• Logistic Regression can be used to classify the
observations using different types of data and
can easily determine the most effective
variables used for the classification.
• The below image is showing the logistic
function:
Logistic Regression
• import numpy
from sklearn import linear_model
#Reshaped for Logistic function.
X=
numpy.array([2.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).
reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(X,y)
#predict if tumor is cancerous where the size is 3.46mm:
predicted = logr.predict(numpy.array([2.46]).reshape(-1,1))
print(predicted)
• O/P: [0]
Logistic Regression
Logistic Regression: Estimating Probabilities
Logistic Regression: Example
• Suppose Z = 2* hours + (-64)
Logistic Regression: Decision Boundaries
• While training a classifier on a dataset, using a specific classification
algorithm, it is required to define a set of hyper-planes, called
Decision Boundary, that separates the data points into specific
classes, where the algorithm switches from one class to another.
• On one side a decision boundary, a datapoints is more likely to be
called as class A — on the other side of the boundary, it’s more likely
to be called as class B.
• The goal of logistic regression, is to figure out some way to split the
datapoints to have an accurate prediction of a given observation’s
class using the information present in the features.
• Let’s suppose we define a line that describes the decision boundary.
• So, all of the points on one side of the boundary shall have all the
datapoints belong to class A and all of the points on one side of the
boundary shall have all the datapoints belong to class B.
• S(z)=1/(1+e^-z)
• S(z) = Output between 0 and 1 (probability estimate)
• z = Input to the function (z= mx + b)
• e = Base of natural log
• Our current prediction function returns a probability score between 0 and
1.
• In order to map this to a discrete class (A/B), we select a threshold value or
tipping point above which we will classify values into class A and below
which we classify values into class B.
• p>=0.5,class=A
• p<=0.5,class=B
• If our threshold was 0.5 and our prediction function returned 0.7, we
would classify this observation belongs to class A.
• If our prediction was 0.2 we would classify the observation belongs to class
B.
• So, line with 0.5 is called the decision boundary.
• In order to map predicted values to probabilities, we use the Sigmoid
function.
• In Logistic Regression,
Decision Boundary is a linear
line, which separates class A
and class B.
• Some of the points from class
A have come to the region of
class B too, because in linear
model, its difficult to get the
exact boundary line
separating the two classes.
SoftMax Regression
• The Logistic Regression model can be generalized to support multiple classes directly, without
having to train and combine multiple binary classifiers.
• This is called SoftMax Regression, or Multinomial Logistic Regression.
• The idea is quite simple: when given an instance x, the SoftMax Regression model first computes
a score sk(x) for each class k, then estimates the probability of each class by applying the SoftMax
function (also called the normalized exponential) to the scores.
• Once you have computed the score of every class for the instance x, you can estimate the
probability pk that the instance belongs to class k by running the scores through the SoftMax
function.
SoftMax Regression
• The class with the highest probability is the output class.
• The SoftMax Regression classifier predicts only one class at a time (i.e., it is
multiclass, not multioutput) so it should be used only with mutually exclusive
classes such as different types of plants.
• You cannot use it to recognize multiple people in one picture.
• Now that you know how the model estimates probabilities and makes
predictions, let’s take a look at training.
• The objective is to have a model that estimates a high probability for the target
class (and consequently a low probability for the other classes).
SoftMax Regression
• Let’s use SoftMax Regression to classify the iris flowers into all three classes.
Scikit-Learn’s LogisticRegression uses one-versus-all by default when you train it
on more than two classes, but you can set the multi_class hyperparameter to
"multinomial“ to switch it to SoftMax Regression instead.
• You must also specify a solver that supports SoftMax Regression, such as the
"lbfgs" solver (see Scikit-Learn’s documentation for more details).
• It also applies ℓ2 regularization by default, which you can control using the
hyperparameter C.
SoftMax Regression
• So the next time you find an iris with 5 cm long and 2 cm wide petals,
you can ask your model to tell you what type of iris it is, and it will
answer Iris-Virginica (class 2) with 94.2% probability (or Iris-Versicolor
with 5.8% probability):
SoftMax Regression
• In training the objective is to have a model that estimates high
probability for the target class and low probability for the other
classes.
• Keeping this objective in mind we are going to minimize the cost
function, also called as cross entropy.
• Cross entropy cost function penalizes the model when it estimates a
low probability for a target class.
• Cross entropy is a measure of how well a set of estimated class
probabilities match the target classes.
• So, we can say that it measures the performance of the model.
SoftMax Regression
• Where:
• H(y,p) is the cross-entropy loss.
• y is a one-hot encoded vector representing the true class (e.g., [0,1,0]
for the second class in a 3-class problem).
• p is a vector of predicted class probabilities for each class.
• log(pi) computes the natural logarithm of the predicted probability for
class i.
• The sum is taken over all classes.
SoftMax Regression
• The purpose of the Cross-Entropy is to take the output probabilities (P) and
measure the distance from the truth values (as shown in Figure below).
• For the example above the desired output is [1,0,0,0] for the class dog but the
model outputs [0.775, 0.116, 0.039, 0.070] .
• The probability will be equal to 1 or 0 depending on which class the instance
belongs.
• When there are just two classes (K = 2), this cost function is same as logistic
regression cost function.
SoftMax Regression
• Figure 4-25 shows the resulting decision boundaries, represented by the background colors.
• Notice that the decision boundaries between any two classes are linear.
• The figure also shows the probabilities for the Iris-Versicolor class, represented by the curved lines (e.g., the
line labeled with 0.450 represents the 45% probability boundary).
• Notice that the model can predict a class that has an estimated probability below 50%.
• For example, at the point where all decision boundaries meet, all classes have an equal estimated probability
of 33%.

Module 2

Uploaded by

Copyright:

Available Formats

Module 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2

Uploaded by

Copyright:

Available Formats

Supervised Learning

For the above case:

2. Precision: Precision is a measure of how accurate a

4. F1-Score: F1-score is used to evaluate the overall performance of a

y_train_perfect_predictions = y_train_5 # pretend we reached perfection

• To compute the F1 score, simply call the f1_score() function:

Here, TPR, TNR is high and FPR,

House Actual (in K) Predicted (in K) Absolute Error (in K)

MAE = (30K + 10K + 340K + 50K)/4 = 107.5K

Calculate the error

Calculate the error

• This formula measures model prediction accuracy for ground-truth values

• Lambda is the hyperparameter that is tuned to prevent overfitting.

• Instead of one regularization parameter α we now use two

Now let’s look at

You might also like