ML Unit 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

UNIT - III

Classification Algorithms: KNN, Linear classification, logistic regression, grid search,


classification metrics, ROC curve.

What is the Classification Algorithm?


The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labelled input data, which means it contains input with the
corresponding output.

→ The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.

The algorithm which implements the classification on a dataset is known as a classifier. There
are two types of Classifications:
● Binary Classifier: If the classification problem has only two possible outcomes, then it
is called as Binary Classifier. Examples: YES or NO, MALE or FEMALE, SPAM or
NOT SPAM, CAT or DOG, etc.
● Multi-class Classifier: If a classification problem has more than two outcomes, then it
is called a Multi-class Classifier. Example: Classifications of types of crops,
Classification of types of music.
Learners in Classification Problems:
In the classification problems, there are two types of learners:
● Lazy Learners: Lazy Learner firstly stores the training dataset and waits until it
receives the test dataset. In the Lazy learner case, classification is done on the basis of
the most related data stored in the training dataset. It takes less time in training but
more time for predictions. Example: K-NN algorithm, Case-based reasoning
● Eager Learners: Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve
Bayes, ANN.

Types of ML classification algos :


Linear Models
● Logistic Regression
● Support Vector Machines
Non-linear Models
● K-Nearest Neighbours
● Kernel SVM
● Naïve Bayes
● Decision Tree Classification
● Random Forest Classification

Use cases of Classification Algorithms


Classification algorithms can be used in different places. Below are some popular use cases
of Classification Algorithms:
● Email Spam Detection
● Speech Recognition
● Identifications of Cancer tumour cells.
● Drugs Classification
● Biometric Identification, etc.

KNN :
● The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method
employed to tackle classification and regression problems.
● KNN is one of the most basic yet essential classification algorithms in machine
learning. It belongs to the supervised learning domain and finds intense application in
pattern recognition, data mining, and intrusion detection.
● K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
● It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.
● KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or
class of a particular dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbours
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Step-6: Our model is ready.

Distance Metrics Used in KNN Algorithm


Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight line
that joins the two points which are into consideration. This metric helps us calculate the net
displacement done between the two states of an object.

Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total distance
travelled by the object instead of the displacement. This metric is calculated by summing the
absolute difference between the coordinates of the points in n-dimensions.

Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special cases of the
Minkowski distance.

From the formula above we can say that when p = 2 then it is the same as the formula for the
Euclidean distance and when p = 1 then we obtain the formula for the Manhattan distance.
● There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
● A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
Steps :
● Step 1: Selecting the optimal value of K
K represents the number of nearest neighbors that needs to be considered while
making prediction.
● Step 2: Calculating distance
To measure the similarity between target and training data points, Euclidean distance
is used. Distance is calculated between each of the data points in the dataset and target
point.
● Step 3: Finding Nearest Neighbors
The k data points with the smallest distances to the target point are the nearest
neighbors.
● Step 4: Voting for Classification or Taking Average for Regression
In the classification problem, the class labels of are determined by performing
majority voting. The class with the most occurrences among the neighbors becomes
the predicted class for the target data point. In the regression problem, the class label
is calculated by taking average of the target values of K nearest neighbors. The
calculated average value becomes the predicted output for the target data point.
Advantages of the KNN Algorithm
● Easy to implement as the complexity of the algorithm is not that high.
● Adapts Easily – As per the working of the KNN algorithm it stores all the data in
memory storage and hence whenever a new example or data point is added then the
algorithm adjusts itself as per that new example and has its contribution to the future
predictions as well.
● Few Hyperparameters – The only parameters which are required in the training of a
KNN algorithm are the value of k and the choice of the distance metric which we
would like to choose from our evaluation metric.
Disadvantages of the KNN Algorithm
● Does not scale – As we have heard about this that the KNN algorithm is also
considered a Lazy Algorithm. The main significance of this term is that this takes lots
of computing power as well as data storage. This makes this algorithm both
time-consuming and resource exhausting.
● Curse of Dimensionality – There is a term known as the peaking phenomenon
according to this the KNN algorithm is affected by the curse of dimensionality which
implies the algorithm faces a hard time classifying the data points properly when the
dimensionality is too high.
● Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality
it is prone to the problem of overfitting as well. Hence generally feature selection as
well as dimensionality reduction techniques are applied to deal with this problem.
LOGISTIC REGRESSION :
● Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
● Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
● Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
● In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).

Logistic Function (Sigmoid Function):


● The sigmoid function is a mathematical function used to map the predicted values to
probabilities. It maps any real value into another value within a range of 0 and 1.
● The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
● In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Assumptions for Logistic Regression:
● The dependent variable must be categorical in nature.
● The independent variable should not have multi-collinearity.
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
● Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
● Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
● Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will
use the same steps as we have done in previous topics of Regression. Below are the steps:
● Data Pre-processing step
● Fitting Logistic Regression to the Training set
● Predicting the test result
● Test accuracy of the result(Creation of Confusion matrix)
● Visualizing the test set result.

Hyperparameters :
Parameters are the variables that are used by the Machine Learning algorithm for predicting
the results based on the input historic data. These are estimated by using an optimization
algorithm by the Machine Learning algorithm itself. Thus, these variables are not set or
hardcoded by the user or professional. These variables are served as a part of model training.
Example of Parameters: Coefficient of independent variables Linear Regression and Logistic
Regression.

Hyperparameters are the variables that the user specify usually while building the Machine
Learning model. thus, hyperparameters are specified before specifying the parameters or we
can say that hyperparameters are used to evaluate optimal parameters of the model. the best
part about hyperparameters is that their values are decided by the user who is building the
model. For example, max_depth in Random Forest Algorithms, k in KNN Classifier.

Grid Search uses a different combination of all the specified hyperparameters and their values
and calculates the performance for each combination and selects the best value for the
hyperparameters. This makes the processing time-consuming and expensive based on the
number of hyperparameters involved.

Cross-Validation and GridSearchCV


In GridSearchCV, along with Grid Search, cross-validation is also performed.
Cross-Validation is used while training the model. As we know that before training the model
with data, we divide the data into two parts – train data and test data. In cross-validation, the
process divides the train data further into two parts – the train data and the validation data.
The most popular type of Cross-validation is K-fold Cross-Validation. It is an iterative
process that divides the train data into k partitions. Each iteration keeps one partition for
testing and the remaining k-1 partitions for training the model. The next iteration will set the
next partition as test data and the remaining k-1 as train data and so on. In each iteration, it
will record the performance of the model and at the end give the average of all the
performance. Thus, it is also a time-consuming process.

from sklearn import svm, datasets


import numpy as np
iris = datasets.load_iris()
import pandas as pd
df = pd.DataFrame(iris.data,columns=iris.feature_names)
df['flower'] = iris.target
df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])
df[47:150]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)
model = svm.SVC(kernel='rbf',C=1,gamma='auto')
model.fit(X_train,y_train)
model.score(X_test, y_test)
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(svm.SVC(gamma='auto'), {
'C': [1,10,20,30],
'kernel': ['rbf','linear']
}, cv=5, return_train_score=False)
clf.fit(iris.data, iris.target)
clf.cv_results_
CLASSIFICATION METRICS :
1. Accuracy : Accuracy simply measures how often the classifier correctly predicts. We
can define accuracy as the ratio of the number of correct predictions and the total
number of predictions.

It works great if there are an equal number of samples for each class. For example, we
have a 90% sample of class A and a 10% sample of class B in our training set. Then,
our model will predict with an accuracy of 90% by predicting all the training samples
belonging to class A. If we test the same model with a test set of 60% from class A
and 40% from class B. Then the accuracy will fall, and we will get an accuracy of
60%.

Classification accuracy is good but it gives a False Positive sense of achieving high
accuracy. The problem arises due to the possibility of misclassification of minor class
samples being very high.

2. Confusion matrix : A confusion matrix is a performance evaluation tool in machine


learning, representing the accuracy of a classification model. It displays the number of
true positives, true negatives, false positives, and false negatives. This matrix aids in
analyzing model performance, identifying mis-classifications, and improving
predictive accuracy.
● True Positive (TP) - The predicted value matches the actual value, or the
predicted class matches the actual class. The actual value was positive, and the
model predicted a positive value.
● True Negative (TN) - The predicted value matches the actual value, or the
predicted class matches the actual class. The actual value was negative, and
the model predicted a negative value.
● False Positive (FP) – Type I Error - The predicted value was falsely predicted.
The actual value was negative, but the model predicted a positive value. Also
known as the type I error.
● False Negative (FN) – Type II Error - The predicted value was falsely
predicted. The actual value was positive, but the model predicted a negative
value. Also known as the type II error.

● True Positive (TP) = 560, meaning the model correctly classified 560 positive
class data points.
● True Negative (TN) = 330, meaning the model correctly classified 330
negative class data points.
● False Positive (FP) = 60, meaning the model incorrectly classified 60 negative
class data points as belonging to the positive class.
● False Negative (FN) = 50, meaning the model incorrectly classified 50
positive class data points as belonging to the negative class.

3. Precision(specificity) : Precision tells us how many of the correctly predicted cases


actually turned out to be positive. This would determine whether our model is reliable
or not.

4. Recall(sensitivity) : Recall tells us how many of the actual positive cases we were
able to predict correctly with our model.
5. F1 score : It gives a combined idea about Precision and Recall metrics. It is
maximum when Precision is equal to Recall. F1 Score is the harmonic mean of
precision and recall.

6. ROC - AUC : The Receiver Operator Characteristic (ROC) is a probability curve that
plots the TPR(True Positive Rate) against the FPR(False Positive Rate) at various
threshold values and separates the ‘signal’ from the ‘noise’.

The Area Under the Curve (AUC) is the measure of the ability of a classifier to
distinguish between classes. From the graph, we simply say the area of the curve
ABDE and the X and Y-axis.

● True positive rate: Also called or termed sensitivity. True Positive Rate is
considered as a portion of positive data points that are correctly considered as
positive, with respect to all data points that are positive.
● True Negative Rate : Also called or termed specificity. False Negative Rate is
considered as a portion of negative data points that are correctly considered as
negative, with respect to all data points that are negatives.
● False-positive Rate : False positive Rate is considered as a portion of positive
data points that are mistakenly considered as negative, with respect to all data
points that are negative.
● False-negative Rate : False Negative Rate is considered as a portion of
negative data points that are mistakenly considered as negative, with respect to
all data points that are negative.
● A true positive is an outcome where the model correctly predicts the positive
class. Similarly, a true negative is an outcome where the model correctly
predicts the negative class.
● A false positive is an outcome where the model incorrectly predicts the
positive class. And a false negative is an outcome where the model incorrectly
predicts the negative class.

7. Log loss : Log loss (Logistic loss) or Cross-Entropy Loss is one of the major metrics
to assess the performance of a classification problem. For a single sample with true
label y∈{0,1} and a probability estimate p=Pr(y=1), the log loss is:

ROC
The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary
classification problems. It is a probability curve that plots the TPR against FPR at various
threshold values and essentially separates the ‘signal’ from the ‘noise.’ In other words, it
shows the performance of a classification model at all classification thresholds. The Area
Under the Curve (AUC) is the measure of the ability of a binary classifier to distinguish
between classes and is used as a summary of the ROC curve.

Defining the terms used in AUC and ROC Curve?


● AUC (Area Under the Curve): A single metric representing the overall performance
of a binary classification model based on the area under its ROC curve.
● ROC Curve (Receiver Operating Characteristic Curve): A graphical plot illustrating
the trade-off between True Positive Rate and False Positive Rate at various
classification thresholds.
● True Positive Rate (Sensitivity): Proportion of actual positives correctly identified by
the model.
● False Positive Rate: The model incorrectly classifies the proportion of actual
negatives as positives.
● Specificity (True Negative Rate): Proportion of actual negatives correctly identified
by the model.

You might also like