Unit 4 ML

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

Machine Learning for classification

Classification is a process of categorizing data or objects into predefined classes or categories


based on their features or attributes.
Machine Learning classification is a type of supervised learning technique where an algorithm is
trained on a labeled dataset to predict the class or category of new, unseen data.
The main objective of classification machine learning is to build a model that can accurately
assign a label or category to a new observation based on its features.
For example, a classification model might be trained on a dataset of images labeled as either dogs
or cats and then used to predict the class of new, unseen images of dogs or cats based on their
features such as color, texture, and shape.

Classification Types
There are two main classification types in machine learning:
Binary Classification
In binary classification, the goal is to classify the input into one of two classes or categories.
Example – On the basis of the given health conditions of a person, we have to determine whether
the person has a certain disease or not.
Multiclass Classification
In multi-class classification, the goal is to classify the input into one of several classes or
categories. For Example – On the basis of data about different species of flowers, we have to
determine which specie our observation belongs to.

Binary vs Multi class classification

Other categories of classification involves:

Multi-Label Classification
In, Multi-label Classification the goal is to predict which of several labels a new data point
belongs to. This is different from multiclass classification, where each data point can only belong
to one class. For example, a multi-label classification algorithm could be used to classify images
of animals as belonging to one or more of the categories cat, dog, bird, or fish.

Imbalanced Classification
In, Imbalanced Classification the goal is to predict whether a new data point belongs to a
minority class, even though there are many more examples of the majority class. For example, a
medical diagnosis algorithm could be used to predict whether a patient has a rare disease, even
though there are many more patients with common diseases.
Classification Algorithms
There are various types of classifiers algorithms. Some of them are :
Linear Classifiers
Linear models create a linear decision boundary between classes. They are simple and
computationally efficient. Some of the linear classification models are as follows:
 Logistic Regression
 Support Vector Machines having kernel = ‘linear’
 Single-layer Perceptron
 Stochastic Gradient Descent (SGD) Classifier

Non-linear Classifiers
Non-linear models create a non-linear decision boundary between classes. They can capture more
complex relationships between the input features and the target variable. Some of the non-
linear classification models are as follows:
 K-Nearest Neighbours
 Kernel SVM
 Naive Bayes
 Decision Tree Classification
 Ensemble learning classifiers:
 Random Forests,
 AdaBoost,
 Bagging Classifier,
 Voting Classifier,
 ExtraTrees Classifier
 Multi-layer Artificial Neural Networks

Learners in Classifications Algorithm


In machine learning, classification learners can also be classified as either “lazy” or “eager”
learners.
 Lazy Learners: Lazy Learners are also known as instance-based learners, lazy learners
do not learn a model during the training phase. Instead, they simply store the training
data and use it to classify new instances at prediction time. It is very fast at prediction
time because it does not require computations during the predictions. it is less effective
in high-dimensional spaces or when the number of training instances is large. Examples
of lazy learners include k-nearest neighbors and case-based reasoning.
 Eager Learners: Eager Learners are also known as model-based learners, eager
learners learn a model from the training data during the training phase and use this
model to classify new instances at prediction time. It is more effective in high-
dimensional spaces having large training datasets. Examples of eager learners include
decision trees, random forests, and support vector machines.
Classification Models in Machine Learning
Evaluating a classification model is an important step in machine learning, as it helps to assess the
performance and generalization ability of the model on new, unseen data. There are several
metrics and techniques that can be used to evaluate a classification model, depending on the
specific problem and requirements. Here are some commonly used evaluation metrics:
 Classification Accuracy: The proportion of correctly classified instances over the total
number of instances in the test set. It is a simple and intuitive metric but can be
misleading in imbalanced datasets where the majority class dominates the accuracy
score.
 Confusion matrix: A table that shows the number of true positives, true negatives,
false positives, and false negatives for each class, which can be used to calculate various
evaluation metrics.
 Precision and Recall: Precision measures the proportion of true positives over the total
number of predicted positives, while recall measures the proportion of true positives
over the total number of actual positives. These metrics are useful in scenarios where
one class is more important than the other, or when there is a trade-off between false
positives and false negatives.
 F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x
recall) / (precision + recall). It is a useful metric for imbalanced datasets where both
precision and recall are important.
 ROC curve and AUC: The Receiver Operating Characteristic (ROC) curve is a plot of
the true positive rate (recall) against the false positive rate (1-specificity) for different
threshold values of the classifier’s decision function. The Area Under the Curve (AUC)
measures the overall performance of the classifier, with values ranging from 0.5
(random guessing) to 1 (perfect classification).
 Cross-validation: A technique that divides the data into multiple folds and trains the
model on each fold while testing on the others, to obtain a more robust estimate of the
model’s performance.
It is important to choose the appropriate evaluation metric(s) based on the specific problem and
requirements, and to avoid overfitting by evaluating the model on independent test data.

Characteristics of Classification
Here are the characteristics of the classification:
 Categorical Target Variable: Classification deals with predicting categorical target
variables that represent discrete classes or labels. Examples include classifying emails
as spam or not spam, predicting whether a patient has a high risk of heart disease, or
identifying image objects.
 Accuracy and Error Rates: Classification models are evaluated based on their ability
to correctly classify data points. Common metrics include accuracy, precision, recall,
and F1-score.
 Model Complexity: Classification models range from simple linear classifiers to more
complex nonlinear models. The choice of model complexity depends on the complexity
of the relationship between the input features and the target variable.
 Overfitting and Underfitting: Classification models are susceptible to overfitting and
underfitting. Overfitting occurs when the model learns the training data too well and
fails to generalize to new data.
Examples of Machine Learning Classification in Real Life
Classification algorithms are widely used in many real-world applications across various domains,
including:
 Email spam filtering
 Credit risk assessment
 Medical diagnosis
 Image classification
 Sentiment analysis.
 Fraud detection
 Quality control
 Recommendation systems

What is Logistic Regression?


Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.
 For example, we have two classes Class 0 and Class 1 if the value of the logistic function
for an input is greater than 0.5 (threshold value) then it belongs to Class 1 it belongs to
Class 0. It’s referred to as regression because it is the extension of linear regression but is
mainly used for classification problems.
Logistic Function – Sigmoid Function
 The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
 It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.
 The S-form curve is called the Sigmoid function or the logistic function.
 In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as “low”, “Medium”, or “High”.

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is
a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies


 Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting
a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.

How does the Decision Tree algorithm Work?


In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node
splits into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology
used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase

Decision Tree Induction


Decision Tree is a supervised learning method used in data mining for classification and regression
methods. It is a tree that helps us in decision-making purposes. The decision tree creates
classification or regression models as a tree structure. It separates a data set into smaller subsets,
and at the same time, the decision tree is steadily developed. The final tree is a tree with the
decision nodes and leaf nodes. A decision node has at least two branches. The leaf nodes show a
classification or decision. We can't accomplish more split on leaf nodes-The uppermost decision
node in a tree that relates to the best predictor called the root node. Decision trees can deal with
both categorical and numerical data.

Key factors:

Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.

Information Gain:

Information Gain refers to the decline in entropy after the dataset is split. It is also called Entropy
Reduction. Building a decision tree is all about discovering attributes that return the highest data
gain.

In short, a decision tree is just like a flow chart diagram with the terminal nodes showing decisions.
Starting with the dataset, we can measure the entropy to find a way to segment the set until the
data belongs to the same class.

Why are decision trees useful?


It enables us to analyze the possible consequences of a decision thoroughly.

It provides us a framework to measure the values of outcomes and the probability of


accomplishing them.

It helps us to make the best decisions based on existing data and best speculations.

In other words, we can say that a decision tree is a hierarchical tree structure that can be used to
split an extensive collection of records into smaller sets of the class by implementing a sequence of
simple decision rules. A decision tree model comprises a set of rules for portioning a huge
heterogeneous population into smaller, more homogeneous, or mutually exclusive classes. The
attributes of the classes can be any variables from nominal, ordinal, binary, and quantitative values,
in contrast, the classes must be a qualitative type, such as categorical or ordinal or binary. In brief,
the given data of attributes together with its class, a decision tree creates a set of rules that can be
used to identify the class. One rule is implemented after another, resulting in a hierarchy of
segments within a segment. The hierarchy is known as the tree, and each segment is called
a node. With each progressive division, the members from the subsequent sets become more and
more similar to each other. Hence, the algorithm used to build a decision tree is referred to as
recursive partitioning. The algorithm is known as CART (Classification and Regression Trees)

Consider the given example of a factory where

Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which leads to
$8 million profit, and the probability of a bad economy is 0.4 (40%), which leads to $6 million
profit.

Not expanding factor with 0$ cost, the probability of a good economy is 0.6(60%), which leads to
$4 million profit, and the probability of a bad economy is 0.4, which leads to $2 million profit.
The management teams need to take a data-driven decision to expand or not based on the given
data.

Net Expand = ( 0.6 *8 + 0.4*6 ) - 3 = $4.2M


Net Not Expand = (0.6*4 + 0.4*2) - 0 = $3M
$4.2M > $3M,therefore the factory should be expanded.

Decision tree Algorithm:


The decision tree algorithm may appear long, but it is quite simply the basis algorithm techniques
is as follows:

The algorithm is based on three parameters: D, attribute_list, and Attribute _selection_method.

Generally, we refer to D as a data partition.

Initially, D is the entire set of training tuples and their related class levels (input training data).

The parameter attribute_list is a set of attributes defining the tuples.

Attribute_selection_method specifies a heuristic process for choosing the attribute that "best"
discriminates the given tuples according to class.

Attribute_selection_method process applies an attribute selection measure.

Advantages of using decision trees:


A decision tree does not need scaling of information.

Missing values in data also do not influence the process of building a choice tree to any
considerable extent.

A decision tree model is automatic and simple to explain to the technical team as well as
stakeholders.

Compared to other algorithms, decision trees need less exertion for data preparation during pre-
processing.

A decision tree does not require a standardization of data.

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the final
output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem
of overfitting.

The below diagram explains the working of the Random Forest algorithm:

Note: To better understand the Random Forest Algorithm, you should have knowledge of the Decision
Tree Algorithm.

Assumptions for Random Forest


Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all the
trees predict the correct output. Therefore, below are two assumptions for a better Random forest
classifier:

o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?


Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?


Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to
the Random forest classifier. The dataset is divided into subsets and given to each decision tree.
During the training phase, each decision tree produces a prediction result, and when a new data
point occurs, then based on the majority of results, the Random Forest classifier predicts the final
decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest


o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest


o Although random forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:


Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

K-Nearest Neighbor(KNN) Algorithm for


Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features it
will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need a
K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular
dataset. Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN


Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
How does SVM works?
Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below
image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data,
we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Clustering in Machine Learning


Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset.
It can be defined as "A way of grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities remain in a group that has less
or no similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it


deals with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.
The clustering technique is commonly used for statistical data analysis.

Note: Clustering is somewhere similar to the classification algorithm, but the difference is the type of
dataset that we are using. In classification, we work with the labeled data set, whereas in clustering,
we work with the unlabelled dataset.

Example: Let's understand the clustering technique with the real-world example of Mall: When we
visit any shopping mall, we can observe that the things with similar usage are grouped together.
Such as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at
vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we
can easily find out the things. The clustering technique also works in the same way. Other
examples of clustering are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this technique
to recommend the movies and web-series to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see the different fruits
are divided into several groups with similar properties.

Types of Clustering Methods


The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also
other various approaches of Clustering exist. Below are the main clustering methods used in
Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number of
pre-defined groups. The cluster center is created in such a way that the distance between the data
points of one cluster is minimum as compared to another cluster centroid.

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has varying densities
and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the probability of
how a dataset belongs to a particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).

Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset
is divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level. The
most common example of this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the degree
of membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There
are different types of clustering algorithms published, but only a few are commonly used. The
clustering algorithm is based on the kind of data that we are using. Such as, some algorithms need
to guess the number of clusters in the given dataset, whereas some are required to find the
minimum distance between the observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of equal
variances. The number of clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on updating
the candidates for centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with
Noise. It is an example of a density-based model similar to the mean-shift, but with some
remarkable advantages. In this algorithm, the areas of high density are separated by the
areas of low density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can be failed. In
GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a single
cluster at the outset and then successively merged. The cluster hierarchy can be represented
as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require
to specify the number of clusters. In this, each data point sends a message between the pair
of data points until convergence. It has O(N 2T) time complexity, which is the main drawback
of this algorithm.

Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:

o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets into
different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a
query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on
their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in
the GIS database. This can be very useful to find that for what purpose the particular land
should be used, that means for which purpose it is more suitable.

You might also like