Types of Classification Algorithm

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 27

TYPES OF CLASSIFICATION ALGORITHM

In machine learning and statistics, classification is a supervised learning


approach in which the computer program learns from the data input given
to it and then uses this learning to classify new observation. This data set
may simply be bi-class (like identifying whether the person is male or
female or that the mail is spam or non-spam) or it may be multi-class too.
Some examples of classification problems are: speech recognition,
handwriting recognition, bio metric identification, document classification
etc.

Here we have the types of classification algorithms in Machine Learning:

1. Linear Classifiers: Logistic Regression, Naive Bayes Classifier


2. Nearest Neighbor
3. Support Vector Machines
4. Decision Trees
5. Boosted Trees
6. Random Forest
7. Neural Networks

Naive Bayes Classifier (Generative Learning Model) :


It is a classification technique based on Bayes’ Theorem with an assumption
of independence among predictors. In simple terms, a Naive Bayes
classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature. Even if these features
depend on each other or upon the existence of the other features, all of
these properties independently contribute to the probability. Naive Bayes
model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly
sophisticated classification methods.

Nearest Neighbor:
The k-nearest-neighbors algorithm is a classification algorithm, and it is
supervised: it takes a bunch of labelled points and uses them to learn how
to label other points. To label a new point, it looks at the labelled points
closest to that new point (those are its nearest neighbors), and has those
neighbors vote, so whichever label the most of the neighbors have is the
label for the new point (the “k” is the number of neighbors it checks).

Logistic Regression (Predictive Learning Model) :


It is a statistical method for analysing a data set in which there are one or
more independent variables that determine an outcome. The outcome is
measured with a dichotomous variable (in which there are only two
possible outcomes). The goal of logistic regression is to find the best fitting
model to describe the relationship between the dichotomous characteristic
of interest (dependent variable = response or outcome variable) and a set of
independent (predictor or explanatory) variables. This is better than other
binary classification like nearest neighbor since it also explains
quantitatively the factors that lead to classification.

Decision Trees:
Decision tree builds classification or regression models in the form of a tree
structure. It breaks down a data set into smaller and smaller subsets while
at the same time an associated decision tree is incrementally developed.
The final result is a tree with decision nodes and leaf nodes. A decision node
has two or more branches and a leaf node represents a classification or
decision. The topmost decision node in a tree which corresponds to the best
predictor called root node. Decision trees can handle both categorical and
numerical data.

Random Forest:
Random forests or random decision forests are an ensemble learning
method for classification, regression and other tasks, that operate by
constructing a multitude of decision trees at training time and outputting
the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Random decision forests correct for
decision trees’ habit of over fitting to their training set.

Neural Network:
A neural network consists of units (neurons), arranged in layers, which
convert an input vector into some output. Each unit takes an input, applies
a (often nonlinear) function to it and then passes the output on to the next
layer. Generally the networks are defined to be feed-forward: a unit feeds its
output to all the units on the next layer, but there is no feedback to the
previous layer. Weightings are applied to the signals passing from one unit
to another, and it is these weightings which are tuned in the training phase
to adapt a neural network to the particular problem at hand.

What is Classification?
We use the training dataset to get better boundary conditions which could be used to determine each
target class. Once the boundary conditions are determined, the next task is to predict the target class.
The whole process is known as classification.

Target class examples:


 Analysis of the customer data to predict whether he will buy computer accessories (Target
class: Yes or No)
 Classifying fruits from features like color, taste, size, weight (Target classes: Apple, Orange,
Cherry, Banana)
 Gender classification from hair length (Target classes: Male or Female)

Let’s understand the concept of classification algorithms with gender classification using hair
length (by no means am I trying to stereotype by gender, this is only an example). To classify
gender (target class) using hair length as feature parameter we could train a model using any
classification algorithms to come up with some set of boundary conditions which can be used to
differentiate the male and female genders using hair length as the training feature. In gender
classification case the boundary condition could the proper hair length value. Suppose
the differentiated boundary hair length value is 15.0 cm then we can say that if hair length is less
than 15.0 cm then gender could be male or else female.

Classification Algorithms vs Clustering Algorithms


In clustering, the idea is not to predict the target class as in classification, it’s more ever trying to group
the similar kind of things by considering the most satisfied condition, all the items in the same group
should be similar and no two different group items should not be similar.

Group items Examples:

 While grouping similar language type documents (Same language documents are one
group.)
 While categorizing the news articles (Same news category(Sport) articles are one group )

Let’s understand the concept with clustering genders based on hair length example. To determine
gender, different similarity measure could be used to categorize male and female genders. This could
be done by finding the similarity between two hair lengths and keep them in the same group if the
similarity is less (Difference of hair length is less). The same process could continue until all the
hair length properly grouped into two categories.

Basic Terminology in Classification Algorithms


 Classifier: An algorithm that maps the input data to a specific category.
 Classification model: A classification model tries to draw some conclusion from the input
values given for training. It will predict the class labels/categories for the new data.
 Feature: A feature is an individual measurable property of a phenomenon being observed.
 Binary Classification: Classification task with two possible outcomes. Eg: Gender
classification (Male / Female)
 Multi-class classification: Classification with more than two classes. In multi-class
classification, each sample is assigned to one and only one target label. Eg: An animal can
be a cat or dog but not both at the same time.
 Multi-label classification: Classification task where each sample is mapped to a set of target
labels (more than one class). Eg: A news article can be about sports, a person, and
location at the same time.

Applications of Classification Algorithms


 Email spam classification
 Bank customers loan pay willingness prediction.
 Cancer tumor cells identification.
 Sentiment analysis
 Drugs classification
 Facial key points detection
 Pedestrians detection in an automotive car driving.

Types of Classification Algorithms


Classification Algorithms could be broadly classified as the following:

 Linear Classifiers
o Logistic regression
o Naive Bayes classifier
o Fisher’s linear discriminant
 Support vector machines
o Least squares support vector machines
 Quadratic classifiers
 Kernel estimation
o k-nearest neighbor
 Decision trees
o Random forests
 Neural networks
 Learning vector quantization

Examples of a few popular Classification Algorithms are given below.

Logistic Regression
As confusing as the name might be, you can rest assured. Logistic Regression is a classification
and not a regression algorithm. It estimates discrete values (Binary values like 0/1, yes/no,
true/false) based on a given set of independent variable(s). Simply put, it basically, predicts the
probability of occurrence of an event by fitting data to a logit function. Hence, it is also known
as logit regression. The values obtained would always lie within 0 and 1 since it predicts the
probability.

Let us try and understand this through another example.

Let’s say there’s a sum on your math test. It can only have 2 outcomes, right? Either you solve it or
you don’t (and let’s not assume points for method here). Now imagine, that you are being given a
wide range of sums in an attempt to understand which chapters have you understood well. The
outcome of this study would be something like this – if you are given a trigonometry based problem,
you are 70% likely to solve it. On the other hand, if it is an arithmetic problem, the probability of you
getting an answer is only 30%. This is what Logistic Regression provides you.

If I had to do the math, I would model the log odds of the outcome as a linear combination of the
predictor variables.

odds= p/ (1-p) = probability of event occurrence / probability of event


occurrence ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) =
b0+b1X1+b2X2+b3X3....+bkXk)
In the equation given above, p is the probability of the presence of the characteristic of interest. It
chooses parameters that maximize the likelihood of observing the sample values rather than
that minimize the sum of squared errors (like in ordinary regression).
Now, a lot of you might wonder, why take a log? For the sake of simplicity, let’s just say that this is
one of the best mathematical ways to replicate a step function. I can go way more in-depth with this,
but that will beat the purpose of this blog.

R-Code:

x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)
There are many different steps that could be tried in order to improve the model:

 include interaction terms


 remove features
 regularize techniques
 use a non-linear model

Decision Trees
Now, the decision tree is by far, one of my favorite algorithms. With versatile features helping
actualize both categorical and continuous dependent variables, it is a type of supervised learning
algorithm mostly used for classification problems. What this algorithm does is, it splits the population
into two or more homogeneous sets based on the most significant attributes making the groups as
distinct as possible.
In the image above, you can see that population is classified into four different groups based on
multiple attributes to identify ‘if they will play or not’.

R-Code:

library(rpart)
x <- cbind(x_train,y_train)
# grow tree
fit <- rpart(y_train ~ ., data = x,method="class")
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
Naive Bayes Classifier
This is a classification technique based on an assumption of independence between predictors or
what’s known as Bayes’ theorem. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features, a
Naive Bayes Classifier would consider all of these properties to independently contribute to the
probability that this fruit is an apple.

To build a Bayesian model is simple and particularly functional in case of enormous data sets. Along
with simplicity, Naive Bayes is known to outperform sophisticated classification methods as well.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
The expression for Posterior Probability is as follows.
Here,

 P(c|x) is the posterior probability of class (target) given predictor (attribute).


 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

Example: Let’s work through an example to understand this better. So, here I have a training data
set of weather namely, sunny, overcast and rainy, and corresponding binary variable ‘Play’. Now, we
need to classify whether players will play or not based on weather condition. Let’s follow the below
steps to perform it.

Step 1: Convert the data set to the frequency table

Step 2: Create a Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.Step 3: Now, use the Naive Bayesian equation to calculate
the posterior probability for each class. The class with the highest posterior probability is the outcome
of prediction.

Problem: Players will play if the weather is sunny, is this statement correct?

We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P
(Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.

R-Code:

library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-naiveBayes(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
KNN (k- Nearest Neighbors)
K nearest neighbors is a simple algorithm used for both classification and regression problems. It
basically stores all available cases to classify the new cases by a majority vote of its k neighbors. The
case assigned to the class is most common amongst its K nearest neighbors measured by a distance
function (Euclidean, Manhattan, Minkowski, and Hamming).

While the three former distance functions are used for continuous variables, Hamming distance
function is used for categorical variables. If K = 1, then the case is simply assigned to the class of its
nearest neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling.

You can understand KNN easily by taking an example of our real lives. If you have a crush on a
girl/boy in class, of whom you have no information, you might want to talk to their friends and social
circles to gain access to their information!

R-Code:

library(knn)
x <- cbind(x_train,y_train)
# Fitting model
fit <-knn(y_train ~ ., data = x,k=5)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)
Things to consider before selecting KNN:

 KNN is computationally expensive


 Variables should be normalized else higher range variables can bias it
 Works on pre-processing stage more before going for kNN like an outlier, noise removal

SVM(Support Vector Machine)


In this algorithm, we plot each data item as a point in n-dimensional space (where n is a number of
features you have) with the value of each feature being the value of a particular coordinate.

For example, if we only had two features like Height and Hair length of an individual, we’d first plot
these two variables in two-dimensional space where each point has two coordinates (these
coordinates are known as Support Vectors)

Now, we will find some line that splits the data between the two differently classified groups of data.
This will be the line such that the distances from the closest point in each of the two groups will be
farthest away.
In the example shown above, the line which splits the data into two differently classified groups is
the blue line, since the two closest points are the farthest apart from the line. This line is our classifier.
Then, depending on where the testing data lands on either side of the line, that’s what class we can
classify the new data as.

R-Code:

library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)

DECISION TREE
Decision Tree is considered to be one of the most useful Machine Learning algorithms since it can be
used to solve a variety of problems. Here are a few reasons why you should use Decision Tree:

1. It is considered to be the most understandable Machine Learning algorithm and it can be


easily interpreted.
2. It can be used for classification and regression problems.
3. Unlike most Machine Learning algorithms, it works effectively with non-linear data.
4. Constructing a Decision Tree is a very quick process since it uses only one feature per node
to split the data.
What Is A Decision Tree Algorithm?
A Decision Tree is a Supervised Machine Learning algorithm which looks like
an inverted tree, wherein each node represents a predictor variable (feature),
the link between the nodes represents a Decision and each leaf node
represents an outcome (response variable).

To get a better understanding of a Decision Tree, let’s look at an example:

Let’s say that you hosted a huge party and you want to know how many of your guests were non-
vegetarians. To solve this problem, let’s create a simple Decision Tree.
Decision Tree Example – Decision Tree Algorithm – Edureka

In the above illustration, I’ve created a Decision tree that classifies a guest as either vegetarian or
non-vegetarian. Each node represents a predictor variable that will help to conclude whether or not a
guest is a non-vegetarian. As you traverse down the tree, you must make decisions at each node,
until you reach a dead end.

Now that you know the logic of a Decision Tree, let’s define a set of terms related to a Decision
Tree.

Structure Of A Decision Tree


Decision Tree Structure – Decision Tree Algorithm – Edureka

A Decision Tree has the following structure:

 Root Node: The root node is the starting point of a tree. At this point, the first split is performed.
 Internal Nodes: Each internal node represents a decision point (predictor variable) that
eventually leads to the prediction of the outcome.
 Leaf/ Terminal Nodes: Leaf nodes represent the final class of the outcome and therefore
they’re also called terminating nodes.
 Branches: Branches are connections between nodes, they’re represented as arrows. Each
branch represents a response such as yes or no.

So that is the basic structure of a Decision Tree. Now let’s try to understand the workflow of a Decision
Tree.

How Does The Decision Tree Algorithm Work?


The Decision Tree Algorithm follows the below steps:

Step 1: Select the feature (predictor variable) that best classifies the data set into the desired classes
and assign that feature to the root node.
Step 2: Traverse down from the root node, whilst making relevant decisions at each internal node
such that each internal node best classifies the data.
Step 3: Route back to step 1 and repeat until you assign a class to the input data.

The above-mentioned steps represent the general workflow of a Decision Tree used for classification
purposes.

Now let’s try to understand how a Decision Tree is created.

Build A Decision Tree Using ID3 Algorithm


There are many ways to build a Decision Tree, in this blog we’ll be focusing on how the ID3 algorithm
is used to create a Decision Tree.

What Is The ID3 Algorithm?

ID3 or the Iterative Dichotomiser 3 algorithm is one of the most effective algorithms used to build a
Decision Tree. It uses the concept of Entropy and Information Gain to generate a Decision Tree for a
given set of data.

ID3 Algorithm:

The ID3 algorithm follows the below workflow in order to build a Decision Tree:

1. Select Best Attribute (A)


2. Assign A as a decision variable for the root node.
3. For each value of A, build a descendant of the node.
4. Assign classification labels to the leaf node.
5. If data is correctly classified: Stop.
6. Else: Iterate over the tree.

The first step in this algorithm states that we must select the best attribute. What does that mean?
The best attribute (predictor variable) is the one that, separates the data set into different classes,
most effectively or it is the feature that best splits the data set.

Now the next question in your head must be, “How do I decide which variable/ feature best splits the
data?”

Two measures are used to decide the best attribute:

1. Information Gain
2. Entropy

What Is Entropy?

Entropy measures the impurity or uncertainty present in the data. It is used to


decide how a Decision Tree can split the data.
Equation For Entropy:

What Is Information Gain?

Information Gain (IG) is the most significant measure used to build a Decision
Tree. It indicates how much “information” a particular feature/ variable gives us
about the final outcome.
Information Gain is important because it used to choose the variable that best splits the data at each
node of a Decision Tree. The variable with the highest IG is used to split the data at the root node.

Equation For Information Gain (IG):

To better understand how Information Gain and Entropy are used to create a Decision Tree, let’s look
at an example. The below data set represents the speed of a car based on certain parameters.

Speed Data Set – Decision Tree Algorithm – Edureka

Your problem statement is to study this data set and create a Decision Tree that classifies the speed
of a car (response variable) as either slow or fast, depending on the following predictor variables:

 Road type
 Obstruction
 Speed limit
We’ll be building a Decision Tree using these variables in order to predict the speed of a car. Like I
mentioned earlier we must first begin by deciding a variable that best splits the data set and assign
that particular variable to the root node and repeat the same thing for the other nodes as well.

At this point, you might be wondering how do you know which variable best separates the data? The
answer is, the variable with the highest Information Gain best divides the data into the desired output
classes.

So, let’s begin by calculating the Entropy and Information Gain (IG) for each of the predictor variables,
starting with ‘Road type’.

In our data set, there are four observations in the ‘Road type’ column that correspond to four labels
in the ‘Speed of car’ column. We shall begin by calculating the entropy of the parent node (Speed of
car).

Step one is to find out the fraction of the two classes present in the parent node. We know that there
are a total of four values present in the parent node, out of which two samples belong to the ‘slow’
class and the other 2 belong to the ‘fast’ class, therefore:

 P(slow) -> fraction of ‘slow’ outcomes in the parent node


 P(fast) -> fraction of ‘fast’ outcomes in the parent node

The formula to calculate P(slow) is:

p(slow) = no. of ‘slow’ outcomes in the parent node / total number of outcomes

Similarly, the formula to calculate P(fast) is:

p(fast) = no. of ‘fast’ outcomes in the parent node / total number of outcomes

Therefore, the entropy of the parent node is:

Entropy(parent) = – {0.5 log2(0.5) + 0.5 log2(0.5)} = – {-0.5 + (-0.5)} = 1

Now that we know that the entropy of the parent node is 1, let’s see how to calculate the Information
Gain for the ‘Road type’ variable. Remember that, if the Information gain of the ‘Road type’ variable
is greater than the Information Gain of all the other predictor variables, only then the root node can
be split by using the ‘Road type’ variable.

In order to calculate the Information Gain of ‘Road type’ variable, we first need to split the root node
by the ‘Road type’ variable.
Decision Tree (Road type) – Decision Tree Algorithm – Edureka

In the above illustration, we’ve split the parent node by using the ‘Road type’ variable, the child nodes
denote the corresponding responses as shown in the data set. Now, we need to measure the entropy
of the child nodes.

The entropy of the right-hand side child node (fast) is 0 because all of the outcomes in this node
belongs to one class (fast). In a similar manner, we must find the Entropy of the left-hand side node
(slow, slow, fast).

In this node there are two types of outcomes (fast and slow), therefore, we first need to calculate the
fraction of slow and fast outcomes for this particular node.

P(slow) = 2/3 = 0.667


P(fast) = 1/3 = 0.334

Therefore, entropy is:

Entropy(left child node) = – {0.667 log2(0.667) + 0.334 log2(0.334)} = – {-0.38 + (-0.52)}


= 0.9

Our next step is to calculate the Entropy(children) with weighted average:

 Total number of outcomes in parent node: 4


 Total number of outcomes in left child node: 3
 Total number of outcomes in right child node: 1

Formula for Entropy(children) with weighted avg. :

[Weighted avg]Entropy(children) = (no. of outcomes in left child node) / (total


no. of outcomes in parent node) * (entropy of left node) + (no. of outcomes in
right child node)/ (total no. of outcomes in parent node) * (entropy of right node)
By using the above formula you’ll find that the, Entropy(children) with weighted avg. is = 0.675

Our final step is to substitute the above weighted average in the IG formula in order to calculate the
final IG of the ‘Road type’ variable:

Therefore,

Information gain(Road type) = 1 – 0.675 = 0.325

Information gain of Road type feature is 0.325.

Like I mentioned earlier, the Decision Tree Algorithm selects the variable with the highest Information
Gain to split the Decision Tree. Therefore, by using the above method you need to calculate the
Information Gain for all the predictor variables to check which variable has the highest IG.

Next
So by using the above methodology, you must get the following values for each predictor variable:

1. Information gain(Road type) = 1 – 0.675 = 0.325


2. Information gain(Obstruction) = 1 – 1 = 0
3. Information gain(Speed limit) = 1 – 0 = 1

So, here we can see that the ‘Speed limit’ variable has the highest Information Gain. Therefore, the
final Decision Tree for this dataset is built using the ‘Speed limit’ variable.

Decision Tree (Speed limit) – Decision Tree Algorithm – Edureka

Now that you know how a Decision Tree is created, let’s run a short demo that solves a real-world
problem by implementing Decision Trees.
Naive Bayes Classifier

What is a classifier?
A classifier is a machine learning model that is used to discriminate
different objects based on certain features.

Principle of Naive Bayes Classifier:


A Naive Bayes classifier is a probabilistic machine learning model that’s
used for classification task. The crux of the classifier is based on the Bayes
theorem.

Bayes Theorem:

Using Bayes theorem, we can find the probability of A happening, given


that B has occurred. Here, B is the evidence and A is the hypothesis. The
assumption made here is that the predictors/features are independent.
That is presence of one particular feature does not affect the other. Hence it
is called naive.

Example:
Let us take an example to get some better intuition. Consider the problem
of playing golf. The dataset is represented as below.
We classify whether the day is suitable for playing golf, given the features of
the day. The columns represent these features and the rows represent
individual entries. If we take the first row of the dataset, we can observe
that is not suitable for playing golf if the outlook is rainy, temperature is
hot, humidity is high and it is not windy. We make two assumptions here,
one as stated above we consider that these predictors are independent. That
is, if the temperature is hot, it does not necessarily mean that the humidity
is high. Another assumption made here is that all the predictors have an
equal effect on the outcome. That is, the day being windy does not have
more importance in deciding to play golf or not.

According to this example, Bayes theorem can be rewritten as:


The variable y is the class variable(play golf), which represents if it is
suitable to play golf or not given the conditions. Variable X represent the
parameters/features.

X is given as,

Here x_1,x_2….x_n represent the features, i.e they can be mapped to


outlook, temperature, humidity and windy. By substituting for X and
expanding using the chain rule we get,

Now, you can obtain the values for each by looking at the dataset and
substitute them into the equation. For all entries in the dataset, the
denominator does not change, it remain static. Therefore, the denominator
can be removed and a proportionality can be introduced.

In our case, the class variable(y) has only two outcomes, yes or no. There
could be cases where the classification could be multivariate. Therefore, we
need to find the class y with maximum probability.
Using the above function, we can obtain the class, given the predictors.

Types of Naive Bayes Classifier:

Multinomial Naive Bayes:


This is mostly used for document classification problem, i.e whether a
document belongs to the category of sports, politics, technology etc. The
features/predictors used by the classifier are the frequency of the words
present in the document.

Bernoulli Naive Bayes:


This is similar to the multinomial naive bayes but the predictors are
boolean variables. The parameters that we use to predict the class variable
take up only values yes or no, for example if a word occurs in the text or not.

Gaussian Naive Bayes:


When the predictors take up a continuous value and are not discrete, we
assume that these values are sampled from a gaussian distribution.

Gaussian Distribution(Normal Distribution)

Since the way the values are present in the dataset changes, the formula for
conditional probability changes to,
Conclusion:
Naive Bayes algorithms are mostly used in sentiment analysis, spam
filtering, recommendation systems etc. They are fast and easy to implement
but their biggest disadvantage is that the requirement of predictors to be
independent. In most of the real life cases, the predictors are dependent,
this hinders the performance of the classifier.

Standardization VS Normalization

Feature Scaling via StackExchange

Standardization
Standardization (or Z-score normalization) is the process of rescaling
the features so that they’ll have the properties of a Gaussian distribution
with

μ=0 and σ=1


where μ is the mean and σ is the standard deviation from the mean;
standard scores (also called z scores) of the samples are calculated as
follows:

Normalization
Normalization often also simply called Min-Max scaling basically shrinks
the range of the data such that the range is fixed between 0 and 1 (or -1 to 1
if there are negative values). It works better for cases in which the
standardization might not work so well. If the distribution is not Gaussian
or the standard deviation is very small, the min-max scaler works better.

Normalization is typically done via the following equation:

You might also like