Machine Learning Notes Unit 1 To 4
Machine Learning Notes Unit 1 To 4
Machine Learning Notes Unit 1 To 4
7th Semester
Machine Learning
MACHINE LEARNING-
Machine learning is a growing technology which enables computers to learn automatically from
past data. Machine learning uses various algorithms for building mathematical models and
making predictions using historical data or information. Currently, it is being used for various
tasks such as image recognition, speech recognition, email filtering, Facebook auto-
tagging, recommender system, and many more.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences
on their own. The term machine learning was first introduced by Arthur Samuel in 1959. We can
define it in a summarized way as:
“Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.”
In Victorian England, Lady Ada Lovelace was a friend and collaborator of Charles Babbage, the
inventor of the Analytical Engine: the first-known general-purpose, mechanical computer.
Although visionary and far ahead of its time, the Analytical Engine wasn't meant as a general-
purpose computer when it was designed in the 1830s and 1840s, because the concept of general-
purpose computation was yet to be invented. It was merely meant as a way to use mechanical
operations to automate certain computations from the field of mathematical analysis-hence, the
name analytical Engine. In 1843, Ada Lovelace remarked on the invention, "The Analytical Engine
has no pretensions whatever to originate anything. It can do whatever we know how to order it
to perform. Its province is to assist us in making available what we're already acquainted with."
This remark was later quoted by Al pioneer Alan Turing as "Lady Lovelace's objection in his
landmark 1950 paper "Computing Machinery and Intelligence." which introduced the Turing best
as well as key concepts that would come to shape Al. Tuning was quoting Ada Lovelace while
pondering whether general-purpose computers could be capable of learning and originality, and
he came to the conclusion that they could. Machine learning arises from this question: could a
computer go beyond "what we know how to order it to perform and learn on its own how to
perform a specified task?
A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model
which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help of
these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram explains
the working of Machine Learning algorithm:
Supervised learning is a process of providing input data as well as correct output data to
the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
The working of Supervised learning can be easily understood by the below example and
diagram:
So, it ascertains that the more it rains, the longer you will be driving to get back to
your home. It might also see the connection between the time you leave work and
the time you’ll be on the road.
The closer you’re to 6 p.m. the longer it takes for you to get home. Your machine
may find some of the relationships with your labeled data.
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled
as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
• Irrelevant input feature present training data could give inaccurate results
• Data preparation and pre-processing is always a challenge.
• Accuracy suffers when impossible, unlikely, and incomplete values have
been inputted as training data
• If the concerned expert is not available, then the other approach is “brute-
force.” It means you need to think that the right features (input variables) to
train the machine on. It could be inaccurate.
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
“Unsupervised learning is a type of machine learning in which models are trained using
unlabelled dataset and are allowed to act on that data without any supervision.”
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
o Unsupervised learning works on unlabelled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning.
Here, we have taken an unlabelled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to find the
hidden patterns from the data and then will apply suitable algorithms such as k-means
clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Clustering
Clustering is a data mining technique which groups unlabeled data based on their
similarities or differences. Clustering algorithms are used to process raw, unclassified
data objects into groups represented by structures or patterns in the information.
Association Rules
1) Apriori algorithms
Apriori algorithms have been popularized through market basket analyses, leading to
different recommendation engines for music platforms and online retailers. They are
used within transactional datasets to identify frequent itemsets, or collections of
items, to identify the likelihood of consuming a product given the consumption of
another product.
While more data generally yields more accurate results, it can also impact the
performance of machine learning algorithms (e.g. overfitting) and it can also make it
difficult to visualize datasets. Dimensionality reduction is a technique used when the
number of features, or dimensions, in a given dataset is too high. It reduces the
number of data inputs to a manageable size while also preserving the integrity of the
dataset as much as possible.
While unsupervised learning has many benefits, some challenges can occur when it
allows machine learning models to execute without any human intervention. Some of
these challenges can include:
Reinforcement Learning
o Reinforcement Learning is a feedback-based Machine learning technique in which an
agent learns to behave in an environment by performing the actions and seeing the results
of actions. For each good action, the agent gets positive feedback, and for each bad action,
the agent gets negative feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks without any
labelled data, unlike supervised learning.
o Since there is no labelled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, and the goal is
long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
o The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that "Reinforcement learning is a
type of machine learning method where an intelligent agent (computer program)
The accessibility and abundance of data today makes knowledge discovery and Data Mining a
matter of considerable importance and necessity. Given the recent growth of the field, it is not
surprising that a wide variety of methods is now available to the researchers and practitioners. No
one method is superior to others for all cases. The handbook of Data Mining and Knowledge
Discovery from Data aims to organize all significant methods developed in the field into a
coherent and unified catalog; presents performance evaluation approaches and techniques; and
explains with cases and software tools the use of the different methods.
SEMMA Model
SEMMA is the sequential methods to build machine learning models incorporated in ‘SAS
Enterprise Miner’, a product by SAS Institute Inc., one of the largest producers of commercial
statistical and business intelligence software. However, the sequential steps guide the
development of a machine learning system. Let’s look at the five sequential steps to understand
it better.
Enterprise Miner software is an integrated product that provides an end-to-end business solution
for data mining.
A graphical user interface (GUI) provides a user-friendly front end to the SEMMA data mining
process:
• Sample: Sample the data by creating one or more data tables. The samples should be
large enough to contain the significant information, yet small enough to process.
Scales of Measurement
Data can be classified as being on one of four scales: nominal, ordinal,
interval or ratio. Each level of measurement has some important properties
that are useful to know.
1. Nominal Scale –
Nominal variables can be placed into categories. These don’t have a numeric
value and so cannot be added, subtracted, divided or multiplied. These also
have no order, and nominal scale of measurement only satisfies the identity
property of measurement.
The ordinal scale contains things that you can place in order. It measures a
variable in terms of magnitude, or rank. Ordinal scales tell us relative order,
but give us no information regarding differences between the categories. The
ordinal scale has the property of both identity and magnitude.
For example, in a race If Ram takes first and Vidur takes second place, we
do not know competition was close by how many seconds.
3. Interval Scale –
Page 1 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
The ratio scale of measurement is similar to the interval scale in that it also
represents quantity and has equality of units with one major difference: zero
is meaningful (no numbers exist below the zero). The true zero allows us to
know how many times greater one case is than another. Ratio scales have
all of the characteristics of the nominal, ordinal and interval scales.
Page 2 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
One of the biggest impact of Missing Data is, it can bias the results of
the machine learning models or reduce the accuracy of the model.
So, it is very important to handle missing values.
Page 3 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
o Xn = Value of Normalization
o Xmaximum = Maximum value of a feature
o Xminimum = Minimum value of a feature
Example: Let's assume we have a model dataset having maximum and minimum values of
feature as mentioned above. To normalize the machine learning model, values are shifted
and rescaled so their range can vary between 0 and 1. This technique is also known as Min-
Max scaling.
Page 4 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
Here, µ represents the mean of feature value, and σ represents the standard
deviation of feature values.
However, unlike Min-Max scaling technique, feature values are not restricted to a
specific range in the standardization technique.
This technique is helpful for various machine learning algorithms that use distance
measures such as KNN, K-means clustering, and Principal component analysis,
etc. Further, it is also important that the model is built on assumptions and data is
normally distributed.
What is a feature?
Generally, all machine learning algorithms take input data to generate the output.
The input data remains in a tabular form consisting of rows (instances or
observations) and columns (variable or attributes), and these attributes are often
known as features. For example, an image is an instance in computer vision, but a
line in the image could be the feature. Similarly, in NLP, a document can be an
Page 5 of 17
UNIT-2 |MACHINE DATA 7th SEM | MACHINE LEARNING
MACHINE LEARNING PERSPECTIVE OF DATA|
observation, and the word count could be the feature. So, we can say a feature is an
attribute that impacts a problem
roblem or is useful for the problem.
problem
Since 2016,
6, automated feature engineering is also used in different machine learning
software that helps in automatically extracting features from raw data. Feature
engineering in ML contains mainly four processes: Feature Creation,
Transformations, Feature Extraction,
Extract and Feature Selection.
Page 6 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
ensures that all the features are within the acceptable range to avoid any
computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering
process that generates new variables by extracting them from the raw data.
The main aim of this step is to reduce the volume of data so that it can be
easily used and managed for data modelling. Feature extraction methods
include cluster analysis, text analytics, edge detection algorithms, and
principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the rest features
are either redundant or irrelevant. If we input the dataset with all these
redundant and irrelevant features, it may negatively impact and reduce the
overall performance and accuracy of the model. Hence it is very important to
identify and select the most appropriate features from the data and remove
the irrelevant or less important features, which is done with the help of feature
selection in machine learning. "Feature selection is a way of selecting the
subset of the most relevant features from the original features set by
removing the redundant, irrelevant, or noisy features."
Page 7 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
1. Correlation :
Page 8 of 17
UNIT-2 |MACHINE DATA 7th SEM | MACHINE LEARNING
MACHINE LEARNING PERSPECTIVE OF DATA|
2. Causation :
Page 9 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
Polynomial Regression
o Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree polynomial. The
Polynomial Regression equation is given below:
Page 10 of 17
UNIT-2 |MACHINE DATA 7th SEM | MACHINE LEARNING
MACHINE LEARNING PERSPECTIVE OF DATA|
When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables. The Simple
and Multiple Linear equations are also Polynomial equations with a single degr
degree,
and the Polynomial regression equation is Linear equation with the nth degree. So if
we add a degree to our linear equations, then it will be converted into Polynomial
Linear equations.
Page 11 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
Note: To better understand Polynomial Regression, you must have knowledge of Simple
Linear Regression.
Page 12 of 17
UNIT-2 |MACHINE DATA 7th SEM | MACHINE LEARNING
MACHINE LEARNING PERSPECTIVE OF DATA|
Page 13 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
Note: For a better understanding of this article, we suggest you first understand the
Confusion Matrix, as AUC-ROC uses terminologies used in the Confusion matrix.
Page 14 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
ROC Curve
ROC or Receiver Operating Characteristic curve represents a probability graph
to show the performance of a classification model at different threshold levels.
The curve is plotted between two parameters, which are:
TPR:
TPR or True Positive rate is a synonym for Recall, which can be calculated as:
Now, to efficiently calculate the values at any threshold level, we need a method,
which is AUC.
Page 15 of 17
UNIT-2 |MACHINE DATA 7th SEM | MACHINE LEARNING
MACHINE LEARNING PERSPECTIVE OF DATA|
In the ROC curve, AUC computes the performance of the binary classifier across
different thresholds and provides an aggregate measure. The value of AUC ranges
from 0 to 1, which means an excellent model will have AUC near 1, and hence it will
show a good measure of Separability.
Applications of AUC-ROC
AUC Curve
Although the AUC-ROC
ROC curve is used to evaluate a classification model, it is widely
used for various applications. Some of the important applications of AUC-ROC
AUC are
given below:
1. Classification of 3D model
The curve is used to classify a 3D model and separate it from the normal models.
With the specified threshold level, the curve classifies the non
non-3D
3D and separates out
the 3D models.
2. Healthcare
The curve has various applica
applications
tions in the healthcare sector. It can be used to detect
Page 16 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING
cancer disease in patients. It does this by using false positive and false negative rates,
and accuracy depends on the threshold value used for the curve.
3. Binary Classification
AUC-ROC curve is mainly used for binary classification problems to evaluate their
performance.
Page 17 of 17
MACHINE LEARNING UNIT-3
Introduction to Machine Learning Algorithms: Decision Trees, Support Vector Machine, k-
Nearest Neighbors, Time-Series Forecasting, Clustering, Principal Component Analysis (PCA)
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
How can an algorithm be represented as a tree?
For this let’s consider a very basic example that uses titanic data set for
predicting whether a passenger will survive or not. Below model uses 3
features/attributes/columns from the data set, namely sex, age and
sibsp (number of spouses or children along).
A decision tree is drawn upside down with its root at the top. In the
image on the left, the bold text in black represents a
condition/internal node, based on which the tree splits into
branches/ edges. The end of the branch that doesn’t split anymore is
the decision/leaf, in this case, whether the passenger died or survived,
represented as red and green text respectively.
Although, a real dataset will have a lot more features and this will just
be a branch in a much bigger tree, but you can’t ignore the simplicity of
this algorithm. The feature importance is clear and relations can
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
be viewed easily. This methodology is more commonly known
as learning decision tree from data and above tree is
called Classification tree as the target is to classify passenger as
survived or died. Regression trees are represented in the same
manner, just they predict continuous values like price of a house. In
general, Decision Tree algorithms are referred to as CART or
Classification and Regression Trees.
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Advantages of CART
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
• Nonlinear relationships between parameters do not affect
tree performance.
Disadvantages of CART
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Support Vector Machine-
What is the Support Vector Machine?
Support Vectors are simply the coordinates of individual observation. The SVM
classifier is a frontier that best segregates the two classes (hyper-plane/ line).
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green
or blue. Consider the below image:
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third dimension
z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
k-Nearest Neighbors-
• KNN which stands for K Nearest Neighbor is a Supervised Machine Learning algorithm
that classifies a new data point into the target class, depending on the features of its
neighboring data points.
• K nearest neighbors or KNN Algorithm is a simple algorithm which uses the entire dataset
in its training phase. Whenever a prediction is required for an unseen data instance, it
searches through the entire training dataset for k-most similar instances and the data
with the most similar instance is finally returned as the prediction.
• k-NN is often used in search applications where you are looking for similar items, like find
items similar to this one.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features
it will put it in either cat or dog category.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Time-Series Forecasting:
Time series forecasting is one of the most applied data science techniques in business, finance,
supply chain management, production and inventory planning. Many prediction problems
involve a time component and thus require extrapolation of time series data, or time series
forecasting. Time series forecasting is also an important area of machine learning (ML) and can
be cast as a supervised learning problem. ML methods such as Regression, Neural Networks,
Support Vector Machines, Random Forests and XGBoost — can be applied to it. Forecasting
involves taking models fit on historical data and using them to predict future observations.
Time series forecasting means to forecast or to predict the future value over a period of time. It
entails developing models based on previous data and applying them to make observations and
guide future strategic decisions.
The future is forecast or estimated based on what has already happened. Time series adds a
time order dependence between observations. This dependence is both a constraint and a
structure that provides a source of additional information. Before we discuss time series
forecasting methods, let’s define time series forecasting more closely.
Time series forecasting is a technique for the prediction of events through a sequence of time.
It predicts future events by analyzing the trends of the past, on the assumption that future
trends will hold similar to historical trends. It is used across many fields of study in various
applications including:
• Astronomy
• Business planning
• Control engineering
• Earthquake prediction
• Econometrics
• Mathematical finance
• Pattern recognition
• Resources allocation
• Signal processing
• Statistics
• Weather forecasting
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
When forecasting, it is important to understand your goal. To narrow down the specifics of your
predictive modeling problem, ask questions about:
1. Volume of data available — more data is often more helpful, offering greater
opportunity for exploratory data analysis, model testing and tuning, and model
fidelity.
2. Required time horizon of predictions — shorter time horizons are often easier to
predict — with higher confidence — than longer ones.
3. Forecast update frequency — Forecasts might need to be updated frequently over
time or might need to be made once and remain static (updating forecasts as new
information becomes available often results in more accurate predictions).
4. Forecast temporal frequency — Often forecasts can be made at lower or higher
frequencies, which allows harnessing downsampling and up-sampling of data (this
in turn can offer benefits while modeling).
Moving-average model
In time series analysis, the moving-average model (MA model), also known as moving-average
process, is a common approach for modeling univariate time series. The moving-average model
specifies that the output variable depends linearly on the current and various past values of a
stochastic (imperfectly predictable) term.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Together with the autoregressive (AR) model (covered below), the moving-average model is a
special case and key component of the more general ARMA and ARIMA models of time series,
which have a more complicated stochastic structure.
Contrary to the AR model, the finite MA model is always stationary.
Exponential Smoothing model
Exponential smoothing is a rule of thumb technique for smoothing time series data using the
exponential window function. Exponential smoothing is an easily learned and easily applied
procedure for making some determination based on prior assumptions by the user, such as
seasonality. Different types of exponential smoothing include single exponential smoothing,
double exponential smoothing, and triple exponential smoothing (also known as the Holt-
Winters method). For tutorials on how to use Holt-Winters out of the box with InfluxDB, see
“When You Want Holt-Winters Instead of Machine Learning” and “Using InfluxDB to Predict The
Next Extinction Event”).
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Clustering:
A cluster refers to a small group of objects. Clustering is grouping those objects into clusters. In
order to learn clustering, it is important to understand the scenarios that lead to cluster different
objects.
What is Clustering?
•Clustering is dividing data points into homogeneous classes or clusters:
• Points in the same group are as similar as possible • Points in different group are as dissimilar
as possible
•When a collection of objects is given, we put objects into group based on similarity.
Clustering Algorithms -
• A Clustering Algorithm tries to analyze natural groups of data on the basis of some
similarity. It locates the centroid of the group of data points. To carry out effective
clustering, the algorithm evaluates the distance between each point from the centroid of
the cluster.
• The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Applications of Clustering
• Listed here are few more applications, which would add to what you have learnt.
1. Clustering helps marketers improve their customer base and work on the target areas. It helps
group people (according to different criteria’s such as willingness, purchasing power etc.) based
on their similarity in many ways related to the product under consideration.
2. Clustering helps in identification of groups of houses on the basis of their value, type and
geographical locations.
3. Clustering is used to study earth-quake. Based on the areas hit by an earthquake in a region,
clustering can help analyze the next probable location where earthquake can occur.
Clustering is a type of unsupervised learning mechanism. It basically analyzes the points and
clusters them based on similarities and dissimilarities. • Applications of Clustering in different
fields
1. Marketing : It can be used to characterize & discover customer segments for marketing
purposes.
2. Biology : It can be used for classification among different species of plants and animals.
3. Libraries : It is used in clustering different books on the basis of topics and information.
4. Insurance : It is used to acknowledge the customers, their policies and identifying the frauds.
5. City Planning : It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
6. Earthquake studies : By learning the earthquake affected areas we can determine the
dangerous zones.
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known
as the centroid-based method. The most common example of partitioning clustering is
the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the
distance between the data points of one cluster is minimum as compared to another
cluster centroid.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and
the arbitrarily shaped distributions are formed as long as the dense region can be
connected. This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters. The dense areas in data space are
divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there
is no requirement of pre-specifying the number of clusters to be created. In this technique,
the dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting
the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than
one group or cluster. Each dataset has a set of membership coefficients, which depend
on the degree of membership to be in a cluster. Fuzzy C-means algorithm is the example
of this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above.
There are different types of clustering algorithms published, but only a few are commonly
used. The clustering algorithm is based on the kind of data that we are using. Such as,
some algorithms need to guess the number of clusters in the given dataset, whereas some
are required to find the minimum distance between the observation of the dataset.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Here we are discussing mainly popular Clustering algorithms that are widely used in
machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of equal
variances. The number of clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on updating
the candidates for centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications
with Noise. It is an example of a density-based model similar to the mean-shift, but with
some remarkable advantages. In this algorithm, the areas of high density are separated by
the areas of low density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can be failed. In
GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can be
represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require
to specify the number of clusters. In this, each data point sends a message between the
pair of data points until convergence. It has O(N2T) time complexity, which is the main
drawback of this algorithm.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Principal Component Analysis (PCA):
Principal Component Analysis (PCA) is an unsupervised, non-parametric
statistical technique primarily used for dimensionality reduction in
machine learning.
• Principal Component Analysis is an unsupervised learning
algorithm that is used for the dimensionality reduction in machine
learning.
• It is a statistical process that converts the observations of
correlated features into a set of linearly uncorrelated features with
the help of orthogonal transformation. These new transformed
features are called the Principal Components. It is one of the
popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from
the given dataset by reducing the variances.
• PCA generally tries to find the lower-dimensional surface to project
the high-dimensional data.
• High dimensionality means that the dataset has a large number of
features. The primary problem associated with high-dimensionality
in the machine learning field is model overfitting, which reduces
the ability to generalize beyond the examples in the training set.
• Richard Bellman described this phenomenon in 1961 as the Curse
of Dimensionality where “Many algorithms that work fine in low
dimensions become intractable when the input is high-
dimensional. “
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
• The ability to generalize correctly becomes exponentially harder as
the dimensionality of the training dataset grows, as the training set
covers a dwindling fraction of the input space. Models also become
more efficient as the reduced feature set boosts learning rates and
diminishes computation costs by removing redundant features.
• PCA can also be used to filter noisy datasets, such as image
compression. The first principal component expresses the most
amount of variance. Each additional component expresses less
variance and more noise, so representing the data with a smaller
subset of principal components preserves the signal and discards
the noise.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Principal Components in PCA
As described, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or
less than the original features present in the dataset. Some properties of
these principal components are given below:
1. The principal component must be the linear combination of the
original features.
2. These components are orthogonal, i.e., the correlation between a pair
of variables is zero.
3. The importance of each component decreases when going to 1 to n, it
means the 1 PC has the most importance, and n PC will have the least
importance.
Applications of Principal Component Analysis:
• PCA is mainly used as the dimensionality reduction technique in
various AI applications such as computer vision, image
compression, etc.
• It can also be used for finding hidden patterns if data has high
dimensions. Some fields where PCA is used are Finance, data
mining, Psychology, etc.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
items, and the column corresponds to the Features. The number of columns is the
dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features
with high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will
divide each data item in a column with the standard deviation of the column. Here we
will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which
means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z.
In the resultant matrix Z*, each observation is the linear combination of original features.
Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree
tree-
like structure.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub
sub-nodes according
to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree. The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and one leaf node based on
the corresponding labels. The next decision node further gets split into one decision
node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of
a dataset based on an attribute
attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute
e/attribute having the highest information gain is split first. It can be calculated
using the below formula:
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional
dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories
ories that are classified using a decision boundary or hyperplane:
Possible hyperplanes
To separate the two classes of data points, there are many possible hyperplanes that
could be chosen. Our objective is to find a plane that has the maximum margin, i.e
the maximum distance between data points of both classes. Maximizing the margin
distance provides some reinforcement so that future data points can be classified
with more confidence.
Hyperplanes are decision boundaries that help classify the data points. Data points
falling on either side of the hyperplane can be attributed to different classes. Also,
the dimension of the hyperplane depends upon the number of features. If the
number of input features is 2, then the hyperplane is just a line. If the number of
input features is 3, then the hyperplane becomes a two-dimensional plane. It
becomes difficult to imagine when the number of features exceeds 3.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)
KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features
tures of the new data set to the cats and dogs images and based on the
most similar features it will put it in either cat or dog category.
Why do we need a K
K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve this
type of problem, we need a K K-NN algorithm. With the help of K-NN, NN, we can easily
identify the category or class of a particular dataset. Consider the below diagram:
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean
ean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
All the previously, recently, and currently collected data is used as input for time series
forecasting where future trends, seasonal changes, irregularities, and such are elaborated
based on complex math-driven algorithms. And with machine learning, time series
forecasting becomes faster, more precise, and more efficient in the long run. ML has proven
to help better process both structured and unstructured data flows, swiftly capturing
accurate patterns within massifs of data.
Stock prices forecasting — the data on the history of stock prices combined with the
data on both regular and irregular stock market spikes and drops can be used to gain
insightful predictions of the most probable upcoming stock price shifts.
Demand and sales forecasting — customer behaviour patterns data along with inputs
from the history of purchases, timeline of demand, seasonal impact, etc., enable ML
models to point out the most potentially demanded products and hit the spot in the
dynamic market.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)
Web traffic forecasting — common data on usual traffic rates among competitor
websites is bunched up with input data on traffic-related patterns in order to predict
web traffic rates during certain periods.
Climate and weather prediction — time-based data is regularly gathered from
numerous interconnected weather stations worldwide, while ML techniques allow to
thoroughly analyze and interpret it for future forecasts based on statistical dynamics.
Demographic and economic forecasting — there are tons of statistical inputs in
demographics and economics, which is most efficiently used for ML-based time-
series predictions. As a result, the most fitting target audience can be picked and the
most efficient ways to interact with that particular TA can be elaborated.
Scientific studies forecasting — ML and deep learning principles accelerate the rates
of polishing up and introducing scientific innovations dramatically. For instance,
science data that requires an indefinite number of analytical iterations can be
processed much faster with the help of patterns automated by machine learning.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)
Reducing the number of variables of a data set naturally comes at the expense of accuracy,
but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because
smaller data sets are easier to explore and visualize and make analyzing data much easier
and faster for machine learning algorithms without extraneous variables to process.
So, to sum up, the idea of PCA is simple — reduces the number of variables of a data set,
while preserving as much information as possible.
It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, colour, behaviour, etc., and divides them as per the presence and absence of
those similar patterns.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.
Note: Clustering is somewhere similar to the classification algorithm, but the difference is
the type of dataset that we are using. In classification, we work with the labelled data
set, whereas in clustering, we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)
The clustering technique can be widely used in various tasks. Some most common
uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past sear search of
products. Netflix also uses this technique to recommend the movies and web
web-series
to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several grou
groups
ps with similar properties.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine
Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets
into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers
based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands
use in the GIS database. This can be very useful to find that for what purpose the
particular land should be used, that means for which purpose it is more suitable.
Bias and Variance
Machine learning is a branch of Artificial Intelligence, which allows machines to perform data
analysis and make predictions. However, if the machine learning model is not accurate, it can
make predictions errors, and these prediction errors are usually kn
known
own as Bias and Variance.
In machine learning, these errors will always be present as there is always a slight difference
between the model predictions and actual predictions. The main aim of ML/data science
analysts is to reduce these errors in order to get more accurate results.
What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and
applies them to test data for prediction. While making predictions, a difference
occurs between prediction values made by the model and actual
values/expected values, and this difference is known as bias errors or Errors due
to bias.. It can be defined as an inability of machine learning algorithms such as
Linear Regression to capture the true relationship between the data points. Each
algorithm begins with some amount of bias because bias occurs from assumptions in
the model, which makes the target function simple to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform
rform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler
the algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear
algorithm often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-
Nearest Neighbours and Support Vector Machines. At the same time, an
algorithm with high bias is Linear Regression, Linear Discriminant Analysis and
Logistic Regression.
Low variance means there is a small variation in the prediction of the target function
with changes in the training data set. At the same time, High variance shows a large
variation in the prediction of the target function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with the training
dataset, and does not generalize well with the unseen dataset. As a result, such a
model gives good results with the training dataset but shows high error rates on the
test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to
overfitting of the model. A model with high variance has the below problems:
Bagging
Machine Learning uses several techniques to build models and improve their
performance. Ensemble learning methods help improves the accuracy of
classification and regression models. This article will discuss one of the most popular
ensemble learning
arning algorithms, i.e., bagging in Machine Learning.
AdaBoost (Adaptive boosting) was the first boosting algorithm to combine various
weak classifiers into a single strong classifier in the history of machine learning. It
primarily focuses to solve classification tasks such as binary classification.
Example:
Let's suppose, we have three different models with their predictions and they work in
completely different ways. For example, the linear regression model shows a linear
relationship in data while the decision tree model attempts to capture the non-
linearity in the data as shown below image.
Further, instead of using these models separately to predict the outcome if we use
them in form of series or combination, then we g get
et a resulting model with correct
information than all base models. In other words, instead of using each model's
individual prediction, if we use average prediction from these models then we would
be able to capture more information from the data. It is rreferred
eferred to as ensemble
learning and boosting is also based on ensemble methods in machine learning.
It enables us to combine the predictions from various learner models and build a
final predictive model having the correct prediction.
But here one question may arise if we are applying the same algorith
algorithm then how
multiple decision trees can give better predictions than a single decision tree?
Moreover, how does each decision tree capture different information from the same
data?
So, the answer to these questions is that a different subset of features iis taken by the
nodes of each decision tree to select the best split. It means that each tree behaves
differently, and hence captures different signals from the same data.
Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of
two or more base/learner's models and a meta
meta-model
model that combines the predictions
of the base models. These base models are called level 0 models
models, and the meta-
model is known as the level 1 model. So, the Stacking ensemble method
includes original (training) data, primary level models, primary level prediction,
secondary level model, and final prediction
prediction.. The basic architecture of stacking can
be represented
epresented as shown below the image.
In machine learning, there is always the need to test the stability of the model. It
means based only on the training dataset; we can't fit our model on the training
dataset. For this purpose, we reserve a particular sample of the dataset, which was
not part of the training dataset. After that, we test our model on that sample before
deployment, and this complete process comes under cross-validation. This is
something different from the general train-test split.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples
of equal sizes. These samples are called folds. For each learning set, the prediction
function uses k-1 folds, and the rest of the folds are used for the test set. This
approach is a very popular CV approach because it is easy to understand, and the
output is less biased than other methods.
The steps for k-fold cross-validation
validation are: