Machine Learning Notes Unit 1 To 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 101

Computer Science and Engineering

7th Semester

Machine Learning

Chhattisgarh Swami Vivekananda Technical University, Bhilai


Chhattisgarh Swami Vivekananda Technical University, Bhilai (C.G.)

Machine Learning Course Code: D022711(022)


otal / Minimum-Pass Marks
100 / 35
MACHINE LEARNING NOTES
UNIT- I Introduction: History and Evolution, Machine Learning Categories: Supervised Learning,
Unsupervised Learning, Reinforcement Learning. Knowledge Discovery in Databases, SEMMA
(Sample, Explore, Modify, Model, Assess).

MACHINE LEARNING-

Machine learning is a growing technology which enables computers to learn automatically from
past data. Machine learning uses various algorithms for building mathematical models and
making predictions using historical data or information. Currently, it is being used for various
tasks such as image recognition, speech recognition, email filtering, Facebook auto-
tagging, recommender system, and many more.

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences
on their own. The term machine learning was first introduced by Arthur Samuel in 1959. We can
define it in a summarized way as:

“Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.”

In Victorian England, Lady Ada Lovelace was a friend and collaborator of Charles Babbage, the
inventor of the Analytical Engine: the first-known general-purpose, mechanical computer.
Although visionary and far ahead of its time, the Analytical Engine wasn't meant as a general-
purpose computer when it was designed in the 1830s and 1840s, because the concept of general-
purpose computation was yet to be invented. It was merely meant as a way to use mechanical
operations to automate certain computations from the field of mathematical analysis-hence, the
name analytical Engine. In 1843, Ada Lovelace remarked on the invention, "The Analytical Engine
has no pretensions whatever to originate anything. It can do whatever we know how to order it
to perform. Its province is to assist us in making available what we're already acquainted with."
This remark was later quoted by Al pioneer Alan Turing as "Lady Lovelace's objection in his
landmark 1950 paper "Computing Machinery and Intelligence." which introduced the Turing best
as well as key concepts that would come to shape Al. Tuning was quoting Ada Lovelace while
pondering whether general-purpose computers could be capable of learning and originality, and
he came to the conclusion that they could. Machine learning arises from this question: could a
computer go beyond "what we know how to order it to perform and learn on its own how to
perform a specified task?

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Could a computer surprise us? Rather than programmers crafting data-processing rules by hand,
could a computer automatically learn these rules by looking at data? This question opens the
door to a new programming paradigm. In classical programming, the paradigm of symbolic Al,
humans input rules (a program) and data to be processed according to these rules and outcome
answers. With machine learning, humans input data as well as the answers expected from the
data, and outcome the rules. These rules can then be applied to new data to produce original
answers.

How does Machine Learning work?

A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model
which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so instead
of writing a code for it, we just need to feed the data to generic algorithms, and with the help of
these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram explains
the working of Machine Learning algorithm:

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


HISTORY AND EVOLUTION
Machine learning history starts in 1943 with the first mathematical model of neural networks
presented in the scientific paper "A logical calculus of the ideas immanent in nervous activity" by
Walter Pitts and Warren McCulloch.
Then, in 1949, the book The Organization of Behavior by Donald Hebb is published. The book had
theories on how behavior relates to neural networks and brain activity and would go on to
become one of the monumental pillars of machine learning development.
In 1950 Alan Turing created the Turing Test to determine if a computer has real intelligence. To
pass the test, a computer must be able to fool a human into believing it is also human. He
presented the principle in his paper Computing Machinery and Intelligence while working at the
University of Manchester. It opens with the words: "I propose to consider the question, 'Can
machines think?'"
Playing games and plotting routes:
The first ever computer learning program was written in 1952 by Arthur Samuel. The program
was the game of checkers, and the IBM computer improved at the game the more it played,
studying which moves made up winning strategies and incorporating those moves into its
program.
Then in 1957 Frank Rosenblatt designed the first neural network for computers - the perceptron
- which simulated the thought processes of the human brain.
The next significant step forward in ML wasn’t until 1967 when the “nearest neighbor” algorithm
was written, allowing computers to begin using very basic pattern recognition. This could be used
to map a route for traveling salesmen, starting at a random city but ensuring they visit all cities
during a short tour.
Twelve years later, in 1979 students at Stanford University invent the ‘Stanford Cart’ which could
navigate obstacles in a room on its own. And in 1981, Gerald Dejong introduced the concept of
Explanation Based Learning (EBL), where a computer analyses training data and creates a general
rule, it can follow by discarding unimportant data.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Machine learning was first conceived from the mathematical modeling of neural networks. A
paper by logician Walter Pitts and neuroscientist Warren McCulloch, published in 1943,
attempted to mathematically map out thought processes and decision making in human
cognition.
In 1950, Alan Turning proposed the Turing Test, which became the litmus test for which machines
were deemed "intelligent" or "unintelligent." The criteria for a machine to receive status as an
"intelligent" machine, was for it to have the ability to convince a human being that it, the
machine, was also a human being. Soon after, a summer research program at Dartmouth College
became the official birthplace of AI.
From this point on, "intelligent" machine learning algorithms and computer programs started to
appear, doing everything from planning travel routes for salespeople, to playing board games
with humans such as checkers and tic-tac-toe.
Intelligent machines went on to do everything from using speech recognition to learning to
pronounce words the way a baby would learn to defeating a world chess champion at his own
game. The infographic below shows the history of machine learning and how it grew from
mathematical models to sophisticated technology.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Machine Learning from Theory to Reality: Year by year

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai
Machine Learning at Present
Machine learning is now responsible for some of the most significant advancements in
technology. It is being used for the new industry of self-driving vehicles, and for exploring the
galaxy as it helps in identifying exoplanets. Recently, Machine learning was defined by Stanford
University as “the science of getting computers to act without being explicitly programmed.”
Machine learning has prompted a new array of concepts and technologies, including supervised
and unsupervised learning, new algorithms for robots, the Internet of Things, analytics tools,
chatbots, and more. Listed below are seven common ways the world of business is currently using
machine learning:
Analyzing Sales Data: Streamlining the data
Real-Time Mobile Personalization: Promoting the experience
Fraud Detection: Detecting pattern changes
Product Recommendations: Customer personalization
Learning Management Systems: Decision-making programs
Dynamic Pricing: Flexible pricing based on a need or demand
Natural Language Processing: Speaking with humans
Machine learning models have become quite adaptive in continuously learning, which makes
them increasingly accurate the longer they operate. ML algorithms combined with new

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


computing technologies promote scalability and improve efficiency. Combined with business
analytics, machine learning can resolve a variety of organizational complexities. Modern ML
models can be used to make predictions ranging from outbreaks of disease to the rise and fall of
stocks.
Google is currently experimenting with machine learning using an approach called instruction
fine-tuning. The goal is to train an ML model to resolve natural language processing issues in a
generalized way. The process trains the model to solve a broad range of problems, rather than
only one kind of problem.

Machine Learning Categories:


o Supervised Learning
o Unsupervised Learning
o Reinforcement Learning

Supervised Machine Learning


Supervised learning is the types of machine learning in which machines are trained using
well "labelled" training data, and on basis of that data, machines predict the output. The
labelled data means some input data is already tagged with the correct output. In
supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as
a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to
the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on
the basis of test data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Let’s see some Supervised learning examples on how you can develop a supervised
learning model of this example which help the user to determine the commute
time. The first thing you requires to create is a training set. This training set will
contain the total commute time and corresponding factors like weather, time, etc.
Based on this training set, your machine might see there’s a direct relationship
between the amount of rain and time you will take to get home.

So, it ascertains that the more it rains, the longer you will be driving to get back to
your home. It might also see the connection between the time you leave work and
the time you’ll be on the road.

The closer you’re to 6 p.m. the longer it takes for you to get home. Your machine
may find some of the relationships with your labeled data.

Working of Supervised Machine Learning

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


This is the start of your Data Model. It begins to impact how rain impacts the way
people drive. It also starts to see that more people travel during a particular time of
day.

Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled
as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine, decision
tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the
correct output, which means our model is accurate.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Challenges in Supervised machine learning
Here, are challenges faced in supervised machine learning:

• Irrelevant input feature present training data could give inaccurate results
• Data preparation and pre-processing is always a challenge.
• Accuracy suffers when impossible, unlikely, and incomplete values have
been inputted as training data
• If the concerned expert is not available, then the other approach is “brute-
force.” It means you need to think that the right features (input variables) to
train the machine on. It could be inaccurate.

Advantages of Supervised Learning


Here are the advantages of Supervised Machine learning:

• Supervised learning in Machine Learning allows you to collect data or


produce a data output from the previous experience
• Helps you to optimize performance criteria using experience
• In supervised learning, we can have an exact idea about the classes of objects.
• Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.

Disadvantages of Supervised Learning


Below are the disadvantages of Supervised Machine learning:

• Decision boundary might be overtrained if your training set which doesn’t


have examples that you want to have in a class
• You need to select lots of good examples from each class while you are
training the classifier.
• Classifying big data can be a real challenge.
• Supervised learning models are not suitable for handling the complex tasks.
• Supervised learning cannot predict the correct output if the test data is different
from the training dataset.
• Training required lots of computation times.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


• In supervised learning, we need enough knowledge about the classes of object.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types of problems:

Regression: It is a Supervised Learning task where output is having continuous value.


Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. For example in above Figure B, Output – Wind Speed is
not having any discrete value but is continuous in a particular range. The goal here is to
predict a value as much closer to the actual output value as our model can and then
evaluation is done by calculating the error value. The smaller the error the greater the
accuracy of our regression model. Below are some popular Regression algorithms which
come under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification- Classification algorithms are used when the output variable is


categorical, which means there are two classes such as Yes-No, Male-Female, True-false,
etc. It is a Supervised Learning task where output is having defined labels(discrete value).
For example in above Figure A, Output – Purchased has defined labels i.e. 0 or 1; 1 means

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


the customer will purchase, and 0 means that the customer won’t purchase. The goal here
is to predict discrete values belonging to a particular class and evaluate them on the basis
of accuracy.

It can be either binary or multi-class classification. In binary classification, the model


predicts either 0 or 1; yes or no but in the case of multi-class classification, the model
predicts more than one class. Example: Gmail classifies mails in more than one class like
social, promotions, updates, and forums.

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Unsupervised Machine Learning


In the previous topic, we learned supervised machine learning in which models are trained
using labelled data under the supervision of training data. But there may be many cases
in which we do not have labelled data and need to find the hidden patterns from the
given dataset. So, to solve such types of cases in machine learning, we need unsupervised
learning techniques.

What is Unsupervised Learning?


As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data. It can be compared to learning which takes
place in the human brain while learning new things. It can be defined as:

“Unsupervised learning is a type of machine learning in which models are trained using
unlabelled dataset and are allowed to act on that data without any supervision.”

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output
data. The goal of unsupervised learning is to find the underlying structure of dataset,
group that data according to similarities, and represent that dataset in a compressed
format.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Example: Suppose the unsupervised learning algorithm is given an input dataset
containing images of different types of cats and dogs. The algorithm is never trained upon
the given dataset, which means it does not have any idea about the features of the
dataset. The task of the unsupervised learning algorithm is to identify the image features
on their own. Unsupervised learning algorithm will perform this task by clustering the
image dataset into the groups according to similarities between images.

Unsupervised learning, also known as unsupervised machine learning, uses machine


learning algorithms to analyze and cluster unlabeled datasets. These algorithms
discover hidden patterns or data groupings without the need for human intervention. Its
ability to discover similarities and differences in information make it the ideal solution
for exploratory data analysis, cross-selling strategies, customer segmentation, and
image recognition.

Why use Unsupervised Learning?


Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
o Unsupervised learning works on unlabelled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabelled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to find the
hidden patterns from the data and then will apply suitable algorithms such as k-means
clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:


Unsupervised learning models are utilized for three main tasks—clustering,
association, and dimensionality reduction. Below we’ll define each learning method
and highlight common algorithms and approaches to conduct them effectively.

Clustering

Clustering is a data mining technique which groups unlabeled data based on their
similarities or differences. Clustering algorithms are used to process raw, unclassified
data objects into groups represented by structures or patterns in the information.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Clustering algorithms can be categorized into a few types, specifically exclusive,
overlapping, hierarchical, and probabilistic.

1) K-means clustering is a common example of an exclusive clustering method


where data points are assigned into K groups, where K represents the number
of clusters based on the distance from each group’s centroid. The data points
closest to a given centroid will be clustered under the same category.
2) Hierarchical clustering Hierarchical clustering, also known as hierarchical cluster
analysis (HCA), is an unsupervised clustering algorithm that can be categorized in two
ways; they can be agglomerative or divisive. Agglomerative clustering is considered a
“bottoms-up approach.” Its data points are isolated as separate groupings initially, and
then they are merged together iteratively on the basis of similarity until one cluster has
been achieved.
3) Probabilistic clustering- A probabilistic model is an unsupervised technique that helps
us solve density estimation or “soft” clustering problems. In probabilistic clustering, data
points are clustered based on the likelihood that they belong to a particular distribution.
The Gaussian Mixture Model (GMM) is the one of the most commonly used probabilistic
clustering methods.

Association Rules

An association rule is a rule-based method for finding relationships between variables


in a given dataset. These methods are frequently used for market basket analysis,
allowing companies to better understand relationships between different products.
Understanding consumption habits of customers enables businesses to develop better
cross-selling strategies and recommendation engines. Examples of this can be seen in
Amazon’s “Customers Who Bought This Item Also Bought” or Spotify’s "Discover
Weekly" playlist. While there are a few different algorithms used to generate association
rules, such as Apriori, Eclat, and FP-Growth, the Apriori algorithm is most widely used.

1) Apriori algorithms

Apriori algorithms have been popularized through market basket analyses, leading to
different recommendation engines for music platforms and online retailers. They are
used within transactional datasets to identify frequent itemsets, or collections of
items, to identify the likelihood of consuming a product given the consumption of
another product.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Dimensionality reduction

While more data generally yields more accurate results, it can also impact the
performance of machine learning algorithms (e.g. overfitting) and it can also make it
difficult to visualize datasets. Dimensionality reduction is a technique used when the
number of features, or dimensions, in a given dataset is too high. It reduces the
number of data inputs to a manageable size while also preserving the integrity of the
dataset as much as possible.

Challenges of unsupervised learning

While unsupervised learning has many benefits, some challenges can occur when it
allows machine learning models to execute without any human intervention. Some of
these challenges can include:

• Computational complexity due to a high volume of training data


• Longer training times
• Higher risk of inaccurate results
• Human intervention to validate output variables
• Lack of transparency into the basis on which data was clustered

Applications of Unsupervised Machine Learning


Some application of Unsupervised Learning Techniques are:

• Clustering automatically split the dataset into groups base on their


similarities
• Anomaly detection can discover unusual data points in your dataset. It is
useful for finding fraudulent transactions
• Association mining identifies sets of items which often occur together in
your dataset
• Latent variable models are widely used for data preprocessing. Like reducing
the number of features in a dataset or decomposing the dataset into
multiple components

Disadvantages of Unsupervised Learning


• You cannot get precise information regarding data sorting, and the output as
data used in unsupervised learning is labeled and not known

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


• Less accuracy of the results is because the input data is not known and not
labeled by people in advance. This means that the machine requires to do
this itself.
• The spectral classes do not always correspond to informational classes.
• The user needs to spend time interpreting and label the classes which follow
that classification.
• Spectral properties of classes can also change over time so you can’t have
the same class information while moving from one image to another.

Advantages of Unsupervised Learning


o Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.

Reinforcement Learning
o Reinforcement Learning is a feedback-based Machine learning technique in which an
agent learns to behave in an environment by performing the actions and seeing the results
of actions. For each good action, the agent gets positive feedback, and for each bad action,
the agent gets negative feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks without any
labelled data, unlike supervised learning.
o Since there is no labelled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, and the goal is
long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
o The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that "Reinforcement learning is a
type of machine learning method where an intelligent agent (computer program)

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


interacts with the environment and learns to act within that." How a Robotic dog
learns the movement of his arms is an example of Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns from
its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment, and his goal
is to find the diamond. The agent interacts with the environment by performing some
actions, and based on those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.
o The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Key Features of Reinforcement Learning
o In RL, the agent is not instructed about the environment and what actions need to be
taken.
o It is based on the hit and trial process.
o The agent takes the next action and changes states according to the feedback of the
previous action.
o The agent may get a delayed reward.
o The environment is stochastic, and the agent needs to explore it to reach to get the
maximum positive rewards.

Reinforcement Learning Applications

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


Knowledge Discovery in Databases (KDD)

Knowledge Discovery in Databases (KDD) is an automatic, exploratory analysis and modelling of


large data repositories. KDD is the organized process of identifying valid, novel, useful, and
understandable patterns from large and complex data sets. Data Mining (DM) is the core of the
KDD process, involving the inferring of algorithms that explore the data, develop the model and
discover previously unknown patterns. The model is used for understanding phenomena from the
data, analysis and prediction.

The accessibility and abundance of data today makes knowledge discovery and Data Mining a
matter of considerable importance and necessity. Given the recent growth of the field, it is not
surprising that a wide variety of methods is now available to the researchers and practitioners. No
one method is superior to others for all cases. The handbook of Data Mining and Knowledge
Discovery from Data aims to organize all significant methods developed in the field into a
coherent and unified catalog; presents performance evaluation approaches and techniques; and
explains with cases and software tools the use of the different methods.

SEMMA Model

SEMMA is the sequential methods to build machine learning models incorporated in ‘SAS
Enterprise Miner’, a product by SAS Institute Inc., one of the largest producers of commercial
statistical and business intelligence software. However, the sequential steps guide the
development of a machine learning system. Let’s look at the five sequential steps to understand
it better.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


SAS Institute defines data mining as the process of Sampling, Exploring, Modifying, Modelling,
and Assessing (SEMMA) large amounts of data to uncover previously unknown patterns which
can be utilized as a business advantage. The data mining process is applicable across a variety of
industries and provides methodologies for such diverse business problems as fraud detection,
house holding, customer retention and attrition, database marketing, market segmentation, risk
analysis, affinity analysis, customer satisfaction, bankruptcy prediction, and portfolio analysis.

Enterprise Miner software is an integrated product that provides an end-to-end business solution
for data mining.

A graphical user interface (GUI) provides a user-friendly front end to the SEMMA data mining
process:

• Sample: Sample the data by creating one or more data tables. The samples should be
large enough to contain the significant information, yet small enough to process.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


• Explore: Explore the data by searching for anticipated relationships, unanticipated trends,
and anomalies in order to gain understanding and ideas.
• Modify: Modify the data by creating, selecting, and transforming the variables to focus
the model selection process.
• Model: Model the data by using the analytical tools to search for a combination of the
data that reliably predicts a desired outcome.
• Assess: Assess the data by evaluating the usefulness and reliability of the findings from
the data mining process.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal SSTC Bhilai


UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

Scales of Measurement
Data can be classified as being on one of four scales: nominal, ordinal,
interval or ratio. Each level of measurement has some important properties
that are useful to know.

Properties of Measurement Scales:


 Identity – Each value on the measurement scale has a unique meaning.
 Magnitude – Values on the measurement scale have an ordered
relationship to one another. That is, some values are larger and some are
smaller.
 Equal intervals – Scale units along the scale are equal to one another.
For Example the difference between 1 and 2 would be equal to the
difference between 11 and 12.
 A minimum value of zero – The scale has a true zero point, below which
no values exist.

1. Nominal Scale –

Nominal variables can be placed into categories. These don’t have a numeric
value and so cannot be added, subtracted, divided or multiplied. These also
have no order, and nominal scale of measurement only satisfies the identity
property of measurement.

For example, gender is an example of a variable that is measured on a


nominal scale. Individuals may be classified as “male” or “female”, but
neither value represents more or less “gender” than the other.
2. Ordinal Scale –

The ordinal scale contains things that you can place in order. It measures a
variable in terms of magnitude, or rank. Ordinal scales tell us relative order,
but give us no information regarding differences between the categories. The
ordinal scale has the property of both identity and magnitude.

For example, in a race If Ram takes first and Vidur takes second place, we
do not know competition was close by how many seconds.
3. Interval Scale –

An interval scale has ordered numbers with meaningful divisions, the


magnitude between the consecutive intervals are equal. Interval scales do
not have a true zero i.e In Celsius 0 degrees does not mean the absence of
heat.

Page 1 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

Interval scales have the properties of:


 Identity
 Magnitude
 Equal distance

For example, temperature on Fahrenheit/Celsius thermometer i.e. 90° are


hotter than 45° and the difference between 10° and 30° are the same as the
difference between 60° degrees and 80°.
4. Ratio Scale –

The ratio scale of measurement is similar to the interval scale in that it also
represents quantity and has equality of units with one major difference: zero
is meaningful (no numbers exist below the zero). The true zero allows us to
know how many times greater one case is than another. Ratio scales have
all of the characteristics of the nominal, ordinal and interval scales.

The simplest example of a ratio scale is the measurement of length. Having


zero length or zero money means that there is no length and no money but
zero temperature is not an absolute zero.
Properties of Ratio Scale:
 Identity
 Magnitude
 Equal distance
 Absolute/true zero
For example, in distance 10 miles is twice as long as 5 mile.

Dealing with Missing Data


The real-world data often has a lot of missing values. If you want
your model to work unbiased and accurately then you just can’t
ignore the part of “missing value” in your data. One of the most
common problems faced in data cleansing or pre-processing is
handling missing values. The purpose of this article is to discover
techniques to handle missing data efficiently.

Page 2 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

What is Missing Data?

Missing data means absence of observations in columns. It appears


in values such as “0”, “NA”, “NaN”, “NULL”, “Not Applicable”,
“None”.

Why dataset has Missing values?

The cause of it can be data corruption ,failure to record data, lack of


information, incomplete results ,person might not provided the data
intentionally ,some system or equipment failure etc. There could any
reason for missing values in your dataset.

Why to handle Missing values?

One of the biggest impact of Missing Data is, it can bias the results of
the machine learning models or reduce the accuracy of the model.
So, it is very important to handle missing values.

How to check Missing Data?

The first step in handling missing values is to look at the data


carefully and find out all the missing values. In order to check
missing values in Python Pandas Data Frame, we use a function like
isnull() and notnull() which help in checking whether a value is
“NaN”(True) or not and return boolean values.

Page 3 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

Normalization in Machine Learning


Normalization is one of the most frequently used data preparation techniques, which
helps us to change the values of numeric columns in the dataset to use a common scale.

Although Normalization is no mandate for all datasets available in machine learning, it is


used whenever the attributes of the dataset have different ranges. It helps to enhance the
performance and reliability of a machine learning model.

What is Normalization in Machine Learning?


Normalization is a scaling technique in Machine Learning applied during data
preparation to change the values of numeric columns in the dataset to use a
common scale. It is not necessary for all datasets in a model. It is required only when
features of machine learning models have different ranges.

Mathematically, we can calculate normalization with the below formula:

Xn = (X - Xminimum) / ( Xmaximum - Xminimum)

o Xn = Value of Normalization
o Xmaximum = Maximum value of a feature
o Xminimum = Minimum value of a feature

Example: Let's assume we have a model dataset having maximum and minimum values of
feature as mentioned above. To normalize the machine learning model, values are shifted
and rescaled so their range can vary between 0 and 1. This technique is also known as Min-
Max scaling.

Normalization techniques in Machine Learning


Although there are so many feature normalization techniques in Machine Learning,
few of them are most frequently used. These are as follows:

o Min-Max Scaling: This technique is also referred to as scaling. As we have already


discussed above, the Min-Max scaling method helps the dataset to shift and rescale
the values of their attributes, so they end up ranging between 0 and 1.
o Standardization scaling:

Page 4 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

Standardization scaling is also known as Z-score normalization, in which values are


centredon the mean with a unit standard deviation, which means the attribute
becomes zero and the resultant distribution has a unit standard deviation.
Mathematically, we can calculate the standardization by subtracting the feature value
from the mean and dividing it by standard deviation.

Hence, standardization can be expressed as follows:

Here, µ represents the mean of feature value, and σ represents the standard
deviation of feature values.

However, unlike Min-Max scaling technique, feature values are not restricted to a
specific range in the standardization technique.

This technique is helpful for various machine learning algorithms that use distance
measures such as KNN, K-means clustering, and Principal component analysis,
etc. Further, it is also important that the model is built on assumptions and data is
normally distributed.

Feature Engineering for Machine Learning


Feature engineering is the pre-processing step of machine learning, which is used to
transform raw data into features that can be used for creating a predictive model using
Machine learning or statistical Modelling. Feature engineering in machine learning aims
to improve the performance of models. In this topic, we will understand the details about
feature engineering in Machine learning. But before going into details, let's first understand
what features are? And What is the need for feature engineering?

What is a feature?
Generally, all machine learning algorithms take input data to generate the output.
The input data remains in a tabular form consisting of rows (instances or
observations) and columns (variable or attributes), and these attributes are often
known as features. For example, an image is an instance in computer vision, but a
line in the image could be the feature. Similarly, in NLP, a document can be an

Page 5 of 17
UNIT-2 |MACHINE DATA 7th SEM | MACHINE LEARNING
MACHINE LEARNING PERSPECTIVE OF DATA|

observation, and the word count could be the feature. So, we can say a feature is an
attribute that impacts a problem
roblem or is useful for the problem.
problem

What is Feature Engineering?


Feature engineering is the pre-processing
pre processing step of machine learning, which
extracts features from raw data.
data. It helps to represent an underlying problem to
predictive models in a better way, which
which as a result, improve the accuracy of the
model for unseen data. The predictive model contains predictor variables and an
outcome variable, and while the feature engineering process selects the most useful
predictor variables for the model.

Since 2016,
6, automated feature engineering is also used in different machine learning
software that helps in automatically extracting features from raw data. Feature
engineering in ML contains mainly four processes: Feature Creation,
Transformations, Feature Extraction,
Extract and Feature Selection.

These processes are described as below:

1. Feature Creation:: Feature creation is finding the most useful variables to be


used in a predictive model. The process is subjective, and it requires human
creativity and intervention. The new features are created by mixing existing
features using addition, subtraction, and ration, and these new features have
great flexibility.
2. Transformations:: The transformation step of feature engineering involves
adjusting the predictor variable to improve
improve the accuracy and performance of
the model. For example, it ensures that the model is flexible to take input of
the variety of data; it ensures that all the variables are on the same scale,
making the model easier to understand. It improves the model's ac
accuracy and

Page 6 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

ensures that all the features are within the acceptable range to avoid any
computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering
process that generates new variables by extracting them from the raw data.
The main aim of this step is to reduce the volume of data so that it can be
easily used and managed for data modelling. Feature extraction methods
include cluster analysis, text analytics, edge detection algorithms, and
principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the rest features
are either redundant or irrelevant. If we input the dataset with all these
redundant and irrelevant features, it may negatively impact and reduce the
overall performance and accuracy of the model. Hence it is very important to
identify and select the most appropriate features from the data and remove
the irrelevant or less important features, which is done with the help of feature
selection in machine learning. "Feature selection is a way of selecting the
subset of the most relevant features from the original features set by
removing the redundant, irrelevant, or noisy features."

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.


o It helps in the simplification of the model so that the researchers can easily
interpret it.
o It reduces the training time.
o It reduces overfitting hence enhancing the generalization.

Correlation and Causation

Page 7 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

1. Correlation :

Page 8 of 17
UNIT-2 |MACHINE DATA 7th SEM | MACHINE LEARNING
MACHINE LEARNING PERSPECTIVE OF DATA|

It is a statistical term which depicts the degree of association between two


random variables. In data analysis it is often used to determine the amount
to which they relate to one another.
Three types of correlation-
1. Positive correlation –
If with increase in random variable A, random variable B increases too, or
vice versa.
2. Negative correlation –
If increase in random variable A leads to a decrease in B, or vice versa.
3. No correlation –
When both the variables are completely unrelated and change in one
leads to no change in other.

2. Causation :

Causation between random variables A and B implies that A and B have a


cause-and-effect
effect relationship with one another. Or we can say existence of
one gives birth to other, and we say A causes B or vice versa. Causation is
also termed as causality.
Correlation
ion does not imply Causation.
Correlation and Causation can exist at the same time also, so definitely
correlation doesn’t imply causation. Below example is to show this difference
more clearly-

Page 9 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

No battery in computer causes computer to shut and also causes video


player to stop shows causality of battery over laptop and video player. The
moment computer shuts, video player also shuts shows both are correlated.
More specifically positively correlated.

Polynomial Regression
o Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree polynomial. The
Polynomial Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n


o It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modelled using a
linear model."

Need for Polynomial Regression:


The need of Polynomial Regression in ML can be understood in the below points:

o If we apply a linear model on a linear dataset, then it provides us a good result as we


have seen in Simple Linear Regression, but if we apply the same model without any
modification on a non-linear dataset, then it will produce a drastic output. Due to
which loss function will increase, the error rate will be high, and accuracy will be
decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we
need the Polynomial Regression model. We can understand it in a better way using
the below comparison diagram of the linear dataset and non-linear dataset.

Page 10 of 17
UNIT-2 |MACHINE DATA 7th SEM | MACHINE LEARNING
MACHINE LEARNING PERSPECTIVE OF DATA|

o In the above image, we have taken a dataset which is arranged non


non-linearly.
linearly. So if we
try to cover it with a linear model, then we can clearly see that it hardly covers any
data point. On the other hand, a curve is suitable to cover most of the data points,
which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear
non linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression.

Note: A Polynomial Regression algorithm is also called Polynomial Linear


Regression because it does not depend on the variables; instead, it depends
on the coefficients, which are arranged in a linear fashion.

Equation of the Polynomial Regression Model:


Simple Linear Regression equation: y = b0+b1x .........(a)

Multiple Linear Regression equation:

y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)

Polynomial Regression equation:

y= b0+b1x + b2x2+ b3x3+....+ bnxn ..........(c)

When we compare the above three equations, we can clearly see that all three
equations are Polynomial equations but differ by the degree of variables. The Simple
and Multiple Linear equations are also Polynomial equations with a single degr
degree,
and the Polynomial regression equation is Linear equation with the nth degree. So if
we add a degree to our linear equations, then it will be converted into Polynomial
Linear equations.

Page 11 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

Note: To better understand Polynomial Regression, you must have knowledge of Simple
Linear Regression.

Logistic Regression (Binary Classification) in


Machine Learning
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or
1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

Page 12 of 17
UNIT-2 |MACHINE DATA 7th SEM | MACHINE LEARNING
MACHINE LEARNING PERSPECTIVE OF DATA|

Note: Logistic regression uses the concept of predictive modeling as regression;


therefore, it is called logistic regression, but is used to classify samples; therefore, it falls
under the classification
ation algorithm.

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S
S-form
form curve is called the
sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.

Assumptions for Logistic Regression:


o The dependent variable must be categorical in nature.
o The independent variable should not have multi
multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:

o We know the equation


on of the straight line can be written as:

Page 13 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".

AUC-ROC Curve in Machine Learning


In Machine Learning, only developing an ML model is not sufficient as we also need
to see whether it is performing well or not. It means that after building an ML model,
we need to evaluate and validate how good or bad it is, and for such cases, we use
different Evaluation Metrics. AUC-ROC curve is such an evaluation metric that is used
to visualize the performance of a classification model. It is one of the popular and
important metrics for evaluating the performance of the classification model. In this
topic, we are going to discuss more details about the AUC-ROC curve.

Note: For a better understanding of this article, we suggest you first understand the
Confusion Matrix, as AUC-ROC uses terminologies used in the Confusion matrix.

Page 14 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

What is AUC-ROC Curve?


AUC-ROC curve is a performance measurement metric of a classification model at
different threshold values. Firstly, let's understand ROC (Receiver Operating
Characteristic curve) curve.

ROC Curve
ROC or Receiver Operating Characteristic curve represents a probability graph
to show the performance of a classification model at different threshold levels.
The curve is plotted between two parameters, which are:

o True Positive Rate or TPR


o False Positive Rate or FPR

In the curve, TPR is plotted on Y-axis, whereas FPR is on the X-axis.

TPR:
TPR or True Positive rate is a synonym for Recall, which can be calculated as:

FPR or False Positive Rate can be calculated as:

Here, TP: True Positive

FP: False Positive

TN: True Negative

FN: False Negative

Now, to efficiently calculate the values at any threshold level, we need a method,
which is AUC.

Page 15 of 17
UNIT-2 |MACHINE DATA 7th SEM | MACHINE LEARNING
MACHINE LEARNING PERSPECTIVE OF DATA|

AUC: Area Under the ROC curve


AUC is known for Area Under the ROC curve.
curve. As its name suggests, AUC calculates
the two-dimensional
dimensional area under the entire ROC curve ranging from (0,0) to (1,1),
(1,1 as
shown below image:

In the ROC curve, AUC computes the performance of the binary classifier across
different thresholds and provides an aggregate measure. The value of AUC ranges
from 0 to 1, which means an excellent model will have AUC near 1, and hence it will
show a good measure of Separability.

Applications of AUC-ROC
AUC Curve
Although the AUC-ROC
ROC curve is used to evaluate a classification model, it is widely
used for various applications. Some of the important applications of AUC-ROC
AUC are
given below:

1. Classification of 3D model

The curve is used to classify a 3D model and separate it from the normal models.
With the specified threshold level, the curve classifies the non
non-3D
3D and separates out
the 3D models.

2. Healthcare
The curve has various applica
applications
tions in the healthcare sector. It can be used to detect

Page 16 of 17
UNIT-2 |MACHINE LEARNING PERSPECTIVE OF DATA| 7th SEM | MACHINE LEARNING

cancer disease in patients. It does this by using false positive and false negative rates,
and accuracy depends on the threshold value used for the curve.
3. Binary Classification
AUC-ROC curve is mainly used for binary classification problems to evaluate their
performance.

Page 17 of 17
MACHINE LEARNING UNIT-3
Introduction to Machine Learning Algorithms: Decision Trees, Support Vector Machine, k-
Nearest Neighbors, Time-Series Forecasting, Clustering, Principal Component Analysis (PCA)

Decision Trees in Machine Learning


In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. As the name goes, it uses a
tree-like model of decisions. Though a commonly used tool in data
mining for deriving a strategy to reach a particular goal, its also widely
used in machine learning, which will be the main focus of this article.
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.
o The decisions or the test are performed on the basis of features of the given
dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
How can an algorithm be represented as a tree?

For this let’s consider a very basic example that uses titanic data set for
predicting whether a passenger will survive or not. Below model uses 3
features/attributes/columns from the data set, namely sex, age and
sibsp (number of spouses or children along).

A decision tree is drawn upside down with its root at the top. In the
image on the left, the bold text in black represents a
condition/internal node, based on which the tree splits into
branches/ edges. The end of the branch that doesn’t split anymore is
the decision/leaf, in this case, whether the passenger died or survived,
represented as red and green text respectively.

Although, a real dataset will have a lot more features and this will just
be a branch in a much bigger tree, but you can’t ignore the simplicity of
this algorithm. The feature importance is clear and relations can
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
be viewed easily. This methodology is more commonly known
as learning decision tree from data and above tree is
called Classification tree as the target is to classify passenger as
survived or died. Regression trees are represented in the same
manner, just they predict continuous values like price of a house. In
general, Decision Tree algorithms are referred to as CART or
Classification and Regression Trees.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Advantages of CART

• Simple to understand, interpret, visualize.

• Decision trees implicitly perform variable screening or


feature selection.

• Can handle both numerical and categorical data. Can


also handle multi-output problems.

• Decision trees require relatively little effort from users for


data preparation.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
• Nonlinear relationships between parameters do not affect
tree performance.

Disadvantages of CART

• Decision-tree learners can create over-complex trees that do


not generalize the data well. This is called overfitting.

• Decision trees can be unstable because small variations in the


data might result in a completely different tree being
generated. This is called variance, which needs to be lowered
by methods like bagging and boosting.

• Greedy algorithms cannot guarantee to return the globally


optimal decision tree. This can be mitigated by training
multiple trees, where the features and samples are randomly
sampled with replacement.

• Decision tree learners create biased trees if some classes


dominate. It is therefore recommended to balance the data set
prior to fitting with the decision tree.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Support Vector Machine-
What is the Support Vector Machine?

“Support Vector Machine” (SVM) is a supervised machine learning


algorithm that can be used for both classification or regression challenges.
However, it is mostly used in classification problems. In the SVM algorithm, we
plot each data item as a point in n-dimensional space (where n is a number of
features you have) with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the hyper-plane that
differentiates the two classes very well (look at the below snapshot).

Support Vectors are simply the coordinates of individual observation. The SVM
classifier is a frontier that best segregates the two classes (hyper-plane/ line).
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

How does SVM works?


Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green
or blue. Consider the below image:

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third dimension
z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
k-Nearest Neighbors-
• KNN which stands for K Nearest Neighbor is a Supervised Machine Learning algorithm
that classifies a new data point into the target class, depending on the features of its
neighboring data points.
• K nearest neighbors or KNN Algorithm is a simple algorithm which uses the entire dataset
in its training phase. Whenever a prediction is required for an unseen data instance, it
searches through the entire training dataset for k-most similar instances and the data
with the most similar instance is finally returned as the prediction.
• k-NN is often used in search applications where you are looking for similar items, like find
items similar to this one.

Features of KNN Algorithm


The KNN algorithm has the following features:
1. KNN is a Supervised Learning algorithm that uses labeled input data set to predict the output
of the data points.
2. It is one of the simplest Machine learning algorithms and it can be easily implemented for a
varied set of problems.
3. It is mainly based on feature similarity. KNN checks how similar a data point is to its neighbor
and classifies the data point into the class it is most similar to.
4. Unlike most algorithms, KNN is a nonparametric model which means that it does not make any
assumptions about the data set. This makes the algorithm more effective since it can handle
realistic data.
5. KNN is a lazy algorithm, this means that it memorizes the training data set instead of learning
a discriminative function from the training data.
6. KNN can be used for solving both classification and regression problems.

Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features
it will put it in either cat or dog category.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Time-Series Forecasting:
Time series forecasting is one of the most applied data science techniques in business, finance,
supply chain management, production and inventory planning. Many prediction problems
involve a time component and thus require extrapolation of time series data, or time series
forecasting. Time series forecasting is also an important area of machine learning (ML) and can
be cast as a supervised learning problem. ML methods such as Regression, Neural Networks,
Support Vector Machines, Random Forests and XGBoost — can be applied to it. Forecasting
involves taking models fit on historical data and using them to predict future observations.

Time series forecasting means to forecast or to predict the future value over a period of time. It
entails developing models based on previous data and applying them to make observations and
guide future strategic decisions.

The future is forecast or estimated based on what has already happened. Time series adds a
time order dependence between observations. This dependence is both a constraint and a
structure that provides a source of additional information. Before we discuss time series
forecasting methods, let’s define time series forecasting more closely.

Time series forecasting is a technique for the prediction of events through a sequence of time.
It predicts future events by analyzing the trends of the past, on the assumption that future
trends will hold similar to historical trends. It is used across many fields of study in various
applications including:

• Astronomy
• Business planning
• Control engineering
• Earthquake prediction
• Econometrics
• Mathematical finance
• Pattern recognition
• Resources allocation
• Signal processing
• Statistics
• Weather forecasting

Time series models


Time series models are used to forecast events based on verified historical data. Common types
include ARIMA, smooth-based, and moving average. Not all models will yield the same results for
the same dataset, so it's critical to determine which one works best based on the individual time
series.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
When forecasting, it is important to understand your goal. To narrow down the specifics of your
predictive modeling problem, ask questions about:

1. Volume of data available — more data is often more helpful, offering greater
opportunity for exploratory data analysis, model testing and tuning, and model
fidelity.
2. Required time horizon of predictions — shorter time horizons are often easier to
predict — with higher confidence — than longer ones.
3. Forecast update frequency — Forecasts might need to be updated frequently over
time or might need to be made once and remain static (updating forecasts as new
information becomes available often results in more accurate predictions).
4. Forecast temporal frequency — Often forecasts can be made at lower or higher
frequencies, which allows harnessing downsampling and up-sampling of data (this
in turn can offer benefits while modeling).

Types of time series methods used for forecasting


Times series methods refer to different ways to measure timed data. Common types include:
Autoregression (AR), Moving Average (MA), Autoregressive Moving Average (ARMA),
Autoregressive Integrated Moving Average (ARIMA), and Seasonal Autoregressive Integrated
Moving-Average (SARIMA).
The important thing is to select the appropriate forecasting method based on the
characteristics of the time series data.
Smoothing-based models
In time series forecasting, data smoothing is a statistical technique that involves removing
outliers from a time series data set to make a pattern more visible. Inherent in the collection of
data taken over time is some form of random variation. Smoothing data removes or reduces
random variation and shows underlying trends and cyclic components.

Moving-average model
In time series analysis, the moving-average model (MA model), also known as moving-average
process, is a common approach for modeling univariate time series. The moving-average model
specifies that the output variable depends linearly on the current and various past values of a
stochastic (imperfectly predictable) term.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Together with the autoregressive (AR) model (covered below), the moving-average model is a
special case and key component of the more general ARMA and ARIMA models of time series,
which have a more complicated stochastic structure.
Contrary to the AR model, the finite MA model is always stationary.
Exponential Smoothing model
Exponential smoothing is a rule of thumb technique for smoothing time series data using the
exponential window function. Exponential smoothing is an easily learned and easily applied
procedure for making some determination based on prior assumptions by the user, such as
seasonality. Different types of exponential smoothing include single exponential smoothing,
double exponential smoothing, and triple exponential smoothing (also known as the Holt-
Winters method). For tutorials on how to use Holt-Winters out of the box with InfluxDB, see
“When You Want Holt-Winters Instead of Machine Learning” and “Using InfluxDB to Predict The
Next Extinction Event”).

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Clustering:
A cluster refers to a small group of objects. Clustering is grouping those objects into clusters. In
order to learn clustering, it is important to understand the scenarios that lead to cluster different
objects.
What is Clustering?
•Clustering is dividing data points into homogeneous classes or clusters:
• Points in the same group are as similar as possible • Points in different group are as dissimilar
as possible
•When a collection of objects is given, we put objects into group based on similarity.
Clustering Algorithms -

• A Clustering Algorithm tries to analyze natural groups of data on the basis of some
similarity. It locates the centroid of the group of data points. To carry out effective
clustering, the algorithm evaluates the distance between each point from the centroid of
the cluster.
• The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Applications of Clustering
• Listed here are few more applications, which would add to what you have learnt.
1. Clustering helps marketers improve their customer base and work on the target areas. It helps
group people (according to different criteria’s such as willingness, purchasing power etc.) based
on their similarity in many ways related to the product under consideration.
2. Clustering helps in identification of groups of houses on the basis of their value, type and
geographical locations.
3. Clustering is used to study earth-quake. Based on the areas hit by an earthquake in a region,
clustering can help analyze the next probable location where earthquake can occur.
Clustering is a type of unsupervised learning mechanism. It basically analyzes the points and
clusters them based on similarities and dissimilarities. • Applications of Clustering in different
fields
1. Marketing : It can be used to characterize & discover customer segments for marketing
purposes.
2. Biology : It can be used for classification among different species of plants and animals.
3. Libraries : It is used in clustering different books on the basis of topics and information.
4. Insurance : It is used to acknowledge the customers, their policies and identifying the frauds.
5. City Planning : It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
6. Earthquake studies : By learning the earthquake affected areas we can determine the
dangerous zones.

Types of Clustering Methods


The clustering methods are broadly divided into Hard clustering (datapoint belongs to
only one group) and Soft Clustering (data points can belong to another group also). But
there are also other various approaches of Clustering exist. Below are the main clustering
methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known
as the centroid-based method. The most common example of partitioning clustering is
the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the
distance between the data points of one cluster is minimum as compared to another
cluster centroid.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and
the arbitrarily shaped distributions are formed as long as the dense region can be
connected. This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters. The dense areas in data space are
divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that


uses Gaussian Mixture Models (GMM).

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there
is no requirement of pre-specifying the number of clusters to be created. In this technique,
the dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting
the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.

Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than
one group or cluster. Each dataset has a set of membership coefficients, which depend
on the degree of membership to be in a cluster. Fuzzy C-means algorithm is the example
of this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above.
There are different types of clustering algorithms published, but only a few are commonly
used. The clustering algorithm is based on the kind of data that we are using. Such as,
some algorithms need to guess the number of clusters in the given dataset, whereas some
are required to find the minimum distance between the observation of the dataset.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Here we are discussing mainly popular Clustering algorithms that are widely used in
machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of equal
variances. The number of clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on updating
the candidates for centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications
with Noise. It is an example of a density-based model similar to the mean-shift, but with
some remarkable advantages. In this algorithm, the areas of high density are separated by
the areas of low density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can be failed. In
GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can be
represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require
to specify the number of clusters. In this, each data point sends a message between the
pair of data points until convergence. It has O(N2T) time complexity, which is the main
drawback of this algorithm.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Principal Component Analysis (PCA):
Principal Component Analysis (PCA) is an unsupervised, non-parametric
statistical technique primarily used for dimensionality reduction in
machine learning.
• Principal Component Analysis is an unsupervised learning
algorithm that is used for the dimensionality reduction in machine
learning.
• It is a statistical process that converts the observations of
correlated features into a set of linearly uncorrelated features with
the help of orthogonal transformation. These new transformed
features are called the Principal Components. It is one of the
popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from
the given dataset by reducing the variances.
• PCA generally tries to find the lower-dimensional surface to project
the high-dimensional data.
• High dimensionality means that the dataset has a large number of
features. The primary problem associated with high-dimensionality
in the machine learning field is model overfitting, which reduces
the ability to generalize beyond the examples in the training set.
• Richard Bellman described this phenomenon in 1961 as the Curse
of Dimensionality where “Many algorithms that work fine in low
dimensions become intractable when the input is high-
dimensional. “

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
• The ability to generalize correctly becomes exponentially harder as
the dimensionality of the training dataset grows, as the training set
covers a dwindling fraction of the input space. Models also become
more efficient as the reduced feature set boosts learning rates and
diminishes computation costs by removing redundant features.
• PCA can also be used to filter noisy datasets, such as image
compression. The first principal component expresses the most
amount of variance. Each additional component expresses less
variance and more noise, so representing the data with a smaller
subset of principal components preserves the signal and discards
the noise.

The PCA algorithm is based on some mathematical concepts such as:


• Variance and Covariance
• Eigenvalues and Eigen factors

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Principal Components in PCA
As described, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or
less than the original features present in the dataset. Some properties of
these principal components are given below:
1. The principal component must be the linear combination of the
original features.
2. These components are orthogonal, i.e., the correlation between a pair
of variables is zero.
3. The importance of each component decreases when going to 1 to n, it
means the 1 PC has the most importance, and n PC will have the least
importance.
Applications of Principal Component Analysis:
• PCA is mainly used as the dimensionality reduction technique in
various AI applications such as computer vision, image
compression, etc.
• It can also be used for finding hidden patterns if data has high
dimensions. Some fields where PCA is used are Finance, data
mining, Psychology, etc.

Steps for PCA algorithm


1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where
X is the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
items, and the column corresponds to the Features. The number of columns is the
dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features
with high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will
divide each data item in a column with the standard deviation of the column. Here we
will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which
means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z.
In the resultant matrix Z*, each observation is the linear combination of original features.
Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.

Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
Dr. Siddhartha Choubey, Mr. Sonu Agrawal, Ms. Khushi Gupta SSTC Bhilai
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for
the given dataset and problem is the main point to remember while creating a
machine learning model. Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree
tree-
like structure.

Decision Tree Terminologies


Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneo
homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub
sub-nodes according
to the given conditions.

Branch/Sub Tree: A tree formed by splitting the tree.


UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree. The complete process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and one leaf node based on
the corresponding labels. The next decision node further gets split into one decision
node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub
sub-nodes.
nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the bes
bestt attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of
a dataset based on an attribute
attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute
e/attribute having the highest information gain is split first. It can be calculated
using the below formula:
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional
dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories
ories that are classified using a decision boundary or hyperplane:

The objective of the support vector machine algorithm is to find a hyperplane in an


N-dimensional space(N — the number of features) that distinctly classifies the data
points.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

Possible hyperplanes

To separate the two classes of data points, there are many possible hyperplanes that
could be chosen. Our objective is to find a plane that has the maximum margin, i.e
the maximum distance between data points of both classes. Maximizing the margin
distance provides some reinforcement so that future data points can be classified
with more confidence.

Hyperplanes in 2D and 3D feature space

Hyperplanes are decision boundaries that help classify the data points. Data points
falling on either side of the hyperplane can be attributed to different classes. Also,
the dimension of the hyperplane depends upon the number of features. If the
number of input features is 2, then the hyperplane is just a line. If the number of
input features is 3, then the hyperplane becomes a two-dimensional plane. It
becomes difficult to imagine when the number of features exceeds 3.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

K-Nearest Neighbor (KNN) Algorithm


o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features
tures of the new data set to the cats and dogs images and based on the
most similar features it will put it in either cat or dog category.

Why do we need a K
K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve this
type of problem, we need a K K-NN algorithm. With the help of K-NN, NN, we can easily
identify the category or class of a particular dataset. Consider the below diagram:

How does K-NN


NN work?
The K-NN
NN working can be explained on the basis of the below algorithm:
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calcu
calculated
lated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean
ean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

Time Series Forecasting


Time Series is a certain sequence of data observations that a system collects within specific
periods of time — e.g., daily, monthly, or yearly. The specialized models are used to analyze
the collected time-series data — describe and interpret them, as well as make certain
assumptions based on shifts and odds in the collection. These shifts and odds may include
the switch of trends, seasonal spikes in demand, certain repetitive changes or non-systematic
shifts in usual patterns, etc.

All the previously, recently, and currently collected data is used as input for time series
forecasting where future trends, seasonal changes, irregularities, and such are elaborated
based on complex math-driven algorithms. And with machine learning, time series
forecasting becomes faster, more precise, and more efficient in the long run. ML has proven
to help better process both structured and unstructured data flows, swiftly capturing
accurate patterns within massifs of data.

Applications of Machine Learning Time Series Forecasting

 Stock prices forecasting — the data on the history of stock prices combined with the
data on both regular and irregular stock market spikes and drops can be used to gain
insightful predictions of the most probable upcoming stock price shifts.
 Demand and sales forecasting — customer behaviour patterns data along with inputs
from the history of purchases, timeline of demand, seasonal impact, etc., enable ML
models to point out the most potentially demanded products and hit the spot in the
dynamic market.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

 Web traffic forecasting — common data on usual traffic rates among competitor
websites is bunched up with input data on traffic-related patterns in order to predict
web traffic rates during certain periods.
 Climate and weather prediction — time-based data is regularly gathered from
numerous interconnected weather stations worldwide, while ML techniques allow to
thoroughly analyze and interpret it for future forecasts based on statistical dynamics.
 Demographic and economic forecasting — there are tons of statistical inputs in
demographics and economics, which is most efficiently used for ML-based time-
series predictions. As a result, the most fitting target audience can be picked and the
most efficient ways to interact with that particular TA can be elaborated.
 Scientific studies forecasting — ML and deep learning principles accelerate the rates
of polishing up and introducing scientific innovations dramatically. For instance,
science data that requires an indefinite number of analytical iterations can be
processed much faster with the help of patterns automated by machine learning.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

Principal Component Analysis (PCA)


Principal component analysis, or PCA, is a dimensionality-reduction method that is often
used to reduce the dimensionality of large data sets, by transforming a large set of variables
into a smaller one that still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of accuracy,
but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because
smaller data sets are easier to explore and visualize and make analyzing data much easier
and faster for machine learning algorithms without extraneous variables to process.

So, to sum up, the idea of PCA is simple — reduces the number of variables of a data set,
while preserving as much information as possible.

Clustering in Machine Learning


Clustering or cluster analysis is a machine learning technique, which groups the
unlabelled dataset. It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the possible
similarities remain in a group that has less or no similarities with another
group."

It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, colour, behaviour, etc., and divides them as per the presence and absence of
those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the


algorithm, and it deals with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.

The clustering technique is commonly used for statistical data analysis.

Note: Clustering is somewhere similar to the classification algorithm, but the difference is
the type of dataset that we are using. In classification, we work with the labelled data
set, whereas in clustering, we work with the unlabelled dataset.

Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

usage are grouped together. Such as the tt-shirts


shirts are grouped in one section, and
trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can eas
easily find out the
things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common
uses of this technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past sear search of
products. Netflix also uses this technique to recommend the movies and web
web-series
to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several grou
groups
ps with similar properties.
UNIT-3 | Introduction to Machine Learning Algorithms (BIT, Durg)

Types of Clustering Methods


The clustering methods are broadly divided into Hard clustering (datapoint belongs
to only one group) and Soft Clustering (data points can belong to another group
also). But there are also other various approaches of Clustering exist. Below are the
main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Applications of Clustering
Below are some commonly known applications of clustering technique in Machine
Learning:

o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets
into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers
based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands
use in the GIS database. This can be very useful to find that for what purpose the
particular land should be used, that means for which purpose it is more suitable.
Bias and Variance
Machine learning is a branch of Artificial Intelligence, which allows machines to perform data
analysis and make predictions. However, if the machine learning model is not accurate, it can
make predictions errors, and these prediction errors are usually kn
known
own as Bias and Variance.

In machine learning, these errors will always be present as there is always a slight difference
between the model predictions and actual predictions. The main aim of ML/data science
analysts is to reduce these errors in order to get more accurate results.

What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and
applies them to test data for prediction. While making predictions, a difference
occurs between prediction values made by the model and actual
values/expected values, and this difference is known as bias errors or Errors due
to bias.. It can be defined as an inability of machine learning algorithms such as
Linear Regression to capture the true relationship between the data points. Each
algorithm begins with some amount of bias because bias occurs from assumptions in
the model, which makes the target function simple to learn. A model has either:

o Low Bias: A low bias model will make fewer assumptions about the form of the
target function.
o High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias model
also cannot perform
rform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler
the algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear
algorithm often has low bias.

Some examples of machine learning algorithms with low bias are Decision Trees, k-
Nearest Neighbours and Support Vector Machines. At the same time, an
algorithm with high bias is Linear Regression, Linear Discriminant Analysis and
Logistic Regression.

Ways to reduce High Bias:


High bias mainly occurs due to a much simple model. Below are some ways to
reduce the high bias:

o Increase the input features as the model is underfitted.


o Decrease the regularization term.
o Use more complex models, such as including some polynomial features.

What is a Variance Error?


The variance would specify the amount of variation in the prediction if the different
training data was used. In simple words, variance tells that how much a random
variable is different from its expected value. Ideally, a model should not vary too
much from one training dataset to another, which means the algorithm should be
good in understanding the hidden mapping between inputs and output variables.
Variance errors are either of low variance or high variance.

Low variance means there is a small variation in the prediction of the target function
with changes in the training data set. At the same time, High variance shows a large
variation in the prediction of the target function with changes in the training dataset.

A model that shows high variance learns a lot and perform well with the training
dataset, and does not generalize well with the unseen dataset. As a result, such a
model gives good results with the training dataset but shows high error rates on the
test dataset.

Since, with high variance, the model learns too much from the dataset, it leads to
overfitting of the model. A model with high variance has the below problems:

o A high variance model leads to overfitting.


o Increase model complexities.
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high
variance.

Ways to Reduce High Variance:


o Reduce the input features or number of parameters as a model is overfitted.
o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.

Bagging
Machine Learning uses several techniques to build models and improve their
performance. Ensemble learning methods help improves the accuracy of
classification and regression models. This article will discuss one of the most popular
ensemble learning
arning algorithms, i.e., bagging in Machine Learning.

Bagging, also known as Bootstrap aggregating, is an ensemble learning technique


that helps to improve the performance and accuracy of machine learning algorithms.
It is used to deal with bias
bias-variance trade-offsoffs and reduces the variance of a
prediction model. Bagging avoids overfitting of data and is used for both regression
and classification models, specifically for decision tree algorithms.
Steps to Perform Bagging
 Consider there are n observations and m features in the training set. You need
to select a random sample from the training dataset without replacement
 A subset of m features is chosen randomly to create a model using sample
observations
 The feature offering the best split out of the lot is used to split the nodes
 The tree is grown, so you have the best root nodes
 The above steps are repeated n times. It aggregates the output of individual
decision trees to give the best prediction

Advantages of Bagging in Machine Learning


 Bagging minimizess the overfitting of data
 It improves the model’s accuracy
 It deals with higher dimensional data efficiently
Gradient Boosting
Machine learning is one of the most popular technologies to build predictive models
for various complex regression and classification tasks. Gradient Boosting
Machine (GBM) is considered one of the most powerful boosting algorithms.

Boosting is one of the popular learning ensemble modelling techniques used to


build strong classifiers from various weak classifiers. It starts with building a primary
model from available training data sets then it identifies the errors present in the
base model. After identifying the error, a secondary model is built, and further, a
third model is introduced in this process. In this way, this process of introducing
more models is continued until we get a complete training data set by which model
predicts correctly.

AdaBoost (Adaptive boosting) was the first boosting algorithm to combine various
weak classifiers into a single strong classifier in the history of machine learning. It
primarily focuses to solve classification tasks such as binary classification.

Steps in Boosting Algorithms:


There are a few important steps in boosting the algorithm as follows:

o Consider a dataset having different data points and initialize it.


o Now, give equal weight to each of the data points.
o Assume this weight as an input for the model.
o Identify the data points that are incorrectly classified.
o Increase the weight for data points in step 4.
o If you get appropriate output then terminate this process else follow steps 2 and 3
again.

Example:
Let's suppose, we have three different models with their predictions and they work in
completely different ways. For example, the linear regression model shows a linear
relationship in data while the decision tree model attempts to capture the non-
linearity in the data as shown below image.
Further, instead of using these models separately to predict the outcome if we use
them in form of series or combination, then we g get
et a resulting model with correct
information than all base models. In other words, instead of using each model's
individual prediction, if we use average prediction from these models then we would
be able to capture more information from the data. It is rreferred
eferred to as ensemble
learning and boosting is also based on ensemble methods in machine learning.

Boosting Algorithms in Machine Learning


There are primarily 4 boosting algorithms in machine learning. These are as follows:

o Gradient Boosting Machine (GBM)


o Extreme Gradient Boosting Machine (XGBM)
o Light GBM
o CatBoost

What is Gradient Boosting Machine


Machine(GBM)
(GBM) in Machine
Learning?
Gradient Boosting Machine (GBM) is one of the most popular forward learning
ensemble methods in machine learning. It is a powerful technique for building
predictive models for regression and classification tasks.

GBM helps us to get a predictive model in form of an ensemble of weak prediction


models such as decision trees. Whenever a decision tree performs as a weak learner
then the resulting
esulting algorithm is called gradient
gradient-boosted trees.

It enables us to combine the predictions from various learner models and build a
final predictive model having the correct prediction.

But here one question may arise if we are applying the same algorith
algorithm then how
multiple decision trees can give better predictions than a single decision tree?
Moreover, how does each decision tree capture different information from the same
data?

So, the answer to these questions is that a different subset of features iis taken by the
nodes of each decision tree to select the best split. It means that each tree behaves
differently, and hence captures different signals from the same data.

Stacking in Machine Learning


There are many ways to ensemble models in machine lear learning,
ning, such as Bagging,
Boosting, and stacking. Stacking is one of the most popular ensemble machine
learning techniques used to predict multiple nodes to build a new model and
improve model performance. Stacking enables us to train multiple models to solve
similar problems, and based on their combined output, it builds a new model with
improved performance.

Stacking is one of the popular ensemble modeling techniques in machine


learning. Various weak learners are ensembled in a parallel manner in such a
way that by combining them with Meta learners, we can predict better
predictions for the future.

This ensemble technique works by applying input of combined multiple weak


learners' predictions and Meta learners so that a better output prediction model can
be achieved.

In stacking, an algorithm takes the outputs of sub


sub-models
models as input and attempts to
learn how to best combine the input predictions to make a better output prediction.
Stacking is also known as a stacked generalization and is an extended form of the
th
Model Averaging Ensemble technique in which all sub
sub-models
models equally participate as
per their performance weights and build a new model with better predictions. This
new model is stacked up on top of the others; this is the reason why it is named
stacking.

Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of
two or more base/learner's models and a meta
meta-model
model that combines the predictions
of the base models. These base models are called level 0 models
models, and the meta-
model is known as the level 1 model. So, the Stacking ensemble method
includes original (training) data, primary level models, primary level prediction,
secondary level model, and final prediction
prediction.. The basic architecture of stacking can
be represented
epresented as shown below the image.

o Original data: This data is divided into n


n-folds
folds and is also considered test data or
training data.
o Base models: These models are also referred to as level
level-0
0 models. These models use
training data and provide compiled predictions (level-0)
0) as an output.
o Level-0 Predictions: Each base model is triggered on some training data and
provides different predictions, which are known as level-0
0 predictions.
o Meta Model: The architecture of the stacking model consists of one meta
meta-model,
which helps to best combine the predictions of the base models. The meta
meta-model is
also known as the level
level-1 model.
o Level-1 Prediction: The meta
meta-model
model learns how to best combine the predictions of
the base models and is trained on different predicti
predictions
ons made by individual base
models, i.e., data not used to train the base models are fed to the meta
meta-model,
predictions are made, and these predictions, along with the expected outputs,
provide the input and output pairs of the training dataset used to fit the meta-model.
K-Fold Cross-Validation
Cross-validation is a technique for validating the model efficiency by training it on
the subset of input data and testing on previously unseen subset of the input
data. We can also say that it is a technique to check how a statistical model
generalizes to an independent dataset.

In machine learning, there is always the need to test the stability of the model. It
means based only on the training dataset; we can't fit our model on the training
dataset. For this purpose, we reserve a particular sample of the dataset, which was
not part of the training dataset. After that, we test our model on that sample before
deployment, and this complete process comes under cross-validation. This is
something different from the general train-test split.

Hence the basic steps of cross-validations are:

o Reserve a subset of the dataset as a validation set.


o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else check for
the issues.

Methods used for Cross-Validation


There are some common methods that are used for cross-validation. These methods
are given below:

1. Validation Set Approach


2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation

K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples
of equal sizes. These samples are called folds. For each learning set, the prediction
function uses k-1 folds, and the rest of the folds are used for the test set. This
approach is a very popular CV approach because it is easy to understand, and the
output is less biased than other methods.
The steps for k-fold cross-validation
validation are:

o Split the input dataset into K groups


o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performan
performance of the model
using the test set.

Let's take an example of 5 5-folds cross-validation.


validation. So, the dataset is grouped into 5
st
folds. On 1 iteration, the first fold is reserved for test the model, and rest are used to
train the model. On 2nd iteration, the secsecond
ond fold is used to test the model, and rest
are used to train the model. This process will continue until each fold is not used for
the test fold.

Consider the below diagram:

You might also like