Merge +1
Merge +1
Merge +1
Machine learning is a branch of Artificial Intelligence, which allows machines to perform data
analysis and make predictions. However, if the machine learning model is not accurate, it can
make predictions errors, and these prediction errors are usually known as Bias and Variance. In
machine learning, these errors will always be present as there is always a slight difference
between the model predictions and actual predictions. The main aim of ML/data science analysts
is to reduce these errors in order to get more accurate results. In this topic, we are going to
discuss bias and variance, Bias-variance trade-off, Underfitting and Overfitting. But before
starting, let's first understand what errors in Machine learning are?
regardless of which algorithm has been used. The cause of these errors is unknown variables
whose value can't be reduced.
What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make predictions.
While training, the model learns these patterns in the dataset and applies them to test data for
prediction. While making predictions, a difference occurs between prediction values
made by the model and actual values/expected values, and this difference is
known as bias errors or Errors due to bias. It can be defined as an inability of machine
learning algorithms such as Linear Regression to capture the true relationship between the data
points. Each algorithm begins with some amount of bias because bias occurs from assumptions
in the model, which makes the target function simple to learn. A model has either:
x
o Low Bias: A low bias model will make fewer assumptions about the form of the target function.
o High Bias: A model with a high bias makes more assumptions, and the model becomes unable to capture
the important features of our dataset. A high bias model also cannot perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm often
has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest
Neighbours and Support Vector Machines. At the same time, an algorithm with high bias
is Linear Regression, Linear Discriminant Analysis and Logistic Regression.
High bias mainly occurs due to a much simple model. Below are some ways to reduce the high
bias:
Low variance means there is a small variation in the prediction of the target function with
changes in the training data set. At the same time, High variance shows a large variation in the
prediction of the target function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with the training dataset, and
does not generalize well with the unseen dataset. As a result, such a model gives good results
with the training dataset but shows high error rates on the test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to overfitting of
the model. A model with high variance has the below problems:
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.
Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis. At the same time, algorithms with
high variance are decision tree, Support Vector Machine, and K-nearest neighbours.
o High training error and the test error is almost similar to training error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very simple
with fewer parameters, it may have low variance and high bias. Whereas, if the model has a large
number of parameters, it will have high variance and low bias. So, it is required to make a
balance between bias and variance errors, and this balance between the bias error and variance
error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias. But this is
not possible because bias and variance are related to each other:
Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that
accurately captures the regularities in training data and simultaneously generalizes well with the
unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a high variance
algorithm may perform well with training data, but it may lead to overfitting to noisy data.
Whereas, high bias algorithm generates a much simple model that may not even capture
important regularities in the data. So, we need to find a sweet spot between bias and variance to
make an optimal model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and
variance errors.
MACHINE LEARNING
PROGRAM: B.TECH (CSE-DATA SCIENCE)
SEM-V
[email protected]
Syllabus Unit Description Duration
1 Introduction: What is Machine Learning. Supervised Learning. Unsupervised Learning 2
Total 30
Teaching and Evaluation Scheme
Program: B. Tech. CSDS Semester : II
Course/Module : Machine Learning Module Code:
Testing: Test the model using unseen test data to assess the model accuracy
CS583, BING
20
LIU, UIC
What do we mean by Learning?
• Given
• a data set D,
• a task T, and
• a performance measure M,
a computer system is said to learn from D to perform the task T if after learning the system’s
performance on T improves as measured by M.
• In other words, the learned model helps the system to perform T better as compared to no
learning.
An Example
• Data: Loan application data
• Task: Predict whether a loan should be approved or not.
• Performance measure: accuracy.
No learning: classify all future applications (test data) to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
• We can do better than 60% with learning.
Fundamental Assumption of Learning
Assumption: The distribution of training examples is identical to the distribution of test
examples (including future unseen examples).
• If the shape of the object is rounded and has a depression at the top, is red in color, then it
will be labeled as –Apple.
• If the shape of the object is a long curving cylinder having Green-Yellow color, then it will
be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the
basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time has to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name
as BANANA and put it in the Banana category. Thus the machine learns the things from
training data(basket containing fruits) and then applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” , “disease” or “no disease”.
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
Steps
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides, then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
o Supervised learning allows collecting data and produces data output from previous
experiences.
o Helps to optimize performance criteria with the help of experience.
o Supervised machine learning helps to solve various types of real-world computation
problems.
o Supervised learning models are not suitable for handling complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
o Classifying big data can be challenging.
Unsupervised
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as
‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture into two parts. The first may
contain all pics having dogs in them and the second part may contain all pics having cats in
them. Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
• Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
• Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Parameters Supervised machine learning Unsupervised machine learning
Computational
Complexity Simpler method Computationally complex
𝑛∑𝑥𝑦 − (∑𝑥)(∑𝑦)
𝑏(𝑠𝑙𝑜𝑝𝑒) =
𝑛∑𝑥 2 − (∑𝑥)2
Where,
x and y are two variables on the regression line.
b = Slope of the line.
a = y-intercept of the line.
x = Values of the first data set.
y = Values of the second data set.
Solved Examples
Question: Find linear regression equation for the following two sets of data:
x 2 4 6 8
y 3 7 5 10
Solution:
Construct the following table:
x y x2 xy
2 3 4 6
4 7 16 28
6 5 36 30
8 10 64 80
= 20 = 25 = 120 = 144
𝑛∑𝑥𝑦−(∑𝑥)(∑𝑦)
𝑏= 𝑛∑𝑥 2 −(∑𝑥)2
=
b = 0.95
∑𝑦∑𝑥 2 –∑𝑥∑𝑥𝑦
𝑎= 𝑛(∑𝑥 2 )–(∑𝑥)2
a = 1.5
Linear regression is given by:
y = a + bx
y = 1.5 + 0.95 x
Linear Regression
Problems with Solutions
Linear regression and modelling problems are presented along with their solutions at the bottom of the
page. Also a linear regression calculator and grapher may be used to check answers and create more
opportunities for practice.
Review
If the plot of n pairs of data (x , y) for an experiment appear to indicate a "linear relationship" between y
and x, then the method of least squares may be used to write a linear relationship between x and y.
The least squares regression line is the line that minimizes the sum of the squares (d1 + d2 + d3 + d4) of
the vertical deviation from each data point to the line (see figure below as an example of 4 points).
Figure 1. Linear regression where the sum of vertical distances d1 + d2 + d3 + d4 between observed and
predicted (line and its equation) values is minimized.
The least square regression line for the set of n data points is given by the equation of a line in slope
intercept form:
y=ax+b
• Problem 1
• Problem 2
a) Find the least square regression line for the following set of data
b) Plot the given points and the regression line in the same rectangular system of axes.
• Problem 3
The values of y and their corresponding values of y are shown in the table below
x 0 1 2 3 4
y 2 3 5 4 6
• Problem 4
The sales of a company (in million dollars) for each year are shown in the table below.
x y xy x2
-2 -1 2 4
1 1 1 1
3 2 6 9
Σx = 2 Σy = 2 Σxy = 9 Σx2 = 14
2.
We now use the above formula to calculate a and b as follows
a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2) = (3*9 - 2*2) / (3*14 - 22) = 23/38
b) We now graph the regression line given by y = a x + b and the given points.
3.
x Y xy x2
-1 0 0 1
0 2 0 0
1 4 4 1
2 5 10 4
Σx = 2 Σy = 11 Σx y = 14 Σx2 = 6
b) We now graph the regression line given by y = ax + b and the given points.
5.
x Y xy x2
0 2 0 0
1 3 3 1
2 5 10 4
3 4 12 9
4 6 24 16
Σx = 10 Σy = 20 Σx y = 49 Σx2 = 30
We now calculate a and b using the least square regression formulas for a and b.
a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2) = (5*49 - 10*20) / (5*30 - 102) = 0.9
b) Now that we have the least square regression line y = 0.9 x + 2.2, substitute x by 10 to find the
value of the corresponding y.
y = 0.9 * 10 + 2.2 = 11.2
7. a) We first change the variable x into t such that t = x - 2005 and therefore t represents the
number of years after 2005. Using t instead of x makes the numbers smaller and therefore
manageable. The table of values becomes.
y (sales) 12 19 29 37 45
We now use the table to calculate a and b included in the least regression line formula.
t Y ty t2
0 12 0 0
1 19 19 1
2 29 58 4
3 37 111 9
4 45 180 16
Σx = 10 Σy = 142 Σxy = 368 Σx2 = 30
We now calculate a and b using the least square regression formulas for a and b.
a = (nΣt y - ΣtΣy) / (nΣt2 - (Σt)2) = (5*368 - 10*142) / (5*30 - 102) = 8.4
b = (1/n)(Σy - a Σx) = (1/5)(142 - 8.4*10) = 11.6
Example 9.9
Calculate the regression coefficient and obtain the lines of regression for the following data
Solution:
Regression coefficient of X on Y
(i) Regression equation of X on Y
= 0.929X+7.284
Example 9.10
Calculate the two regression equations of X on Y and Y on X from the data given below, taking deviations
from a actual means of X and Y.
Solution:
= –0.25 (20)+44.25
= –5+44.25
= 39.25 (when the price is Rs. 20, the likely demand is 39.25)
Example 9.11
Obtain regression equation of Y on X and estimate Y when X=55 from the following
Solution:
(i) Regression coefficients of Y on X
(ii) Regression equation of Y on X
Y–51.57 = 0.942(X–48.29 )
Y = 0.942X–45.49+51.57=0.942 #–45.49+51.57
Y = 0.942X+6.08
Y= 0.942(55)+6.08=57.89
Example 9.12
Find the means of X and Y variables and the coefficient of correlation between them from the following
two regression equations:
2Y–X–50 = 0
3Y–2X–10 = 0.
Solution:
We are given
We get Y = 90
We get X = 130
2Y = X+50
Example 9.13
Find the means of X and Y variables and the coefficient of correlation between them from the following
two regression equations:
4X–5Y+33 = 0
20X–9Y–107 = 0
Solution:
We are given
We get Y = 17
But this is not possible because both the regression coefficient are greater than
So our above assumption is wrong. Therefore treating equation (1) has regression equation of Y on X and
equation (2) has regression equation of X on Y . So we get
Example 9.16
For 5 pairs of observations the following results are obtained ∑X=15, ∑Y=25, ∑X2 =55, ∑Y2 =135,
∑XY=83 Find the equation of the lines of regression and estimate the value of X on the first line
when Y=12 and value of Y on the second line if X=8.
Solution:
Y–5 = 0.8(X–3)
= 0.8X+2.6
=9
Unit-2
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically, Regression
analysis helps us to understand how the value of the dependent variable is changing corresponding to an
independent variable when other independent variables are held fixed. It predicts continuous/real values
such as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and get
sales on that. The below list shows the advertisement made by the company in the last 5 years and the
corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between variables
and enables us to predict the continuous output variable based on the one or more predictor variables. It is
mainly used for prediction, forecasting, time series modeling, and determining the causal-effect
relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot,
the machine learning model can make predictions about the data. In simple words, "Regression shows a
line or curve that passes through all the datapoints on target-predictor graph in such a way that the
vertical distance between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are
various scenarios in the real world where we need some future predictions such as weather condition,
sales prediction, marketing trends, etc., for such case we need some technology which can make
predictions more accurately. So for such case we need Regression analysis which is a statistical method
and used in machine learning and data science. Below are some other reasons for using Regression
analysis:
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has
its own importance on different scenarios, but at the core, all the regression methods analyze the effect of
the independent variable on dependent variables. Here we are discussing some important types of
regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.
o Below is the mathematical equation for Linear regression:
1. Y= aX+b
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model representation.
A Linear Regression model’s main aim is to find the best fit linear line and the optimal values of
intercept and coefficients such that the error is minimized.
In the above diagram,
• x is our dependent variable which is plotted on the x-axis and y is the dependent variable which is
plotted on the y-axis.
• Black dots are the data points i.e the actual values.
• The blue line is the best fit line predicted by the model i.e the predicted values lie on the blue line.
The vertical distance between the data point and the regression line is known as error or
residual. Each data point has one residual and the sum of all the differences is known as the Sum of
Residuals/Errors.
Mathematical Approach:
i.e
Linear Regression: Hypothesis Function, Cost Function, and Gradient Descent
For simplicity, we will first consider Linear Regression with only one variable:-
Model Representation:-
To describe the supervised learning problem slightly more formally, our goal is to, given a training set, to
learn a function h:X → Y, so that h(x) is a ‘good’ predictor for corresponding y. h(x) is known as
hypothesis function.
Now the picture might seem clear to you. Our main task is to design the h function.
What we are trying to achieve is that by plotting all the datasets on a graph with the input variables on the
independent axis and the output on the y or secondary axis. In this way, we would have a direct plotting of
input to output. For example:-
In the above example, we have data for different houses. For different land areas for the house, we have
different prices for those houses. This is our training data. Now sketch this dataset in the graph.
Next time, whenever I enter the area of a new house, it will automatically tell me the price of that house
using this line.
I entered the area=30, and it predicted the price of approximately 195 dollars for us. Which, according to
our training set, is a reasonable price. So how do you teach your computer to predict a line that fits your
dataset? Let’s dive into the mathematics:-
Don’t be overwhelmed if you are not familiar with that equation. Let me dive into the mathematics behind
this.
I thought that before considering the formula, you should have a reference to different terms used in this.
You might be familiar with the formula for a line using the slope and y-intercept.
y=mx+b
Refer to Khan Academy if you are not familiar with this equation
This equation is used to represent lines in the intercept form. Our hypothesis function is exactly the same
as the equation of a line.
So,theta1 is the slope(m) and theta0 is the intercept (b).Now, you have become familiar with the hypothesis
function and why we are using this function[ofcourse we want to fit a line into our graph, and this is the
equation of a line].
At this stage, our primary goal is to minimize the difference between the line and each point. This is done
by tweaking the values of the slope of the line(theta1) and the y-intercept(theta0) of the line. So, we have
to find theta0 and theta1 for which the line has the smallest error.
What do I mean by minimum error? Let’s consider our above prediction.
The lines show the distance of each point from the line. When we sum up the difference for all the points,
it gives us the error in that line. So we have to minimize the error to gain an optimal solution. What we can
do is move the line a little bit higher, lower, change the angle by tweaking the values of theta0 and theta1.
But don’t worry about that, our program will do the hard task for us.
To achieve this, we will use dummy values for theta0 and theta1, put it in our hypothesis function, and
calculate the cost for that line. Repeat this step until we reach the minimum cost. How will we know what
the minimum cost is? I will come to that, but first, have a look at the function that calculates cost.
So we are subtracting each point from the line. The point on the line that is precisely below a specific point
can be found by putting the value of x in the line equation.[If you don’t know about the equation of a line,
first consider it by watching some tutorials on the internet.]
Now, sum up all the terms using the summation sigma. The limit for the values to be summed is equal to
the number of points, and each point refers to a particular training example, so our i varies from 1 to m.
Now exchange the positions of the y and hypothesis function and take square to account for the negative
values. Divide the summation by 2m to reduce the cost. This is just to make computation easy for the
computer. You can also neglect this part.
You have your error function. Remember what I said about tweaking the values of theta0 and theta1. This
is how we will calculate the cost for each value of theta0 and theta1.
Now, our main task is to predict the price of a new house using this dataset. This is achieved using Linear
Regression. What we do is fit a line into our dataset in such a way that it minimizes the distance from each
point.
One prediction would be the above blue line. Note that I have tried to draw the line in such a way that it is
close relative to all the points. So we have to choose such a line that perfectly fit our data set.
Assumptions of Linear Regression
1. Linearity: It states that the dependent variable Y should be linearly related to independent variables.
This assumption can be checked by plotting a scatter plot between both variables.
2. Normality: The X and Y variables should be normally distributed. Histograms, KDE plots, Q-Q plots
can be used to check the Normality assumption.
3. Homoscedasticity: The variance of the error terms should be constant i.e. the spread of residuals should
be constant for all values of X. This assumption can be checked by plotting a residual plot. If the
assumption is violated then the points will form a funnel shape otherwise they will be constant.
In VIF method, we pick each feature and regress it against all of the other features. For each regression,
the factor is calculated as :
Where, R-squared is the coefficient of determination in linear regression. Its value lies between 0 and 1.
As we see from the formula, greater the value of R-squared, greater is the VIF. Hence, greater VIF
denotes greater correlation. This is in agreement with the fact that a higher R-squared value denotes a
stronger collinearity. Generally, a VIF above 5 indicates a high multicollinearity.
R-squared is a statistical measure that represents the goodness of fit of a regression model. The ideal
value for r-square is 1. The closer the value of r-square to 1, the better is the model fitted.
R-square is a comparison of the residual sum of squares (SSres) with the total sum of squares(SStot). The
total sum of squares is calculated by summation of squares of perpendicular distance between data
points and the average line.
The residual sum of squares is calculated by the summation of squares of perpendicular distance
between data points and the best-fitted line.
Where SSres is the residual sum of squares and SStot is the total sum of squares.
The goodness of fit of regression models can be analyzed on the basis of the R-square method. The
more the value of r-square near 1, the better is the model.
Note: The value of R-square can also be negative when the model fitted is worse than the average fitted
model.
Limitation of using the R-square method –
• The value of r-square always increases or remains the same as new variables are added to the model,
without detecting the significance of this newly added variable (i.e value of r-square never decreases
on the addition of new attributes to the model). As a result, non-significant attributes can also be
added to the model with an increase in the r-square value.
• This is because SStot is always constant and the regression model tries to decrease the value
of SSres by finding some correlation with this new attribute hence the overall value of r-square
increases, which can lead to a poor regression model.
5. The error terms should be normally distributed. Q-Q plots and Histograms can be used to check the
distribution of error terms.
6. No Autocorrelation: The error terms should be independent of each other. Autocorrelation can be
tested using the Durbin Watson test. The null hypothesis assumes that there is no autocorrelation. The
value of the test lies between 0 to 4. If the value of the test is 2 then there is no autocorrelation.
The Violation of the assumptions leads to a decrease in the accuracy of the model therefore the predictions
are not accurate and error is also high.
For example, if the Independence assumption is violated then the relationship between the independent
and dependent variable cannot be determined precisely.
There are various methods are techniques available to deal with the violation of the assumptions. Let’s
discuss some of them below.
To treat this problem, we can transform the variables to the normal distribution using various
transformation functions such as log transformation, Reciprocal, or Box-Cox Transformation.
All the functions are discussed in this article of mine:
• Deriving a new feature by linearly combining the independent variables, such as adding them together
or performing some mathematical operation.
• Performing an analysis designed for highly correlated variables, such as principal components analysis.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
In the figure, the red points are the data points and the blue line is the predicted line for the training data.
To get the predicted value, these data points are projected on to the line.
To summarize, our aim is to find such values of coefficients which will minimize the cost function. The
most common cost function is Mean Squared Error (MSE) which is equal to the average squared
difference between an observation’s actual and predicted values. The coefficient values can be calculated
using the Gradient Descent . To give a brief understanding, in Gradient descent we start with some
random values of coefficients, compute the gradient of cost function on these values, update the
coefficients and calculate the cost function again. This process is repeated until we find a minimum value
of cost function.
When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so
we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost
function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best fit
line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If the observed
points are far from the regression line, then the residual will be high, and so cost function will high. If the
scatter points are close to the regression line, then the residual will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
o It is done by a random selection of values of coefficient and then iteratively update the values to
reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below method:
1. R-squared method:
o If we apply a linear model on a linear dataset, then it provides us a good result as we have seen
in Simple Linear Regression, but if we apply the same model without any modification on a non-
linear dataset, then it will produce a drastic output. Due to which loss function will increase, the
error rate will be high, and accuracy will be decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we need the
Polynomial Regression model. We can understand it in a better way using the below comparison
diagram of the linear dataset and non-linear dataset.
o In the above image, we have taken a dataset which is arranged non-linearly. So if we try to cover
it with a linear model, then we can clearly see that it hardly covers any data point. On the other
hand, a curve is suitable to cover most of the data points, which is of the Polynomial model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use the Polynomial
Regression model instead of Simple Linear Regression.
Note: A Polynomial Regression algorithm is also called Polynomial Linear Regression because it does
not depend on the variables, instead, it depends on the coefficients, which are arranged in a linear
fashion.
When we compare the above three equations, we can clearly see that all three equations are Polynomial
equations but differ by the degree of variables. The Simple and Multiple Linear equations are also
Polynomial equations with a single degree, and the Polynomial regression equation is Linear equation
with the nth degree. So if we add a degree to our linear equations, then it will be converted into
Polynomial Linear equations.
Note: To better understand Polynomial Regression, you must have knowledge of Simple Linear
Regression.
Here we will implement the Polynomial Regression using Python. We will understand it by comparing
Polynomial Regression model with the Simple Linear Regression model. So first, let's understand the
problem for which we are going to build the model.
Problem Description: There is a Human Resource company, which is going to hire a new candidate. The
candidate has told his previous salary 160K per annum, and the HR have to check whether he is telling
the truth or bluff. So to identify this, they only have a dataset of his previous company in which the
salaries of the top 10 positions are mentioned with their levels. By checking the dataset available, we have
found that there is a non-linear relationship between the Position levels and the salaries. Our goal is to
build a Bluffing detector regression model, so HR can hire an honest candidate. Below are the steps to
build such a model.
Steps for Polynomial Regression:
o Data Pre-processing
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output.
Note: Here, we will build the Linear regression model as well as Polynomial Regression to see the results
between the predictions. And Linear regression model is for reference.
The data pre-processing step will remain the same as in previous regression models, except for some
changes. In the Polynomial Regression model, we will not use feature scaling, and also we will not split
our dataset into training and test set. It has two reasons:
o The dataset contains very less information which is not suitable to divide it into a test and training
set, else our model will not be able to find the correlations between the salaries and levels.
o In this model, we want very accurate predictions for salary, so the model should have enough
information.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Position_Salaries.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, 1:2].values
11. y= data_set.iloc[:, 2].values
Explanation:
o In the above lines of code, we have imported the important Python libraries to import dataset and
operate on it.
o Next, we have imported the dataset 'Position_Salaries.csv', which contains three columns
(Position, Levels, and Salary), but we will consider only two columns (Salary and Levels).
o After that, we have extracted the dependent(Y) and independent variable(X) from the dataset. For
x-variable, we have taken parameters as [:,1:2], because we want 1 index(levels), and included :2
to make it as a matrix.
Output:
As we can see in the above output, there are three columns present (Positions, Levels, and Salaries). But
we are only considering two columns because Positions are equivalent to the levels or may be seen as the
encoded form of Positions.
Here we will predict the output for level 6.5 because the candidate has 4+ years' experience as a regional
manager, so he must be somewhere between levels 7 and 6.
Now, we will build and fit the Linear regression model to the dataset. In building polynomial regression,
we will take the Linear regression model as reference and compare both the results. The code is given
below:
In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).
Output:
Now we will build the Polynomial Regression model, but it will be a little different from the Simple
Linear model. Because here we will use PolynomialFeatures class of preprocessing library. We are
using this class to add some extra features to our dataset.
In the above lines of code, we have used poly_regs.fit_transform(x), because first we are converting our
feature matrix into polynomial feature matrix, and then fitting it to the Polynomial regression model. The
parameter value(degree= 2) depends on our choice. We can choose it according to our Polynomial
features.
After executing the code, we will get another matrix x_poly, which can be seen under the variable
explorer option:
Next, we have used another LinearRegression object, namely lin_reg_2, to fit our x_poly vector to the
linear model.
Output:
Now we will visualize the result for Linear regression model as we did in Simple Linear Regression.
Below is the code for it:
Output:
In the above output image, we can clearly see that the regression line is so far from the datasets.
Predictions are in a red straight line, and blue points are actual values. If we consider this output to predict
the value of CEO, it will give a salary of approx. 600000$, which is far away from the real value.
So we need a curved model to fit the dataset other than a straight line.
Here we will visualize the result of Polynomial regression model, code for which is little different from
the above model.
Output:
As we can see in the above output image, the predictions are close to the real values. The above plot will
vary as we will change the degree.
For degree= 3:
If we change the degree=3, then we will give a more accurate plot, as shown in the below image.
SO as we can see here in the above output image, the predicted salary for level 6.5 is near to 170K$-
190k$, which seems that future employee is saying the truth about his salary.
Degree= 4: Let's again change the degree to 4, and now will get the most accurate plot. Hence we can get
more accurate results by increasing the degree of Polynomial.
Now, we will predict the final output using the Linear regression model to see whether an employee is
saying truth or bluff. So, for this, we will use the predict() method and will pass the value 6.5. Below is
the code for it:
1. lin_pred = lin_regs.predict([[6.5]])
2. print(lin_pred)
Output:
[330378.78787879]
Now, we will predict the final output using the Polynomial Regression model to compare with Linear
model. Below is the code for it:
1. poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
2. print(poly_pred)
Output:
[158862.45265153]
As we can see, the predicted output for the Polynomial Regression is [158862.45265153], which is much
closer to real value hence, we can say that future employee is saying true.
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or
discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression algorithm
in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost function.
This sigmoid function is used to model the data in logistic regression. The function can be
represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and
values below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a linear
model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and
corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear fashion,
so for such case, linear regression will not best fit to those datapoints. To cover such datapoints,
we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features of
given degree and then modeled using a linear model. Which means the datapoints are best
fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation that means
Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y=
b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial regression, a
single element has different degrees instead of multiple variables with the same degree.
Support Vector Machine is a supervised learning algorithm which can be used for regression as well as
classification problems. So if we use it for regression problems, then it is termed as Support Vector
Regression.
Support Vector Regression is a regression algorithm which works for continuous variables. Below are
some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line
which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin
for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and
opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number of
datapoints are covered in that margin. The main goal of SVR is to consider the maximum datapoints
within the boundary lines and the hyperplane (best-fit line) must contain a maximum number of
datapoints. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
o Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple decision
trees and predicts the final output based on the average of each tree output. The combined
decision trees are called as base models, and it can be represented more formally as:
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can
only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:
Logistic Regression in Machine Learning
o Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between
0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification. The below image is
showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is called
logistic regression, but is used to classify samples; Therefore, it falls under the classification algorithm.
The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
To understand the implementation of Logistic Regression in Python, we will use the below example:
00:00/05:19Example: There is a dataset given which contains the information of various users obtained
from the social networking sites. There is a car making company that has recently launched a new SUV
car. So the company wanted to check how many users from the dataset, wants to purchase the car.
For this problem, we will build a Machine Learning model using the Logistic regression algorithm. The
dataset is shown in the below image. In this problem, we will predict the purchased variable
(Dependent Variable) by using age and salary (Independent variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the same
steps as we have done in previous topics of Regression. Below are the steps:
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use it in our
code efficiently. It will be the same as we have done in Data pre-processing topic. The code for this is
given below:
By executing the above lines of code, we will get the dataset as the output. Consider the given image:
Now, we will extract the dependent and independent variables from the given dataset. Below is the code
for it:
1. x= data_set.iloc[:, [2,3]].values
2. y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables are age and salary, which
are at index 2, 3. And we have taken 4 for y variable because our dependent variable is at index 4. The
output will be:
Now we will split the dataset into a training set and test set. Below is the code for it:
For test
set:
1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)
We have well prepared our dataset, and now we will train the dataset using the training set. For providing
training or fitting the model to the training set, we will import the LogisticRegression class of
the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:
Output: By executing the above code, we will get the below output:
Out[5]:
Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the variable explorer
option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or not purchase
the car.
Now we will create the confusion matrix here to check the accuracy of the classification. To create it, we
need to import the confusion_matrix function of the sklearn library. After importing the function, we
will call it using a new variable cm. The function takes two parameters, mainly y_true( the actual values)
and y_pred (the targeted value return by the classifier). Below is the code for it:
Output:
By executing the above code, a new confusion matrix will be created. Consider the below image:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above output,
we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:
In the above code, we have imported the ListedColormap class of Matplotlib library to create the
colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a rectangular
grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01
resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided colors
(purple and green). In this function, we have passed the classifier.predict to show the predicted data
points predicted by the classifier.
Output: By executing the above code, we will get the below output:
o In the above graph, we can see that there are some Green points within the green region
and Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the result for
purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-axis and Estimated
salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is probably 0, i.e.,
users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is probably 1 means
user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low salary, did not
purchase the car, whereas older users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and some green points in
the purple region(Not buying the car). So we can say that younger users with a high estimated
salary purchased the car, whereas an older user with a low estimated salary did not purchase the
car.
We have successfully visualized the training set result for the logistic regression, and our goal for this
classification is to divide the users who purchased the SUV car and who did not purchase the car. So from
the output graph, we can clearly see the two regions (Purple and Green) with the observation points. The
Purple region is for those users who didn't buy the car, and Green Region is for those users who
purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have used the
Linear model for Logistic Regression. In further topics, we will learn for non-linear Classifiers.
Visualizing the test set result:
Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we will
use x_test and y_test instead of x_train and y_train. Below is the code for it:
Output:
Hence our model is pretty good and ready to make new predictions for this classification problem.
1. Linear – if degree as 1
2. Quadratic – if degree as 2
Here we are dealing with mathematics, rather than going deep, just understand the basic structure, we all
know the equation of a linear equation will be a straight line, from that if we have many features then we
opt for multiple regression just increasing features part alone, then how about polynomial, it’s not about
increasing but changing the structure to a quadratic equation, you can visually understand from the
diagram,
Linear Regression Vs Polynomial Regression
Rather than focusing on the distinctions between linear and polynomial regression, we may comprehend
the importance of polynomial regression by starting with linear regression. We build our model and
realize that it performs abysmally. We examine the difference between the actual value and the best fit
line we predicted, and it appears that the true value has a curve on the graph, but our line is nowhere near
cutting the mean of the points. This is where polynomial regression comes into play; it predicts the best-
fit line that matches the pattern of the data (curve).
One important distinction between Linear and Polynomial Regression is that Polynomial Regression does
not require a linear relationship between the independent and dependent variables in the data set. When
the Linear Regression Model fails to capture the points in the data and the Linear Regression fails to
adequately represent the optimum conclusion, Polynomial Regression is used.
Overfitting Vs Under-fitting
We keep on increasing the degree, we will see the best result, but there comes the over-fitting problem, if
we get r2 value for a particular value shows 100.
When analyzing a dataset linearly, we encounter an under-fitting problem, which can be corrected using
polynomial regression. However, when fine-tuning the degree parameter to the optimal value, we
encounter an over-fitting problem, resulting in a 100 per cent r2 value. The conclusion is that we must
avoid both overfitting and underfitting issues.