Regression Analysis in Machine Learning
Regression Analysis in Machine Learning
Regression Analysis in Machine Learning
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and
get sales on that. The below list shows the advertisement made by the company in the last 5 years
and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression analysis.
ADVERTISEMENT
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data. In simple words, "Regression
shows a line or curve that passes through all the datapoints on target-predictor graph in such a
way that the vertical distance between the datapoints and the regression line is minimum." The
distance between datapoints and line tells whether a model has captured a strong relationship or not.
ADVERTISEMENT ADVERTISEMENT
Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as a
predictor.
Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.
Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.
Regression estimates the relationship between the target and the independent variable.
By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each
type has its own importance on different scenarios, but at the core, all the regression methods
analyze the effect of the independent variable on dependent variables. Here we are discussing some
important types of regression which are given below:
Linear Regression
Logistic Regression
Polynomial Regression
Ridge Regression
Lasso Regression:
Linear Regression:
Linear regression is a statistical regression method which is used for predictive analysis.
It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is called
multiple linear regression.
The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.
Y= aX+b
Salary forecasting
Real estate prediction
Logistic Regression:
Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or
discrete format such as 0 or 1.
Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.
Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The function
can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.
There are three types of logistic regression:
Binary(0/1, pass/fail)
Polynomial Regression:
Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.
It is similar to multiple linear regression, but it fits a non-linear curve between the value of x
and corresponding conditional values of y.
Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover such
datapoints, we need Polynomial regression.
In Polynomial regression, the original features are transformed into polynomial features
of given degree and then modeled using a linear model. Which means the datapoints are
best fitted using a polynomial line.
The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is our
independent/input variable.
The model is still linear as the coefficients are still linear with quadratic
ADVERTISEMENT
ビジネスエンジニアリング
国内の会計サービスもGLASIAOUS
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
ADVERTISEMENT
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model representation.
ADVERTISEMENT ADVERTISEMENT
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression,
so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we
use cost function.
Cost function-
The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best
fit line.
Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and
hence the cost function.
Assumptions of Linear Regression
ADVERTISEMENT
Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given
dataset.
Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern distribution of
data in the scatter plot.
No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.
← Prev Next →
ADVERTISEMENT
Simple Linear Regression in Machine Learning
Simple Linear Regression is a type of Regression algorithms that models the relationship between a
dependent variable and a single independent variable. The relationship shown by a Simple Linear
Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real
value. However, the independent variable can be measured on continuous or categorical values.
Model the relationship between the two variables. Such as the relationship between Income
and expenditure, experience and Salary, etc.
Each feature variable must model the linear relationship with the dependent variable.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple predictor variables
x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so the same is applied for the
multiple linear regression equation, the equation becomes:
Where,
Y= Output/Response variable
A linear relationship should exist between the Target and predictor variables.
MLR assumes little or no multicollinearity (correlation between the independent variable) in data.
Problem Description:
We have a dataset of 50 start-up companies. This dataset contains five main information: R&D Spend,
Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal is to create a
model that can easily determine which company has a maximum profit, and which is the most affecting
factor for the profit of a company.
Since we need to find the Profit, so it is the dependent variable, and the other four variables are
independent variables. Below are the main steps of deploying the MLR model:
The very first step is data pre-processing, which we have already discussed in this tutorial. This process
contains the below steps:
Importing libraries: Firstly we will import the library which will help in building the model. Below is
the code for it:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
Importing dataset: Now we will import the dataset(50_CompList), which contains all the variables.
Below is the code for it:
#importing datasets
data_set= pd.read_csv('50_CompList.csv')
ADVERTISEMENT
Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image is
showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or
the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
To understand the implementation of Logistic Regression in Python, we will use the below example:
ADVERTISEMENT