Regression Analysis in Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

ADVERTISEMENT

Aspose.Imaging for .NET

.NET API for Image Processing


f f f

Regression Analysis in Machine learning


Regression analysis is a statistical method to model the relationship between a dependent (target)
and independent (predictor) variables with one or more independent variables. More specifically,
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It
predicts continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every year and
get sales on that. The below list shows the advertisement made by the company in the last 5 years
and the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression analysis.
ADVERTISEMENT

Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data. In simple words, "Regression
shows a line or curve that passes through all the datapoints on target-predictor graph in such a
way that the vertical distance between the datapoints and the regression line is minimum." The
distance between datapoints and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

ADVERTISEMENT ADVERTISEMENT

Prediction of rain using temperature and other factors

Determining Market trends

Prediction of road accidents due to rash driving.


Terminologies Related to the Regression Analysis:
Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.

Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as a
predictor.

Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.

Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.

Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.

Why do we use Regression Analysis?


As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are
various scenarios in the real world where we need some future predictions such as weather condition,
sales prediction, marketing trends, etc., for such case we need some technology which can make
predictions more accurately. So for such case we need Regression analysis which is a statistical
method and used in machine learning and data science. Below are some other reasons for using
Regression analysis:

Regression estimates the relationship between the target and the independent variable.

It is used to find the trends in data.

It helps to predict real/continuous values.

By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors.

Types of Regression
There are various types of regressions which are used in data science and machine learning. Each
type has its own importance on different scenarios, but at the core, all the regression methods
analyze the effect of the independent variable on dependent variables. Here we are discussing some
important types of regression which are given below:

Linear Regression
Logistic Regression

Polynomial Regression

Support Vector Regression

Decision Tree Regression

Random Forest Regression

Ridge Regression

Lasso Regression:

Linear Regression:

Linear regression is a statistical regression method which is used for predictive analysis.

It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.

It is used for solving the regression problem in machine learning.

Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.

If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is called
multiple linear regression.

The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.

Below is the mathematical equation for Linear regression:

Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients
ADVERTISEMENT

Some popular applications of linear regression are:

Analyzing trends and sales estimates

Salary forecasting
Real estate prediction

Arriving at ETAs in traffic.

Logistic Regression:

Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary or
discrete format such as 0 or 1.

Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.

It is a predictive analysis algorithm which works on the concept of probability.

Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.

Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The function
can be represented as:

f(x)= Output between the 0 and 1 value.

x= input to the function

e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.
There are three types of logistic regression:

Binary(0/1, pass/fail)

Multi(cats, dogs, lions)

Ordinal(low, medium, high)

Polynomial Regression:

Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.

It is similar to multiple linear regression, but it fits a non-linear curve between the value of x
and corresponding conditional values of y.

Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover such
datapoints, we need Polynomial regression.

In Polynomial regression, the original features are transformed into polynomial features
of given degree and then modeled using a linear model. Which means the datapoints are
best fitted using a polynomial line.

The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression

equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.

Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is our
independent/input variable.

The model is still linear as the coefficients are still linear with quadratic
ADVERTISEMENT

ビジネスエンジニアリング

国内の会計サービスもGLASIAOUS

Linear Regression in Machine Learning


Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

Mathematically, we can represent a linear regression as:


ADVERTISEMENT

ADVERTISEMENT

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

ADVERTISEMENT ADVERTISEMENT

Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
ADVERTISEMENT

variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and independent variables is called a
regression line. A regression line can show two types of relationship:

Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases on X-axis,
then such a relationship is termed as a Positive linear relationship.

Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on the
X-axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:


When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
ADVERTISEMENT

The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression,
so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we
use cost function.

Cost function-

The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best
fit line.

Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.

We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and
hence the cost function.
Assumptions of Linear Regression
ADVERTISEMENT

Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given
dataset.

Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and independent
variables.

Small or no multicollinearity between the features:


Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and target
variables. Or we can say, it is difficult to determine which predictor variable is affecting the
target variable and which is not. So, the model assumes either little or no multicollinearity
between the features or independent variables.

Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern distribution of
data in the scatter plot.

Normal distribution of error terms:


Linear regression assumes that the error term should follow the normal distribution pattern. If
error terms are not normally distributed, then confidence intervals will become either too wide
or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation,
which means the error is normally distributed.

No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.

← Prev Next →

ADVERTISEMENT
Simple Linear Regression in Machine Learning
Simple Linear Regression is a type of Regression algorithms that models the relationship between a
dependent variable and a single independent variable. The relationship shown by a Simple Linear
Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a continuous/real
value. However, the independent variable can be measured on continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

Model the relationship between the two variables. Such as the relationship between Income
and expenditure, experience and Salary, etc.

Forecasting new observations. Such as Weather forecasting according to temperature, Revenue


of a company according to the investments in a year, etc.

Simple Linear Regression Model:


The Simple Linear Regression model can be represented using the below equation:
For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or
independent variable may be of continuous or categorical form.

Each feature variable must model the linear relationship with the dependent variable.

MLR tries to fit a regression line through a multidimensional space of data-points.

MLR equation:

In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple predictor variables
x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so the same is applied for the
multiple linear regression equation, the equation becomes:

Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>+ b<sub>3</

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable


Assumptions for Multiple Linear Regression:

A linear relationship should exist between the Target and predictor variables.

The regression residuals must be normally distributed.

MLR assumes little or no multicollinearity (correlation between the independent variable) in data.

Implementation of Multiple Linear Regression model using Python:

To implement MLR using Python, we have below problem:

Problem Description:

We have a dataset of 50 start-up companies. This dataset contains five main information: R&D Spend,
Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal is to create a
model that can easily determine which company has a maximum profit, and which is the most affecting
factor for the profit of a company.

Since we need to find the Profit, so it is the dependent variable, and the other four variables are
independent variables. Below are the main steps of deploying the MLR model:

1. Data Pre-processing Steps

2. Fitting the MLR model to the training set

3. Predicting the result of the test set

Step-1: Data Pre-processing Step:

The very first step is data pre-processing, which we have already discussed in this tutorial. This process
contains the below steps:

Importing libraries: Firstly we will import the library which will help in building the model. Below is
the code for it:

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

Importing dataset: Now we will import the dataset(50_CompList), which contains all the variables.
Below is the code for it:

#importing datasets
data_set= pd.read_csv('50_CompList.csv')
ADVERTISEMENT

Logistic Regression in Machine Learning


ADVERTISEMENT

Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).

The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.

Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.

Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image is
showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.

Logistic Function (Sigmoid Function):

The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

It maps any real value into another value within a range of 0 and 1.

The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or
the logistic function.

In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.

Assumptions for Logistic Regression:

The dependent variable must be categorical in nature.

The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

We know the equation of the straight line can be written as:


In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered


types of the dependent variable, such as "cat", "dogs", or "sheep"

Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Python Implementation of Logistic Regression (Binomial)

To understand the implementation of Logistic Regression in Python, we will use the below example:

ADVERTISEMENT

You might also like