Aiml M3 C3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Chapter 5

Regression Analysis
5.1 INTRODUCTION TO REGRESSION
• Regression analysis is the premier method of supervised
learning.
• This is one of the most popular and oldest supervised
learning technique.
• Given a training dataset D containing N training points (xi, yi),
where i = 1...N, regression analysis is used to model the
relationship between one or more independent variables xi and
a dependent variable yi.
• The relationship between the dependent and independent
variables can be represented as a function as follows:
y = f(x)
Here, the feature variable x is also known as an explanatory
variable, exploratory variable, a predictor variable, an
independent variable, a covariate, or a domain point.
y is a dependent variable. Dependent variables are also called
as labels, target variables, or response variables.

Regression analysis determines the change in response


variables when one exploration variable is varied while keeping
all other parameters constant.
This is used to determine the relationship each of the exploratory
variables exhibits.
Thus, regression analysis is used for prediction and forecasting.
• Regression is used to predict continuous variables or
quantitative variables such as price and revenue.
• Thus, the primary concern of regression analysis is to find
answer to questions such as:
• 1. What is the relationship between the variables?
• 2. What is the strength of the relationships?
• 3. What is the nature of the relationship such as linear or
non-linear?
• 4. What is the relevance of the attributes?
• 5. What is the contribution of each attribute?
• There are many applications of regression analysis. Some of
the applications of regressions include predicting:
• 1. Sales of a goods or services
• 2. Value of bonds in portfolio management
• 3. Premium on insurance companies
• 4. Yield of crops in agriculture
• 5. Prices of real estate
For Understanding:

Regression:
• A regression model determines a relationship between an independent
variable and a dependent variable, by providing a function.
• Formulating a regression analysis helps you predict the effects of
the independent variable on the dependent one.

• Example: we can say that age and height can be described using
a linear regression model.
• Since a person's height increases as age increases, they have a
linear relationship.
For Understanding:

• Correlation means there is a relationship or pattern between the


values of two variables. Causation means that one event causes
another event to occur. (OR)
• Correlation means there is a statistical association between
variables. Causation means that a change in one variable causes a
change in another variable.
Causation: Sometimes
when two variables are
correlated, the relationship
is coincidental, or a third
factor is causing them both
to change. This is also
referred to as cause and effect.
5.2 INTRODUCTION TO LINEARITY,
CORRELATION, AND CAUSATION
• The quality of the regression analysis is determined by the
factors such as correlation and causation.
Regression and Correlation
• Correlation among two variables can be done effectively using a
Scatter plot, which is a plot between explanatory variables and
response variables.
• It is a 2D graph showing the relationship between two variables.
• The x-axis of the scatter plot is independent, or input or
predictor variables and y-axis of the scatter plot is output or
dependent or predicted variables
• The scatter plot is useful in exploring data.
• Some of the scatter plots are shown in Figure 5.1. The Pearson
correlation coefficient is the most common test for determining
correlation if there is an association between two variables.
• The correlation coefficient is denoted by r.
• The positive, negative, and random correlations are given in
Figure 5.1
• In positive correlation, one variable change is associated with
the change in another variable.
• In negative correlation, the relationship between the variables is
reciprocal while in random correlation, no relationship exists
between variables.
For Understanding:

• There are three ways to describe correlations between variables.


The more time you spend running on a treadmill, the more
calories you will burn. As the temperature goes up, ice
cream sales also go up.

The height of students and their average exam scores has


a correlation of zero. The shoe size of individuals and the
number of movies they watch per year has a correlation of
zero

The more it rains, the less you can water the


garden.
The more you cook at home, the less you might
Scatter plots

While correlation is about relationships among variables, say x and y, regression is about predicting one
variable given another variable.
Regression and Causation
• Causation is about causal relationship among variables, say x and
y.
• Causation means knowing whether x causes y to happen or vice
versa. x causes y is often denoted as x implies y.
• Correlation and Regression relationships are not same as
causation relationship.
• For example, the correlation between economical background
and marks scored does not imply that economic background
causes high marks.
• Similarly, the relationship between higher sales of cool drinks due
to a rise in temperature is not a causal relation.
• Even though high temperature is the cause of cool drinks sales, it
depends on other factors too.
Linearity and Non-linearity Relationships
• The linearity relationship between the variables means the
relationship between the dependent and independent variables
can be visualized as a straight line.
• The line of the form, y = ax + b can be fitted to the data points
that indicate the relationship between x and y.
• By linearity, it’s meant that as one variable increases, the
corresponding variable also increases in a linear manner.
• A linear relationship is shown in Figure 5.2(a).
• A non-linear relationship exists in functions such as exponential
function and power function and it is shown in Figures 5.2 (b)
and 5.2 (c).
• Here, x-axis is given by x data and y-axis is given by y data.
Types of Regression Methods
• The classification of regression methods is shown in Figure 5.3.
Linear Regression
It is a type of regression where a line is fitted upon given data for
finding the linear relationship between one independent variable and
one dependent variable to describe relationships.
It create a hypothetical line that best connects all data points.
Syntax:
• y = θx + b
• where,
• θ – It is the model weights or parameters
• b – It is known as the bias.

Multiple Regression
It is a type of regression where a line is fitted for finding the linear
relationship between two or more independent variables and one
dependent variable to describe relationships among variables. Ex: you
might review salary earnings for education, experience and proximity
to a metropolitan area.
Polynomial Regression
It is a type of non-linear regression method of describing
relationships among variables where Nth degree polynomial is used
to model the relationship between one independent variable and
one dependent variable.
Polynomial multiple regression is used to model two or more
independent variables and one dependant variable.

Logistic Regression
It is used for predicting categorical variables that involve one or
more independent variables and one dependent variable. This is
also known as a binary classifier.
Lasso and Ridge Regression Methods
Ridge regression is another machine learning analysis you might
use when there’s a strong correlation between independent
variables. This means that as one independent variable
changes, others can change with it
Lasso regression, or least absolute shrinkage and selection
operator (LASSO), uses regularization and objective functions by
prohibiting the size of the regression coefficient.
These are special variants of regression method where
regularization methods are used to limit the number and size of
coefficients of the independent variables
Multicollinearity is a statistical concept where several independent variables in a model are correlated.
It makes it hard to interpret of model and also creates an overfitting problem.
5.3 INTRODUCTION TO LINEAR REGRESSION
Intercept is the value of y when x = 0
• The computation order of this equation is shown step by step
as:
5.4 VALIDATION OF REGRESSION
METHODS
• The regression model should be evaluated using some metrics
for checking the correctness. The following metrics are used to
validate the results of regression.
REVISION

You might also like