5) Multiple Regression
5) Multiple Regression
5) Multiple Regression
By
Patitapaban Sahu
Multiple Linear Regression
Regression models are used to describe relationships between variables by fitting a line to the
observed data. Regression allows you to estimate how a dependent variable changes as the
independent variable(s) change.
Multiple linear regression is used to estimate the relationship between two or more
independent variables and one dependent variable. You can use multiple linear regression
when you want to know:
How strong the relationship is between two or more independent variables and one dependent
variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
The value of the dependent variable at a certain value of the independent variables (e.g. the
expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
Example
You are a public health researcher interested in social factors that influence heart
disease. You survey 500 towns and gather data on the percentage of people in
each town who smoke, the percentage of people in each town who bike to work,
and the percentage of people in each town who have heart disease.
Because you have two independent variables and one dependent variable, and all
your variables are quantitative, you can use multiple linear regression to analyze
the relationship between them.
Assumptions of multiple linear regression
Multiple linear regression makes all of the same assumptions as simple linear regression:
Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t
change significantly across the values of the independent variable.
In multiple linear regression, it is possible that some of the independent variables are actually
correlated with one another, so it is important to check these before developing the regression
model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of
them should be used in the regression model.
Linearity: the line of best fit through the data points is a straight line, rather than a curve or some sort of
grouping factor.
To find the best-fit line for each independent variable, multiple linear regression
calculates three things:
The regression coefficients that lead to the smallest overall model error.
The t-statistic of the overall model.
The associated p-value (how likely it is that the t-statistic would have occurred
by chance if the null hypothesis of no relationship between the independent and
dependent variables was true).
It then calculates the t-statistic and p-value for each regression coefficient in the
model.