Regression Monograph DSBA Final
Regression Monograph DSBA Final
Regression Monograph DSBA Final
Analysis
1
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Contents
1. Introduction ................................................................................................................................... 4
2. What is Regression? ...................................................................................................................... 5
3. What is Pairwise Correlation? ...................................................................................................... 7
4. Simple Linear Regression (SLR) ................................................................................................. 13
4.1 Definition........................................................................................................................... 13
4.2 The method of Ordinary Least Square (OLS) .................................................................... 13
4.2.1 Assumptions of OLS method ............................................................................................. 17
4.3 Examining the statistical significance of regression model .............................................. 17
4.3.1 Significance of regression slope ........................................................................................ 17
4.3.2 The coefficient of determination 𝐑𝟐 ................................................................................ 18
4.4 Residual Plot...................................................................................................................... 19
5. Multiple Linear Regression Analysis ......................................................................................... 22
5.1 Definition of Multiple Linear Regression Analysis ............................................................ 22
[email protected]
21YORICED7 5.2 The method of ordinary least squares .............................................................................. 23
5.3 When the predictor is categorical..................................................................................... 24
5.4 Multi-collinearity............................................................................................................... 26
5.4.1 Examining significance of multiple regression model ....................................................... 28
5.4.2 R2 and Adjusted R2 ............................................................................................................ 29
5.4.3 Handling the categorical predictor ................................................................................ 30
5.4.4 Residual Analysis ............................................................................................................. 34
6. Further Discussions and Considerations ................................................................................... 35
6.1 Regression ANOVA ............................................................................................................ 35
6.2 Leverage points ................................................................................................................. 35
2
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
21YORICED7
List of Tables
Table 1: Observed response, estimated response (based on density) and residuals…………….……..16
Table 2: Observed response, estimated response (based on final MLR model) and
residuals………………………………………………………………………………………33
3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Linear regression is one of the simplest statistical tools used to analyse the dependence of a
response on two or more predictors. In this monograph, we discuss how it works, what
information it gives, and what one needs to be careful about while performing linear regression.
4
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Important Note: Regression is a widely applicable tool which can and does incorporate
all types of non-linear relationships also. Linear regression is only a very small subset of
all possible regression functions applicable in various domains.
Case Study:
A top wine manufacturer wants to invest in new technologies to improve its wine quality. Wine
quality is directly dependent on the amount of alcohol in wines and the smoothness which, in
turn, are controlled by various chemicals either directly added during the manufacturing
process or generated through various chemical reactions. Wine certification and quality
assessment are key elements for wine gradation and its price ticket. Wine certification is
determined by various physiochemical elements in the wine. Therefore, the company wants to
estimate the percentage (%) of alcohol in a bottle of wine as a function of various chemical
components of the wine.
Statement of the Problem: Regress alcohol percentage on the chemical components present in
the wines.
5
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
It is possible to regress alcohol on each of the 10 continuous predictors one at a time and explore
their individual relationships. However, in order to avoid repetition, only one predictor
[email protected]
(density) has been considered.
21YORICED7
Before a regression model of alcohol on the predictors can be built, it is necessary to investigate
whether there exists any dependence among the observed variables.
6
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Here, 𝑋 and 𝑌 are the pair of attributes measured on 𝑛 units. From the i-th unit a pair of
observations (𝑥𝑖 , 𝑦𝑖 ) is obtained, and 𝑥̅ and 𝑦̅ are the respective means. The value of 𝑟 ranges
from −1 to +1, both values included.
𝒓 = −𝟏: denotes a perfect negative correlation between 𝑋 and 𝑌. If (x, y) pairs are plotted, the
points would fall on a straight line with negative slope. This indicates that X and Y are perfectly
linearly related but, if one variable increases, the other decreases.
𝒓 = 𝟎: denotes that no correlation exists between 𝑋 and 𝑌.
𝒓 = +𝟏: denotes a perfect positive correlation between 𝑋
[email protected] and 𝑌 data. If (x, y) pairs are plotted,
21YORICED7
the points would fall on a straight line with positive slope. This indicates that X and Y are
perfectly linearly related and if one variable increases, the other also increases.
Often the plot of (x, y) pairs is known as scatterplot.
In Fig 1, the first scatterplot corresponding to r = 0 clearly shows that there is no discernible
pattern in the plot. The other two scatterplots exhibit moderate correlations of equal magnitude
(0.6) but in opposite directions.
7
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The primary objective here is to determine whether any dependence exists between
[email protected] alcohol
21YORICED7(%) and three selected chemical components, FA, pH and density individually.
Exploratory Analysis (EDA) on Wine Data
We first upload the “WineData” dataset into python.
WineData.shape
(1599, 13)
WineData.columns
Index(['ID', 'Brand', 'FA', 'VA', 'CA', 'RS', 'chloride', 'FSD', 'TSD',
'density', 'sulphate', 'pH', 'alcohol'],dtype='object')
8
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
1 Seagram 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
2 Seagram 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
3 Seagram 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
4 Sula Vineyards 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8
5 Seagram 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
The dataset contains 1599 observations on 13 variables. The first column is “ID”, which is just
a label and will not be used in the analysis. The second column is “Brand”, which is the only
categorical variable present in the data.
The five number summary of each of the quantitative variables is presented below.
For Brand the proportions in each category is shown.
# Summary of all variables
WineData[WineData.columns[2:13]].describe().transpose()
[email protected]
21YORICED7
count mean std min 25% 50% 75% max
WineData['Brand'].value_counts()
Seagram Sula Vineyards Grover Zampa
633 553 413
Name: Brand, dtype: int64
9
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
21YORICED7
10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Correlation coefficient, however, cannot determine what value the response will take for a
given value of the predictor(s). Regression model building is necessary for that.
12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The formal definition of Simple Linear Regression: Let n pairs of observations (𝑥𝑖 , 𝑦𝑖 ), i =
1, 2, …, n, be available on two features, one of which is assumed to depend on the other.
Typically Y denotes the dependent variable and X the independent variable. Let the quantity
𝐸(𝑌) denote the expected or mean value of 𝑌.
Simple linear regression relates the response Y to the single predictor variable X through a
straight line. The mathematical formulation of the simple linear regression line is:
𝐸(𝑌) = 𝛽0 + 𝛽1 𝑋
where,
𝑌: is the value of the continuous response (or dependent) variable,
𝛽0 and 𝛽1: are intercept and slope coefficients, respectively, and known as the regression
parameters.
𝑋: represents the independent (predictor) variable continuous in nature.
It is assumed that the expected value of the response is a linear function of the predictor. When
[email protected]
21YORICED7𝛽0 and 𝛽1 are known, a given value of the predictor will specify the expected value of the
response.
However, since Y is a random variable, not all values 𝑦𝑖 will be equal to 𝐸(𝑌). Another form
of SLR equation is
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖 , i = 1, 2, …,n
𝜖: represents the unobservable error term. Note that error here does not indicate any mistake,
simply the difference between the expected and observed values of the response.
The error terms provide crucial insight into the regression process.
The simple linear regression model involves unknown parameters 𝛽0 and 𝛽1, which need to be
estimated from data. There are several different methods of estimating the parameters. The
simplest and the most widely used method is known as the Ordinary Least Squares method
(OLS).
Given 𝑛 pairs of observations on 𝑌 and 𝑋, the objective is to minimise the sum of squared
errors and thus get appropriate estimates of 𝛽0 and 𝛽1. We want to minimise
∑𝑛𝑖=1 𝜖𝑖2 = ∑𝑛𝑖=1(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 )2 .
Explicitly minimizing the above equation the following estimates of 𝛽0 and 𝛽1 are available.
13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅
𝑟𝑠𝑦
𝛽̂1 has an equivalent representation 𝛽̂1 = 𝑠 where 𝑟 is the correlation coefficient between 𝑋
𝑥
and 𝑌, and 𝑠𝑥 and 𝑠𝑦 are the respective standard deviations of 𝑋 and 𝑌. Since 𝑠𝑥 and 𝑠𝑦 are
both positive quantities, 𝛽̂1 has the same sign as 𝑟. So for two positively correlated variables,
the 𝛽̂1 will be positive and for two negatively correlated variables, 𝛽̂1 will be negative.
The objective here is to determine the dependence of alcohol (%) on density. Though any one
of the 10 continuous predictors may have been used as predictor, the choice is made based on
a higher correlation between the response and the predictor. A scatterplot of alcohol (%) versus
density helps to get a visual impression about whether a linear function of density will at all be
suitable to describe alcohol (%).
14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
We first note that the sign of 𝛽̂1 is negative. This shows that the two variables are inversely
related, that is, if one increases, the other decreases. This confirms our expectation that the
variables alcohol (%) and density increase/decrease in opposite directions and we get a straight
line with negative slope.
The sign of the regression slope and the correlation coefficient will always be the same.
The regression slope is the measure of change in the response with one unit change in predictor.
The sign of the regression slope indicates the direction of the change.
The value of 𝛽̂1 indicates that if density increases by 1 unit, the estimated alcohol (%) decreases
by 280.16 unit. However, the response being in percentage, its values are bounded between 0
[email protected]
21YORICED7and 100. So is our interpretation misleading? Actually, from the EDA, it can be seen that the
minimum and maximum values of density are mostly between 0.99 and 1.00. Hence the range
of density is approximately 0.01. While interpreting a regression parameter, it is important to
pay heed to the range of values of the predictor. In this case density cannot be expected to
change by 1 unit, but the changes in the density will be in the order of 1/1000.
Hence, the following statement will be appropriate in this case:
If density increases by 0.001 unit, then alcohol (%) decreases by 0.28 unit.
The intercept term is the estimated value of the response when the predictor is 0. However, the
intercept term is not always interpretable, such as in this case.
The following graph shows the OLS regression line (in blue) through the scatterplot.
# Plotting
a4_dims = (10,5)
fig, ax = plt.subplots(figsize=a4_dims)
a = sns.regplot(x="density", y="alcohol", data=WineData)
a.set_title('Model:'+equation,fontsize=15)
# Displaying the plot
plt.show()
15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Given a particular value of density, using the estimated values of 𝛽0 and 𝛽1, estimated or fitted
values of alcohol (%) can be obtained. The fitted values may or may not be equal to the
observed value of the response.
Let us look at alcohol (%) in
[email protected] the wines at several levels of density.
21YORICED7
Observation Density Observed ̂=
Estimated response (Y Residual
response (Y) 289.68 − 280.16 × (𝜖̂ = 𝑌 − 𝑌̂)
𝑑𝑒𝑛𝑠𝑖𝑡𝑦)
355 0.9912 11.9 11.98 −0.08
481 1.0026 9.2 8.79 0.41
609 1.0026 10.4 8.79 1.61
954 0.99458 12.1 11.04 1.06
1198 0.99458 9.8 11.04 −1.24
1201 0.99458 9.8 11.04 −1.24
1460 0.99458 11.9 11.04 0.86
The difference between the observed and fitted values of the response is called residual.
Residuals may be both positive or negative.
16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The linear regression model is defined on 4 important assumptions, often referred to as (LINE)
Assumption 1: The regression model is linear in the parameters i.e. 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 +𝜖𝑖 ,
Assumption 2: The observations 𝑌𝑖 (or the error terms𝜖𝑖 ) are independent
Assumption 3: The error variables 𝜖𝑖 are normally distributed.
Assumption 4: The errors have no bias (that is,𝐸(𝜖𝑖 ) = 0) and they are homoscedastic, that is,
they have equal variance.
Fitting the regression model or simply estimating the regression coefficients is not enough. It
is important to check if the regression slope is significantly different from 0 in the population.
Estimating the regression coefficients is only the first step of regression model fitting. OLS
will produce estimates of regression intercept and regression slope given any sample data.
However, it is of vital importance to know whether the population regression slope is
significantly different from 0. If in the population the regression slope is not significantly
different from 0, then there is no regression of Y on X.
[email protected]
21YORICED7
We now need to know, is density at all statistically significant in explaining alcohol (%)?
Let us consider the test of hypothesis 𝐻0 : 𝛽1 = 0 vs. 𝐻1 : 𝛽1 ≠ 0 where 𝛽1 is the regression
coefficient of density when alcohol (%) is regressed on density.
# Significance of regression
print(mod.summary())
OLS Regression Results
==============================================================================
Dep. Variable: alcohol R-squared: 0.246
Model: OLS Adj. R-squared: 0.246
Method: Least Squares F-statistic: 521.6
Date: Wed, 22 Jan 2020 Prob (F-statistic): 3.94e-100
Time: 16:49:11 Log-Likelihood: -2144.1
No. Observations: 1599 AIC: 4292.
Df Residuals: 1597 BIC: 4303.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 289.6753 12.227 23.691 0.000 265.692 313.659
density -280.1638 12.267 -22.838 0.000 -304.226 -256.102
==============================================================================
Omnibus: 147.785 Durbin-Watson: 1.460
Prob(Omnibus): 0.000 Jarque-Bera (JB): 193.960
Skew: 0.768 Prob(JB): 7.62e-43
Kurtosis: 3.743 Cond. No. 1.06e+03
17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Observe that the p-value corresponding to density is very small and thus the null hypothesis
𝐻0 : 𝛽1 = 0 is rejected which in turn indicates that density is significant in explaining alcohol
(%)
Statistical significance alone, however, is not enough to decide whether the predictor is useful
in explaining the variability in the response. Is density enough to explain a large part of
variation in alcohol? This leads us to the concept of coefficient of determination, 𝑅 2 .
The coefficient of determination R2 is a summary measure that explains how well the sample
regression line fits the data. The rationale and computation of R2 is discussed below:
𝜖̂𝑖 = 𝑌𝑖 - 𝑌̂𝑖
Residuals are very important part of regression and they have many useful properties. In fact,
it can be shown that ∑𝑛𝑖=1 𝜖̂𝑖 = 0, if the estimation method is OLS.
∑𝑛𝑖=1(𝑌𝑖2 − 𝑌̅)2: is the total variation of the actual 𝑌 about their sample mean and termed as the
total sum of squares (SST). This is closely linked to the sample variance of Y.
2
∑𝑛𝑖=1(𝑌̂𝑖 − 𝑌̅) : is the sum of squares due to regression and is called regression sum of squares
(SSR)
2
∑𝑛𝑖=1 𝜖̂𝑖 2 = ∑𝑛𝑖=1(𝑌𝑖 − 𝑌̂𝑖 ) : is the sum of squared differences between the observed and the
predicted values of the response. This is known as the residual sum of squares or the error
sum of squares (SSE)
𝑆𝑆𝑅 𝑆𝑆𝐸
1= + (on dividing SST on both sides)
𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝑅 𝑆𝑆𝐸
We now define R2 as : R2 = or R2 =1−
𝑆𝑆𝑇 𝑆𝑆𝑇
18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
𝑅 2 measures the proportion of the total variation in 𝑌 that is explained by the regression model.
It ranges from 0 ≤ 𝑅 2 ≤ 1. The higher the value of 𝑅 2 , the more powerful is the predictor to
predict the response. A regression model with high 𝑅 2 value indicates that the model fits the
data well. In that case a high proportion of variance in the response is explained by the
dependence of the response on the predictor.
For the current model, the value of 𝑅 2 turns out to be approximately 25% . This means, density
is able to explain around 25% of the total variation in alcohol (%).
It is clear that, even though, the regression of alcohol (%) on density is significant (p-value <
0.05), density by itself is not sufficient to explain the variation in the response to a satisfactory
degree. Later we will investigate if the other predictors in the data set will help in explaining
more variation in the response (Multiple Linear Regression).
Below we have shown the residual plot for the regression of alcohol (%) on density.
19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
20
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
regression_plots(mod,WineData)
[email protected]
21YORICED7
Using the above figure, we may infer that the regression assumptions are not violated.
It is clear from the discussions above that density, in spite of having significant slope when
alcohol (%) is regressed on it, can explain only 25% of the total variability in the response. It
is a natural step then to examine whether inclusion of the other predictors contribute towards
explanation of the variability in the response, and if so, to what degree.
21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
It is assumed that the expected value of the response is a linear function of all the k predictors.
When the regression coefficients 𝛽0 and 𝛽𝑗 , 𝑗 = 1, … , 𝑘 are all known, or estimated, a given
value of the predictor combination will specify the expected value of the response.
However, since Y is a random variable, not all values 𝑦𝑖 will be equal to 𝐸(𝑌). Another form
of MLR equation is
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝜀𝑖 , i = 1, 2, …,n
𝜀: represents the unobservable error term.
Formally the parameters 𝛽1 , 𝛽2 , … , 𝛽𝑘 are called partial regression coefficients, since the
coefficient 𝛽𝑘 measures the change in the value of the response due to unit change in the value
of 𝑋𝑘 , keeping all other variables fixed. We will refer to the 𝛽0 , 𝛽1 , 𝛽2 , … , 𝛽𝑘 simply as
regression coefficients.
22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
21YORICED7
23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Before the problem of fitting a multiple linear regression model is taken up, the case of
categorical predictor needs to be explicitly discussed.
So far we have tacitly assumed that the response as well as the predictors are all continuous.
The case of categorical response is not covered under MLR. Let us discuss the situation when
one or more predictors are categorical. A set of dummy variables or indicator variables is
introduced corresponding to each categorical variable. A discrete variable with 𝑚 categories is
represented by a set of 𝑚 − 1 dummy variables.
Indicator coding: It is the most common format of dummy variable coding in which each
category of the nominal variable is represented by either 1 or 0.
The equation of regression model and concept of the indicator coding using dummy variable
is explained based on our case study.
If 𝑋1 , … , 𝑋𝑘 are continuous then we can write the regression model of 𝑌 on 𝑋 as:
[email protected] 𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜖
21YORICED7
But, when a predictor variable 𝑋 is a categorical variable with 𝑚 levels, small modification is
required for the above.
Brand is a categorical predictor with three levels Grover Zampa, Seagram and Sula Vineyards.
Hence 𝑚 − 1 = 3 − 1 = 2 dummy variables need to be introduced in the model. The modified
equation is
𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝛽𝑘+1 𝐷1 + 𝛽𝑘+2 𝐷2 + 𝜖
where, 𝐷1 and 𝐷2 are dummy variables taking values 1 or 0 as per the observation
representation.
FA 0.509555
VA 0.423617
CA 0.823537
RS 0.272455
chloride -1.303246
FSD -0.002695
TSD -0.001520
density -595.004881
sulphate 1.084578
pH 3.617430
dtype: float64
Regression coefficients corresponding to Brand actually acts as constants. For three levels of
Brand, three different regression equations are obtained.
BrandSeagram takes the value 1 if Brand = Seagram, 0 otherwise
BrandSula Vineyards takes the value 1 if Brand = Sula Vineyards, 0 otherwise
When Brand = Grover Zampa, then both these variables take the value 0.
Explicit form of
[email protected] the regression equations are:
21YORICED7
(1) Brand = Grover Zampa:
𝑌̂ = 585.7 + 0.51FA + 0.424VA + 0.824CA + 0.272RS −1.303Chloride−0.003FSD +
−0.002TSD −595.005density + 1.085sulphate + 3.617pH
(2) Brand = Seagram
𝑌̂ = 585.7 + 0.51FA + 0.424VA + 0.824CA + 0.272RS −1.303Chloride−0.003FSD +
−0.002TSD −595.005density + 1.085sulphate + 3.617pH – 0.23
𝑜𝑟
𝑌̂ = 585.47 + 0.51FA + 0.424VA + 0.824CA + 0.272RS −1.303Chloride−0.003FSD +
−0.002TSD −595.005density + 1.085sulphate + 3.617pH
(3) Brand = Sula Vineyard
𝑌̂ = 585.7 + 0.51FA + 0.424VA + 0.824CA + 0.272RS −1.303Chloride−0.003FSD +
−0.002TSD −595.005density + 1.085sulphate + 3.617pH – 0.003
𝑜𝑟
̂
𝑌 = 585.697 + 0.51FA + 0.424VA + 0.824CA + 0.272RS −1.303Chloride−0.003FSD
+ −0.002TSD −595.005density + 1.085sulphate + 3.617pH
The only difference among the three regression equations is in the intercept terms. Slopes
corresponding to the continuous predictors remain the same. Sign of the slopes indicates in
which direction the response will change, given the predictor increases by a unit amount.
Any positive coefficient means that a unit increase in the corresponding predictor increases the
response by the numerical value of the coefficient, provided all other predictors are held at
constant level. Any negative coefficient means that a unit increase in the corresponding
25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
5.4 Multi-collinearity
In multiple regression, if one or more pairs of explanatory variables is highly correlated among
themselves, then the phenomenon is known as multi-collinearity.
Effects of Multi-collinearity: Multi-collinearity is not desirable. It leads to inflated standard
errors of the estimates of the regression coefficients, which in turn affects significance of the
[email protected]
21YORICED7regression parameters. Often the signs of the regression coefficients may also change. As a
result the regression model becomes non-reliable or lacks interpretability.
The first thing one should do in multiple linear regression is, to check if multi-collinearity is
present in the data.
Detection of Multi-collinearity: There are some ways of detecting (or testing) multi-
collinearity such as:
Correlation Matrix: We can start by computing the pairwise correlations among all
the independent variables. The independent variables should not be highly (positive or
negative) correlated. But this itself is not enough as the correlation matrix only detects
high pairwise correlations. It is possible that even when no pairwise correlations are
high, several moderately correlated pairs may give rise to multi-collinearity.
Variance Inflation factor: Variance inflation factors measure the inflation in the
variances of the regression parameter estimates due to collinearities that exist among
the predictors. It is a measure of how much the variance of the estimated regression
coefficient βk is “inflated” by the existence of correlation among the predictor
variables in the model.
General Rule of thumb: If VIF is 1 then there is no correlation among the k th predictor
and the remaining predictor variables, and hence the variance of β̂k is not inflated at
all. Whereas if VIF exceeds 5 or is close to exceeding 5, we say there is moderate VIF
and if it is 10 or exceeding 10, it shows signs of high multi-collinearity.
26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
# VIF
df_new=pd.get_dummies(WineData, drop_first=True)
def vif_cal(input_data, dependent_col):
x_vars=input_data.drop([dependent_col], axis=1)
xvar_names=x_vars.columns
for i in range(0,xvar_names.shape[0]):
y=x_vars[xvar_names[i]]
x=x_vars[xvar_names.drop(xvar_names[i])]
rsq=ols(formula="y~x", data=x_vars).fit().rsquared
vif=round(1/(1-rsq),2)
print (xvar_names[i], "VIF = " , vif)
vif_cal(input_data=df_new.iloc[:, 1:14], dependent_col="alcohol")
FA VIF = 5.65
VA VIF = 1.79
CA VIF = 3.07
RS VIF = 1.31
chloride VIF = 1.48
FSD VIF = 1.98
[email protected]
21YORICED7 TSD VIF = 2.24
density VIF = 2.91
pH VIF = 2.49
sulphate VIF = 1.4
Brand_Seagram VIF = 1.91
Brand_Sula Vineyards VIF = 1.59
Note that Brand is a nominal variable, so the notion of correlation is difficult to define in this
case. However, we still output a VIF for Brand which needs to be ignored.
We observe that among all continuous predictors, only FA has a sufficiently high VIF (5.65)
indicating it is substantially correlated with the other predictor variables. FA is removed from
the model.
# MLR with FA removed
mod2 = ols('alcohol ~ VA+CA+RS+chloride+FSD+TSD+density+sulphate+pH+Brand', data = WineDat
a).fit()
coeff_MLR_wine2 = mod2.params
print(coeff_MLR_wine2)
Intercept 364.757981
Brand[T.Seagram] -0.416173
Brand[T.Sula Vineyards] -0.066258
VA 0.759100
CA 2.454701
RS 0.199998
27
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Note that the coefficients of the different predictor values have changed. We check the VIFs of
the new predictors.
VA VIF = 1.77
CA VIF = 2.34
RS VIF = 1.23
chloride VIF = 1.32
FSD VIF = 1.96
TSD VIF = 2.04
density VIF = 1.51
pH VIF = 1.54
sulphate VIF = 1.39
Brand_Seagram VIF = 1.86
Brand_Sula Vineyards VIF = 1.58
[email protected]
21YORICED7
We can see that after removing FA, all the predictors have low VIF (at most 2.34). So the
problem of multi-collinearity has been eliminated. In all our subsequent discussions, we will
consider the multiple linear regression model with FA removed.
It is not enough to merely fit a multiple regression model to the data, it is necessary to check
whether all regression coefficients are significant or not. Significance here means whether the
population regression parameters are significantly different from zero. For the j-th slope
parameter, the null hypothesis of interest is: H0: βj = 0 versus H1: βj ≠ 0. This test is done
separately for each regression coefficient, including the parameters corresponding to the
dummy variables, if necessary. The test may not always make sense for the intercept
parameters.
28
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Warnings:
[email protected]
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
21YORICED7[2] The condition number is large, 5.48e+04. This might indicate that there are strong multicol
linearity or other numerical problems.
From the above it may be noted that the regression coefficients corresponding to FSD and
Brand = Sula Vineyards are not statistically significant at level α=0.05. In other words, the
regression coefficients corresponding to these two are not significantly different from 0 in the
population.
Hence, FSD may be eliminated from the multiple regression model. However, it is not
recommended to eliminate the Brand level Sula Vineyards directly, because it is part of a
system of dummy variable. The adjustment required is explained in Section 5.4.3.
29
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Hence a high numerical value of 𝑅 2 gives an intuitive justification that the model works well
since the observed response and the predicted responses are close.
One limitation of the coefficient of determination is that, its value increases as the number of
independent variables in the model increases. A good regression model should include only
those predictors that are significantly different from 0. However, the numerical value of R2 is
non-decreasing even if non-significant predictors are included in the model. Addition of non-
significant predictors adversely affect the predictive quality of the model. Therefore, there is a
need of another measure of model adequacy.
Adjusted R2 measure involves an adjustment based on the number of predictors relative to the
sample size. Adjusted R2 is defined as:
2
SSE⁄(n − p) ∑𝑛𝑖=1 𝜖̂𝑖 2 ⁄(n − p)
Adj R = 1 − = 1− 𝑛
SST⁄(n − 1) ∑𝑖=1(𝑌𝑖 − 𝑌̅)2 ⁄(n − 1)
Here p is the number of parameters in the regression model including intercept term.
Case Study continues
[email protected]
2 2
21YORICED7R and adjusted R are available from the output of summary(). Values of these two statistics
are almost equivalent (55%) albeit adjusted R2 is slightly smaller than R2 .
In the regression equation including all predictors, it was noted that Brand = Sula Vineyards is
non-significant. Before any action is taken regarding this, let us understand what significance
in case of categorical data means in this situation.
While dealing with a categorical predictor with m nominal levels, the set of m – 1 indicator
variables implicitly fixes one level as baseline, when all m-1 indicator variables take the value
0. In this case study the baseline is Brand = Grover Zampa.
Significance of regression coefficients for categorical variable means that the levels are
significantly different from the baseline.
If the regression coefficient for Brand = Sula Vineyards is non-significant, it means that the
effect of Brand = Sula Vineyards statistically indistinguishable from the baseline level Grover
Zampa. This is also evident from the box plot (Fig 9).
#Brand
pd.DataFrame(WineData['Brand'].value_counts()).transpose()
30
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
round(WineData.groupby("Brand")["alcohol"].mean(),2)
Grover Zampa 10.88
Seagram 9.91
Sula Vineyards 10.67
Name: alcohol, dtype: float64
[email protected]
21YORICED7
print(MLR_new.summary())
OLS Regression Results
===========================================================================================
Dep. Variable: alcohol R-squared: 0.556
Model: OLS Adj. R-squared: 0.553
Method: Least Squares F-statistic: 220.7
Date: Tue, 28 Jan 2020 Prob (F-statistic): 2.31e-272
Time: 20:52:58 Log-Likelihood: -1721.7
No. Observations: 1599 AIC: 3463.
Df Residuals: 1589 BIC: 3517.
[email protected]
Df Model: 9
21YORICED7
Covariance Type: nonrobust
===========================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept 365.5364 11.582 31.560 0.000 342.818 388.255
VA 0.7443 0.131 5.698 0.000 0.488 1.000
CA 2.4537 0.138 17.770 0.000 2.183 2.725
RS 0.2025 0.014 14.534 0.000 0.175 0.230
chloride -4.4167 0.435 -10.158 0.000 -5.270 -3.564
TSD -0.0060 0.001 -10.320 0.000 -0.007 -0.005
density -361.9534 11.586 -31.240 0.000 -384.679 -339.228
sulphate 1.0068 0.124 8.152 0.000 0.765 1.249
pH 1.2808 0.142 8.994 0.000 1.001 1.560
new_brand -0.3784 0.040 -9.360 0.000 -0.458 -0.299
===========================================================================================
Omnibus: 118.826 Durbin-Watson: 1.567
Prob(Omnibus): 0.000 Jarque-Bera (JB): 208.529
Skew: 0.533 Prob(JB): 5.23e-46
Kurtosis: 4.412 Cond. No. 5.24e+04
===========================================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.24e+04. This might indicate that there are strong mult
icollinearity or other numerical problems.
Notice that now, all the predictors are statistically significant with very low p-values. The value
of R2 has not decreased much and adjusted R2 is almost same as R2 (around 55%). Thus we
select this as our final regression model.
32
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
# Coefficients
coeff_MLR = MLR_new.params
print(coeff_MLR )
Intercept 365.536428
VA 0.744273
CA 2.453709
RS 0.202516
chloride -4.416669
TSD -0.005956
density -361.953393
sulphate 1.006848
pH 1.280760
new_brand -0.378445
dtype: float64
Below we present some particular values of the predictors, the observed response, predicted
response and the residuals.
Table 2: Observed response, estimated response (based on final MLR model) and
residuals
33
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
21YORICED7
# VIF
vif_cal(input_data=WineData.iloc[:, 3:14], dependent_col="alcohol")
VA VIF = 1.76
CA VIF = 2.32
RS VIF = 1.23
chloride VIF = 1.32
FSD VIF = 1.94
TSD VIF = 2.04
density VIF = 1.51
pH VIF = 1.54
sulphate VIF = 1.38
new_brand VIF = 1.24
34
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The problem of regression is also a technique of reducing variation in data. Total variation in
the response is being explained by the regression on one or more predictors. The ANOVA table
describes the partition of the variance.
Corresponding to each partition of the total variance, there is an associated Degrees of
freedom(df). The total degrees of freedom in a sample of size n is always n – 1. Df for the
partitioned sums of squares depend on the number of parameters to be estimated. If there are p
parameters to be estimated in the regression equation, then the df corresponding to SSR will
be p – 1. Df corresponding to SSE is (n – 1) – (p – 1) = n – p.
We present the ANOVA table corresponding to the final model in our case study.
# ANOVA table
aov_tbl = sm.stats.anova_lm(MLR_new)
print(aov_tbl)
We can see that each continuous predictor has 1 degree of freedom, and since our modified
categorical predictor is new_Brand with 2 levels, it enjoys (2 - 1 =)1 degree of freedom. The
residual sum of squares has 1589 degrees of freedom, and the total sum of squares has 1598
degrees of freedom.
35
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
36
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Yes
Perform Multiple Linear Regression with lm() function
Compute VIFs
No
Any VIF > 5?
[email protected]
21YORICED7
Yes
Remove predictor with largest VIF and perform MLR with remaining variables
Any non-significant
No
predictor?
Yes
Remove non-significant continuous predictors. Merge non-significant levels of categorical predictor and
redefine levels of the categorical predictor.
37
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
[email protected]
21YORICED7
38
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.