Regression Monograph DSBA Final

A Short Monograph on Regression
Analysis
TO SERVE AS A REFRESHER FOR PGP-DSBA

[email protected]
21YORICED7
1
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
This file is meant for personal use by [email protected] only.

Sharing or publishing the contents in part or full is liable for legal action.
Index
Contents
1. Introduction ................................................................................................................................... 4
2. What is Regression? ...................................................................................................................... 5
3. What is Pairwise Correlation? ...................................................................................................... 7
4. Simple Linear Regression (SLR) ................................................................................................. 13
4.1 Definition........................................................................................................................... 13
4.2 The method of Ordinary Least Square (OLS) .................................................................... 13
4.2.1 Assumptions of OLS method ............................................................................................. 17
4.3 Examining the statistical significance of regression model .............................................. 17
4.3.1 Significance of regression slope ........................................................................................ 17
4.3.2 The coefficient of determination 𝐑𝟐 ................................................................................ 18
4.4 Residual Plot...................................................................................................................... 19
5. Multiple Linear Regression Analysis ......................................................................................... 22
5.1 Definition of Multiple Linear Regression Analysis ............................................................ 22
[email protected]
21YORICED7 5.2 The method of ordinary least squares .............................................................................. 23
5.3 When the predictor is categorical..................................................................................... 24
5.4 Multi-collinearity............................................................................................................... 26
5.4.1 Examining significance of multiple regression model ....................................................... 28
5.4.2 R2 and Adjusted R2 ............................................................................................................ 29
5.4.3 Handling the categorical predictor ................................................................................ 30
5.4.4 Residual Analysis ............................................................................................................. 34
6. Further Discussions and Considerations ................................................................................... 35
6.1 Regression ANOVA ............................................................................................................ 35
6.2 Leverage points ................................................................................................................. 35
2

List of Figures
Fig. 1: Three different correlations……………………………………………………………………..8
Fig. 2: Correlation between alcohol and FA…………………………………………………………...10
Fig. 3: Correlation between alcohol and pH…………………………………………………………....11
Fig. 4: Correlation between alcohol and density………………………………………………………11

Fig. 5: Scatterplot of density and alcohol (%)………………………………………………………….14
Fig. 6: Regression of alcohol (%) on density…………………………………………………………..16
Fig. 7: Residual plot for alcohol vs. density…………………………………………………………...21
Fig. 8: Heatmap showing correlations among all variables………………………………....................23
Fig. 9: Box plot for alcohol (%) at different levels of Brand…………………………………………...31

Fig. 10: Residual Plot of alcohol vs all variables……………………………………………………...34
Fig. 11: Action of leverage point in shifting the regression line………………………………………36
[email protected]
21YORICED7
List of Tables
Table 1: Observed response, estimated response (based on density) and residuals…………….……..16
Table 2: Observed response, estimated response (based on final MLR model) and
residuals………………………………………………………………………………………33
3

1. Introduction
Today, we are living in a world dominated by data. Thanks to impressive technology and data
tracking systems, computers are capable of capturing and storing massive amounts of data. For
example, a car sales company like Maruti is in a position to collect enormous amount of data
on different aspects of its business, such as car sales, economy of the country, stock market
prices, crude oil prices etc. Environment activists are able to record proportions of different
greenhouse gases in the atmosphere in minute details and are interested in understanding how
the fluctuations of these gases in different times of day in different seasons are responsible for
climatic pollution and human health.
The immediate next step after data collection is data analysis and interpretation. Consider the
car sales example of Maruti. The question of how the economy of a country affects car sales,
and possibly vice versa, is a rather important question. For this, the industry needs to
understand the dependence between car sales and various macro-economic factors. For
environmentalists, knowing how the average global temperature fluctuates with fluctuations in
the proportions of the greenhouse gases, is a question of crucial importance.
Also, whether there is at all any direct relation among these features, or the relationship is
controlled by one or more unobserved factors, need to be objectively determined. In both the
cases mentioned above, it is required to provide a definitive relationship between the variable
of interest (or response) and the other variables (or predictors) which are used to understand
[email protected]
it. This helps in clarifying the dependence between the variables concerned.
21YORICED7
Another very important aspect of data analysis, particularly in the above two examples cited,
is prediction. Maruti would like to predict car sales for future, given the expected measure of
economy growth or recession that year, so that it can take informed decisions regarding its
business. Similarly, environmentalists would be interested in knowing how reduction in global
warming will be effected by decreasing concentrations of the greenhouse gases and by how
much. For both these objectives, one needs to form a model which explicitly relates the
response to the predictors.
Linear regression is one of the simplest statistical tools used to analyse the dependence of a
response on two or more predictors. In this monograph, we discuss how it works, what
information it gives, and what one needs to be careful about while performing linear regression.
4

2. What is Regression?
Regression analysis is one of the most commonly used tools to find a relationship (linear or
non-linear) between a response and one or more predictors and exploit that relationship in
predicting the expected value of the response for certain values of the predictor(s) with
maximum accuracy possible.
.
Two important terms and corresponding notations are given below:
Dependent variable or Response: It is the variable of interest that one wants to model or
predict using one or more variables whose values are known.
Independent variable(s) or Predictor(s): It is assumed that the response depends on one or
more predictors. These variables are independent and a model is formulated identifying the
explicit relationship between the response and the predictor(s).
Types of Regression: In this monograph two types of regression are discussed in detail:
I. Simple Linear Regression: When the response is assumed to have a linear dependence
on one single predictor.
II. Multiple Linear Regression: When the response is assumed to have a linear dependence
on multiple predictors.
[email protected]
21YORICED7
Important Note: Regression is a widely applicable tool which can and does incorporate
all types of non-linear relationships also. Linear regression is only a very small subset of
all possible regression functions applicable in various domains.
Case Study:
A top wine manufacturer wants to invest in new technologies to improve its wine quality. Wine
quality is directly dependent on the amount of alcohol in wines and the smoothness which, in
turn, are controlled by various chemicals either directly added during the manufacturing
process or generated through various chemical reactions. Wine certification and quality
assessment are key elements for wine gradation and its price ticket. Wine certification is
determined by various physiochemical elements in the wine. Therefore, the company wants to
estimate the percentage (%) of alcohol in a bottle of wine as a function of various chemical
components of the wine.
Statement of the Problem: Regress alcohol percentage on the chemical components present in
the wines.
5

Description of the data (Data Dictionary):
Variables Description
fixed acidity (FA) Number of grams of Tartaric acid per cubic decimetre,
(𝑔(𝑡𝑎𝑟𝑡𝑎𝑟𝑖𝑐 𝑎𝑐𝑖𝑑)⁄𝑑𝑚3 )
volatile acidity (VA) Number of grams of Acetic acid per cubic decimetre,
(𝑔(𝑎𝑐𝑒𝑡𝑖𝑐 𝑎𝑐𝑖𝑑)⁄𝑑𝑚3)
citric acid (CA) Number of grams of Citric acid per cubic decimetre, (𝑔⁄𝑑𝑚3 )
residual sugar (RS) Number of grams of Residual sugar per cubic decimetre, (𝑔⁄𝑑𝑚3 )
Chlorides Number of grams of Sodium chloride per cubic decimetre,
(𝑔(𝑠𝑜𝑑𝑖𝑢𝑚 𝑐ℎ𝑙𝑜𝑟𝑖𝑑𝑒)⁄𝑑𝑚3 )
free sulphur dioxide Number of milligram of Free sulphur dioxide per cubic decimetre,
(FSD) (𝑚𝑔⁄𝑑𝑚3 )
total sulphur dioxide Number of milligram of total sulphur dioxide per cubic decimetre,
(TSD) (𝑚𝑔⁄𝑑𝑚3 )
Density Number of grams per cubic centimetre (𝑔⁄𝑐𝑚3 )
Ph pH is a scale of acidity from 0 to 14.
Sulphates Number of grams of Potassium sulphate per cubic decimetre,
(𝑔(𝑝𝑜𝑡𝑎𝑠𝑠𝑖𝑢𝑚 𝑠𝑢𝑙𝑝ℎ𝑎𝑡𝑒)⁄𝑑𝑚3 )
Brand (categorical) three different brands of wine are considered where 1 represents “Grover
Zampa”, 2 represents “Seagram” and 3 represents “Sula Vineyards”.
Alcohol (response) percentage volume of alcohol in wine (% 𝑣𝑜𝑙. )
It is possible to regress alcohol on each of the 10 continuous predictors one at a time and explore
their individual relationships. However, in order to avoid repetition, only one predictor
[email protected]
(density) has been considered.
21YORICED7
Before a regression model of alcohol on the predictors can be built, it is necessary to investigate
whether there exists any dependence among the observed variables.
6

3. What is Pairwise Correlation?
The formal definition of Correlation: The pairwise correlation coefficient (denoted as 𝑟)
measures the degree of relatedness between two variables.
The correlation coefficient (𝑟) developed by Karl Pearson is termed as Pearson product-
moment correlation coefficient. The term 𝑟 is a measure of the strength and the direction of
the linear relationship between two continuous variables. Mathematically it can be expressed
as:
∑𝑛 ̅)
𝑖=1(𝑥𝑖 −𝑥̅ )(𝑦𝑖 −𝑦 𝐶𝑜𝑣 (𝑋,𝑌)
𝑟= =
𝑆𝐷(𝑋) 𝑆𝐷(𝑌)
√∑𝑛 2 𝑛 ̅)2
𝑖=1(𝑥𝑖 −𝑥̅ ) ∑𝑖=1(𝑦𝑖 −𝑦
Here, 𝑋 and 𝑌 are the pair of attributes measured on 𝑛 units. From the i-th unit a pair of
observations (𝑥𝑖 , 𝑦𝑖 ) is obtained, and 𝑥̅ and 𝑦̅ are the respective means. The value of 𝑟 ranges
from −1 to +1, both values included.
𝒓 = −𝟏: denotes a perfect negative correlation between 𝑋 and 𝑌. If (x, y) pairs are plotted, the
points would fall on a straight line with negative slope. This indicates that X and Y are perfectly
linearly related but, if one variable increases, the other decreases.
𝒓 = 𝟎: denotes that no correlation exists between 𝑋 and 𝑌.
𝒓 = +𝟏: denotes a perfect positive correlation between 𝑋
[email protected] and 𝑌 data. If (x, y) pairs are plotted,
21YORICED7
the points would fall on a straight line with positive slope. This indicates that X and Y are
perfectly linearly related and if one variable increases, the other also increases.
Often the plot of (x, y) pairs is known as scatterplot.
In Fig 1, the first scatterplot corresponding to r = 0 clearly shows that there is no discernible
pattern in the plot. The other two scatterplots exhibit moderate correlations of equal magnitude
(0.6) but in opposite directions.
7

Fig. 1: Three different correlations
Case Study continued.
The primary objective here is to determine whether any dependence exists between
[email protected] alcohol
21YORICED7(%) and three selected chemical components, FA, pH and density individually.
Exploratory Analysis (EDA) on Wine Data
We first upload the “WineData” dataset into python.
#Step 1: Import important packages into python

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from statsmodels.formula.api import ols
from statsmodels.graphics.gofplots import ProbPlot
import statsmodels.api as sm
#Step 2: Read the dataset into python using read_csv

WineData = pd.read_csv('WineData.csv')
WineData.shape
(1599, 13)
WineData.columns
Index(['ID', 'Brand', 'FA', 'VA', 'CA', 'RS', 'chloride', 'FSD', 'TSD',
'density', 'sulphate', 'pH', 'alcohol'],dtype='object')
8

WineData.head()
ID Brand FA VA CA RS chloride FSD TSD density pH sulphate alcohol
1 Seagram 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
2 Seagram 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
3 Seagram 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
4 Sula Vineyards 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8
5 Seagram 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
The dataset contains 1599 observations on 13 variables. The first column is “ID”, which is just
a label and will not be used in the analysis. The second column is “Brand”, which is the only
categorical variable present in the data.
The five number summary of each of the quantitative variables is presented below.
For Brand the proportions in each category is shown.
# Summary of all variables
WineData[WineData.columns[2:13]].describe().transpose()
[email protected]
21YORICED7
count mean std min 25% 50% 75% max
FA 1599.0 8.32 1.74 4.60 7.10 7.90 9.20 15.90
VA 1599.0 0.53 0.18 0.12 0.39 0.52 0.64 1.58
CA 1599.0 0.27 0.19 0.00 0.09 0.26 0.42 1.00
RS 1599.0 2.54 1.41 0.90 1.90 2.20 2.60 15.50
chloride 1599.0 0.09 0.05 0.01 0.07 0.08 0.09 0.61
FSD 1599.0 15.87 10.46 1.00 7.00 14.00 21.00 72.00
TSD 1599.0 46.47 32.90 6.00 22.00 38.00 62.00 289.00
density 1599.0 1.00 0.00 0.99 1.00 1.00 1.00 1.00
ph 1599.0 3.31 0.15 2.74 3.21 3.31 3.40 4.01
sulphate 1599.0 0.66 0.17 0.33 0.55 0.62 0.73 2.00
alcohol 1599.0 10.42 1.07 8.40 9.50 10.20 11.10 14.90
WineData['Brand'].value_counts()
Seagram Sula Vineyards Grover Zampa
633 553 413
Name: Brand, dtype: int64
9

The minimum % of alcohol in wine is 8.40 and maximum is 12.90 while the mean value is
10.45. We can see there are 3 brands of alcohol viz. “Grover Zampa”, “Seagram” and “Sula
Vineyards”.
To visually explore if there is any association between alcohol and the three selected variables
respectively correlation charts for (alcohol, FA), (alcohol, pH) and (alcohol, density) are
obtained.
Note that correlation can only be defined for two variables which are continuous (or at least
ordinal). Hence correlation cannot be defined between alcohol and Brand.
# Correlation: Alcohol and Fixed Acidity
g = sns.scatterplot(WineData['FA'], WineData['alcohol'])
r = np.round(pearsonr(WineData['FA'], WineData['alcohol'])[0],3)
g.text(x = 10,y = 14.5,s = "rho = "+str(r))
plt.show()
[email protected]
21YORICED7
Fig. 2: Correlation between alcohol and FA
# Correlation: Alcohol and pH

g = sns.scatterplot(WineData['pH'], WineData['alcohol'])
r = np.round(pearsonr(WineData['pH'], WineData['alcohol'])[0],3)
g.text(x = 3.4,y = 14.5,s = "rho = "+str(r))
plt.show()
10

Fig. 3: Correlation between alcohol and pH
# Correlation: Alcohol and density

g = sns.scatterplot(WineData['density'], WineData['alcohol'])
r = np.round(pearsonr(WineData['density'], WineData['alcohol'])[0],3)
g.text(x = 0.996,y = 14.5,s = "rho = "+str(r))
plt.show()
[email protected]
21YORICED7
Fig. 4: Correlation between alcohol and density
11

Interpretations of Fig. 2 − 4 are given below
i) Fig. 1: Correlation between alcohol (𝑌) and 𝑋1 (FA) is 𝑟 = −0.062. Though the
correlation is negative and statistical significance is also there, for all practical purpose
correlation between alcohol and FA is taken to be non-existent.
ii) Fig. 2: Correlation between alcohol (𝑌) and 𝑋2 (pH) is 𝑟 = 0.206. This indicates positive
dependence but the numerical value is very small.
iii) Fig. 3: Correlation between alcohol (𝑌) and 𝑋3 (density) is 𝑟 = −0.496. This indicates
moderate dependence between alcohol and density and as density increases, alcohol (%)
reduces.
NOTE: Correlation is only useful in determining linear relationship between two variables.
Zero correlation may imply no linear dependence, but that does not preclude any other form of
dependence (such as, polynomial) between the variables concerned.
Further, correlation does not indicate any cause and effect relationship. Correlation simply
quantifies how well two variables are related, if at all, and its direction (i.e. positive or
negative).
Another important observation about correlation is that, even when correlation is significantly
different from zero (the * in correlation chart indicates precisely that), it may not have any
practical significance. Typically, the rule of thumb for correlation values is as follows:
i) r between −0.4 and +0.4 indicates absence of linear dependence
ii) r between −0.7 and −0.4 or r between +0.4 and +0.7 indicates moderate linear dependence,
the sign indicating its direction
[email protected]
21YORICED7
iii) r less than −0.7 or r greater than +0.7 indicates strong linear dependence
Correlation coefficient, however, cannot determine what value the response will take for a
given value of the predictor(s). Regression model building is necessary for that.
12

4. Simple Linear Regression (SLR)
4.1 Definition
The formal definition of Simple Linear Regression: Let n pairs of observations (𝑥𝑖 , 𝑦𝑖 ), i =
1, 2, …, n, be available on two features, one of which is assumed to depend on the other.
Typically Y denotes the dependent variable and X the independent variable. Let the quantity
𝐸(𝑌) denote the expected or mean value of 𝑌.
Simple linear regression relates the response Y to the single predictor variable X through a
straight line. The mathematical formulation of the simple linear regression line is:
𝐸(𝑌) = 𝛽0 + 𝛽1 𝑋
where,
𝑌: is the value of the continuous response (or dependent) variable,
𝛽0 and 𝛽1: are intercept and slope coefficients, respectively, and known as the regression
parameters.
𝑋: represents the independent (predictor) variable continuous in nature.
It is assumed that the expected value of the response is a linear function of the predictor. When
[email protected]
21YORICED7𝛽0 and 𝛽1 are known, a given value of the predictor will specify the expected value of the
response.
However, since Y is a random variable, not all values 𝑦𝑖 will be equal to 𝐸(𝑌). Another form
of SLR equation is
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖 , i = 1, 2, …,n
𝜖: represents the unobservable error term. Note that error here does not indicate any mistake,
simply the difference between the expected and observed values of the response.
The error terms provide crucial insight into the regression process.
4.2 The method of Ordinary Least Square (OLS)
The simple linear regression model involves unknown parameters 𝛽0 and 𝛽1, which need to be
estimated from data. There are several different methods of estimating the parameters. The
simplest and the most widely used method is known as the Ordinary Least Squares method
(OLS).
Given 𝑛 pairs of observations on 𝑌 and 𝑋, the objective is to minimise the sum of squared
errors and thus get appropriate estimates of 𝛽0 and 𝛽1. We want to minimise
∑𝑛𝑖=1 𝜖𝑖2 = ∑𝑛𝑖=1(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 )2 .
Explicitly minimizing the above equation the following estimates of 𝛽0 and 𝛽1 are available.
13

𝑛
∑ (𝑋𝑖 −𝑋̅)(𝑌𝑖 −𝑌̅)
𝛽̂1 = 𝑖=1
∑𝑛 (𝑋 −𝑋̅)2
, and
𝑖=1 𝑖
𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅
𝑟𝑠𝑦
𝛽̂1 has an equivalent representation 𝛽̂1 = 𝑠 where 𝑟 is the correlation coefficient between 𝑋
𝑥
and 𝑌, and 𝑠𝑥 and 𝑠𝑦 are the respective standard deviations of 𝑋 and 𝑌. Since 𝑠𝑥 and 𝑠𝑦 are
both positive quantities, 𝛽̂1 has the same sign as 𝑟. So for two positively correlated variables,
the 𝛽̂1 will be positive and for two negatively correlated variables, 𝛽̂1 will be negative.
The objective here is to determine the dependence of alcohol (%) on density. Though any one
of the 10 continuous predictors may have been used as predictor, the choice is made based on
a higher correlation between the response and the predictor. A scatterplot of alcohol (%) versus
density helps to get a visual impression about whether a linear function of density will at all be
suitable to describe alcohol (%).
# Scatter plot of alcohol vs. density

a4_dims = (10,5)
fig, ax = plt.subplots(figsize=a4_dims)
a = sns.scatterplot(x="density", y="alcohol", data=WineData)
plt.show()
[email protected]
21YORICED7
Fig. 5: Scatterplot of density and alcohol (%)

The scatterplot above suggests that a linear relationship between alcohol and density may exist
since the majority of the points seem to fall on a straight line. We also expect the slope to be
negative and hence, increase in density is expected to decrease alcohol (%).
Recall that the correlation between alcohol (%) and density is −0.496.
A linear regression model is fit with alcohol (%) as the response and density as the predictor.
14

# Regression: Alcohol on density
mod = ols('alcohol ~ density', data = WineData).fit()
intercept , density_slope = mod.params
equation = "\n Y = {}".format(round(density_slope,2))+"*X +"+" {}".format(round(intercept,2))
print(equation)
Y = -280.16*X + 289.68
Thus the OLS line has the form

̂(%) = 289.68 − 280.16 × 𝑑𝑒𝑛𝑠𝑖𝑡𝑦
𝐴𝑙𝑐𝑜ℎ𝑜𝑙
The hat symbol is used to indicate that the regression gives an estimate of the response.
We first note that the sign of 𝛽̂1 is negative. This shows that the two variables are inversely
related, that is, if one increases, the other decreases. This confirms our expectation that the
variables alcohol (%) and density increase/decrease in opposite directions and we get a straight
line with negative slope.
The sign of the regression slope and the correlation coefficient will always be the same.
The regression slope is the measure of change in the response with one unit change in predictor.
The sign of the regression slope indicates the direction of the change.
The value of 𝛽̂1 indicates that if density increases by 1 unit, the estimated alcohol (%) decreases
by 280.16 unit. However, the response being in percentage, its values are bounded between 0
[email protected]
21YORICED7and 100. So is our interpretation misleading? Actually, from the EDA, it can be seen that the
minimum and maximum values of density are mostly between 0.99 and 1.00. Hence the range
of density is approximately 0.01. While interpreting a regression parameter, it is important to
pay heed to the range of values of the predictor. In this case density cannot be expected to
change by 1 unit, but the changes in the density will be in the order of 1/1000.
Hence, the following statement will be appropriate in this case:
If density increases by 0.001 unit, then alcohol (%) decreases by 0.28 unit.
The intercept term is the estimated value of the response when the predictor is 0. However, the
intercept term is not always interpretable, such as in this case.
The following graph shows the OLS regression line (in blue) through the scatterplot.
# Plotting
a4_dims = (10,5)
a = sns.regplot(x="density", y="alcohol", data=WineData)
a.set_title('Model:'+equation,fontsize=15)
# Displaying the plot
plt.show()
15

Fig. 6: Regression of alcohol (%) on density
Given a particular value of density, using the estimated values of 𝛽0 and 𝛽1, estimated or fitted
values of alcohol (%) can be obtained. The fitted values may or may not be equal to the
observed value of the response.
Let us look at alcohol (%) in
[email protected] the wines at several levels of density.
21YORICED7
Observation Density Observed ̂=
Estimated response (Y Residual
response (Y) 289.68 − 280.16 × (𝜖̂ = 𝑌 − 𝑌̂)
𝑑𝑒𝑛𝑠𝑖𝑡𝑦)
355 0.9912 11.9 11.98 −0.08
481 1.0026 9.2 8.79 0.41
609 1.0026 10.4 8.79 1.61
954 0.99458 12.1 11.04 1.06
1198 0.99458 9.8 11.04 −1.24
1201 0.99458 9.8 11.04 −1.24
1460 0.99458 11.9 11.04 0.86
Table 1: Observed response, estimated response (based on density) and residuals

It is clear from the above table that the pairs (𝑥𝑖 , 𝑦𝑖 ) is not unique but (𝑥𝑖 , 𝑦̂𝑖 ) are. At a given
level of density, there may be several observed values of alcohol (%). When density is 0.99458,
alcohol content may range from 9.8% to 12.1%. On the other hand, 11.9% alcohol may be
found when density is 0.9912 as well as when density is 0.99458. However, estimated alcohol
content is uniquely defined through the regression equation for a given value of density.
The difference between the observed and fitted values of the response is called residual.
Residuals may be both positive or negative.
16

4.2.1 Assumptions of OLS method
The linear regression model is defined on 4 important assumptions, often referred to as (LINE)
Assumption 1: The regression model is linear in the parameters i.e. 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 +𝜖𝑖 ,
Assumption 2: The observations 𝑌𝑖 (or the error terms𝜖𝑖 ) are independent
Assumption 3: The error variables 𝜖𝑖 are normally distributed.
Assumption 4: The errors have no bias (that is,𝐸(𝜖𝑖 ) = 0) and they are homoscedastic, that is,
they have equal variance.
4.3 Examining the statistical significance of regression model
Fitting the regression model or simply estimating the regression coefficients is not enough. It
is important to check if the regression slope is significantly different from 0 in the population.
4.3.1 Significance of regression slope
Estimating the regression coefficients is only the first step of regression model fitting. OLS
will produce estimates of regression intercept and regression slope given any sample data.
However, it is of vital importance to know whether the population regression slope is
significantly different from 0. If in the population the regression slope is not significantly
different from 0, then there is no regression of Y on X.
[email protected]
21YORICED7
We now need to know, is density at all statistically significant in explaining alcohol (%)?
Let us consider the test of hypothesis 𝐻0 : 𝛽1 = 0 vs. 𝐻1 : 𝛽1 ≠ 0 where 𝛽1 is the regression
coefficient of density when alcohol (%) is regressed on density.
# Significance of regression
print(mod.summary())
OLS Regression Results
==============================================================================
Dep. Variable: alcohol R-squared: 0.246
Model: OLS Adj. R-squared: 0.246
Method: Least Squares F-statistic: 521.6
Date: Wed, 22 Jan 2020 Prob (F-statistic): 3.94e-100
Time: 16:49:11 Log-Likelihood: -2144.1
No. Observations: 1599 AIC: 4292.
Df Residuals: 1597 BIC: 4303.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 289.6753 12.227 23.691 0.000 265.692 313.659
density -280.1638 12.267 -22.838 0.000 -304.226 -256.102
==============================================================================
Omnibus: 147.785 Durbin-Watson: 1.460
Prob(Omnibus): 0.000 Jarque-Bera (JB): 193.960
Skew: 0.768 Prob(JB): 7.62e-43
Kurtosis: 3.743 Cond. No. 1.06e+03
17

===============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.06e+03. This might indicate that there are strong mu
lticollinearity or other numerical problems.
Observe that the p-value corresponding to density is very small and thus the null hypothesis
𝐻0 : 𝛽1 = 0 is rejected which in turn indicates that density is significant in explaining alcohol
(%)
Statistical significance alone, however, is not enough to decide whether the predictor is useful
in explaining the variability in the response. Is density enough to explain a large part of
variation in alcohol? This leads us to the concept of coefficient of determination, 𝑅 2 .
4.3.2 The coefficient of determination 𝐑𝟐
The coefficient of determination R2 is a summary measure that explains how well the sample
regression line fits the data. The rationale and computation of R2 is discussed below:
̂𝑖 = 𝛽̂0 + 𝛽̂1 𝑋𝑖 and 𝑌̅ = 1 ∑𝑛𝑖=1 𝑌𝑖 . 𝑌

Let 𝑌 ̂𝑖 is the estimated value of the response, for a given
𝑛
value of the predictor 𝑋𝑖 . Fig 6 and Table 1 show that not all predicted values of the response
will be equal to the observed value 𝑌𝑖 . In fact, it may well happen that none of the estimated
values of the response coincides with the corresponding observed values.
[email protected]
21YORICED7
The difference between the observed and the estimated values of the response is called residual.
Residual is the estimated value of the unobserved error component in the regression equation.
𝜖̂𝑖 = 𝑌𝑖 - 𝑌̂𝑖
Residuals are very important part of regression and they have many useful properties. In fact,
it can be shown that ∑𝑛𝑖=1 𝜖̂𝑖 = 0, if the estimation method is OLS.
∑𝑛𝑖=1(𝑌𝑖2 − 𝑌̅)2: is the total variation of the actual 𝑌 about their sample mean and termed as the
total sum of squares (SST). This is closely linked to the sample variance of Y.
2
∑𝑛𝑖=1(𝑌̂𝑖 − 𝑌̅) : is the sum of squares due to regression and is called regression sum of squares
(SSR)
2
∑𝑛𝑖=1 𝜖̂𝑖 2 = ∑𝑛𝑖=1(𝑌𝑖 − 𝑌̂𝑖 ) : is the sum of squared differences between the observed and the
predicted values of the response. This is known as the residual sum of squares or the error
sum of squares (SSE)
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
𝑆𝑆𝑅 𝑆𝑆𝐸
1= + (on dividing SST on both sides)
𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝑅 𝑆𝑆𝐸
We now define R2 as : R2 = or R2 =1−
𝑆𝑆𝑇 𝑆𝑆𝑇
18

The numerical value of 𝑅 2 is equal to the square of the correlation coefficient between the
response and the predictor only if there is a single predictor in the linear regression model, i.e.
for simple linear regression only.
𝑅 2 measures the proportion of the total variation in 𝑌 that is explained by the regression model.
It ranges from 0 ≤ 𝑅 2 ≤ 1. The higher the value of 𝑅 2 , the more powerful is the predictor to
predict the response. A regression model with high 𝑅 2 value indicates that the model fits the
data well. In that case a high proportion of variance in the response is explained by the
dependence of the response on the predictor.
For the current model, the value of 𝑅 2 turns out to be approximately 25% . This means, density
is able to explain around 25% of the total variation in alcohol (%).
It is clear that, even though, the regression of alcohol (%) on density is significant (p-value <
0.05), density by itself is not sufficient to explain the variation in the response to a satisfactory
degree. Later we will investigate if the other predictors in the data set will help in explaining
more variation in the response (Multiple Linear Regression).
4.4 Residual Plot
Residuals are estimated errors and are defined as 𝜖̂𝑖 = 𝑌𝑖 − 𝑌̂𝑖 .

[email protected]
21YORICED7Residuals have many important properties and are employed to check various regression
assumptions.
Recall that ∑𝑛𝑖=1 𝜖̂𝑖 = 0.
If the assumptions (Sec 4.2.1) of linear regression are satisfied, the residuals will form a band
around the 0 line when plotted against the fitted values. The residuals should be randomly
distributed around the horizontal line (that represents zero residual errors) i.e. there should not
be a distinct trend in the distribution of points.
If the scatterplot shows any systematic deviations from this band shape, they may indicate
various departures from the assumptions. For example, if the residuals show any funnel-out or
funnel-in shape, then one may conclude that the error variances are not all equal. If the
scatterplot shows sinusoidal pattern, that may indicate serial dependence among the
observations.
Normal Q-Q plot provides evidence whether the errors are normally distributed. If the plotted
residuals form a straight line making 450 angle with x-axis, then one may conclude that the
errors are normally distributed. Even when the angle is not preserved, but the residuals form a
straight line, we may conclude that the data is approximately normal.
Below we have shown the residual plot for the regression of alcohol (%) on density.
19

# Residual plot
def regression_plots(model,data):
# model values
model_fitted_y = mod.fittedvalues
# model residuals
model_residuals = mod.resid
# normalized residuals
model_norm_residuals = mod.get_influence().resid_studentized_internal
# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = mod.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = mod.get_influence().cooks_distance[0]
def graph(formula, x_range, label=None):

x = x_range
y = formula(x)
plt.plot(x, y, label=label, lw=1, ls='--', color='red')
###################################################################
fig, axes = plt.subplots(nrows=2,ncols=2)
fig.set_size_inches(10, 10)
a = sns.residplot(model_fitted_y, 'alcohol', data=WineData,lowess=True,scatter_kws={'alpha': 0.
5},
[email protected]
21YORICED7 line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8} ,ax=axes[0][0])
a.set_title("Residuals vs Fitted ",fontsize=15)
a.set_xlabel('Fitted Values')
a.set_ylabel('Residuals')
###################################################################
axes[0][1].set_title("Normal Q-Q",fontsize=15)
QQ = ProbPlot(model_norm_residuals)
b = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1,ax=axes[0][1],ylabel = 'Standardized R
esiduals')
###################################################################
c= sns.scatterplot(model_fitted_y, model_norm_residuals_abs_sqrt, alpha=0.5,ax=axes[1][0])
sns.regplot(model_fitted_y, model_norm_residuals_abs_sqrt,
scatter=False,ci=False,lowess=True,line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8},ax=axes[1][0]);
c.set_title('Scale-Location',fontsize=15)
c.set_xlabel('Fitted values')
c.set_ylabel('$\sqrt{|Standardized Residuals|}$');
###################################################################
d= sns.scatterplot(model_leverage, model_norm_residuals, alpha=0.5,ax=axes[1][1]);
sns.regplot(model_leverage, model_norm_residuals,
scatter=False,ci=False,lowess=True,line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8},ax=axes[1][1]);
d.set_xlim(0, max(model_leverage))
d.set_ylim(-3, 5)
d.set_title('Residuals vs Leverage',fontsize=15)
d.set_xlabel('Leverage')
d.set_ylabel('Standardized Residuals')
20

p = len(mod.params) # number of model parameters
graph(lambda x: np.sqrt((0.5 * p * (1 - x)) / x),np.linspace(0.001, 0.200, 50),'Cook\'s distance') # 0
.5 line
graph(lambda x: np.sqrt((1 * p * (1 - x)) / x),np.linspace(0.001, 0.200, 50)) # 1 line
d.legend()
plt.show()
regression_plots(mod,WineData)
[email protected]
21YORICED7
Fig. 7: Residual plot for alcohol vs. density
Using the above figure, we may infer that the regression assumptions are not violated.
It is clear from the discussions above that density, in spite of having significant slope when
alcohol (%) is regressed on it, can explain only 25% of the total variability in the response. It
is a natural step then to examine whether inclusion of the other predictors contribute towards
explanation of the variability in the response, and if so, to what degree.
21

5. Multiple Linear Regression Analysis
In the previous sections we have investigated how one single predictor variable may be used
to predict a response. In reality, the response depends on multiple variables. For example, the
volume of car sales of a company depends not only on the GDP but also on other factors like
crude oil prices, confidence index of the population and many other considerations. In general,
the response variable is found to depend on several predictors simultaneously.
Multiple linear regression is one of the simplest statistical tools that is used to analyse the
relationship of the response with several predictors. Like simple linear regression, this
procedure fits a linear function of the predictors to the response.
5.1 Definition of Multiple Linear Regression Analysis

The formal definition of Multiple Linear Regression: Multiple regression is a statistical
technique used to analyse relationship between a single dependent variable and several
predictors simultaneously. Assume n (multivariate) sample observations (𝑥1𝑖 , 𝑥2𝑖 , … 𝑥𝑘𝑖 , 𝑦𝑖 ), i
= 1, 2, …, n, are available. Y denotes the dependent variable and𝑋1 , 𝑋2, … , 𝑋𝑘 the independent
variables.
The mathematical formulation of multiple linear regression line is:
𝐸(𝑌) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘
where,
[email protected]
21YORICED7
𝑌: is the value of the continuous response (or dependent) variable,
𝛽0 : is the intercept
𝑋𝑗 : represents the 𝑗 𝑡ℎ independent (predictor) variable continuous in nature. j = 1, …, k
𝛽𝑗 : represents the coefficient of the 𝑗 𝑡ℎ independent (predictor) variable.
It is assumed that the expected value of the response is a linear function of all the k predictors.
When the regression coefficients 𝛽0 and 𝛽𝑗 , 𝑗 = 1, … , 𝑘 are all known, or estimated, a given
value of the predictor combination will specify the expected value of the response.
However, since Y is a random variable, not all values 𝑦𝑖 will be equal to 𝐸(𝑌). Another form
of MLR equation is
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝑘 𝑋𝑘𝑖 + 𝜀𝑖 , i = 1, 2, …,n
𝜀: represents the unobservable error term.
Formally the parameters 𝛽1 , 𝛽2 , … , 𝛽𝑘 are called partial regression coefficients, since the
coefficient 𝛽𝑘 measures the change in the value of the response due to unit change in the value
of 𝑋𝑘 , keeping all other variables fixed. We will refer to the 𝛽0 , 𝛽1 , 𝛽2 , … , 𝛽𝑘 simply as
regression coefficients.
22

5.2 The method of ordinary least squares
In case of multiple regression also the regression coefficients are estimated by minimizing the
error sum of squares (SSE) ∑𝑛𝑖=1 𝜖𝑖2 .

Now we consider all the physiochemical liquids together and evaluate the regression model
that would predict the alcohol (%) in wine. Before, running regression model it is important to
look at correlations of all variables with respect to each other.
# Correlation: Alcohol versus all independent variables

plt.figure(figsize=(12,7))
corr = WineData[WineData.columns[1:13]].corr()
cmap = sns.diverging_palette(230, 20, as_cmap=True)
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr,annot=True,mask = mask,cmap = cmap)
plt.show()
[email protected]
21YORICED7
Fig. 8: Heatmap showing correlations among all variables

It may be observed that FA is positively correlated with CA with correlation coefficient 0.67.
FA is also positively correlated with density (67%) and negatively correlated (−68%) with pH.
All these are moderately high correlations. Similarly, FSD and TSD has moderately high
correlation. Likewise, other correlations can also be observed.
23

Although almost all correlations have been shown to be statistically significant, we will treat
only those which are above 0.4 or below −0.4 to be of any importance. Once we agree to impose
this restriction, only a few variable pairs show substantial correlation.
5.3 When the predictor is categorical
Before the problem of fitting a multiple linear regression model is taken up, the case of
categorical predictor needs to be explicitly discussed.
So far we have tacitly assumed that the response as well as the predictors are all continuous.
The case of categorical response is not covered under MLR. Let us discuss the situation when
one or more predictors are categorical. A set of dummy variables or indicator variables is
introduced corresponding to each categorical variable. A discrete variable with 𝑚 categories is
represented by a set of 𝑚 − 1 dummy variables.
Indicator coding: It is the most common format of dummy variable coding in which each
category of the nominal variable is represented by either 1 or 0.
The equation of regression model and concept of the indicator coding using dummy variable
is explained based on our case study.
If 𝑋1 , … , 𝑋𝑘 are continuous then we can write the regression model of 𝑌 on 𝑋 as:
[email protected] 𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜖
21YORICED7
But, when a predictor variable 𝑋 is a categorical variable with 𝑚 levels, small modification is
required for the above.
Brand is a categorical predictor with three levels Grover Zampa, Seagram and Sula Vineyards.
Hence 𝑚 − 1 = 3 − 1 = 2 dummy variables need to be introduced in the model. The modified
equation is
𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝛽𝑘+1 𝐷1 + 𝛽𝑘+2 𝐷2 + 𝜖
where, 𝐷1 and 𝐷2 are dummy variables taking values 1 or 0 as per the observation
representation.

For an explicitly declared categorical variable, the corresponding dummy variable assignment
is automatically done. The default choice of the baseline is that level whose name comes first
in lexicographic ordering. The default ordering can be changed. In this case the baseline is
taken as Grover Zampa since Grover Zampa, Seagram and Sula Vineyards are arranged in
increasing lexicographic order.
Next we perform multiple linear regression in Python and the output is presented below.
# Multiple Linear Regression of alcohol on all variables
mod = ols('alcohol ~ FA+VA+CA+RS+chloride+FSD+TSD+density+sulphate+pH+Brand', data = Wine
Data).fit()
coefficients = mod.params
24

print(coefficients)
Intercept 585.744496
Brand[T.Seagram] -0.235467
Brand[T.Sula Vineyards] -0.002991
FA 0.509555
VA 0.423617
CA 0.823537
RS 0.272455
chloride -1.303246
FSD -0.002695
TSD -0.001520
density -595.004881
sulphate 1.084578
pH 3.617430
dtype: float64
Regression coefficients corresponding to Brand actually acts as constants. For three levels of
Brand, three different regression equations are obtained.
BrandSeagram takes the value 1 if Brand = Seagram, 0 otherwise
BrandSula Vineyards takes the value 1 if Brand = Sula Vineyards, 0 otherwise
When Brand = Grover Zampa, then both these variables take the value 0.
Explicit form of
[email protected] the regression equations are:
21YORICED7
(1) Brand = Grover Zampa:
𝑌̂ = 585.7 + 0.51FA + 0.424VA + 0.824CA + 0.272RS −1.303Chloride−0.003FSD +
−0.002TSD −595.005density + 1.085sulphate + 3.617pH
(2) Brand = Seagram
−0.002TSD −595.005density + 1.085sulphate + 3.617pH – 0.23
𝑜𝑟
−0.002TSD −595.005density + 1.085sulphate + 3.617pH
(3) Brand = Sula Vineyard
−0.002TSD −595.005density + 1.085sulphate + 3.617pH – 0.003
𝑜𝑟
̂
𝑌 = 585.697 + 0.51FA + 0.424VA + 0.824CA + 0.272RS −1.303Chloride−0.003FSD
+ −0.002TSD −595.005density + 1.085sulphate + 3.617pH
The only difference among the three regression equations is in the intercept terms. Slopes
corresponding to the continuous predictors remain the same. Sign of the slopes indicates in
which direction the response will change, given the predictor increases by a unit amount.
Any positive coefficient means that a unit increase in the corresponding predictor increases the
response by the numerical value of the coefficient, provided all other predictors are held at
constant level. Any negative coefficient means that a unit increase in the corresponding
25

predictor decreases the response by the value of the coefficient, provided all other predictors
are held at constant level.
If FA (fixed acidity) increases by one unit, other predictors remaining constant, estimated
alcohol (%) will increase by 0.51.
If FSD (free sulphuric acid) increases by one unit, other predictors remaining constant,
estimated alcohol (%) will decrease by 0.003.
Note also that the estimated slope parameter corresponding to density has not changed sign,
but the numerical value is very different. In general, whether the sign of the regression
coefficient of a predictor will remain unchanged in both SLR and MLR cannot be determined
beforehand. The sign depends on the correlations among the predictors. Sufficiently high
correlations among the predictors can result in disturbance in the sign of the regression
coefficient.
Note also that the change in estimated response is identical for all continuous predictors at all
three levels of the categorical predictor Brand.
5.4 Multi-collinearity
In multiple regression, if one or more pairs of explanatory variables is highly correlated among
themselves, then the phenomenon is known as multi-collinearity.
Effects of Multi-collinearity: Multi-collinearity is not desirable. It leads to inflated standard
errors of the estimates of the regression coefficients, which in turn affects significance of the
[email protected]
21YORICED7regression parameters. Often the signs of the regression coefficients may also change. As a
result the regression model becomes non-reliable or lacks interpretability.
The first thing one should do in multiple linear regression is, to check if multi-collinearity is
present in the data.
Detection of Multi-collinearity: There are some ways of detecting (or testing) multi-
collinearity such as:
 Correlation Matrix: We can start by computing the pairwise correlations among all
the independent variables. The independent variables should not be highly (positive or
negative) correlated. But this itself is not enough as the correlation matrix only detects
high pairwise correlations. It is possible that even when no pairwise correlations are
high, several moderately correlated pairs may give rise to multi-collinearity.
 Variance Inflation factor: Variance inflation factors measure the inflation in the
variances of the regression parameter estimates due to collinearities that exist among
the predictors. It is a measure of how much the variance of the estimated regression
coefficient βk is “inflated” by the existence of correlation among the predictor
variables in the model.
General Rule of thumb: If VIF is 1 then there is no correlation among the k th predictor
and the remaining predictor variables, and hence the variance of β̂k is not inflated at
all. Whereas if VIF exceeds 5 or is close to exceeding 5, we say there is moderate VIF
and if it is 10 or exceeding 10, it shows signs of high multi-collinearity.
26

We now calculate the VIF of each predictor variable. Before calculating VIF, we need to create
the dummy variable for the Brand as it is categorical variable.
# VIF
df_new=pd.get_dummies(WineData, drop_first=True)
def vif_cal(input_data, dependent_col):
x_vars=input_data.drop([dependent_col], axis=1)
xvar_names=x_vars.columns
for i in range(0,xvar_names.shape[0]):
y=x_vars[xvar_names[i]]
x=x_vars[xvar_names.drop(xvar_names[i])]
rsq=ols(formula="y~x", data=x_vars).fit().rsquared
vif=round(1/(1-rsq),2)
print (xvar_names[i], "VIF = " , vif)
vif_cal(input_data=df_new.iloc[:, 1:14], dependent_col="alcohol")
FA VIF = 5.65
VA VIF = 1.79
CA VIF = 3.07
RS VIF = 1.31
chloride VIF = 1.48
FSD VIF = 1.98
[email protected]
21YORICED7 TSD VIF = 2.24
density VIF = 2.91
pH VIF = 2.49
sulphate VIF = 1.4
Brand_Seagram VIF = 1.91
Brand_Sula Vineyards VIF = 1.59
Note that Brand is a nominal variable, so the notion of correlation is difficult to define in this
case. However, we still output a VIF for Brand which needs to be ignored.
We observe that among all continuous predictors, only FA has a sufficiently high VIF (5.65)
indicating it is substantially correlated with the other predictor variables. FA is removed from
the model.
# MLR with FA removed
mod2 = ols('alcohol ~ VA+CA+RS+chloride+FSD+TSD+density+sulphate+pH+Brand', data = WineDat
a).fit()
coeff_MLR_wine2 = mod2.params
print(coeff_MLR_wine2)
Brand[T.Seagram] -0.416173
Brand[T.Sula Vineyards] -0.066258
VA 0.759100
CA 2.454701
RS 0.199998
27

chloride -4.397360
FSD 0.002859
TSD -0.006540
density -361.063201
sulphate 0.985574
pH 1.257011
dtype: float64
Note that the coefficients of the different predictor values have changed. We check the VIFs of
the new predictors.
# VIF after removing FA

vif_cal(input_data=df_new.iloc[:, 2:14], dependent_col="alcohol")
VA VIF = 1.77
CA VIF = 2.34
RS VIF = 1.23
chloride VIF = 1.32
FSD VIF = 1.96
TSD VIF = 2.04
density VIF = 1.51
pH VIF = 1.54
sulphate VIF = 1.39
Brand_Seagram VIF = 1.86
Brand_Sula Vineyards VIF = 1.58
[email protected]
21YORICED7
We can see that after removing FA, all the predictors have low VIF (at most 2.34). So the
problem of multi-collinearity has been eliminated. In all our subsequent discussions, we will
consider the multiple linear regression model with FA removed.
5.4.1 Examining significance of multiple regression model
It is not enough to merely fit a multiple regression model to the data, it is necessary to check
whether all regression coefficients are significant or not. Significance here means whether the
population regression parameters are significantly different from zero. For the j-th slope
parameter, the null hypothesis of interest is: H0: βj = 0 versus H1: βj ≠ 0. This test is done
separately for each regression coefficient, including the parameters corresponding to the
dummy variables, if necessary. The test may not always make sense for the intercept
parameters.

# Significance of regression
print(mod2.summary())

=====================================================================================
28

Date: Tue, 28 Jan 2020 Prob (F-statistic): 1.05e-270
Df Model: 11
coef std err t P>|t| [0.025 0.975]

-------------------------------------------------------------------------------------
Intercept 364.7580 11.587 31.481 0.000 342.032 387.484
Brand[T.Seagram] -0.4162 0.050 -8.387 0.000 -0.514 -0.319
Brand[T.Sula Vineyards]-0.0663 0.047 -1.408 0.159 -0.159 0.026
VA 0.7591 0.132 5.741 0.000 0.500 1.018
CA 2.4547 0.140 17.536 0.000 2.180 2.729
RS 0.2000 0.014 14.266 0.000 0.173 0.227
chloride -4.3974 0.436 -10.096 0.000 -5.252 -3.543
FSD 0.0029 0.002 1.200 0.230 -0.002 0.008
TSD -0.0065 0.001 -8.450 0.000 -0.008 -0.005
density -361.0632 11.593 -31.145 0.000 -383.802 -338.324
sulphate 0.9856 0.124 7.942 0.000 0.742 1.229
pH 1.2570 0.143 8.783 0.000 0.976 1.538
=====================================================================================
Skew: 0.539 Prob(JB): 2.01e-47
=====================================================================================
Warnings:
[email protected]
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
21YORICED7[2] The condition number is large, 5.48e+04. This might indicate that there are strong multicol
linearity or other numerical problems.
From the above it may be noted that the regression coefficients corresponding to FSD and
Brand = Sula Vineyards are not statistically significant at level α=0.05. In other words, the
regression coefficients corresponding to these two are not significantly different from 0 in the
population.
Hence, FSD may be eliminated from the multiple regression model. However, it is not
recommended to eliminate the Brand level Sula Vineyards directly, because it is part of a
system of dummy variable. The adjustment required is explained in Section 5.4.3.
5.4.2 R2 and Adjusted R2
The coefficient of determination, R2 is defined for multiple linear regression similarly, as it is

defined for SLR.
R2 = 1 − SSE/SST
2
̅)2 and SSE = ∑ni=1 ϵ̂2i = ∑ni=1(Yi − Y
where SST = ∑ni=1(Yi − Y ̂i ) .
29

Recall that in simple linear regression, 𝑅 2 is the square of the pairwise correlation coefficient
between the single predictor X and the response Y. In MLR no such interpretation of 𝑅 2 holds.
A general formulation valid for both simple and multiple regression is that 𝑅 2 equals the square
of the correlation between the observed response 𝑌 and estimated/predicted response 𝑌̂.
Hence a high numerical value of 𝑅 2 gives an intuitive justification that the model works well
since the observed response and the predicted responses are close.
One limitation of the coefficient of determination is that, its value increases as the number of
independent variables in the model increases. A good regression model should include only
those predictors that are significantly different from 0. However, the numerical value of R2 is
non-decreasing even if non-significant predictors are included in the model. Addition of non-
significant predictors adversely affect the predictive quality of the model. Therefore, there is a
need of another measure of model adequacy.
Adjusted R2 measure involves an adjustment based on the number of predictors relative to the
sample size. Adjusted R2 is defined as:
2
SSE⁄(n − p) ∑𝑛𝑖=1 𝜖̂𝑖 2 ⁄(n − p)
Adj R = 1 − = 1− 𝑛
SST⁄(n − 1) ∑𝑖=1(𝑌𝑖 − 𝑌̅)2 ⁄(n − 1)
Here p is the number of parameters in the regression model including intercept term.
Case Study continues
[email protected]
2 2
21YORICED7R and adjusted R are available from the output of summary(). Values of these two statistics
are almost equivalent (55%) albeit adjusted R2 is slightly smaller than R2 .
5.4.3 Handling the categorical predictor
In the regression equation including all predictors, it was noted that Brand = Sula Vineyards is
non-significant. Before any action is taken regarding this, let us understand what significance
in case of categorical data means in this situation.
While dealing with a categorical predictor with m nominal levels, the set of m – 1 indicator
variables implicitly fixes one level as baseline, when all m-1 indicator variables take the value
0. In this case study the baseline is Brand = Grover Zampa.
Significance of regression coefficients for categorical variable means that the levels are
significantly different from the baseline.
Case Study continues
If the regression coefficient for Brand = Sula Vineyards is non-significant, it means that the
effect of Brand = Sula Vineyards statistically indistinguishable from the baseline level Grover
Zampa. This is also evident from the box plot (Fig 9).
#Brand
pd.DataFrame(WineData['Brand'].value_counts()).transpose()
30

Seagram Sula Vineyards Grover Zampa
633 553 413
round(WineData.groupby("Brand")["alcohol"].mean(),2)
Grover Zampa 10.88
Seagram 9.91
Sula Vineyards 10.67
Name: alcohol, dtype: float64
# Box plot showing alcohol values at different levels of Brand

a4_dims = (10,5)
a = sns.boxplot(x= "Brand", y = 'alcohol' , data = WineData)
plt.xlabel('Independent variable: Brand')
plt.ylabel('% of alcohol')
plt.show()
[email protected]
21YORICED7
Fig. 9: Box plot for alcohol (%) at different levels of Brand

If a continuous predictor is not significant, i.e. if the p-value in the regression table is less than
a pre-fixed level α, we simply eliminate the variable from the regression equation. However,
we cannot do that in case of a categorical predictor.
Instead, the non-significant level is merged into the baseline and a new baseline is created.
Consequently, the number of levels is also reduced, along with the degrees of freedom that the
categorical variable carries.
# Defining binary predictor new_Brand

x = np.array([0])
new_brand = np.repeat(x,len(WineData), axis = 0)
for i in range(len(WineData)):
if(WineData.iloc[i,1] == "Seagram"):
31

new_brand[i]=1
WineData['new_brand'] = new_brand
Brands Sula Vineyards and Grover Zampa are merged and we consider only the indicator
variable for Brand = Seagram as our predictor. A three category variable is reduced to a binary
variable. Let us define a new variable (new_Brand) that takes the value 1 if the brand is
Seagram otherwise takes the value 0. Note that this variable takes the same value when the
brand is either Grover Zampa or Sula Vineyards and thus combines these two brands together.
So now instead of Brand as a predictor we use new_Brand.
Now we perform MLR of alcohol (%) on the modified set of predictors. The results are shown
below.
# Multiple linear regression on modified set of predictors
MLR_new = ols('alcohol ~ VA+CA+RS+chloride+TSD+density+sulphate+pH+new_brand', data = WineD

ata).fit()
print(MLR_new.summary())
===========================================================================================
Date: Tue, 28 Jan 2020 Prob (F-statistic): 2.31e-272
[email protected]
Df Model: 9
21YORICED7
===========================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept 365.5364 11.582 31.560 0.000 342.818 388.255
VA 0.7443 0.131 5.698 0.000 0.488 1.000
CA 2.4537 0.138 17.770 0.000 2.183 2.725
RS 0.2025 0.014 14.534 0.000 0.175 0.230
chloride -4.4167 0.435 -10.158 0.000 -5.270 -3.564
TSD -0.0060 0.001 -10.320 0.000 -0.007 -0.005
density -361.9534 11.586 -31.240 0.000 -384.679 -339.228
sulphate 1.0068 0.124 8.152 0.000 0.765 1.249
pH 1.2808 0.142 8.994 0.000 1.001 1.560
new_brand -0.3784 0.040 -9.360 0.000 -0.458 -0.299
===========================================================================================
Skew: 0.533 Prob(JB): 5.23e-46
===========================================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.24e+04. This might indicate that there are strong mult
icollinearity or other numerical problems.
Notice that now, all the predictors are statistically significant with very low p-values. The value
of R2 has not decreased much and adjusted R2 is almost same as R2 (around 55%). Thus we
select this as our final regression model.
32

We now present the regression coefficients for this final model which will help us to write the
regression equation.
# Coefficients
coeff_MLR = MLR_new.params
print(coeff_MLR )
VA 0.744273
CA 2.453709
RS 0.202516
chloride -4.416669
TSD -0.005956
density -361.953393
sulphate 1.006848
pH 1.280760
new_brand -0.378445
dtype: float64
Thus, the final regression equation becomes

̂ (%) = 365.54 + 0.74VA + 2.45CA + 0.2RS − 4.42chloride − 0.006TSD
alcohol
− 361.95density + 1.007sulphate + 1.280pH − 0.378 I(Brand
= Seagram)
The fitted values

[email protected] of the response and the residuals can be extracted directly from the model.
21YORICED7
# Extraction of fitted response
model_fitted_y = pd.DataFrame(mod.fittedvalues,columns= ['Estimated'])
# Extraction of residuals
model_residuals = pd.DataFrame(mod.resid , columns= ['Residual'])
d1 = pd.concat([WineData, model_fitted_y,model_residuals], axis=1, ignore_index=True)
d1.columns = ['ID', 'Brand', 'FA', 'VA', 'CA','RS','chloride','FSD','TSD','density','sulphate','pH','alcohol','
new_brand', 'Estimated','Residuals']
d1.loc[[0,1,3,7,8,20], ['Brand', 'VA', 'CA','RS','chloride','TSD','density','sulphate','pH','alcohol','Estimat
ed','Residuals']]
Below we present some particular values of the predictors, the observed response, predicted
response and the residuals.
Table 2: Observed response, estimated response (based on final MLR model) and
residuals
33

5.4.4 Residual Analysis
Residual plots are shown below. They exhibit no specific pattern and are more or less normally
distributed. So we can assume they satisfy the properties of linear regression.
# Residual plot
regression_plots(MLR_new,WineData)
[email protected]
21YORICED7
Fig. 10: Residual Plot of alcohol vs all variables

We also check the VIFs of the final model which are all under 3 and thus there is no issue of
multicollinearity.
# VIF
vif_cal(input_data=WineData.iloc[:, 3:14], dependent_col="alcohol")
VA VIF = 1.76
CA VIF = 2.32
RS VIF = 1.23
chloride VIF = 1.32
FSD VIF = 1.94
TSD VIF = 2.04
density VIF = 1.51
pH VIF = 1.54
sulphate VIF = 1.38
new_brand VIF = 1.24
34

6. Further Discussions and
Considerations
6.1 Regression ANOVA
The problem of regression is also a technique of reducing variation in data. Total variation in
the response is being explained by the regression on one or more predictors. The ANOVA table
describes the partition of the variance.
Corresponding to each partition of the total variance, there is an associated Degrees of
freedom(df). The total degrees of freedom in a sample of size n is always n – 1. Df for the
partitioned sums of squares depend on the number of parameters to be estimated. If there are p
parameters to be estimated in the regression equation, then the df corresponding to SSR will
be p – 1. Df corresponding to SSE is (n – 1) – (p – 1) = n – p.
We present the ANOVA table corresponding to the final model in our case study.
# ANOVA table
aov_tbl = sm.stats.anova_lm(MLR_new)
print(aov_tbl)
[email protected] df sum_sq mean_sq F PR(>F)

21YORICED7VA 1.0 74.260974 74.260974 146.311486 2.773338e-32
CA 1.0 0.009037 0.009037 0.017805 8.938658e-01
RS 1.0 3.435694 3.435694 6.769121 9.361018e-03
chloride 1.0 86.123751 86.123751 169.683930 6.277644e-37
TSD 1.0 72.561124 72.561124 142.962383 1.300526e-31
density 1.0 620.963182 620.963182 1223.442687 3.003708e-199
sulphate 1.0 60.471165 60.471165 119.142337 8.482902e-27
pH 1.0 45.968786 45.968786 90.569259 6.367156e-21
new_brand 1.0 44.467580 44.467580 87.611532 2.622912e-20
Residual 1589.0 806.503244 0.507554 NaN NaN
We can see that each continuous predictor has 1 degree of freedom, and since our modified
categorical predictor is new_Brand with 2 levels, it enjoys (2 - 1 =)1 degree of freedom. The
residual sum of squares has 1589 degrees of freedom, and the total sum of squares has 1598
degrees of freedom.
6.2 Leverage points

In the context of regression, an observation is called a leverage point if it has an unusually
high or low value for the predictor(s) and/or for the response. All leverage points are outliers
on at least one of the variables. However, all outliers may not be leverage points. Presence of
any leverage point can substantially shift the OLS regression line towards itself. The following
figure illustrates this.
35

Fig. 11: Action of leverage point in shifting the regression line
(ref: https://newonlinecourses.science.psu.edu/stat501/node/337/ )
The red point shown to the right of the figure is a leverage point. Note that this regressor value
is unusually large compared to the remaining values but the corresponding response value is
not an outlier. The red line is OLS line obtained by regression including the leverage value and
the dashed line is the OLS line obtained by regression excluding the leverage value. We can
[email protected]
notice how the leverage point pulls the regression line towards itself.
21YORICED7
The flowchart below explains the main steps of the regression.
Regression model building will be discussed in a subsequent monograph.
36

Identify response and predictors from data
Flowchart for MLR
Declare categorical predictors using factor()

command
Compute all pairwise correlations and obtain scatter plots
At least one response- No Linear regression

predictor pair has cannot be performed
sufficient correlation?
Yes
Perform Multiple Linear Regression with lm() function
Compute VIFs
No
Any VIF > 5?
[email protected]
21YORICED7
Yes
Remove predictor with largest VIF and perform MLR with remaining variables
Test for significance of the predictors using summary() function
Any non-significant
No
predictor?
Yes
Remove non-significant continuous predictors. Merge non-significant levels of categorical predictor and
redefine levels of the categorical predictor.
Perform MLR with new set of predictors
Final Multiple Linear

Regression Model
37

References
Draper, N. R., Smith, H. (1998). Applied Regression Analysis. Wiley Series in Probability and
Statistics.
Neter, J., Wasserman, W., Kutner, M. H. (1983). Applied Linear Regression Models. Richard
D. Irwin, Inc.
Seber, G. A. F., Lee, A. J. (2003). Linear Regression Analysis. Wiley Series in Probability and
Statistics.
https://newonlinecourses.science.psu.edu/stat501/node/2/
https://online-learning.harvard.edu/course/data-science-linear-regression
[email protected]
21YORICED7
38


Regression Monograph DSBA Final

Uploaded by

Copyright:

Available Formats

Regression Monograph DSBA Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Monograph DSBA Final

Uploaded by

Copyright:

Available Formats

A Short Monograph on Regression

TO SERVE AS A REFRESHER FOR PGP-DSBA

This file is meant for personal use by [email protected] only.

This file is meant for personal use by [email protected] only.

Fig. 3: Correlation between alcohol and pH…………………………………………………………....11

Fig. 4: Correlation between alcohol and density………………………………………………………11

Fig. 8: Heatmap showing correlations among all variables………………………………....................23

Fig. 9: Box plot for alcohol (%) at different levels of Brand…………………………………………...31

Fig. 11: Action of leverage point in shifting the regression line………………………………………36

This file is meant for personal use by [email protected] only.

This file is meant for personal use by [email protected] only.

This file is meant for personal use by [email protected] only.

This file is meant for personal use by [email protected] only.

This file is meant for personal use by [email protected] only.

Case Study continued.

#Step 1: Import important packages into python

#Step 2: Read the dataset into python using read_csv

This file is meant for personal use by [email protected] only.

ID Brand FA VA CA RS chloride FSD TSD density pH sulphate alcohol

FA 1599.0 8.32 1.74 4.60 7.10 7.90 9.20 15.90

VA 1599.0 0.53 0.18 0.12 0.39 0.52 0.64 1.58

CA 1599.0 0.27 0.19 0.00 0.09 0.26 0.42 1.00

RS 1599.0 2.54 1.41 0.90 1.90 2.20 2.60 15.50

chloride 1599.0 0.09 0.05 0.01 0.07 0.08 0.09 0.61

FSD 1599.0 15.87 10.46 1.00 7.00 14.00 21.00 72.00

TSD 1599.0 46.47 32.90 6.00 22.00 38.00 62.00 289.00

density 1599.0 1.00 0.00 0.99 1.00 1.00 1.00 1.00

ph 1599.0 3.31 0.15 2.74 3.21 3.31 3.40 4.01

sulphate 1599.0 0.66 0.17 0.33 0.55 0.62 0.73 2.00

alcohol 1599.0 10.42 1.07 8.40 9.50 10.20 11.10 14.90

This file is meant for personal use by [email protected] only.

Fig. 2: Correlation between alcohol and FA

# Correlation: Alcohol and pH

This file is meant for personal use by [email protected] only.

# Correlation: Alcohol and density

Fig. 4: Correlation between alcohol and density

This file is meant for personal use by [email protected] only.

This file is meant for personal use by [email protected] only.

4.2 The method of Ordinary Least Square (OLS)

This file is meant for personal use by [email protected] only.

Case Study continued.

# Scatter plot of alcohol vs. density

Fig. 5: Scatterplot of density and alcohol (%)

This file is meant for personal use by [email protected] only.

Thus the OLS line has the form

This file is meant for personal use by [email protected] only.

Table 1: Observed response, estimated response (based on density) and residuals

This file is meant for personal use by [email protected] only.

4.3 Examining the statistical significance of regression model

4.3.1 Significance of regression slope

This file is meant for personal use by [email protected] only.

4.3.2 The coefficient of determination 𝐑𝟐

̂𝑖 = 𝛽̂0 + 𝛽̂1 𝑋𝑖 and 𝑌̅ = 1 ∑𝑛𝑖=1 𝑌𝑖 . 𝑌

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸

This file is meant for personal use by [email protected] only.

4.4 Residual Plot

Residuals are estimated errors and are defined as 𝜖̂𝑖 = 𝑌𝑖 − 𝑌̂𝑖 .

This file is meant for personal use by [email protected] only.

def graph(formula, x_range, label=None):

This file is meant for personal use by [email protected] only.