Linear Regression
Linear Regression
Linear Regression
Short Definition
A linear regression is a statistical model that analyzes the relationship between a response variable (often
called Y) and one or more variables (often called X or explanatory variables). We can use this model to
predict the Y when only the X is known. The mathematical equation can be generalized as follows:
Y = a + bX
, where a is the intercept and b is the slope. Collectively, they are called regression coefficients.
Linear regression assumes that there exists a linear relationship between the response variable Y and the
explanatory variables X. This means that you can fit a line between the two (or more variables).
A package can include one or multiple datasets. We will use the Boston dataset that is a part of the MASS
package. Let’s start by examining the dataset.
# examine the structure of the Boston dataset
str(Boston)
1
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
We will seek to predict medv variable (median house value) using (some of) the other 13 variables.
To find out more about the data set, type ?Boston.
# bring out the docs for the dataset
?Boston
Let’s start by examining which of the 13 predictors might be relevant for predicting the response variable
(medv). One way to do that is to examine the correlation between the predictors and the response variable.
Since we have many variables, examining a correlation matrix will not be that easy. So, it is better to plot
the correlations. To that end, we’ll use the corrplot package (explore the plotting options offered by this
package here):
# compute the correlation matrix
corr.matrix <- cor(Boston)
# one option for plotting correlations: using colors to represent the extent of correlation
corrplot(corr.matrix, method = "number", type = "upper", diag = FALSE, number.cex=0.75, tl.cex = 0.85)
2
ptratio
medv
indus
black
chas
lstat
age
nox
rad
tax
dis
rm
zn 1
crim −0.2 0.41 −0.06 0.42 −0.22 0.35 −0.38 0.63 0.58 0.29 −0.39 0.46 −0.39
indus 0.06 0.76 −0.39 0.64 −0.71 0.6 0.72 0.38 −0.36 0.6 −0.48 0.6
chas 0.09 0.09 0.09 −0.1 −0.01−0.04−0.12 0.05 −0.05 0.18
0.4
nox −0.3 0.73 −0.77 0.61 0.67 0.19 −0.38 0.59 −0.43
0.2
rm −0.24 0.21 −0.21−0.29−0.36 0.13 −0.61 0.7
3
1
crim
−0.2 zn 0.8
0.63 −0.31 0.6 −0.01 0.61 −0.21 0.46 −0.49 rad −0.2
0.58 −0.31 0.72 −0.04 0.67 −0.29 0.51 −0.53 0.91 tax
−0.4
0.29 −0.39 0.38 −0.12 0.19 −0.36 0.26 −0.23 0.46 0.46 ptratio
−0.6
−0.39 0.18 −0.36 0.05 −0.38 0.13 −0.27 0.29 −0.44−0.44−0.18 black
0.46 −0.41 0.6 −0.05 0.59 −0.61 0.6 −0.5 0.49 0.54 0.37 −0.37 lstat −0.8
−0.39 0.36 −0.48 0.18 −0.43 0.7 −0.38 0.25 −0.38−0.47−0.51 0.33 −0.74medv
−1
Predictors lstat (percent of households with low socioeconomic status) and rm (average number of rooms per
house) have the highest correlation with the outcome variable.
To examine this further, we can plot variables lstat and rm against the response variable.
# plot *lstat* against the response variable
p1 <- ggplot(data = Boston, mapping = aes(x = lstat, y = medv)) +
geom_point(shape = 1)
p1
4
50
40
30
medv
20
10
0 10 20 30
lstat
# plot *rm* against the response variable
p2 <- ggplot(data = Boston, mapping = aes(x = rm, y = medv)) +
geom_point(shape = 1)
p2
5
50
40
30
medv
20
10
4 5 6 7 8 9
rm
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
6
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
As we see, the summary() function gives us the following:
• p-values and standard errors for the coefficients,
• R-squared (R2 ) statistic,
• F-statistic for the model.
In particular, we can conclude the following:
• based on the coefficient of the lstat variable, with each unit increase in lstat, that is, with a percentage
increase in the households with low socioeconomic status, median house value decreases by 0.95005
units.
• based on the R2 value, this model explains 54.4% of the variability in the median house value.
• based on the F statistic and the associated p-value, there is a significant linear relationship between the
predictor and the response variable.
To find out what other pieces of information are stored in the fitted model (that is, the lm1 object), we can
use the names() function.
# print all attributes stored in the fitted model
names(lm1)
## (Intercept) lstat
## 34.5538409 -0.9500494
Note, there is also the coef() function that returns the coefficients:
# print the coefficients with the coef() f.
coef(lm1)
## (Intercept) lstat
## 34.5538409 -0.9500494
The residual sum of squares (RSS), also known as the sum of squared residuals, is the sum of the squares
of residuals (deviations predicted from actual empirical values of data). It is a measure of the discrepancy
between the data and the estimation model.
# compute the RSS
lm1_rss <- sum(lm1$residuals^2)
lm1_rss
## [1] 19472.38
Recall that the obtained coefficient values are just estimates (of the real coefficient values) obtained using
one particular sample from the target population. If some other sample was taken, these estimates might
have been somewhat different. So, we usually compute the 95 confidence interval for the coefficients to
get an interval of values within which we can expect, in 95% of cases (i.e. 95% of examined samples), that
the ‘true’ value for the coefficients will be.
7
# compute 95% confidence interval
confint(lm1, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
Making predictions
Now that we have a model, we can predict the value of medv based on the given lstat values. To do that, we
will create a tiny test data frame.
# create a test data frame containing only lstat variable and three observations: 5, 10, 15
df.test <- data.frame(lstat=c(5, 10, 15))
# calculate the predictions with the fitted model over the test data
predict(lm1, newdata = df.test)
## 1 2 3
## 29.80359 25.05335 20.30310
We can also include the confidence interval for the predictions:
# calculate the predictions with the fitted model over the test data, including the confidence interval
predict(lm1, newdata = df.test, interval = "confidence")
8
50
40
30
medv
20
10
0 10 20 30
lstat
The plot indicates that there is some non-linearity in the relationship between lstat and medv.
Diagnostic Plots
Next, we will use diagnostic plots to examine the model fitness in more detail. The diagnostic plots help in
testing the assumptions of the linear regression:
1. Linear relationship between the independent and dependent variables,
2. Normallity of residuals,
3. Homoscedasticity, meaning residuals are equal across the regression line,
4. No or little multicollinearity (multicollinearity occurs when the independent variables are too highly
correlated with each other).
Four diagnostic plots are automatically produced by passing the output from lm() function (e.g. our lm1 )
to the plot() function. This will produce one plot at a time, and hitting Enter will generate the next plot.
However, it is often convenient to view all four plots together. We can achieve this by using the par()
function, which tells R to split the display screen into separate panels so that multiple plots can be viewed
simultaneously.
# split the plotting area into 4 cells
par(mfrow=c(2,2))
9
Standardized residuals
Residuals vs Fitted Normal Q−Q
268372
373 372
373
Residuals
20 268
1 3
−20 0
−2
0 5 10 15 20 25 30 −3 −2 −1 0 1 2 3
Standardized residuals
Scale−Location Residuals vs Leverage
2.0
268372
373
215 413 375
2
1.0
Cook's distance
−2
0.0
10
0.025
lm1.leverage
0.015
0.005
Index The
plot suggests that there are several observations with high leverage values.
We can check this further by examining the value of leverage statistic for the observations. Leverage statistics
is always between 1/n and 1 (n is the number of observations); observations with leverage statistic considerably
above 2*(p+1)/n (p is the number of predictors) are often considered as high leverage points.
Let’s check this for our data:
# calculate the number of high leverage points
n <- nrow(Boston)
p <- 1
cutoff <- 2*(p+1)/n
length(which(lm1.leverage > cutoff))
## [1] 34
The results confirm that there are several (34) high leverage points.
11
10 20 30 14 16 18 20 22
50
30
medv
10
30
lstat
10
8
rm
6
4
22
18
ptratio
14
10 20 30 40 50 4 5 6 7 8
Far from perfect linear relation, but let’s see what the model will look like.
To be able to properly test our model (not use fictitious data points as we did in the case of simple linear
regression), we need to split our dataset into:
1 training data that will be used to build a model, 2 test data to be used to evaluate/test the predictive
power of our model.
Typically, 80% of observations are used for training and the rest for testing.
When splitting the dataset, we need to assure that observations are randomly assigned to the training and
testing data sets. In addition, we should assure that the outcome variable has the same distribution in the
train and test sets. This can be easily done using the createDataPartition() function from the caret package.
# install.packages('caret')
library(caret)
Check that the outcome variable (medv) has the same distribution in the training and test sets
# print the summary of both train and test sets
summary(train.boston$medv)
12
summary(test.boston$medv)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio, data = train.boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8219 -3.0757 -0.8036 1.7893 29.7479
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.11824 4.33535 4.179 3.59e-05 ***
## lstat -0.56496 0.04778 -11.824 < 2e-16 ***
## rm 4.62379 0.45996 10.053 < 2e-16 ***
## ptratio -0.94082 0.13192 -7.132 4.63e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.181 on 403 degrees of freedom
## Multiple R-squared: 0.6935, Adjusted R-squared: 0.6912
## F-statistic: 303.9 on 3 and 403 DF, p-value: < 2.2e-16
From the summary, we can see that:
• R-squared has increased considerably, from 0.544 to 0.694 even though we have built it with a smaller
dataset (407 observations, instead of 506 observations),
• all 3 predictors are highly significant.
TASK 1: Interpret the estimated coefficients (see how it was done for the simple linear regression).
TASK 2: Use diagnostic plots to examine how well the model adheres to the assumptions.
Let’s make predictions using this model on the test data set that we have created.
# calculate the predictions with the lm2 model over the test data
lm2.predict <- predict(lm2, newdata = test.boston)
## 3 5 11 12 14 15
## 32.31678 30.55989 21.75026 24.10511 21.20138 20.75116
To examine the predicted against the real values of the response variable (medv), we can plot their distributions
one against the other.
13
# combine the test set with the predictions
test.boston.lm2 <- cbind(test.boston, pred = lm2.predict)
0.06
0.04
medv distribution
density
predicted
real
0.02
0.00
0 10 20 30 40 50
medv
To evaluate the predictive power of the model, we’ll compute R-squared on the test data. Recall that
R-squared is computed as 1 − RSS
T SS , where TSS is the total sum of squares.
# calculate RSS
lm2.test.RSS <- sum((lm2.predict - test.boston$medv)^2)
# calculate TSS
lm.test.TSS <- sum((mean(test.boston$medv) - test.boston$medv)^2)
## [1] 0.6017718
R2 on the test is lower than the one obtained on the training set, which is expected.
Let’s also compute Root MeanqSquared Error (RMSE) to see how much error we are making with the
predictions. Recall: RM SE = RSS
n .
14
# calculate RMSE
lm2.test.RMSE <- sqrt(lm2.test.RSS/nrow(test.boston))
lm2.test.RMSE
## [1] 5.432056
To get a perspective of how large this error is, let’s check the mean value of the response variable on the test
set.
# compare medv mean to the RMSE
mean(test.boston$medv)
## [1] 21.68384
lm2.test.RMSE/mean(test.boston$medv)
## [1] 0.2505117
So, it’s not a small error, it’s about 25% of the mean value.
Let’s now build another model using all available predictors:
# build an lm model with the training set using all of the variables
lm3 <- lm(medv ~ ., data = train.boston) # note the use of '.' to mean all variables
##
## Call:
## lm(formula = medv ~ ., data = train.boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1772 -2.6987 -0.5194 1.7225 26.0486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.759e+01 5.609e+00 6.702 7.17e-11 ***
## crim -9.610e-02 4.024e-02 -2.388 0.01741 *
## zn 4.993e-02 1.521e-02 3.283 0.00112 **
## indus -5.789e-03 6.745e-02 -0.086 0.93166
## chas 2.292e+00 1.019e+00 2.250 0.02501 *
## nox -1.723e+01 4.244e+00 -4.059 5.95e-05 ***
## rm 3.784e+00 4.537e-01 8.341 1.26e-15 ***
## age 8.387e-04 1.450e-02 0.058 0.95391
## dis -1.620e+00 2.217e-01 -7.310 1.50e-12 ***
## rad 3.031e-01 7.434e-02 4.078 5.51e-05 ***
## tax -1.316e-02 4.144e-03 -3.176 0.00161 **
## ptratio -9.582e-01 1.473e-01 -6.505 2.37e-10 ***
## black 9.723e-03 2.993e-03 3.249 0.00126 **
## lstat -5.297e-01 5.691e-02 -9.308 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.692 on 393 degrees of freedom
## Multiple R-squared: 0.7549, Adjusted R-squared: 0.7468
## F-statistic: 93.1 on 13 and 393 DF, p-value: < 2.2e-16
15
Note that even though we now have 13 predictors, we haven’t much improved the R2 value: in the model
with 3 predictors (lm2), it was 0.693 and now it is 0.755. In addition, it should be recalled that R2 increases
with the increase in the number of predictors, no matter how good/useful they are.
The 3 predictors from the previous model are still highly significant. Plus, there are a number of other
significant variables.
Let’s do the prediction using the new model:
# calculate the predictions with the lm3 model over the test data
lm3.predict <- predict(lm3, newdata = test.boston)
## 3 5 11 12 14 15
## 30.70615 28.05079 18.88585 21.53429 19.68122 19.43022
Plot the distribution of predictions against the real values of the response variable (medv).
# combine the test set with the predictions
test.boston.lm3 <- cbind(test.boston, pred = lm3.predict)
0.06
0.04
medv distribution
density
predicted
real
0.02
0.00
0 10 20 30 40 50
medv
As before, we’ll compute R2 on the test data.
16
# calculate RSS
lm3.test.RSS <- sum((lm3.predict - test.boston$medv)^2)
## [1] 0.6635756
Again, we got lower R2 than on the train set.
We can also compute RMSE:
# calculate RMSE
lm3.test.RMSE <- sqrt(lm3.test.RSS/nrow(test.boston))
lm3.test.RMSE
## [1] 4.992775
It is lower (therefore, better) than with the previous model.
TASK: use diagnostic plots to examine how well the model adheres to the assumptions.
Considering the number of variables in the model, we should check for multicolinearity (Assumption 4). To
do that, we’ll compute the variance inflation factor (VIF):
# load the 'car' package
library(car)
# calculate vif
vif(lm3)
17