Stat 378

Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Applied Regression Analysis

Course notes for STAT 378/502

Adam B Kashlak
Mathematical & Statistical Sciences
University of Alberta
Edmonton, Canada, T6G 2G1

November 19, 2018


cbna
This work is licensed under the Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International License. To view a
copy of this license, visit http://creativecommons.org/licenses/
by-nc-sa/4.0/.
Contents

Preface 1

1 Ordinary Least Squares 2


1.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Derivation of the OLS estimator . . . . . . . . . . . . . . . . 5
1.1.2 Maximum likelihood estimate under normality . . . . . . . . 7
1.1.3 Proof of the Gauss-Markov Theorem . . . . . . . . . . . . . . 8
1.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Regression coefficients . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Partial F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Prediction for an expected observation . . . . . . . . . . . . . 12
1.4.2 Prediction for a new observation . . . . . . . . . . . . . . . . 14
1.5 Indicator Variables and ANOVA . . . . . . . . . . . . . . . . . . . . 14
1.5.1 Indicator variables . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.2 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Model Assumptions 18
2.1 Plotting Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 Plotting Residuals . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Variance Stabilizing . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 Box-Cox and the power transform . . . . . . . . . . . . . . . 25
2.2.4 Cars Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Model Problems . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Piecewise Polynomials . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Interacting Regressors . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Influence and Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 The Hat Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 Cook’s D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.3 DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.4 DFFITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.5 Covariance Ratios . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.6 Influence Measures: An Example . . . . . . . . . . . . . . . . 39
2.5 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Model Building 43
3.1 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.1 Identifying Multicollinearity . . . . . . . . . . . . . . . . . . . 45
3.1.2 Correcting Multicollinearity . . . . . . . . . . . . . . . . . . . 46
3.2 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Subset Models . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.2 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.3 Forward and Backward Selection . . . . . . . . . . . . . . . . 52
3.3 Penalized Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 Best Subset Regression . . . . . . . . . . . . . . . . . . . . . 55
3.3.3 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.4 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.5 Penalized Regression: An Example . . . . . . . . . . . . . . . 56

4 Generalized Linear Models 58


4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.1 Binomial responses . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.2 Testing model fit . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.3 Logistic Regression: An Example . . . . . . . . . . . . . . . . 61
4.2 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Poisson Regression: An Example . . . . . . . . . . . . . . . . 63

A Distributions 65
A.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2 Chi-Squared distribution . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.3 t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.4 F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

B Some Linear Algebra 68


Preface

I never felt such a glow of loyalty and respect towards the


sovereignty and magnificent sway of mathematical analysis
as when his answer reached me confirming, by purely
mathematical reasoning, my various and laborious
statistical conclusions.
Regression towards Mediocrity in Hereditary Stature
Sir Francis Galton, FRS (1886)

The following are lecture notes originally produced for an upper level under-
graduate course on linear regression at the University of Alberta in the fall of 2017.
Regression is one of the main, if not the primary, workhorses of statistical inference.
Hence, I do hope you will find these notes useful in learning about regression.
The goal is to begin with the standard development of ordinary least squares in
the multiple regression setting, then to move onto a discussion of model assumptions
and issues that can arise in practice, and finally to discuss some specific instances of
generalized linear models (GLMs) without delving into GLMs in full generality. Of
course, what follows is by no means a unique exposition but is mostly derived from
three main sources: the text, Linear Regression Analysis, by Montgomery, Peck, and
Vining; the course notes of Dr. Linglong Kong who lectured this same course in
2013; whatever remains inside my brain from supervising (TAing) undergraduate
statistics courses at the University of Cambridge during my PhD years.

Adam B Kashlak
Edmonton, Canada
August 2017

1
Chapter 1

Ordinary Least Squares

Introduction
Linear regression begins with the simple but profound idea that some observed
response variable, Y ∈ R, is a function of p input or regressor variables x1 , . . . , xp
with the addition of some unknown noise variable ε. Namely,

Y = f (x1 , . . . , xp ) + ε

where the noise is generally assumed to have mean zero and finite variance. In this
setting, Y is usually considered to be a random variable while the xi are considered
fixed. Hence, the expected value of Y is in terms of the unknown function f and the
regressors:
E (Y |x1 , . . . , xp ) = f (x1 , . . . , xp ).
While f can be considered to be in some very general classes of functions, we
begin with the standard linear setting. Let β0 , β1 , . . . , βp ∈ R. Then, the multiple
regression model is
Y = β0 + β1 x1 + . . . + βp xp = β T X
where β = (β0 , . . . , βp )T and X = (1, x1 , . . . , xp )T . The simple regression model is a
submodel of the above where p = 1, which is

Y = β0 + β1 x1 + ε,

and will be treated concurrently with multiple regression.


In the statistics setting, the parameter vector β ∈ Rp is unknown. The analyst
observes multiple replications of regressor and response pairs, (X1 , Y1 ), . . . , (Xn , Yn )
where n is the sample size, and wishes to choose a “best” estimate for β based on
these n observations. This setup can be concisely written in a vector-matrix form as

Y = Xβ + ε (1.0.1)

2
25 ●
SZ

20
Nobel prizes per 10 million


15 SW


10 GE

NO

● ● ●
5 ● FR
● FN ●BE
CA
AU IR
● ●
● IT
●JP
0 ●● PR
CHBZ SP

2 4 6 8 10 12

Chocolate Consumption (kg per capita)

Figure 1.1: Comparing chocolate consumption in kilograms per capita to number


of Nobel prizes received per 10 million citizens.

where
 
  1 x1,1 . . . x1,p 
  
Y1 1 x2,1 . . . x2,p  β0 ε1
 .. 
β =  ...  ,
 .. 
Y =  . , X = . ..  , ε =  . .
   
. .. . .
. . . . 
Yn βp εn
1 xn,1 . . . xn,p

Note that Y, ε ∈ Rn , β ∈ Rp+1 , and X ∈ Rn×p+1 .


As Y is a random variable, we can compute its mean vector and covariance
matrix as follows:
EY = E (Xβ + ε) = Xβ
and  
Var (Y ) = E (Y − Xβ)(Y − Xβ)T = E εεT = Var (ε) = σ 2 In .


An example of a linear regression from a study from the New England Journal of
Medicine can be found in Figure 1.1. This study highlights the correlation between
chocolate consumption and Nobel prizes received in 16 different countries.

3
Table 1.1: The four variables in the linear regression model of Equation 1.0.1 split
between whether they are fixed or random variables and between whether or not the
analyst knows their value.
Known Unknown
Fixed X β
Random Y ε

Definitions
Before continuing, we require the following collection of terminology.
The response Y and the regressors X were already introduced above. These
elements comprise the observed data in our regression. The noise or error variable is
ε. The entries in this vector are usually considered to be independent and identically
distributed (iid) random variables with mean zero and finite variance σ 2 < ∞.
Very often, thisvector will be assumed to have a multivariate normal distribution:
ε ∼ N 0, σ 2 In where In is the n × n identity matrix. The variance σ 2 is also
generally considered to be unknown to the analyst.
The unknown vector β is our parameter vector. Eventually, we will construct an
estimator β̂ from the observed data. Given such an estimator, the fitted values are
Ŷ := X β̂. These values are what the model believes are the expected values at each
regressor.
Given the fitted values, the residuals are r = Y − Ŷ which is a vector with entries
ri = Yi − Ŷi . This is the difference between the observed response and the expected
response of our model. The residuals are of critical importance to testing how good
our model is and will reappear in most subsequent sections. Pn
Lastly, there is the concept of sum of squares. Letting Ȳ = n −1
Pn i=1 Yi be the
2
sample mean for Y , the total sum of squares is SStot = i=1 (Yi − Ȳ ) , which can
be thought of as the total variation of the responses. This can be decomposed into a
sum of the explained sum of squares and the residual sum of squares as follows:
n
X n
X
2
SStot = SSexp + SSres = (Ŷi − Ȳ ) + (Yi − Ŷi )2 .
i=1 i=1

The explained sum of squares can be thought of as the amount of variation explained
by the model while the residual sum of squares can be thought of as a measure of
the variation that is not yet contained in the model. The sum of squares gives us
an expression for the so called coefficient of determination, R2 = SSexp /SStot =
1 − SSres /SStot ∈ [0, 1], which is treated as a measure of what percentage of the
variation is explained by the given model.

4
1.1 Point Estimation
In the ordinary least squares setting, the our choice of estimator is
n
X
β̂ = arg min (Yi − Xi,· · β̃)2 (1.1.1)
β̃∈Rp i=1

where Xi,· is the ith row of the matrix X. In the simple regression setting, this
reduces to
n
X
(β̂0 , β̂1 ) = arg min (Yi − (β̃0 + β̃1 x1 ))2 .
(β̃0 ,β̃1 )∈R2 i=1

Note that this is equivalent to choosing a β̂ to minimize the sum of the squared
residuals.
It is perfectly reasonable to consider other criterion beyond minimizing the sum
of squared residuals. However, this approach results in an estimator with many nice
properties. Most notably is the Gauss-Markov theorem:
Theorem 1.1.1 (Gauss-Markov Theorem). Given the regression setting from Equa-
tion 1.0.1 and that for the errors, Eεi = 0 for i = 1, . . . , n, Var (εi ) = σ 2 for
i = 1, . . . , n, and cov (εi , εj ) = 0 for i 6= j, then the least squares estimator results in
the minimal variance over all linear unbiased estimators. (This is sometimes referred
to as the “Best Linear Unbiased Estimator” or BLUE)
Hence, it can be shown that the estimator is unbiased, Eβ̂ = β and that the
constructed least squares line passes through the centre of the data in the sense
the sum of the residuals is zero, ni=1 ri = 0 and that Ȳ = β̂ X̄ where Ȳ =
P
thatP
n−1 ni=1 Yi is the sample mean of the Yi and where X̄ is the vector of column means
of the matrix X.

1.1.1 Derivation of the OLS estimator1


The goal is to derive an explicit solution to Equation 1.1.1. First, consider the
following partial derivative:
n n
∂ X X
(Yi − Xi,· · β̂)2 = −2 (Yi − Xi,· · β̂)Xi,k
∂ β̂k i=1 i=1
n
(Yi − pj=1 Xi,j β̂j )Xi,k
X P
= −2
i=1
n
X p
n X
X
= −2 Yi Xi,k + 2 Xi,j Xi,k β̂j
i=1 i=1 j=1
1
See Montgomery, Peck, Vining Sections 2.2.1 and 3.2.1 for simple and multiple regression,
respectively.

5
Pn
The above is the kth entry in the vector ∇ i=1 (Yi − Xi,· β̂)2 . Hence,
n
X
∇ (Yi − Xi,· β̂)2 = −2X T Y + 2X T X β̂.
i=1

Setting this equal to zero results in a critical point at

X T Y = X T X β̂

or β̂ = (X T X)−1 X T Y assuming X T X is invertible. Revisiting the terminology in


the above definitions sections gives the following:

Least Squares Estimator: β̂ = (X T X)−1 X T Y


Fitted Values: Ŷ = X β̂ = X(X T X)−1 X T Y
Residuals: r = Y − Ŷ = (In − X(X T X)−1 X T )Y

In the case that n > p, the matrix PX := X(X T X)−1 X T is a rank p+1 projection
matrix. Similarly, In − PX is the complementary rank n − p − 1 projection matrix.
Intuitively, this implies that the fitted values are the projection on the observed
values onto a p-dimensional subspace while the residuals arise from
 a projection
 onto
the orthogonal subspace. As a result, it can be shown that cov Ŷ , r = 0.
Now that we have an explicit expression for the least squares estimator β̂, we
can show that it is unbiased.

Eβ̂ = E (X T X)−1 X T Y = (X T X)−1 X T EY = (X T X)−1 X T Xβ = β.




Following that, we can compute its variance.


T
   
Var β̂ = E (β̂ − β)(β̂ − β)
 
= E β̂ β̂ − ββ T
T

 T

= E (X T X)−1 X T Y ((X T X)−1 X T Y ) − ββ T
= (X T X)−1 X T E Y Y T X(X T X)−1 − ββ T


= (X T X)−1 X T (σ 2 In + Xββ T X T )X(X T X)−1 − ββ T


= σ 2 (X T X)−1 .

Thus far, we have only assumed that ε is a random vector with iid entries with
mean zero and variance σ 2 . If in addition, we assumed that ε has a normal or
Gaussian distribution, then

ε ∼ N 0, σ 2 In , Y ∼ N Xβ, σ 2 In , and β̂ ∼ N β, σ 2 (X T X)−1 .


  

6
Furthermore, with a little work, one can show that for the fitted values and residuals
also have normal distributions in this setting:
 
Ŷ ∼ N X β̂, σ 2 PX , and r ∼ N 0, σ 2 (In − PX ) .


Notice that the two above covariance matrices are not generally of full rank. This
assumption that the errors follow a normal distribution is a very common assumption
to make in practice.

1.1.2 Maximum likelihood estimate under normality2


In the previous section, the OLS estimator is derived by minimizing the sum of the
squared errors. Now, given the additional assumption that the errors have a normal
distribution, we can compute an alternative estimator for β: the maximum likelihood
estimate (MLE). We can also use this to simultaneously compute the MLE for σ 2 .
2

From above we have that Y ∼ N Xβ, σ In , and hence the likelihood is
 
2 2 −n/2 1 T
L(β, σ ; X, Y ) = (2πσ ) exp − 2 (Y − Xβ) (Y − Xβ) .

The log likelihood is then

`(β, σ 2 ; X, Y ) = log L(β, σ 2 ; X, Y ) =


n n 1
= − log 2π − log σ 2 − 2 (Y − Xβ)T (Y − Xβ).
2 2 2σ
This implies that the MLE for β comes from solving
∂` ∂
0= = (Y − Xβ)T (Y − Xβ),
∂β ∂β
which is solved by the OLS estimator from above. Hence, the MLE under normality
is the least squares estimator.
For the variance term σ 2 , the MLE is similarly found by solving
∂` n 2 −1 (σ 2 )−2
0= = − (σ ) + (Y − Xβ)T (Y − Xβ).
∂σ 2 2 2
T
This occurs for σ̂ 2 = n−1 (Y − X β̂)P(Y − X β̂), which is just the average sum of
squares of the residuals: σ̂ 2 = n−1 ni=1 ri2 . However, this is a biased estimator of
the variance as the residuals are not independent and have a degenerate covariance
matrix of rank n − p − 1. Intuitively, this implies that the sum of squared residuals
has n − p − 1 degrees of freedom resulting in
n
SSres 1 X 2
= 2 ri ∼ χ2 (n − p − 1)
σ2 σ
i=1
2
See Montgomery, Peck, Vining Section 3.2.6.

7
and the unbiased estimator of σ 2 being SSres /(n − p − 1) = σ̂ 2 (n/(n − p − 1)).
For a more precise explanation of where this comes from, see Cochran’s theorem
(https://en.wikipedia.org/wiki/Cochran%27s_theorem), which is beyond the
scope of this course.

Chocolate-Nobel Data
Running a regression in R on the chocolate consumption vs Nobel prize data from
Figure 1.1 results in  
−0.9910
β̂ = .
1.3545

1.1.3 Proof of the Gauss-Markov Theorem


Proof. Any linear estimator can be written as AY for some non-random matrix
A ∈ R(p+1)×n . We can in turn write A = (X T X)−1 X T + D for some matrix
D ∈ R(p+1)×n . Then, as

E (AY ) = AXβ
= (X T X)−1 X T + D Xβ
 

= β + DXβ,

the unbiased condition implies that DXβ = 0 for any β ∈ Rp+1 and hence that
DX = 0.
Next, we compute the variance of the arbitrary linear unbiased estimator to get

Var (AY ) = AVar (Y ) AT


T
= σ 2 (X T X)−1 X T + D (X T X)−1 X T + D
 

= σ 2 (X T X)−1 + (X T X)−1 X T DT + DX(X T X)−1 + DDT


 

= σ 2 (X T X)−1 + DDT .
 

Hence, to minimize the variance, we must minimize DDT as DDT is necessarily a


positive semi-definite matrix. This is achieved by setting D = 0 and arriving at
(X T X)−1 X T Y having minimial variance.

Remark 1.1.2. Note that DDT is positive semi-definite for any choice of D as for
any w ∈ Rp+1 , we have
T
wT (DDT )w = (DT w) (Dw) = kDwk2 ≥ 0.

Remark 1.1.3. While β̂ has minimal variance over all unbiased estimators, we can
lessen the variance further if we allow for biased estimators. This is considered in
many more advanced regression methods such as ridge regression.

8
1.2 Hypothesis Testing
1.2.1 Goodness of fit3
We now have a model for our data, and in some sense, this model is optimal as it
minimizes the squared errors. However, even being optimal, we are still interested
in knowing whether or not this is a good model for our data. This is a question of
goodness of fit.
The first question to ask is, do any of the regressors provide information about
the response in the linear model framework? This can be written mathematically as

H0 : β1 = . . . = βp = 0 H1 : ∃i ≥ 1 s.t. βi 6= 0, (1.2.1)

which is asking is there at least one βi that we can claim is non-zero and hence
implies that the regressor xi has some nontrivial influence over y.
To test this hypothesis, we revisit the explained and residual sums of squares
introduced in the Definitions section. Specifically, we already have that SSres /σ 2 ∼
χ2 (n − p − 1) from above. Similarly, SSexp /σ 2 ∼ χ2 (p) under the null hypothesis
where β1 = . . . = βp = 0, and hence any variation in those terms should be pure
noise. Lastly, it can be demonstrated that SSres and SSexp are independent random
variables, which intuitively follows from the orthogonality of the fitted values and
the errors.
The usual test statistic for Hypothesis 1.2.1 is

SSexp /p
∼ F (p, n − p − 1) ,
SSres /(n − p − 1)

which leads to an F test. If the test statistic is large, then the explained variation
is larger than the noise resulting in a small p-value and a rejection of the null
hypothesis.

F test on Chocolate-Nobel data


From the final line of the R output from the summary() command, we have a test
statistic value of 15.45 with degrees of freedom 1 and 14. This results in a very small
p-value of 0.001508.
If you were to run the regression in R without the intercept term, which is fixing
β0 = 0, then the result is β̂1 = 1.22, a value for the test statistic for the F test
of 44.24, now with degrees of freedom 1 and 15, and an even smaller p-value of
7.7 × 10−06 .
3
See Montgomery, Peck, Vining Sections 2.3.2 and 3.3.1 for simple and multiple regression,
respectively.

9
1.2.2 Regression coefficients4
Given that the previous F test results in a significant p-value, the subsequent question
is to ask which of the p regressors are significant? Hence, we have the following
hypotheses for j = 0, 1, . . . , p.

H0,j : βj = 0 H1,j : βj 6= 0.
 
Each individual β̂j ∼ N βj , σ 2 (X T X)−1
j,j where (X T X)−1
j,j is the jth entry in the
diagonal of (X T X)−1 5
j,j . Thus, under the null hypothesis that βj = 0, we have that
q
β̂j / σ 2 (X T X)−1
j,j ∼ N (0, 1) .

However, we cannot perform a z test as σ 2 is unknown. To rectify this, the unbiased


estimator for σ 2 is used in its place resulting in

β̂j
q ∼ t (n − p − 1) ,
−1
(X T X)j,j SSres /(n − p − 1)

and a t test can be performed. If the value of the test statistic is large, then there
may be sufficient evidence to reject the null that βj = 0. The denominator is often
referred to as the standard error. To simplify future formulae, this will be denoted
as se(βj ).
It is worth noting that this test looks for significant influence of the jth regressor
on the response given all of the other regressors. Hence, it quantifies the marginal as
opposed to the absolute effect of that variable on the model. These ideas will be
investigated further when discussing variable selection in Section 3.2. However, as a
quick word of caution, when p hypothesis tests are performed, the analyst needs to
consider multiple testing corrections.

t test on Chocolate-Nobel data


The R commands lm() and summary() will return a table of regression coefficients, t
test statistics and p-values associated with each coefficient. For the Chocolate-Nobel
prize data, the table looks like

Estimate Std. Error t value Pr(> |t|)


(Intercept) -0.9910 2.1327 -0.465 0.64932
choco 1.3545 0.3446 3.931 0.00151 ??
4
See Montgomery, Peck, Vining Sections 2.3.1 and 3.3.2 for simple and multiple regression,
respectively.
5
We will index the entries of the matrix from 0, 1, . . . , p to conform with the indexing of the β’s.
Note that this is a (p + 1) × (p + 1) matrix.

10
1.2.3 Partial F-test6
In the previous two sections, we first tested as to whether or not there exists at least
one βj , j = 1, . . . , p, that is non-zero. Then, we tested whether or not a specific βj
is non-zero. The next logical question is whether or not some collection of βj ’s of
size strictly between 1 and p has a non-zero element. That is, for a fixed q,
H0 : βp−q+1 = . . . = βp = 0 H1 : ∃i ≥ p − q + 1 s.t. βi 6= 0. (1.2.2)
Here, we are comparing two different models, which are the partial and full models,
respectively,
Y = β0:p−q X + ε, and Y = β0:p−q X + βp−q+1:p X + ε,
and want to know whether the final q regressors add any significant explanation to
our model given the other p − q. For the above notation,
βi:j = (0, . . . , 0, βi , βi+1 , . . . , βj , 0, . . . , 0)T .
To run the hypothesis test 1.2.2, we would have to compute the least squares
estimator in the partial model, β̂1:p−q , and the standard least squares estimator in
the full model, β̂. Then, we will have to compute the additional explained sum of
squares gained from adding the q extra regressors to our model, which is
SSexp (βp−q+1:p |β1:p−q ) = SSexp (β) − SSexp (β1:p−q ),
the explained sum of squares from the full model minus the explained sum of squares
from the partial model.
Similarly to the full F-test from above, we have under the null hypothesis that
SSexp (βp−q+1:p |β1:p−q )/σ 2 ∼ χ2 (q) . Hence,
SSexp (βp−q+1:p |β1:p−q )/q
∼ F (q, n − p − 1) ,
SSres /(n − p − 1)
so if this test statistic is large, then we have evidence to suggestion that at least one
of the additional q regressors adds some explanatory power to our model.

1.3 Confidence Intervals7


Confidence intervals play a complementary role with hypothesis testing. From the
development of the above test for an individual βj , we have that

β̂j − βj
∼ t (n − p − 1) ,
se(βj )
6
See Montgomery, Peck, Vining the latter half of Section 3.3.2.
7
See Montgomery, Peck, Vining Sections 2.4.1 and 3.4.1 for simple and multiple regression,
respectively.

11
Hence, a 1 − α confidence interval for the parameter βj is

β̂j − tα/2,n−p−1 se(βj ) ≤ β ≤ β̂j + tα/2,n−p−1 se(βj )

where tα/2,n−p−1 ∈ R+ is such that P T ≤ tα/2,n−p−1 = α/2 when T ∼ t (n − p − 1).




While the above can be used to produce a confidence interval for each individual
parameter, combining these intervals will not result in a 1 − α confidence set for the
entire parameter vector. To construct such a confidence region, a little more care is
required.8 Also, we will construct a confidence set for the entire vector (β0 , β1 , . . . , βp ),
which results in p + 1 degrees of freedom in what follows. As β̂ ∼ N β, σ 2 (X T X)−1


we have that
T
σ −2 (β̂ − β) X T X(β̂ − β) ∼ χ2 (p + 1) .
From before, we have that SSres /σ 2 ∼ χ2 (n − p − 1). Hence
T
(β̂ − β) X T X(β̂ − β)/(p + 1)
∼ F (p + 1, n − p − 1) .
SSres /(n − p − 1)

Thus, a 1 − α confidence ellipsoid can be constructed as


T
(β̂ − β) X T X(β̂ − β)/(p + 1)
≤ Fα,p+1,n−p−1 .
SSres /(n − p − 1)

A 95% and a 99% confidence ellipsoid for the Chocolate-Nobel prize data is
displayed in Figure 1.2. Notice that both ellipses contain β̂0 = 0 which had a t
statistic p-value of 0.649. Meanwhile neither contain β̂1 = 0 whose p-value was the
very significant 0.0015. The confidence ellipses were plotted with help from the R
library ellipse.

1.4 Prediction Intervals


1.4.1 Prediction for an expected observation9
Given the least squares model, the analyst may be interested in estimating the
expected value of Y have some specific input x = (1, x1 , . . . , xp ). Our new random
variable is Ŷ0 = β̂ · X where X is fixed and β̂ is random. Of course, the expected
value is just
  Xp
E Ŷ0 X = x = Eβ̂ · x = β0 + βi xi .

i=1
8
See Montgomery, Peck, Vining Section 3.4.3
9
See Montgomery, Peck, Vining Sections 2.4.2 and 3.4.2 for simple and multiple regression,
respectively.

12
3.0
2.5
99% confidence
2.0
choco

1.5

95% confidence
1.0
0.5
0.0

−5 0 5

(Intercept)

Figure 1.2: Confidence ellipses at the 95% and 99% level for the chocolate consump-
tion vs Nobel prizes data.

To find a 1 − α interval estimate for Ŷ0 at X = x, recall once again that β̂ ∼


2 T −1

N β, σ (X X) . Thus,

Ŷ0 |(X = x) ∼ N β · x, σ 2 xT (X T X)−1 x .




Hence,  
β̂ · x − E Ŷ0 X = x

p ∼ N (0, 1) ,
σ 2 xT (X T X)−1 x
and  
β̂ · x − E Ŷ0 X = x

p ∼ t (n − p − 1) ,
(SSres /(n − p − 1))xT (X T X)−1 x
which results in the following 1 − α confidence interval:
s
SSres
β̂ · x − tα/2,n−p−1 xT (X T X)−1 x ≤
n−p−1
 
≤ E Ŷ0 X = x = β · x ≤

s
SSres
≤ β̂ · x + tα/2,n−p−1 xT (X T X)−1 x.
n−p−1

13
1.4.2 Prediction for a new observation10
In the previous subsection, we asked for a confidence interval for the expected value
of theresponse
given
 a new vector of regressors, which was a confidence interval
for E Ŷ0 X = x = β · x based on β̂ · x. Now, we want to determine a confidence

interval for the future response given a vector of regressors. That is, we want an
interval for Y0 = β · x + ε0 ∼ N β · x, σ 2 , but, as usual, β unknown. To circumvent
this, note that

Y0 − Ŷ0 = (β · x + ε0 ) − β̂ · x ∼ N 0, σ 2 (1 + xT (X T X)−1 x) ,


because the variances of ε0 and β̂ · x sum as these are independent random variables.
Hence, applying the usual rearrangement of terms and replacement of σ 2 with
SSres /(n − p − 1) results in
s
(1 + xT (X T X)−1 x)SSres
β̂ · x − tα/2,n−p−1 ≤ Y0 ≤
n−p−1
s
(1 + xT (X T X)−1 x)SSres
≤ β̂ · x + tα/2,n−p−1 .
n−p−1

Intervals for the Chocolate-Nobel prize data for both the expected mean and for
a new observation are displayed in Figure 1.3.

1.5 Indicator Variables and ANOVA11


1.5.1 Indicator Variables12
Thus far, we have considered models of the form

y = β0 + β1 x1 + . . . + βp xp + ε

where the regressors x1 , . . . , xp ∈ R can take on any real value. However, very often
in practice, we have regressors that take on categorical values. For example, male vs
female, employed vs unemployed, treatment vs placebo, Edmonton vs Calgary, etc.
When there is a binary choice as in these examples, we can choose one category to
correspond to zero and the other category to correspond to one.
As an example, consider

y = β0 + β1 x1 + β2 x2 + ε (1.5.1)
10
See Montgomery, Peck, Vining Section 3.5.
11
See Montgomery, Peck, and Vining Chapter 8.
12
See Montgomery, Peck, and Vining Section 8.1 and 8.2.

14
25 ●
SZ

20
Nobel prizes per 10 million


15 SW
New Obs


10 GE

NO

● ● ●
5 ● FR
● FN ● BE
CA
AU IR


● IT Expected Obs
●JP
0 ● ●
SP
PR
CHBZ

2 4 6 8 10 12

Chocolate Consumption (kg per capita)

Figure 1.3: A 95% confidence interval (in blue) for the expected value and a 95%
prediction interval (in read) for a new observation for the chocolate consumption vs
Nobel prizes data.

where x1 ∈ R and x2 ∈ {0, 1}. Then, we effectively have two models:

y = β0 + β1 x1 + 0 + ε
y = β0 + β1 x1 + β2 + ε = (β0 + β2 ) + β2 x1 + ε.

What we have is two models with the same slope β1 but with two different intercepts
β0 and β0 + β2 , which are two parallel lines.

Remark 1.5.1. A first thought is to merely split the data and train two separate
models. However, we want to use the entire dataset at once specifically to estimate
the common slope β1 with as much accuracy as possible.

While the range of the regressors has changed, we will fit the least squares
estimate to the model precisely as before. Now, considering the model 1.5.1, assume
that we have m samples with x2 = 0 and n samples with x2 = 1. Our design matrix
takes on a new form:
   
Y1 1 x1 0
 ..   .. .. .. 
 .  . . .  
ε1
 

 Ym 
 
1 xm 0
 β0
Y =  Ym+1  , X= 1 xm+1 1 , β = β1  , ε =  ...  .
   

 .. 
 
 .. ..

..  β2 εm+n
 .  . . .
Ym+n 1 xm+n 1

15
However, the least squares estimate is computed as before as β̂ = (X T X)−1 X T Y .
Furthermore, we can perform hypothesis tests on the fitted model such as

H0 : β2 = 0 H1 : β2 6= 0,

which is equivalently asking whether or not the regression lines have the same
intercept.
Models can be expanded to include multiple indicator variables as long as the
matrix X T X is still invertible. For example, let’s suppose we want to look at wages
in Alberta with respect to age but partitioned for male vs female and for Edmonton
vs Calgary. Then, the model would look like

(wage) = β0 + β1 (age) + β2 (Is male?) + β3 (Is from Edmonton?).

In the silly case that our data only consisted of men from Calgary and women from
Edmonton, then the final regressor is redundant and X T X will not be invertible.
While this extreme case should not occur, it is possible to have an imbalance in the
categories, which we will discuss later.

1.5.2 ANOVA13
ANOVA, or the Analysis of Variance, is a slightly overloaded term in statistics. We
already considered ANOVA tables when comparing nested models in the hypothesis
tests of Section 1.2. However, ANOVA can also be used in the setting of the so-called
One-Way Analysis of Variance 14 . In this case, we want to compare k samples for
equality of the means. For example, we take height measurements from randomly
selected citizens from different countries and ask whether or not there is significant
evidence to reject the claim that all nations have roughly the same height distribution.
The reason for discussing ANOVA in these notes is that it can be written in
a linear regression context as follows. Imagine that we have k different groups of
observations with sample sizes nj , j = 1, . . . , k for each group. Let yi,j be the ith
observation from the jth group where i ∈ {1, . . . , nj } and j ∈ {1, . . . , k}. The model
is
yi,j = µj + εi,j ,
which is each observation is just some group mean, µj , with the addition of random
noise. Pnj
From here, one can show that the fitted values as just ŷi,j = n−1 j l=1 yl,j ,
which is the jth sample mean. Then, an F-test can be performed similar to that of
Section 1.2 to test

H0 : µ1 = . . . = µk H1 : ∃j1 6= j2 s.t. µj1 6= µj2 .


13
See Montgomery, Peck, and Vining Section 8.3.
14
https://en.wikipedia.org/wiki/One-way_analysis_of_variance

16
To reformulate the model to align with our F-test from before, we rewrite it as a
linear regression with indicator variables for the regressors

y = β0 + β1 x1 + . . . + βk−1 xk−1 + εi,j

with β0 = µk and βj = µj − µk for j = 1, . . . , k − 1. Then, we can test for whether


or not there exists at least one βj = 6 0 for j = 1, . . . , k − 1. Here, the degrees of
freedom for the explained sum of squares is k − 1 and the degrees P of freedom for the
residual sum of squares is N − (k − 1) − 1 = N − k with N = kj=1 nj . If all of the
nj are equal, this reduces to k(n − 1).
In this case, the vector Y and the design matrix X will take on the form
 
1 1 0 ... 0
1 1 0 . . . 0
 .. .. .. . . .. 
 
  . . . . .
y1,1
 
1 1 0 . . . 0
 y2,1  
1 0 1 . . . 0

 
 ..  
1 0 1 . . . 0

 .   
   .. .. .. . . .. 
Y = y , X = . . .
 n1 ,1  . . .
 
 y1,2  1 0 0 . . . 1

 
 ..  
1 0 0 . . . 1

 .   
1 0 0 . . . 0
ynk ,k  
. . . .
 .. .. .. . . ... 

 
1 0 0 . . . 0
1 0 0 ... 0

17
Chapter 2

Model Assumptions

Introduction
Thus far, we have considered linear regression given some key assumptions. Namely,
for models of the form

y = β0 + β1 x1 + . . . + βp xp + ε

we assume that the noise or errors ε have zero mean, constant finite variance, and
are uncorrelated. Furthermore, in order to perform hypothesis tests–F test, t test,
partial F-tests–and to construct confidence and prediction intervals as we did in the
previous chapter, we further assumes that ε has a normal distribution.
In this chapter, we will consider deviations from these assumptions, which will
lead to questions such as
1. What happens if the variance of ε is not constant?
2. What happens if ε has heavier tails than those of a normal distribution?
3. What effect can outliers have on our model?
4. How can we transform our data to correct for some of these deviations?
5. What happens if the true model is not linear?

2.1 Plotting Residuals1


In the last chapter, we constructed the residuals as follows. Assume we have a sample
of n observations (Yi , Xi ) where Yi ∈ R and Xi ∈ Rp with p being the number of
regressors and p < n. The model, as before, is

Y = Xβ + ε
1
See Montgomery, Peck, Vining Section 4.2.

18
where Y ∈ Rn , X ∈ Rn×(p+1) , β ∈ Rp+1 , and ε ∈ Rn . The least squares estimator is
β̂ = (X T X)−1 X T Y and the vector of residuals is thus

r = (I − P )Y = (I − X(X T X)−1 X T )Y.

We can use the residuals to look for problems in our data with respect to
deviations from the assumptions. But first, they should be normalized in some way.
As we know from the previous chapter, the covariance of the residuals is (I − P )σ 2
and that SSres /(n − p − 1) is an unbiased estimator for the unknown variance σ 2 .
This implies that while the errors εi are assumed to be uncorrelated, the residuals
are, in fact, correlated. Explicitly,

Var (ri ) = (1 − Pi,i )σ 2 , and cov (ri , rj ) = −Pi,j σ 2 for i 6= j

where Pi,j is the (i, j)th entry of the matrix P . Hence, a standard normalization
technique is to write
ri
si = p ,
(1 − Pi,i )SSres /(n − p − 1)

which are denoted as the studentized residuals. For a linear model fit in R by the
function lm(), we can extract the residuals with resid() and extract the studentized
residuals with rstudent().

2.1.1 Plotting Residuals


Studentized Residuals
A plot of studentized residuals from a simple linear regression is displayed in Figure 2.1.
Generally, abnormally large studentized residuals indicate that an observation may
be an outlier.

Remark 2.1.1. There are other types of residuals that can be computed such as
standardized residuals, PRESS residuals, and externally studentized residuals. These
are also used to look for outliers.

Residuals vs Fitted Values


 
The residuals and the fitted values as necessarily uncorrelated. That is, cov r, Ŷ =
0. However, if ε is not normally distributed, then they may not be independent.
Plotting the fitted values against the residuals can give useful diagnostic information
about your data.
Figure 2.2 gives four examples of plots of the residuals against the fitted values.
The first plot in the top left came from a simple linear regression model where all of
the standard assumptions are met. The second plot in the top right came from a

19
● ●
● ●
● ●


1
● ● ●
15


● ●

● ●●
● ● ●

● ● ●
● ● ●●
● ●
● ● ● ●
● ●

0
● ● ●
● ●

●● ● ● ●
● ●
10

Studentized Residual
● ● ● ●

● ● ●
● ● ●
● ● ● ● ●
● ● ●

−1
● ● ●

y1

● ●
● ●
● ● ● ●● ●
5

●● ●
● ●

−2

● ●


0

−3
● ●

−4
−5

● ●

0 1 2 3 4 5 0 1 2 3 4 5

x x

Figure 2.1: On the left, a plot of some data with an outlier labelled in blue and a
least squares regression line in red. On the right, the studentized residuals for each
of the 51 data points.

simple linear regression but with errors εi ∼ N 0, σi2 where σi2 was increasing in


i. Hence, the plot has an expanding look to it. The third plot in the bottom left
came from a simple linear regression with the addition of a quadratic term. Fitting
a model without the quadratic term still yielded significant test statistics, but failed
to account for the nonlinear interaction between x and y. The final plot in the
bottom right came from a simple linear regression where the errors were correlated.
Specifically, cov (εi , εj ) = min{xi , xj }.2

Normal Q-Q plots


Another type of plot that can offer insight into your data is the so-called Normal Q-Q
plot. This tool plots the studentized residuals against the quantiles of a standard
normal distribution.

Definition 2.1.2 (Quantile Function). The quantile function is the inverse of the
cumulative distribution function. That is, in the case of the normal distribution, let
Z ∼ N (0, 1). Then the CDF is Φ(z) = P (Z < z) ∈ (0, 1) for z ∈ R. The quantile
function is Φ−1 (t) ∈ [−∞, ∞] for t ∈ [0, 1].3

For a normal Q-Q plot, let s1 , . . . , sn be the ordered studentized residuals,


so that s1 ≤ . . . ≤ sn . The theoretical quantiles are denoted q1 , . . . , qn where
2
Fun fact: This is actually the covariance of Brownian motion.
3
For more details, see https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot

20
Residuals vs Fitted Residuals vs Fitted

30
10

33 ● ● 37
47 ●

20
● ●


● ●

5


● ●
● ●

10

● ●

● ● ● ●
● ● ● ●●
● ●
Residuals

Residuals
● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
0

● ● ●● ● ● ●

0
● ● ●
● ● ●
● ●
● ● ● ● ●
● ●
● ● ●●
●● ● ●

● ● ●
● ● ●

−10


● ●
−5




● ●

−20
31 ● ●

● 18 49 ●
−10

−30

0 5 10 15 0 5 10 15

Fitted values Fitted values


lm(y1 ~ x) lm(y2 ~ x)

Residuals vs Fitted Residuals vs Fitted


1.5

●3 28 ●
48 ●
10

1.0


●●
● ●

● ●
● ●● ●

0.5


5

● ●
● ● ● ●●

● ● ● ●
● ● ● ●

Residuals

Residuals

● ●
● ● ●
● ● ●
● ●
0.0

● ● ●●
● ● ●
● ● ● ●
● ●
0

● ●● ●
● ● ●


● ● ● ●
−0.5

● ●
● ● ●
● ●
● ● ●
● ●
● ●


−5

● ●
● ●
−1.0


● ●
● ●


● 15
● 20 ● 16
−1.5
−10

0 5 10 15 20 25 30 0 2 4 6 8 10 12

Fitted values Fitted values


lm(y3 ~ x) lm(y4 ~ x)

Figure 2.2: For examples of plotting residuals vs fitted values. Top left is when
assumptions hold. Top right has a non-constant variance. Bottom left has nonlinear
terms. Bottom right has correlated errors.

21
Normal Q−Q Normal Q−Q
2

9● 14 ●

3
● ● 44



● 37
● ●
●●
1

2

●●
Standardized residuals

Standardized residuals
●●●
●●●●


●●●

1



0


●●●
● ●●
● ●●

●●●
● ●●●●
●●● ●●

●●

0
●●●●●
●● ●●●●
●●
●●●
●●
● ●●
−1

● ●●
●● ● ●●
●●

−1
● ● ●
● ●





−2

−2
●6
● 12 ●

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles


lm(y1 ~ x) lm(y2 ~ x)

Figure 2.3: On the left, a normal Q-Q plot for data from a simple linear regression
with ε having a normal distribution. On the left, a normal Q-Q plot for data from a
simple linear regression with ε having a t distribution.

qi = Φ−1 (i/(n + 1)). In R, a slightly different formula is used. Figure 2.3 compares
two normal Q-Q plots. On the left, the errors are normally distributed and the
ordered residuals roughly follow the gray line. On the right, the errors have a t
distribution which has heavier tails than expected. Hence, we see deviation from the
gray line for the extreme residuals.

Remark 2.1.3. As noted in Montgomery, Peck, & Vining, these are not always
easy to interpret and often fail to capture non-normality in the model. However, they
seem to be very popular nonetheless.

2.2 Transformations4
Often, data does not follow all of the assumptions of the ordinary least squares
regression model. However, it is often possible to transform data in order to correct
for this deviations. We will consider such methods in the following subsections. One
dissatisfying remark about such methods is that they often are applied “empirically”,
which less euphemistically means in an add-hoc way. Sometimes there may be
genuine information suggesting certain methods for transforming data. Other times,
transforms are chosen because they seem to work.
4
See Montgomery, Peck, Vining Chapter 5.

22
2.2.1 Variance Stabilizing5
One of the major requirements of the least squares model is that the variance of the
errors is constant, which is that Var (yi ) = Var (εi ) = σ 2 for i = 1, . . . , n. Mainly,
when σ 2 is non-constant in y, problems can occur. Our goal is thus to find some
transformation T (y) so that Var (T (yi )) is constant for all i = 1, . . . , n.
Such a transformation T (·) can be determined through a tool known as the delta
method,6 which is beyond the scope of these notes. However, we will consider a
simplified version for our purposes. For simplicity of notation, we write EY = µ
instead of EY = β0 + β1 x1 + . . . + βp xp . Furthermore, assume that T is twice
differentiable.7 Then, Taylor’s theorem says that
T 00 (ξ)
T (Y ) = T (µ) + T 0 (µ)(Y − µ) + (Y − µ)2
2
for some ξ between Y and µ. We can, with a little hand-waving, ignore the higher
order remainder term and just write
T (Y ) ≈ T (µ) + T 0 (µ)(Y − µ),
which implies that
ET (Y ) ≈ T (µ) and Var (T (Y )) ≈ T 0 (µ)2 Var (Y ) .
We want a transformation such that Var (T (Y )) = 1 is constant. Meanwhile, we
assume that the variance of Y is a function of the mean µ, which is Var (Y ) = s(µ)2 .
Hence, we need to solve
Z
0 2 2 1
1 = T (µ) s(µ) or T (µ) = dµ.
s(µ)
Example 2.2.1. For a trivial example, assume that s(µ) = σ is already constant.
Then, Z
1
T (µ) = dµ = µ/σ.
σ
Thus, T is just scaling by σ to achieve a unit variance.

Example 2.2.2. Now, consider the nontrivial example with s(µ) = µ, which is
that the variance of Y is a linear function of µ. Then,

Z
1
T (µ) = √ dµ = 2 µ.
µ
This is the square root transformation, which is applied to, for example, Poisson data.
The coefficient of 2 in the above derivation can be dropped as we are not concerned
with the scaling.
5
See Montgomery, Peck, Vining Section 5.2.
6
See Chapter 3, Asymptotic Statistics, A W van der Vaart or https://en.wikipedia.org/
wiki/Delta_method.
7
We only require the second derivative for a more pleasant exposition.

23
Given that we have found a suitable transform, we can then apply it to the data
y1 , . . . , yn to get a model of the form
T (y) = β0 + β1 x1 + . . . + βp xp + ε
where the variance of ε is constant–i.e not a function of the regressors.

2.2.2 Linearization8
Another key assumption is that the relationship between the regressors and response,
x and y, is linear. If there is reason to believe that
y = f (β0 + β1 x1 + . . . + βp xp + ε)
and that f (·) is invertible, then we can rewrite this model as
y 0 = f −1 (y) = β0 + β1 x1 + . . . + βp xp + ε
and apply our usual linear regression tools.
An example from the textbook, which is also quite common in practice, is to
assume that
y = ceβ1 x1 +...+βp xp +ε
and apply a logarithm to transform to
y 0 = log y = β0 + β1 x1 + . . . + βp xp + ε
where β0 = log c. Furthermore, if ε has a normal distribution then eε has a log-normal
distribution. This model is particularly useful when one is dealing with exponential
growth in some population.
Remark 2.2.3. Linearization by applying a function to y looks very similar to the
variance stabilizing transforms of the previous section. In fact, such transforms have
an effect on both the linearity and the variance of the model and should be used with
care. Often non-linear methods are preferred.
Sometimes it is beneficial to transform the regressors, x’s, as well. As we are not
treating them as random variables, there are less problems to consider.

y = β0 + β1 f (x) + ε ==⇒ y = β0 + β1 x0 + ε, x = f −1 (x0 )


Examples include
0
y = β0 + β1 log x + ε ==⇒ y = β0 + β1 x0 + ε, x = ex

y = β0 + β1 x2 + ε ==⇒ y = β0 + β1 x0 + ε, x = x0
This second example can be alternatively dealt with by employing polynomial
regression to be discussed in a subsequent section.
8
See Montgomery, Peck, Vining Section 5.3.

24
2.2.3 Box-Cox and the power transform9
In short, Box-Cox is a family of transforms parametrized by some λ ∈ R, which can
be optimized via maximum likelihood. Specifically, for yi > 0, we aim to choose the
best transform of the form
( λ
yi −1
(λ)
yi → yi = λ λ 6= 0
log yi λ = 0
by maximizing the likelihood as we did in Chapter 1, but with parameters β, σ 2 ,
and λ.
To do this, we assume that the transformed variables follow all of the usual least
squares regression assumptions and hence have a joint normal distribution with
n
!
1 X
f (Y (λ) ) = (2πσ 2 )−n/2 exp − 2 (yi − Xi,· β)2 .

i=1

Transforming Y → Y (λ) is a change of variables with Jacobian


n n
Y dyi (λ) Y
= yiλ−1 .
dyi
i=1 i=1

Hence, the likelihood function in terms of X and y is


n n
!
1 X λ Y
L(β, σ 2 , λ|y, X) = (2πσ 2 )−n/2 exp − 2 (yi − Xi,· β)2 yiλ−1

i=1 i=1

with log likelihood


n n
n 1 X X
log L = − log(2πσ 2 ) − 2 (yi − Xi,· β)2 + (λ − 1) log yi .
2 2σ
i=1 i=1

From here, the MLEs for β and σ2 are solved for as before but are now in terms of
the transformed Y (λ) .
β̂ = (X T X)−1 X T Y (λ)
n (λ)
1 X (λ) SSres
σ̂ 2 = (yi − Xi,· β̂)2 = .
n n
i=1

Plugging these into the log likelihood gives and replacing all of the constants with
some C gives
n
n n X
log L = − log(2πσ̂ 2 ) − + (λ − 1) log yi
2 2
i=1
n Y 
= C − log σ̂ 2 + log ( yi )λ−1 .
2
9
See Montgomery, Peck, Vining Section 5.4.

25
Defining the geometric mean of the yi to be γ = ( ni=1 yi )1/n , we have
Q

n n  
log L = C − log σ̂ 2 + log γ 2(λ−1)
2 2 
σ̂ 2

n
= C − log .
2 γ 2(λ−1)

Considering the term inside the log, we have that it is just the residual sum of
squares from the least squares regression

Y (λ)
= Xθ + ε
γ λ−1

where θ ∈ Rp+1 is a transformed version of the original β. Hence, we can choose


λ̂ by maximizing the log likelihood above, which is equivalent to minimizing the
residual sum of squares for this new model. This can be calculated numerically in
statistical programs like R.

2.2.4 Cars Data10


To test some of these linearization techniques, we consider the cars dataset, which is
included in the standard distribution of R. It consists of 50 observations of a car’s
speed and a car’s stopping distance. The goal is to model and predict the stopping
distance given the speed of the car. Such a study could be used, for example, for
influencing speed limits and other road rules for the sake of public safety. The
observed speeds range from 4 to 25 mph.
We can first fit a simple regression to the data with lm( dist∼speed, data=cars).
This results in a significant p-value, an R2 = 0.651, and an estimated model of

(dist) = −17.6 + 3.9(speed) + ε.

If we wanted to extrapolate a bit, we can use this model to predict the stopping
distance for a speed of 50 mph, which is 179 feet.
We could stop here and be happy with a significant fit. However, looking at the
data, there seems to be a nonlinear relationship between speed and stopping distance.
Hence, we could try to fit a model with the response being the square root of the
stopping distance: lm( sqrt(dist)∼speed, data=cars). Doing so results in
p
(dist) = 1.28 + 0.32(speed) + ε.

In this case, we similarly get a significant p-value and a slightly higher R2 = 0.709.
The prediction for the stopping distance for a speed of 50 mph is now the much
higher 302 feet.
10
https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/cars.html

26
Linear
150
Stopping Distance

Square Root
Log−Log ●
100


● ●
● ●
● ● ● ●
● ● ●
● ●
50

● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●


● ●
● ●
0

0 5 10 15 20 25 30 35

Speed

Figure 2.4: A plot of the cars dataset with three different regressions fit to it: simple
regression; square root transform; log-log transform.

We can further apply a log-log transform with lm( log(dist)∼log(speed),


data=cars). This results in an even higher R2 = 0.733. The fitted model is

log(y) = −0.73 + 1.6 log(x) + ε,

and the predicted stopping distance for a car travelling at 50 mph is 254 feet. Note
that the square root transform is modelling y ∝ x2 whereas the log-log transform is
modelling y ∝ x1.6 , which is a slower rate of increase.
The three models considered are plotted in Figure 2.4. As we move away from
the range of the regressors–i.e 4 to 25 mph–the models begin to diverge making
extrapolating a bit dangerous.

2.3 Polynomial Regression11


In the following subsections, we will only consider models with a single regressor
x ∈ R until Section 2.3.3 where we will consider polynomials in multiple variables
x1 , . . . , x p .
One of the examples of atypical residual bahaviour from Figure 2.2 is occurs
when there is an unaccounted for quadratic term in the model such as trying to fit a
11
See Montgomery, Peck, Vining Chapter 7

27
model of the form
y = β 0 + β 1 x1 + ε
to data generated by
y = β0 + β1 x21 + ε.
If this is suspected, we can look at the residuals and try to transform x, such as

x → x, in order to put this back into the linear model framework. However, we
can also just fit a polynomial model to our data.
Consider the setting where we observe n data pairs (yi , xi ) ∈ R2 . We can then
attempt to fit a model of the form

y = β0 + β1 x + β2 x2 + . . . + βp xp + ε

to the observations. The n × p design matrix12 will look like

1 x1 x21 . . . xp1
 
1 x2 x2 . . . xp 
2 2
X = . .

.. . . .. 
 .. .. . . . 
1 xn x2n . . . xpn

and the parameters can be estimated as usual: β̂ = (X T X)−1 X T Y .

Remark 2.3.1. While the columns of X are, in general, linearly independent, as


p gets larger, many problems can occur. In particular, the columns become near
linearly dependent resulting in instability when computing (X T X)−1 .

2.3.1 Model Problems13


Polynomial regression is very powerful for modelling, but can also lead to very
erroneous results if used incorrectly. What follows are some potential issues to take
into consideration.

Overfitting
As a general rule, the degree of the polynomial model should be kept as low as
possible. High order polynomials can be misleading as they can often fit the data
quite well. In fact, given n data points, it is possible to fit an n − 1 degree polynomial
that passes through each data point. In this extreme case, all of the residuals would
be zero, but we would never expect this to be the correct model for the data.
12
This matrix arises in Linear Algebra as the Vandermonde matrix https://en.wikipedia.org/
wiki/Vandermonde_matrix.
13
See Montgomery, Peck, Vining Section 7.2.1

28
Problems can occur even when p is much smaller than n. As an example, two
models were fit to n = 50 data points generated from the model

y = 3x2 + ε

with ε ∼ N (0, 1). The first was a cubic model. The second was a degree 20 model.
The first regression resulted in three significant regressors. Note that in these two
cases, orthogonal polynomials were used to maintain numerical stability.
Estimate Std. Error t value Pr(>|t|)
Intercept 4.0856 0.1498 27.276 < 2e-16 ***
Degree 1 25.0368 1.0592 23.638 < 2e-16 ***
Degree 2 4.6890 1.0592 4.427 5.84e-05 ***
The second regression resulted in five significant regressors.
Estimate Std. Error t value Pr(>|t|)
Intercept 4.08562 0.13253 30.829 < 2e-16 ***
Degree 1 25.03676 0.93709 26.717 < 2e-16 ***
Degree 2 4.68903 0.93709 5.004 2.51e-05 ***
Degree 15 -2.79288 0.93709 -2.980 0.00578 **
Degree 17 -2.67700 0.93709 -2.857 0.00784 **
Furthermore, the ANOVA table seems to indicate that the addition of these variables
may be useful.
Model 1: y ∼ poly(x, 3)
Model 2: y ∼ poly(x, 20)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 46 51.606
2 29 25.466 17 26.139 1.751 0.08954 .
However, plotting the data in Figure 2.5 with the two different regression lines
demonstrates the overfitting problem with the very high order second model.

Extrapolating
Figure 2.5 also indicates problems that can occur when extrapolating with polynomial
models. When high degrees are present, the best fit curve can change directions
quickly and even make impossible or illogical predictions.
Beyond that, even when the fitted polynomial models all trend in the same
general direction, polynomials of different degree will diverge quickly from one
another. Consider Figure 2.6 where the data was generated from

y = 2x + 3x3 + ε.

All four models displayed generated very significant F tests. However, each one will
give very different answers to predicting the value of y when x = 5.

29


● ●
degree 3 ●
10

● ●
degree 20

●● ● ●●


y

● ●● ●●
5

● ●
● ● ●
● ●
● ●●
● ●
●● ● ●● ●
● ●●● ●
● ●
0

●●
● ●

0.0 0.5 1.0 1.5 2.0

Figure 2.5: A degree 3 and a degree 20 model fit to the same data.

Hierarchy
An hierarchical polynomial model is one such that if it contains a term of degree k,
then it will contain all terms of order i = 0, 1 . . . , k − 1 as well. In practice, it is not
strictly necessary to do this. However, doing so will maintain invariance to linear
shifts in the data.
Consider the simple linear regression model y = β0 + β1 x + ε. If we were to shift
the values of x by some constant a, then
y = β0 + β1 (x + a) + ε
= (β0 + aβ1 ) + β1 x + ε
= β00 + β1 x + ε
and we still have the same model but with a modified intercept term.
Now consider the polynomial regression model y = β0 + β2 x2 + ε. If we were to
similarly shift the values of x by some constant a, then
y = β0 + β2 (x + a)2 + ε
= (β0 + a2 β2 ) + 2β2 ax + β2 x2 + ε
= β00 + β10 x + β2 x2 + ε.
Now, our model has a linear term, that is β10 x, in it, which was not there before.
In general, for a degree p model, if x is shifted by some constant a, then
p
X p
X
y = β0 + βi xi ==⇒ y = β00 + βi0 xi .
i=1 i=1

30
30

●●


linear ●●
quadratic ●
20



cubic ●

quartic ●●
y



10


●●●●● ●


●●
5

● ●
●●
● ● ● ●●● ●

●●●● ●
●●● ●●
0



0.0 0.5 1.0 1.5 2.0

x
500

linear
quadratic
cubic
300

quartic
y

100

●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0

0 1 2 3 4 5

x
500
300
y

100

●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0

0 1 2 3 4 5

Figure 2.6: Top: four different regressions fit to the data of increasing degree.
Middle: the problem of extrapolating with higher degree polynomial regressions.
Bottom: a confidence interval for the mean of the cubic model.
31
Thus, the model is invariant under linear translation.

2.3.2 Piecewise Polynomials14


While a single high degree polynomial can be fit very closely to the observed data,
there exist the already discussed problems of overfitting and subsequent extrapolation.
Hence, an alternative to capture the behaviour of highly nonlinear data is to apply a
piecewise polynomial model, which is often referred to as a spline model.
To begin, assume that the observed regressors take values in the interval [a, b].
Then, we partition the interval with k + 1 knots by a = t1 < t2 < . . . < tk+1 = b.
The general spline model of order p takes on the form
k
X p X
X k
y= β0,j 1[x ≥ tj ] + βi,j (x − tj )i+ (2.3.1)
j=1 i=1 j=1

where 1[x > tj ] is the indicator function that takes on a value of 0 when x < tj and
a value of 1 otherwise, and where (x − tj )i+ = (x − tj )i 1[x > tj ].

Remark 2.3.2. Equation 2.3.1 is equivalent to fitting a separate pth order polynomial
to the data in each interval [tj , tj+1 ]. While doing so does result in many parameters to
estimate, it will allow us to perform hypothesis tests for continuity and differentiability
of the data, which we will discuss in the next section.

Submodels of 2.3.1 have a collection of interesting properties. If the coefficients


β0,j are set to 0, then we will fit a model that ensures continuity at the knots. For
example, the model
Xk
y = β0,1 + β1,j (x − tj )+
j=1

is a piecewise linear model with continuity at the knots. Conversely, a model of the
form
X k
y= β0,j 1[x ≥ tj ]
j=1

fits a piecewise constant model and can be used to look for change points in an
otherwise constant process.
In practice, spline models with only cubic terms are generally preferred as they
have a high enough order to ensure a good amount of smoothness–i.e. the curves are
twice continuously differentiable–but generally do not lead to overfitting of the data.
14
See Montgomery, Peck, Vining Section 7.2.2

32
An Example
Consider a model of the form
y = 2 + 3x − 4x5 + x7 + ε
with ε ∼ N (0, 4) with n = 50 observations with regressor x ∈ [0, 2]. A degree 4 and
degree 7 polynomial regression were fit to the data, and the results are displayed in
the top plot of Figure 2.7. These deviate quickly from the true curve outside of the
range of the data. The middle plot of Figure 2.7 fits a piecewise linear spline with 6
knots–i.e. k = 5. The model is of the form
y = β0,1 + β1,1 (x − 0.0)+ + β1,2 (x − 0.4)+ + . . . + β1,5 (x − 1.6)+ .
The bottom plot of Figure 2.7 fits a spline with only piecewise cubic terms and a
spline with cubic and all lower order terms. The only cubic model provides a very
reasonable approximation to the data. The full spline model becomes a bit crazy.

Hypothesis testing for spline models


It is useful to consider what the hypothesis tests mean in the context of spline models.
For example, consider fitting the piecewise constant model to some data:
k
X
y= β0,j 1[x ≥ tj ] .
j=1

The usual F-test will consider the hypotheses15


H0 : β0,2 = . . . = β0,k = 0 H1 : ∃i ∈ {2, 3, . . . , k} s.t. β0,i 6= 0.
This hypothesis is asking whether or not we believe the mean of the observations
changes as x increases.
We can also compare two different spline models with a partial F-test. For
example,
y = β0,1 + kj=1 β3,j (x − tj )3+
P
Model 1:
y = β0,1 + kj=2 β0,j 1[x ≥ tj ] + kj=1 β3,j (x − tj )3+
P P
Model 2:
The partial F-test between models 1 and 2 asks whether or not the addition of the
piecewise constant terms adds any explanatory power to our model. Equivalently, it
is testing for whether or not the model has any discontinuities in it. Similarly, first
order terms can be used to test for differentiability.
Instead of constructing a bigger model by adding polynomial terms of different
orders, it is also possible to increase the number of knots in the model. Similar to
all of the past examples, we will require that the new set of knots contains the old
set–that is, {t1 , . . . , tk } ⊂ {s1 , . . . , sm }. This requirement ensures that our models
are nested and thus can be compared in an ANOVA table.
15
Note that β0,1 is the overall intercept term in this model.

33
10

● ●


● ●
5

● ● ●

● ● ●●● ●

●● ● ● ● ●
●● ● ●
● ● ● ●
y


● ●
● ● ●
●●
Truth ● ●
−5


Degree 4 ● ●

Degree 7 ● ● ●
●●

0.0 0.5 1.0 1.5 2.0

x
10

● ●


● ●
5

● ● ●

● ● ●●● ●

●● ● ● ● ●
●● ● ●
● ● ● ●
y


● ●
● ● ●
●●
Truth ● ●
−5


Linear Spline ● ●

● ● ●
●●

0.0 0.5 1.0 1.5 2.0

x
10

● ●


● ●
5

● ● ●

● ● ●●● ●

●● ● ● ● ●
●● ● ●
● ● ● ●
y


● ●
● ● ●
●●
Truth ● ●
−5


Cubic Spline (only) ● ●

Cubic Spline (full) ● ● ●
●●

0.0 0.5 1.0 1.5 2.0

Figure 2.7: Top: A degree 4 and a degree 7 polynomial fit to data generated by
y = 2 + 3x − 4x5 + x7 + ε. Middle: A linear spline model. Bottom: A spline model
with only cubic terms and another including all lower order terms.

34
B-Splines
In what we considered above, the spline models were comprised of polynomials
with supports of the form [tj , ∞) for knots t1 < . . . < tk . However, there are many
different families of splines with different desirable properties. Some such families are
the B-splines, Non-uniform rational B-splines (NURBS), box splines, Bézier splines,
and many others. Here we will briefly consider the family of B-splines due to their
simplicity and popularity. Note that in the spline literature, sometimes the knots
are referred to as control points.
The ultimate goal of the of the B-splines is to construct a polynomial basis where
the constituent polynomials have finite support. Specifically, for an interval [a, b]
and knots a = t1 < t2 < . . . < tk < tk+1 = b, the constant polynomials will have
support on two knots such as [t1 , t2 ], the linear terms on three knots such as [t1 , t3 ],
and so on up to degree p terms, which require p + 2 knots.
B-splines can be defined recursively starting with the constant, or degree 0, terms:
Bj,0 (x) = 1x∈[tj ,tj+1 ] ,
which takes a value of 1 on the interval [tj , tj+1 ] and is 0 elsewhere. From here, the
higher order terms can be written as
   
x − tj tj+i+1 − x
Bj,i (x) = Bj,i−1 (x) + Bj+1,i−1 (x).
tj+i − tj tj+i+1 − tj+1
For example, with knots {0, 1, 2, 3}, we have
Constant: Bj,0 = 1x∈[j,j+1] j = 0, 1, 2

(x − j), x ∈ [j, j + 1]
Linear: Bj,1 = j = 0, 1
(j + 2 − x), x ∈ [j + 1, j + 2]
 2
 x /2, x ∈ [0, 1]
Quadratic: Bj,2 = (−2x2 + 6x − 3)/2, x ∈ [1, 2] j = 0.
(3 − x)2 /2, x ∈ [2, 3]

The linear and quadratic splines are displayed in Figure 2.8.


Just as we fit a spline model above, we could use the linear regression tools to fit
a B-spline model of the form
p X
X k−i
y= βi,j Bj,i (x).
i=0 j=1

Here, we require that k > p as we will at least need knots t1 , . . . , tp+1 for a pth
degree polynomial. The total number of terms in the regression, which is number of
parameters βi,j to estimate, is
k + (k − 1) + . . . + (k − p).

35
1.0
0.8
0.6
y

0.4
0.2
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

x
Figure 2.8: An example of linear and quadratic B-splines.

2.3.3 Interacting Regressors16


Thus far in this section, we have considered only polynomial models with a single
regressor. However, it is certainly possible to fit a polynomial model for more than
one regressor such as

y = β0 + β1 x1 + β2 x2 + β11 x21 + β12 x1 x2 + β22 x22 + ε.

Here we have linear and quadratic terms for x1 and x2 as well as an interaction term
β12 x1 x2 .
Fitting such models to the data follows from what we did for single variable
polynomials. In this case, the number of interaction terms can grow quite large in
practice. With k regressors, x1 , . . . , xk , there will be kp interaction terms of degree


p assuming p < k. This leads to the topic of Response Surface Methodology,17 which
is a subtopic of the field of experimental design.
We will not consider this topic further in these notes. However, it is worth noting
that in R, it is possible to fit a linear regression with interaction terms such as

y = β0 + β1 x1 + β2 x2 + β3 x3 + β12 x1 x2 + β13 x1 x3 + β23 x2 x3 + β123 x1 x2 x3 + ε

with the simple syntax lm( y∼x1*x2*x3 ) where the symbol * replaces the usual +
from before.
16
See Montgomery, Peck, Vining Section 7.4
17
https://en.wikipedia.org/wiki/Response_surface_methodology

36
2.4 Influence and Leverage18
The overall intuition for this section is that each observation does not have an equal
influence on the estimation of β̂. If a given observed regressor x lies far from the
other observed values, it can have a strong effect on the least squares regression line.
The goal of this section is to identify such points or subset of points that have a
large influence on the regression.
If we use the R command lm() to fit a linear model to some data, then we
can use the command influence.measures() to compute an array of diagnostic
metrics for each observation to test its influence on the regression. The function
influence.measures() computes DFBETAS, DFFITS, covariance ratios, Cook’s
distances and the diagonal elements of the so-called hat matrix. We will look at each
of these in the following subsections.

Remark 2.4.1 (Warning). You may notice that the word “heuristic” appears often
in the following subsections when it comes to identifying observations with significant
influence. Ultimately, these are rough guidelines based on the intuition of past
statisticians and should not be taken as strict rules.

2.4.1 The Hat Matrix19


The projection or “hat” matrix, P = X(X T X)−1 X T , directly measures the influence
of one point on another. This is because under the usual model assumptions, the
fitted values have a covariance matrix var(Ŷ ) = σ 2 P . Hence, the i, jth entry in P is
a measure of the covariance between the fitted values Ŷi and Ŷj .
The function influence.measures() reports the diagonal entries of the matrix
P . Heuristically, any entries that are much larger than the rest will have a strong
influence on the regression. More precisely, we know that for a linear model

y = β0 + β1 x1 + . . . + βp xp ,

we have that rank(P ) = p + 1. Hence, trace(P ) = p + 1 where the trace of a matrix


is the sum of the diagonal entries. Thus, as P is an n × n matrix, we roughly expect
the diagonal entries to be approximately (p + 1)/n. Large deviations from this
value should be investigated. For example, Montgomery, Peck, & Vining recommend
looking at observations with Pi,i > 2(p + 1)/n.

Remark 2.4.2. The ith diagonal entry Pi,i is referred to as the leverage of the ith
observation in Montgomery, Peck, & Vining. However, in R, leverage is Pi,i /(1−Pi,i ).
This sometimes referred to as the leverage factor.
18
See Montgomery, Peck, Vining Chapter 6
19
See Montgomery, Peck, Vining Section 6.2

37
2.4.2 Cook’s D20
Cook’s D or distance computes the distance between the vector of estimated pa-
rameters on all n data points, β̂, and the vector of estimated parameters on n − 1
data points, β̂(i) where the ith observation has been removed. Intuitively, if the ith
observation has a lot of influence on the estimation of β, then the distance between
β̂ and β̂(i) should be large.
For a linear model with p + 1 parameters and a sample size of n observations,
the usual form for Cook’s D is
T
(β̂(i) − β̂) X T X(β̂(i) − β̂)/(p + 1)
Di = .
SSres /(n − p − 1)

This is very similar to the confidence ellipsoid from Section 1.3. However, this is
not an usual F statistic, so we do not compute a p-value as we have done before.
Instead, some heuristics are used to determine what a large value is. Some authors
suggest looking for Di greater than 1 or greater than 4/n.21
Cook’s D can be written in different forms. One, in terms of the diagonal entries
of P , is
s2i
 
Pi,i
Di =
p + 1 1 − Pi,i
where si is the ith studentized residual. Another form of this measure compares the
distance between the usual fitted values on all of the data Ŷ = X β̂ and the fitted
values based on all but the ith observation, Ŷ(i) = X β̂(i) . That is,
T
(Ŷ(i) − Ŷ ) (Ŷ(i) − Ŷ )/(p + 1)
Di =
SSres /(n − p − 1)

Note that like Ŷ , the vector Ŷ(i) ∈ Rn . The ith entry in the vector Ŷ(i) is the
predicted value of y given xi .

2.4.3 DFBETAS22
The intuition behind DFBETAS is similar to that for Cook’s D. In this case, we
consider the normalized difference between β̂ and β̂(i) . What results is an n × (p + 1)
matrix whose ith row is

β̂ − β̂(i)
DFBETASi = q ∈ Rp+1
T −1
(X X)i,i SSres(i) /(n − p − 2)
20
See Montgomery, Peck, Vining Section 6.3
21
https://en.wikipedia.org/wiki/Cook%27s_distance
22
See Montgomery, Peck, Vining Section 6.4

38
where SSres(i) is the sum of the squared residuals for the model fit after removing the
ith data point and where (X T X)−1 T
i,i is the ith diagonal entry of the matrix (X X) .
−1

The recommended heuristic is to consider the ith observation as an influential point



if the i, jth entry of DFBETAS has a magnitude greater than 2/ n.

2.4.4 DFFITS23
The DFFITS value is very similar to the previously discussed DFBETAS. In this
case, we are concerned with by how much the fitted values change when the ith
observation is removed. Explicitly,

Ŷ − Ŷ(i)
DFFIT = q ∈ Rn .
T −1
(X X)i,i SSres(i) /(n − p − 2)

The claim is that DFFIT is effected by both leverage and prediction error. The
heuristic
p is to investigate any observation with DFFIT greater in magnitude than
2 (p + 1)/n.

2.4.5 Covariance Ratios24


The covariance ratio whether the precision of the model increases or decreases when
the ith observation is included. This measure is based on the idea that a small value
for det (X T X)−1 SSres indicates high precision in our model. Hence, the covariance


ratio considers a ratio of determinants


h i
det (X(i)T X )−1 SS /(n − p − 2)
(i) res(i)
covratioi = T −1
=
det [(X X) SSres /(n − p − 1)]
SSres(i) /(n − p − 2) p+1
   
1
= .
SSres /(n − p − 1) 1 − Pi,i
If the value is greater than 1, then inclusion of the ith point has increased the
model’s precision. If it is less than 1, then the precision has decreased. The suggested
heuristic threshold for this measure of influence is when the value is greater than
1 + 3(p + 1)/n or less than 1 − 3(p + 1)/n. Though, it is noted that this only is valid
for large enough sample sizes.

2.4.6 Influence Measures: An Example


To test these different measures, we create a dataset of n = 50 observations from the
model
y = 3 + 2x + ε
23
See Montgomery, Peck, Vining Section 6.4
24
See Montgomery, Peck, Vining Section 6.5

39
37 ●


51 ●
8

●●




● ●
● ●
● ●
● ●
6

● ●
● ●

● ●

● ●● ● ●
y

● ●●

4

● ●● ● ●●
● ● ●●


● ●


2

10 ●

52 ●
0

0 1 2 3 4

Figure 2.9: A simple regression in red fit to the original 50 data points, and a simple
regression in blue fit to the original 50 and 2 anomalous data points.

where ε ∼ N (0, 1) and x ∈ [0, 2]. To this dataset, we add 2 anomalous points at
(4, 8) and at (1, 0). Thus, we fit a simple linear regression to the original 50 data
points and also to the new set of 52 data points resulting in the red and blue lines,
respectively, in Figure 2.9.
We can use the R function influence.measures() to compute a matrix contain-
ing the DFBETAS, DFFITS, covariance ratios, Cook’s D, and leverage for each data
point. Applying the recommended thresholds in the previous sections results in the
following extreme points, which are labelled in Figure 2.9:

Measure DF β0 DF β1 DFFITS
Extremes 10, 51, 52 37, 51 37, 51, 52
Measure Cov Ratio > Cov Ratio < Cook’s D Leverage
Extremes 52 51 37, 51, 52 51

The interpretation of the table is that points. . .

• 10, 51, 52 have a strong influence on the estimation of β̂0 ;

• 37, 51 have a strong influence on the estimation of β̂1 ;

40
• 37, 51, 52 have a strong influence on the estimation of Ŷ ;

• 52 significantly increases the precision of the model;

• 51 significantly decreases the precision of the model;

• 37, 51, 52 have large Cook’s D values;

• 51 has a large leverage value.


Note that points 10 and 37 are just part of the randomly generated data while
points 51 and 52 were purposefully added to be anomalous. Hence, just because an
observation is beyond one of these thresholds does not necessarily imply that it lies
outside of the model.

2.5 Weighted Least Squares


A key assumption of the Gauss-Markov theorem is that εi = σ 2 for all i = 1, . . . , n.
What happens when εi = σi2 –i.e. when the variance can differ for each observation?
Normalizing the errors εi can be done by
εi 1
= (Yi − Xi,· β) ∼ N (0, 1) .
σi σi,i

In Chapter 1, we computed the least squares estimator as the vector β̂ such that
n
X
β̂ = arg min (Yi − Xi,· β̃)2
β̃∈Rp+1 i=1

Now, we will solve the slighly modified equation


n
X (Yi − Xi,· β̃)2
β̂ = arg min .
β̃∈Rp+1 i=1 σi2

In this setting, dividing by σi2 is the “weight” that gives this method the name
weighted least squares.
Proceeding as in chapter 1, we take a derivative with respect to the jth β̃j to get
n n
∂ X (Yi − Xi,· β̃)2 X (Yi − Xi,· β̃)
2 =2 Xi,j
∂ β̃j i=1 σi i=1
σi2
n n
X Yi Xi,j X Xi,· β̃Xi,j
=2 −2
i=1
σi2 i=1
σi2
Xn Xn
=2 Yi0 Xi,j
0
−2 0
Xi,· 0
β̃Xi,j
i=1 i=1

41
where Yi0 = Yi /σi and Xi,j
0 = X /σ . Hence, the least squares estimator is as before
i,j i

T T
β̂ = (X 0 X 0 )−1 X 0 Y 0
= (X T W X)−1 X T W Y

where W ∈ Rn×n is the diagonal matrix with entries Wi,i = σi2 .


In practise, we do not know the values for σi2 . Methods to find a good matrix of
weights W exist such as Iteratively Reweighted Least Squares,25 which is equivalent
to finding the estimator
n
X
β̂ = arg min |Yi − Xi,· β̃|q
β̃∈Rp+1 i=1

for some q ∈ [1, ∞).

25
https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares

42
Chapter 3

Model Building

Introduction
The following sections will discuss various topics regarding constructing a good
representative regression model for your data. Three main topics will be considered/
First, multicollinearity deals with the problem of linear relationships between
your regressions. We already know that we require the columns of the design matrix
to be linearly independent in order to solve for the least squares estimate. However, it
is possible to have near dependencies among the columns. This can lead to numerical
stability issues and unnecessary redundancy among the regressors.
Second, there are many different variable selection techniques in existence. Given
a large number of regressors to can be included in a model, the question is, which
should and which should not be included? We will discuss various techniques such
as forward and backward selection as well as different tools for comparing models.
Third, penalized regression will be discussed. This section introduces two modern
and quite powerful approaches to linear regression: ridge regression from the 1970’s
and LASSO from the 1990’s. Both arise from modifying how we estimate the
parameter vector β̂. Up until now, we have chosen β̂ to minimize the sum of the
squared error. Now, we will add a penalty term to this optimization problem, which
will encourage choices of β̂ with small-in-magnitude or just zero entries.

3.1 Multicollinearity1
The concept of multicollinearity is intuitively simple. Say we have a model of the
form
y = β0 + β1 x1 + β2 x2 + ε.
1
See Montgomery, Peck, Vining Section 9.3

43
This results in a design matrix of the form
 
1 x1,1 x1,2
 .. .. .. 
X = . . . 
1 xn,1 xn,2

Then, we can consider a new model of the form

x2 = α0 + α1 x1 + ε.

If this simple regression has a strong fit,2 then the addition of the regressor x2 to the
original model is unnecessary as almost all of the explanatory information provided
by x2 with regards to predicting y is already provided by x1 . Hence, the inclusion of
x2 in our model is superfluous.
Taking a more mathematical approach, it can be shown that such near lin-
ear dependencies lead to a very high variance for the least squares estimator β̂.
Furthermore, the magnitude of the vector is much larger than it should be.
Assuming that the errors have a covariance matrix Var (ε) = σ 2 In , then we have
from before that
 
Var β̂ = Var (X T X)−1 X T Y =


= (X T X)−1 X T Var (Y ) X(X T X)−1 = σ 2 (X T X)−1 .

With some effort, it can be shown that the diagonal entries of the matrix (X T X)−1
are equal to (1 − R02 )−1 , . . . , (1 − Rp2 )−1 where Rj2 is the coefficient of determination
for the model

xj = α0 + α1 x1 + . . . + αj−1 xj−1 + αj+1 xj+1 + . . . + αp xp + ε,

which is trying to predict the jth regressor by the other p − 1 regressors. If the
remaining regressors are good predictors for xj , then the value Rj2 will be close to 1.
Hence,
  σ2
Var β̂j =
1 − Rj2
will be very large.
Furthermore, this implies that the expected Euclidean distance between β̂ and β
will be quite large as well. Indeed, we have
p p
2 X
T
  X   
Var β̂i = σ 2 tr (X T X)−1

E (β̂ − β) (β̂ − β) = E β̂i − βi =
i=0 i=0
2 2
A significant F-test or R value, for example.

44
where tr (·) denotes the trace of a matrix–i.e. the sum of the diagonal entries. Hence,
if at least one of the Rj2 is close to 1, then the expected distance from our estimator
to the true β will be quite large.
The trace of a matrix is also equal to the sum of its eigenvalues. Hence, if we
denote the eigenvalues of X T X by λ1 , . . . , λp+1 , then
p+1
 X
tr (X T X)−1 = λ−1
i .
i=1

Hence, an equivalent condition to check for multicollinearity is the presence of


eigenvalues of X T X very close to zero, which would make the above sum very large.

3.1.1 Identifying Multicollinearity3


To identify the presence of multicollinearity in our linear regression, there are many
measures to consider.
We already established that near linear dependencies will result in large values for
the diagonal entries of (X T X)−1 . These values are known as the Variance Inflation
Factors and sometimes written as VIFi = (1 − Ri2 )−1 .
An interesting interpretation of the VIF is in terms of confidence intervals. Recall
that for βj , we can construct a 1 − α confidence interval as
s s
SSres SSres
−tα/2,n−p−1 (X T X)−1 j,j ≤ βj − β̂j ≤ tα/2,n−p−1 (X T X)−1 j,j .
n−p−1 n−p−1

6 j regressors are orthogonal to the jth regressor, then Rj2 = 0 and the
If all i =
term (X T X)−1j,j = 1. Under multicollinearity, (X T X)−1
j,j  1. Hence, the confidence
q
interval is expanded by a factor of (X T X)−1j,j when the regressors are not orthogonal.
We can alternatively examine the eigenvalues of the matrix X T X. Recall that
finding the least squares estimator is equivalent to solving a system of linear equations
of the form
X T X β̂ = X T Y.
To measure to stability of a solution to a system of equations to small perturbations,
a term referred to as the condition number is used.4 It is

κ = λmax /λmin

where λmax and λmin are the maximal and minimal eigenvalues, respectively. Ac-
cording to Montgomery, Peck, & Vining, values of κ less than 100 are not significant
whereas values greater than 1000 indicate severe multicollinearity.
3
See Montgomery, Peck, Vining Sections 9.4.2 & 9.4.3
4
This term arises in more generality in numerical analysis https://en.wikipedia.org/wiki/
Condition_number

45
If the minimal eigenvalue is very small, we can use the corresponding eigenvector
to understand the nature of the linear dependency. That is, consider the eigenvector
u = (u0 , u1 , . . . , up ) for the matrix X T X corresponding to the eigenvalue λmin . Recall
that this implies that
(X T X)u = λmin u ≈ 0,
which is approximately zero because λmin is close to zero. Hence, for regressors
1, x1 , . . . , xp ,
u0 + u1 x1 + . . . + up xp ≈ 0.
Thus, we can use the eigenvectors with small eigenvalues to get a linear relationship
between the regressors.

Remark 3.1.1. If you are familiar with the concept of the Singular Value Decom-
position, then you could alternatively consider the ratio between the maximal and
minimal singular values of the design matrix X.5 Furthermore, you can also analyze
the singular vectors instead of the eigen vectors.

3.1.2 Correcting Multicollinearity


Ideally, we would design a model such that the columns of the design matrix X
are linearly independent. Of course, in practise, this is often not achievable. When
confronted with real world data, there are still some options available.
First, the regressors can be respecified. That is, if x1 and x2 are near linearly
related, then instead of including both terms in the model, we can include a single
combination term like x1 x2 or (x1 + x2 )/2. Second, one of the two variables can be
dropped from the model, which will be discussed below when we consider variable
selection in Section 3.2.
More sophisticated solutions to this problem include penalized regression tech-
niques, which we will discuss in Section 3.3. Also, principal components regression6
and partial least squares are two other methods that can be applied to deal with
multicollinear data.

Remark 3.1.2. A common thread among all of these alternatives is that they result
in a biased estimate for β unlike the usual least squares estimator. Often in statistics,
we begin with unbiased estimators, but can often achieve a better estimator by adding
a small amount of bias. This is the so-called bias-variance tradeoff.7
5
https://en.wikipedia.org/wiki/Singular-value_decomposition
6
See Montgomery, Peck, Vining Sections 9.5.4 for more on PC regression
7
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

46
3.2 Variable Selection8
In general, if we have p regressors, we may want to build a model consisting only
of the best regressors for modelling the response variable. In some sense, we could
compare all possible subset models. However, there are many issues with this, which
we will address in the following subsections. First, what are the effects of removing
regressors from your model? Second, how do we compare models if they are not
nested? Third, there are 2p possible models to consider. Exhaustively fitting and
comparing all of these models may be computational impractical or impossible.
Hence, how do we find a good subset of the regressors?

3.2.1 Subset Models9


What happens to the model when we remove some regressors? Assume we have a
sample of n observations and p + q regressors and want to remove q of them. The
full model would be

y = β0 + β1 x1 + . . . + βp+q xp+q + ε.

This can be written in terms of the design matrix and partitioned over the two sets
of regressors as

Y = Xβ + ε
= Xp βp + Xq βq + ε

where Xp ∈ Rn×p , Xq ∈ Rn×q , βp ∈ Rp , βq ∈ Rq , and


 
 βp
X = Xp Xq , β=
βq

We have two models to compare. The first is the full model, Y = Xβ + ε,


where we denote the least squares estimator as β̂ = (X T X)−1 X T Y as usual with
components  
β̂p
β̂ = .
β̂q
The second is the reduced model obtained by deleting q regressors: Y = Xp βp + ε.
The least squares estimator for this model will be denoted as β̃p = (XpT Xp )−1 XpT Y
8
See Montgomery, Peck, Vining Section 10.1.3
9
See Montgomery, Peck, Vining Section 10.1.2

47
Bias may increase
The first concern with the reduced model is that the estimator β̃p can be biased as

Eβ̃p = (XpT Xp )−1 XpT EY = (XpT Xp )−1 XpT (Xp βp + Xq βq ) =


= βp + (XpT Xp )−1 XpT Xq βq = βp + Aβq .
Hence, our reduced estimator is only unbiased in two cases. Case one is when
A = 0, which occurs if the p regressors and q regressors are orthogonal resulting
in XpT Xq = 0. Case two is when βq = 0, which occurs if those regressors have no
effect on the given response. If neither of these cases occurs, then Aβq 6= 0 and
represents the bias in our estimator β̃p . Note that the matrix A is referred to as the
alias matrix.

Variance may decrease


While deleting regressors can result in the addition of bias to our estimate, it can
also result in a reduction in the variance of our estimator. Namely,
 
Var β̃p = σ 2 (XpT Xp )−1 , while
 
Var β̂p = σ 2 (XpT Xp )−1 + σ 2 A[Xq T (I − Pp )Xq ]−1 AT ,

where Pp = Xp (XpT Xp )−1 XpT .10 The matrix A[Xq T (I − Pp )Xq ]−1 AT is symmetric
positive semi-definite, so the variance for β̂p can only be larger than β̃p .

MSE may or may not improve


Generally in statistics, when deciding whether or not the increase in the bias is
worth the decrease in the variance, we consider the change in the mean squared error
(MSE) of our estimate. This is,
 T

MSE(β̃p ) = E (β̃p − βp )(β̃p − βp )
 T

= E (β̃p − Eβ̃p + Eβ̃p − βp )(β̃p − Eβ̃p + Eβ̃p − βp )
= var(β̃p ) + bias(β̃p )2
= σ 2 (XpT Xp )−1 + Aβq βqT AT .
For the full model,
MSE(β̂p ) = var(β̂p ) = σ 2 (XpT Xp )−1 + σ 2 A[XqT (I − Pp )Xq ]−1 AT ,

If MSE(β̂p ) − MSE(β̃p ) is positive semi-definite, then the mean squared error has
decreased upon the removal of the regressors in Xq .
10
This expression can be derived via the formula for inverting a block matrix https://en.
wikipedia.org/wiki/Invertible_matrix#Blockwise_inversion

48
3.2.2 Model Comparison11
We have already compared models in Chapter 1 with the partial F-test. However,
for that test to make sense, we require the models to be nested–i.e. the larger model
must contain all of the parameters of the smaller model. But, given a model

y = β0 + β1 x1 + . . . + βp xp + ε,

we may want to compare two different subset models that are not nested. Hence, we
have some different measures to consider.
Note that ideally, we would compare all possible subset models. However,
given p regressors, there are 2p different models to consider, which will often be
computationally infeasible. Hence, in Section 3.2.3, we will consider two approaches
to model selection that avoid this combinatorial problem.
Remark 3.2.1. To avoid confusion and awkward notation, assume that all subset
models will always contain the intercept term β0

Residual Sum of Squares


For two subset models with p1 and p2 regressors, respectively, with p1 < p and
p2 < p, we can compare the mean residual sum of squares for each
SSres (p1 ) SSres (p2 )
vs
n − p1 − 1 n − p2 − 1
and choose the model with the smaller value.
We know from before that the mean of the residual sum of squares for the full
model, SSres /(n − p − 1), is an unbiased estimator for σ 2 . Similar to the calculations
in the previous section, we can show that
   
SSres (p1 ) SSres (p2 )
E ≥ σ 2 and E ≥ σ2,
n − p1 − 1 n − p2 − 1
which is that these estimators for subset models are upwardly biased.

Mallows’ Cp
We can also compare different models by computing Mallows’ Cp . The goal of this
value is to choose the model the minimizes the mean squared prediction error, which
is
n
X E (ỹi − Eyi )2
M SP E =
σ2
i=1
where y˜i is the ith fitted value of the submodel and Eyi is the ith fitted value of the
true model. Furthermore, let ŷi be the ith fitted value for the full model. This is the
11
See Montgomery, Peck, Vining Section 10.1.3

49
expected squared difference between what the submodel predicts and what the real
value is. As usual with mean squared errors in statistics, we rewrite this in terms of
the variance plus the squared bias, which is
n
1 Xh 2
i
M SP E = E (ỹ i − Eỹ i + Eỹi − Ey i )
σ2
i=1
n
1 Xh i
= 2 E (ỹi − Eỹi )2 + (Eỹi − Eyi )2
σ
i=1
n
1 X
Var (ỹi ) + bias(ỹi )2

= 2
σ
i=1

Recall that the variance of the fitted values for the full model is Var (ŷ) = σ 2 Px
where Px = X(X T X)−1 X T . For a submodel with p1 < p regressors and design matrix
Xp1 , we get the similar Var (ỹ) = σ 2 Xp1 (XpT1 Xp1 )−1 XpT1 . As Xp1 (XpT1 Xp1 )−1 XpT1 is a
rank p1 + 1 projection matrix, we have that
n
X
Var (ỹi ) = σ 2 tr Xp1 (XpT1 Xp1 )−1 XpT1 = σ 2 (p1 + 1).


i=1

For the bias term, consider the expected residual sum of squares for the submodel:
n
X
E (SSres (p1 )) = E (yi − ỹi )2
i=1
n
X
=E (yi − Eỹi + Eỹi − Eyi + Eyi − ỹi )2
i=1
n
X
Var (r̃i ) + (Eỹi − Eyi )2
 
=
i=1
n
X
= (n − p1 − 1)σ 2 + bias(ỹi )2 .
i=1

Hence, rearranging the terms above gives


n
X
bias(ỹi )2 = E (SSres (p1 )) − (n − p1 − 1).
i=1

Combining the bias and the variance terms derived above results in Mallows’ Cp
statistic for a submodel with p1 < p regressors:
E (SSres (p1 )) SSres (p1 )
Cp 1 = 2
− n + 2p1 + 2 ≈ − n + 2p1 + 2.
σ SSres /(n − p − 1)
Here, we estimate E (SSres (p1 )) by SSres (p1 ) and estimate σ 2 by SSres /(n − p − 1).

50
Remark 3.2.2. Note that if we compute Mallows’ Cp for the full model, we get
SSres
Cp = − n + 2p + 2 = p + 1.
SSres /(n − p − 1)
Hence, Mallows’ Cp in this case is just the number of parameters in the model. In
general, we want to find submodels with Cp value smaller than p + 1.

Information Criteria
Information criteria are concerned with quantifying the amount of information in a
model. With such a measure, we can choose a model that optimizes this measurement.
A main requirement for these methods is that the response y is the same. Hence,
we should not use the measures below when comparing transformed models–e.g.
different linearized models–without the necessary modifications.
The first such measure is the Akaike Information Criterion or AIC, which is a
measure of the entropy of a model. Its general definition is

AIC = −2 log(Likelihood) + 2(# parameters)

where p is the number of parameters in the model. This can be thought of a


measurement of how much information is lost when modelling complex data with
a p parameter model. Hence, the model with the minimal AIC will be optimal in
some sense.
In our least squares regression case with normally distributed errors,

AIC = n log(SSres /n) + 2(p + 1)

where p + 1 is for the p regressors and 1 intercept term. Thus, adding more regressors
will decrease SSres but will increase p. The goal is to find a model with the minimal
AIC. This can be shown to give the same ordering as Mallows’ Cp when the errors
are normally distributed.
The second such measure is the closely related Bayesian Information Criterion or
BIC, which, in general, is

BIC = −2 log(Likelihood) + (# parameters) log n.

In the linear regression setting with normally distributed errors,

BIC = n log(SSres /n) + (p + 1) log n.

Remark 3.2.3. Using AIC versus using BIC for model selection can sometimes
result in different final choices. In some cases, one may be preferred, but often both
can be tried and discrepancies, if they exist, can be reported.
There are also other information criterion that are not as common in practise
such as the Deviation Information Criterion (DIC) and the Focused Information
Criterion (FIC).

51
3.2.3 Forward and Backward Selection12
Ideally, we choose a measure for model selection from the previous section and then
compare all possible models. However, for p possible regressors, this will result in 2p
models to check, which may be computationally infeasible. Hence, there are iterative
approaches that can be effective.
Forward selection is the process of starting with the constant model

y = β0 + ε

and choosing the best of the p regressors with respect to the model selection criterion.
This gives
y = β0 + β1 x1 + ε.
This process will continue to add terms to the model as long as it results in an
improvement in the criterion. For example, computing the AIC at each step.
Backwards selection is the reverse of forward selection. In this case, the algorithm
begins with the full model,

y = β0 + β1 x1 + . . . + βp xp + ε,

and iteratively removes the regressor that gives the biggest improvement in the model
selection criterion. If the best choice is to remove no regressors, then the process
terminates.
A third option is stepwise selection, which incorporates both forward and back-
ward steps. In this case, we begin with the constant model as in forward selection.
However, at every step, we choose either to add a new regressor to our model or
remove one that is already in the model depending on which choice improves the
criterion the most.

Variable Selection Example


Consider the same example as in the spline section where x ∈ [0, 2] and

y = 2 + 3x − 4x5 + x7 + ε

with a sample of n = 50 observations. We can fit two regression models, an empty


and a saturated model, respectively,
7
X
y = β0 + ε and y = β0 + βi xi + ε.
i=1

and use the R function step( ) to choose a best model with respect to AIC.
12
See Montgomery, Peck, Vining Section 10.2.2

52
For our simulated data, the backward selection method, step( md1, direction=’backward’
), decided that the best AIC of 78.154 is achieved by not removing any regressors
from the model.
The forward selection method,

step( md0, direction=’forward’, scope=list(lower=md0,upper=md1) ),

sequentially added three regressors to the model as follows:


Step Model AIC
0 y = β0 142.5
1 y = β 0 + β 1 x1 118.8
2 y = β 0 + β 1 x1 + β 7 x7 109.8
3 y = β0 + β1 x1 + β5 x5 + β7 x7 73.4
This method was able to achieve a better AIC value than the backward selection
method.
The stepwise selection method,

step( md0, direction=’both’, scope=list(upper=md1) )

runs very similarly to the forward method, but makes one deletion in the final step:
Step Model AIC
0 y = β0 142.5
1 y = β 0 + β 1 x1 118.8
2 y = β 0 + β 1 x1 + β 7 x7 109.8
3 y = β0 + β1 x1 + β5 x5 + β7 x7 73.4
4 y = β 0 + β 5 x5 + β 7 x7 71.5
This method has more choices than forward selection, which can only grow the model,
and hence is able to achieve an even better AIC value.
The R package leaps is able to consider all possible subset models. Running this
process on the same data yields the top three best models in descending order as

Model 1 y = β0 + β5 x5 + β7 x7
Model 2 y = β0 + β6 x6 + β7 x7
Model 3 y = β0 + β5 x5 + β6 x6

Hence, in this case, the stepwise method was able to find the optimal model.

Remark 3.2.4. The VIF for x5 versus x7 in this example is 42.43, which is quite
large. Hence, model selection did not eliminate all of the multicollinearity present in
the model.

53
3.3 Penalized Regressions
No matter how we design our model, thus far we have always computed the least
squares estimator, β̂, by minimizing the sum of squared errors
( n )
X
β̂ = arg min (yi − XiT β)2 .
β∈Rp+1 i=1
This is an unbiased estimator for β. However, as we have seen previously, the variance
of this estimator can be quite large. Hence, we shrink the estimator towards zero
adding bias but decreasing the variance.13 In the context of regression, we add a
penalty term to the above minimization to get a new estimator
( n )
X
pen T 2
β̂ = arg min (yi − Xi β) + penalty(β) ,
β∈Rp+1 i=1
which increases as β increases thus attempting to enforce smaller choices for the
estimated parameters. We will consider some different types of penalized regression.14
Remark 3.3.1. We generally do not want to penalize the intercept term β0 . Often
to account for this, the regressors and response are centred–i.e. Y is replaced with
Y − Ȳ and each Xj is replaced with Xj − X̄j for j = 1, . . . , p–in order to set the
intercept term to zero.

3.3.1 Ridge Regression15


The first method we consider is ridge regression, which arose in statistics in the
1970’s,16 but similar techniques arise in other areas of computational mathematics.
In short, a quadratic penalty is applied to the least squares estimator resulting in
 
X n X p 
R T 2 2
β̂λ = arg min (yi − Xi β) + λ βj
β∈Rp  i=1

j=1

for any λ ≥ 0. When λ = 0, we have the usual least squares estimator. As λ grows,
the β’s are more strongly penalized.
To solve for β̂λR , we proceed as before with the least squares estimator β̂ by
setting the partial derivatives equal to zero
 
n p
∂  X
2
X
2

0= (yi − β1 xi,1 − . . . − βp xi,p ) + λ βj .
∂βk  
i=1 j=1

13
General idea of shrinkage is attributed to Stein (1956) https://en.wikipedia.org/wiki/
James%E2%80%93Stein_estimator
14
In R, the glmnet package has a lot of functionality to fit different types of penalized general
linear models http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
15
See Montgomery, Peck, Vining Section 9.5.3
16
Hoerl, A.E.; R.W. Kennard (1970).

54
This results in the system of equations
X T Y − (X T X)β̂λR − λβ̂λR = 0
with the ridge estimator being β̂λR = (X T X + λIn )−1 X T Y.
The matrix X T X is positive semi-definite even when p > n–i.e. the number of
parameters exceeds the sample size. Hence, any positive value λ will make X T X +λIn
invertible as it adds the positive constant λ to all of the eigenvalues. Increasing the
value of λ will increase the numerical stability of the estimator–i.e. decrease the
condition number of the matrix. Furthermore, it will decrease the variance of the
estimator while increasing the bias. It can also be shown that the bias of β̂λR is
Eβ̂λR − β = −λ(X T X + λIn )−1 β,
which implies that the estimator does, in fact, shrink towards zero as λ increases.

3.3.2 Best Subset Regression


Another type of penalty related to the variable selection techniques from the previous
section is the Best Subset Regression approach, which counts the number of non-
zero β’s and adds a larger penalty as more terms are included in the model. The
optimization looks like
 
Xn X p 
β̂λB = arg min (yi − XiT β)2 + λ 1[βj 6= 0] .
β∈Rp  i=1 j=1

The main problem with this method is that the optimization is non-convex and
becomes severely difficult to compute in practice. This is why the forwards and
backwards selection methods are used for variable selection.

3.3.3 LASSO
The last method we consider is the Least Absolute Shrinkage and Selection Operator,
which is commonly referred to as just LASSO. This was introduced by Tibshirani
(1996) and has since been applied to countless areas of statistics. The form is quite
similar to ridge regression with one small but profound modification,
 
Xn Xp 
L T 2
β̂λ = arg min (yi − Xi β) + λ |βj | ,
β∈Rp  i=1 j=1

which is that the penalty term is now the sum of the absolute values instead of a
sum of squares.
The main reason for why this technique is popular is that it combines shrinkage
methods like ridge regression with variable selection and still results in a convex
optimization problem. Delving into the properties of this estimator requires convex
analysis and will be left for future investigations.

55
3.3.4 Elastic Net17
The elastic net regularization method combines both ridge and lasso regression into
one methodology. Here, we include a penalty term for each of the two methods:
 
X n p
X p
X 
β̂λEN = arg min (yi − XiT β)2 + λ1 |βj | + λ2 βj2 .
β∈Rp  i=1 j=1 j=1

This method has two tuning parameters λ1 ≥ 0 and λ2 ≥ 0. In the R library glmnet,
a mixing parameter α and a scale parameter λ is specified to get
 
n p  
X X 1 − α
β̂λEN = arg min (yi − XiT β)2 + λ α|βj | + βj2 .
β∈R p  2 
i=1 j=1

The intuition behind this approach is to combine the strengths of both ridge and lasso
regression. Namely, ridge regression shrinks the coefficients towards zero reducing
the variance while lasso selects a subset of the parameters to remain in the model.

3.3.5 Penalized Regression: An Example


Consider a sample of size n = 100 generated by the model
y = β0 + β1 x1 + . . . + β50 x50 + ε
where β27 = 2, β34 = −2, all other βi = 0, and ε ∼ N (0, 16). Even though only two
of the regressors have any effect on the response y, feeding all 50 regressors into R’s
lm() function can result in many false positives as in the following table where ten
of the regressors have a moderate to very significant p-value.
Estimate Std. Error t value Pr(>|t|)
x4 -1.01 0.42 -2.42 0.019 *
x16 -1.54 0.56 -2.75 0.008 **
x18 1.16 0.53 2.20 0.033 *
x19 1.15 0.45 2.55 0.014 *
x22 -1.58 0.50 -3.17 0.003 **
x24 -1.01 0.60 -1.69 0.097 .
x27 1.52 0.46 3.27 0.002 **
x29 -1.02 0.50 -2.04 0.047 *
x32 0.95 0.49 1.93 0.060 .
x34 -2.44 0.45 -5.47 0.0000015 ***
We could attempt one of the stepwise variable selection procedures from the
previous section. Running backwards and forwards selection results in the following
regressors being retained in the model.
17
https://en.wikipedia.org/wiki/Elastic_net_regularization

56
Ridge Regression LASSO
0 11 27 36 42 46 50

1
1
0

0
Coefficients
t(x$coef)

−1

−1
−2

−2
0 50 100 150 200 0 5 10 15 20 25 30

x$lambda L1 Norm

Figure 3.1: Plots of the so-called Ridge and LASSO paths.

Method Regressors Kept


Backward x2 , x3 , x4 , x6 , x7 , x10 , x15 , x16 , x18 , x19
x22 , x25 , x27 , x28 , x29 , x32 , x34 , x43 , x47
Forward x4 , x7 , x10 , x15 , x16 , x18 , x19 , x25 , x27 , x28
x29 , x32 , x34 , x43 , x47
Hence, both of these procedures retain too many regressors in the final model. The
stepwise selection method was also run, but returned results equivalent to forward
selection.
Applying ridge regression to this dataset will result in all 50 of the estimated
parameters being shrunk towards zero. The plot on the left hand side of Figure 3.1
demonstrates this behaviour. The vertical axis corresponds to the values of β1 , . . . , β50 .
The horizontal axis corresponds to increasing values of the penalization parameter λ.
As λ increases, the estimates for the β’s tend towards zero. Hence we see all 50 of
the curves bending towards the zero.
Apply LASSO to the dataset results in a different set of paths from the ridge
regression. The right hand plot in Figure 3.1 displays the LASSO paths. In this
case, the horizontal axis corresponds to P some K such that kβ̂λL k1 < K, which is
equivalent to adding the penalty term λ pj=1 |βj |. As this bound K grows, more
variables will enter the model. The blue and green lines represent the regressors x34
and x27 , which are the first two terms to enter the model. Eventually, as the penalty
is relaxed, many more terms begin to enter the model. Hence, choosing a suitable
K, or equivalently λ, is a critical problem for this method.

57
Chapter 4

Generalized Linear Models

Introduction
A generalized linear model (GLM) extends the usual linear model to cases where
the errors ε have proposed distributions that are very different than the normal
distribution. In this case, we assume that ε comes from an exponential family1 , and
the form of the model is
g (Ey) = β0 + β1 x1 + . . . + βp xp .
Here, g(·) is referred to as the link function. As these are still parametric models, the
parameters β0 , . . . , βp are often solved for via maximum likelihood. These models
can be fit in R via the glm() function. While such models can be studied in full
generality, for this course, we will only consider two specific GLMs: the logistic
regression and the Poisson regression.

4.1 Logistic Regression2


One of the most useful GLMs is the logistic regression. This is applied in the case of
a binary response–i.e when y ∈ {0, 1}. This could, for example, be used for diagnosis
of a disease where x1 , . . . , xp are predictors and y corresponds to the presence or
absence of the disease.
The usual setup is to treat the observed responses yi ∈ {0, 1} as Bernoulli (πi )
random variables, which is
P (yi = 1) = πi and P (yi = 0) = 1 − πi .
This implies that the mean is Eyi = πi and the variance is Var (yi ) = πi (1 − πi ).
Hence, the variance is a function of the mean, which violates the assumption of
constant variance in the Gauss-Markov theorem.
1
https://en.wikipedia.org/wiki/Exponential_family
2
See Montgomery, Peck, Vining Section 13.2.

58
The goal is then to model Eyi = πi as a function of xT i β. While different link
3
functions are possible, the standard link function chosen is the logistic response
function–sometimes referred to as the logit function–which is

exp(xT
i β)
Eyi = πi =
1 + exp(xT
i β)

and results in an S-shaped curve. Rearranging the equation results in


 
πi
log = xT
i β = β0 + β1 xi,1 + . . . + βp xi,p
1 − πi

where the ratio πi /(1 − πi ) is the odds ratio or simply the odds. Futhermore, the
entire response log (πi /(1 − πi )) is often referred to as the log odds.
The estimator β̂ can be computed numerically via maximum likelihood as we
assume the underlying distribution of the data. Hence, we also get fitted values of
the form
exp xT
i β̂
ŷi = π̂i = .
1 + exp xTi β̂
However, as noted already, the variance is not constant. Hence, for residuals of the
form yi − π̂i to be useful, we will need to normalize them. The Pearson residual for
the logistic regression is
yi − π̂i
ri = p .
π̂i (1 − π̂i )

4.1.1 Binomial responses


In some cases, multiple observations can be made for each value of predictors x. For
example, xi ∈ R+ could correspond to a dosage of medication and yi ∈ {0, 1, . . . , mi }
could correspond to the number of the mi subjects treated with dosage xi that are
cured of whatever ailment the medication was supposed to treat.
In this setting, we can treat the observed values πi = yi /mi . Often when fitting a
logistic regression model to binomial data, the data points are weighted with respect
to the number of observations mi at each regressor xi . That is, if m1 is very large
and m2 is small, then the estimation of π1 = y1 /m1 is more accurate–i.e. lower
variance–than π2 = y2 /m2 .
3
See, for example, probit regression https://en.wikipedia.org/wiki/Probit_model

59
4.1.2 Testing model fit
Estimation of the parameters β is achieved by finding the maximum likelihood
estimator. The log likelihood in the logistic regression with Bernoulli data is
n
πiyi (1 − πi )1−yi
Y
log L(β) = log
i=1
n
X
= [yi log πi + (1 − yi ) log(1 − πi )]
i=1
n    
X πi
= yi log + log(1 − πi )
1 − πi
i=1
n h i
xT
X
= yi x T
i β − log(1 + e i β) .

i=1

The MLE β̂ is solved for numerically.


Beyond finding the MLEs, the log likelihood is also used to test for the goodness-
of-fit of the regression model. This comes from a result known as Wilks’ theorem.

Theorem 4.1.1 (Wilks’ Theorem). For two statistical models with parameter spaces
Θ0 and Θ1 such that Θ0 ⊂ Θ1 4 and likelihoods L0 (θ) and L1 (θ), then -2 times the
log likelihood ratio has an asymptotic chi-squared distribution with degrees of freedom
equal to the difference in the dimensionality of the parameter spaces. That is,
 
supθ∈Θ0 L0 (θ) d 2
−2 log(LR) = −2 log −
→ χ (|Θ1 | − |Θ0 |) .
supθ∈Θ1 L1 (θ)

In the case of the logistic regression, we can use this result to construct an
analogue to the F-test when the errors have a normal distribution. That is, we can
compute the likelihood of the constant model
 
πi
log = β0 (4.1.1)
1 − πi

and the likelihood of the full model


 
πi
log = β0 + β1 x1 + . . . + βp xp . (4.1.2)
1 − πi

and claim from Wilks’ theorem that −2 log(LR) is approximately distributed χ2 (p)
assuming the sample size is large enough.
4
i.e. the models are nested

60
In R, the glm() function does not return a p-value as the lm() function does for
the F test. Instead, it provides two values, the null and residual deviances, which
are respectively
 
L(null model)
null deviance = −2 log , and
L(saturated model)
 
L(full model)
residual deviance = −2 log .
L(saturated model)
Here, the null model is from Equation 4.1.1 and the full model is from Equation 4.1.2.
The saturated model is the extreme model with p = n parameters, which perfectly
models the data. It is, in some sense, the largest possible model used as a baseline.
In practice, the deviances can be thought of like the residual sum of squares
from the ordinary linear regression setting. If the decrease is significant, then the
regressors have predictive power over the response. Furthermore,
 
L(null model)
(null deviance) − (residual deviance) = −2 log ,
L(full model)
which has an asymptotic χ2 (p) distribution. Such a goodness of fit test can be run
in R using anova( model, test="LRT" ). Note that there are other goodness-of-fit
tests possible for such models.

4.1.3 Logistic Regression: An Example


A sample of size n = 21 was randomly generated with xi ∈ [−1, 1] and yi such that
 2xi 
e
yi ∼ Bernoulli .
1 + e2xi
A logistic regression was fit to the data in R using the glm() function with the
family=binomial(logit) argument to specify that the distribution is Bernoulli–i.e.
binomial with n = 1–and that we want a logistic link function.
The result from summary() is the model
 
π
log = −0.4 + 2.16x
1−π
with a p-value of 0.03 for βˆ1 = 2.16. A plot of the true and predicted curves is
displayed in Figure 4.1.
We could consider the alternative case were
e2xi
 
yi ∼ Binomial mi ,
1 + e2xi
for some mi ∈ Z+ , which is mi Bernoulli observations at each xi . In this case, we
have a much larger dataset, which results in the better fitting curve displayed in the
bottom plot of Figure 4.1.

61
Logistic Regression

1.0 ● ● ● ● ● ● ● ● ●

0.8 Predicted
Truth
0.6
y

0.4

0.2

0.0 ● ● ● ● ● ● ● ● ● ● ● ●

−1.0 −0.5 0.0 0.5 1.0

Logistic Regression

1.0 ● ● ● ●


0.8 Predicted ●
Truth ● ● ●
0.6 ●
y/m

● ●

0.4 ●
● ●


0.2

0.0 ● ● ● ●

−1.0 −0.5 0.0 0.5 1.0

Figure 4.1: Plots of the 21 data points, the true Bernoulli probabilities in red,
and the predicted probabilities in blue. Top plot, yi ∼ Bernoulli (πi ). Bottom plot,
yi ∼ Binomial (mi , pi ).

62
4.2 Poisson Regression5
The Poisson regression is another very useful GLM to know, which models counts or
occurrences of rare events. Examples include modelling the number of defects in a
manufacturing line or predicting lightning strikes across a large swath of land.
The Poisson distribution with parameter λ > 0 has PDF f (x) = e−λ λx /x!. This
can often be thought of as a limiting case of the the binomial distribution as n → ∞
and p → 0 such that np → λ. If we want a linear model for a response variable y
such that yi has a Poisson (λi ) distribution, then, similarly to the logistic regression,
we apply a link function. In this case, one of the most common link functions used
is the log. The resulting model is

log Eyi = β0 + β1 x1 + . . . + βp xp

where Eyi = λi .
Similarly to the logistic regression case–and to GLMS in general–the parameters
β are estimated via maximum likelihood based on the assumption that y has a
Poisson distribution. Furthermore, the deviances can be computed and a likelihood
ratio test can be run to assess the fit of the model. The log likelihood for a Poisson
regression is
n
Y e−λi λyi i
log L(β) = log
yi !
i=1
n
X
= [−λi + yi log λi − log(yi !)]
i=1
n h i
T
X
= −exi β + yi xT
i β − log(y i !)
i=1

4.2.1 Poisson Regression: An Example


As an example, consider a sample of size n = 21 generated randomly by

yi ∼ Poisson e2xi .


A Poisson regression with log link can be fit to the data using R’s glm() function
with the argument family=poisson(link=log).
The result from summary() is

log(λ) = 0.16 + 1.95x

with a p-value of 10−8 for βˆ1 = 1.95. A plot of the true and predicted curves is
displayed in Figure 4.2.
5
See Montgomery, Peck, Vining Section 13.3.

63
Poisson Regression

8 ● ●

Predicted ●
Truth
6

4 ● ●
y

● ● ●

2 ●

● ● ● ●

0 ● ● ● ● ● ● ● ●

−1.0 −0.5 0.0 0.5 1.0

Figure 4.2: Poisson regression (blue) and true rate parameter (red) fit to n = 21
data points.

64
Appendix A

Distributions

Introduction
The following is a short overview of some of the most common probability distributions
encountered in the context of linear regression.

A.1 Normal distribution


The normal or Gaussian distribution is of principal importance in probability and
statistics. In its univariate form, we say that X ∼ N µ, σ 2 with mean µ ∈ R and


variance σ 2 ∈ R+ if it has the following probability density function (pdf):


 
1 1 2
fN (µ,σ2 ) (x) = √ exp − 2 (x − µ) .
2πσ 2 2σ
Such a random variable X can be centred and scaled into a standardized form as
(X − µ)/σ ∼ N (0, 1). Furthermore, the normal distribution is very stable in the
sense that the sum of a collection of independent normal random variables has a
normal distribution as well as scaling a normal random variable by some real valued
scalar.
Part of its importance stems from the central limit theorem, which in its simplest
form states that the standardized sum of a collection on independent random variables
converges in distribution to a standard normal random variable. However, its worth
noting that there are many other central limit theorems in existence.
Theorem A.1.1 (Central Limit Theorem). Let Y1 , . . . , Yn be a sample of n iid
2
Pn variables with mean µ ∈ R and finite variance σ < ∞. Letting Ȳ =
random
−1
n i=1 Yi be the sample mean, then
√ d
n(Ȳ − µ) −
→ N (0, 1)
d
where −
→ denotes convergence in distribution.

65
The normal distribution can be extended to the multivariate normal distribution
for a vector X ∈ Rp where we write X ∼ N (µ, Σ). Here, µ ∈ Rp is the mean vector
while Σ ∈ Rp×p is a p × p matrix that is symmetric and positive semi-definite. This
distribution also has elliptical symmetry. The form of the pdf, assuming Σ is positive
definite, is
 
1 1 T −1
fN (µ,Σ) (x) = p exp − (x − µ) Σ (x − µ)
2π det(Σ) 2

Otherwise, X will have a degenerate normal distribution, which is still normal but
supported on a subspace of dimension less than p.
The multivariate normal (MVN) distribution has some very nice characterizations.
A vector X = (X1 , . . . , Xp ) is MVN if and only if every linear combination of the Xi
p 2

is univariate normal. That is, for all a ∈ R , a · X ∼ N µ̃, σ̃ . Hence, it is worth
emphasizing that if X is a vector that is marginally normal (i.e. every component
Xi is univariate normal), then the joint distribution is not necessarily multivariate
normal.
In general if two random variables X, Y ∈ R are independent, then we have
that cov (X, Y ) = 0. However, the reverse implication is not necessarily true. One
exception to this is that if (X, Y ) is MVN then if cov (X, Y ) = 0 then X and Y are
independent.

A.2 Chi-Squared distribution


The chi-squared distribution arises throughout the field of statistics usually in the
context of goodness-of-fit testing. Let Z ∼ N (0, 1), then Z ∼ χ2 (1), which is said
to be chi-squared with one degree of freedom. Furthermore, chi-squared random
variables are additive in the sense that if X ∼ χ2 (ν) and Y ∼ χ2 (η) are independent,
then X + Y ∼ χ2 (ν + η). A chi-squared random variable is supported on the positive
real line and the pdf for X ∼ χ2 (ν) with ν > 0 is
1
fχ2 (ν) (x) = xν/2−1 e−x/2
2ν/2 Γ(ν/2)

Its mean and variance are ν and 2ν, respectively. Note that while very often the
degrees of freedom parameter is a positive integer, it can take on any positive real
value and still be valid. In fact, the chi-squared distribution is a specific type of the
more general gamma distribution, which will not be discussed in these notes.
This random variable is quite useful in the context of linear regression with respect
to how it leads to the t and F distributions discussed in the following subsections.
However, it also arises in many other areas of statistics including Wilks’ Theorem
regarding the asymptotic distribution of the log likelihood ratio, goodness-of-fit in
multinomial hypothesis testing, and testing for independence in a contingency table.

66
A.3 t distribution
The t distribution is sometimes referred to as Student’s t distribution in recognition
of its, at the time, anonymous progenitor William Sealy Gosset working at the
Guinness brewery. It most notably arises in the context of estimation of a normal
random variable when the variance is unknown as occurs frequently in these lecture
notes.
Let Z ∼ N (0, 1) and let V ∼ χ2 (ν) be independent random variables. Then, we
say that
Z
T =p ∼ t (ν)
V /ν
has a t distribution with ν degrees of freedom. Such a distribution can be thought
of as a heavier tailed version of the standard normal distribution. In fact, t (1) is the
Cauchy distribution with pdf

ft(1) (x) = (π(1 + x2 ))−1

while T ∼ t (ν) converges to a standard normal distribution as ν → ∞.


A noteworthy property of a t distributed random variable is that it will only
have moments up to but not including order ν. That is, for T ∼ t (ν), ET k < ∞ for
k < ν. For k ≥ ν, the moments do not exist.

A.4 F distribution
The F distribution arises often in the context of linear regression when a comparison
is made between two sources of variation. Let X ∼ χ2 (ν) and Y ∼ χ2 (η) be
independent random variables, then we write that

X/ν
F = ∼ F (ν, η)
Y /η

has an F distribution with degrees of freedom ν and η. Due to this form of the
distribution, if F ∼ F (ν, η), then F −1 ∼ F (η, ν). The F distribution is supported
on the positive real line. The F and t distributions are related by the fact that if
T ∼ t (ν), then T 2 ∼ F (1, ν).

67
Appendix B

Some Linear Algebra

To successfully understand linear regression, we will require some basic notations


regrading the manipulation of vectors and matrices. First, we will denote a n
dimensional vector as Y ∈ Rn and an n × p matrix as X ∈ Rn×p . The following is a
list of definitions and results:

1. For a matrix A ∈ Rn×p with row i and column j entry denoted ai,j , then the
transpose of A is AT ∈ Rp×n with row i and column j entry aj,i . That is, the
indices have swapped.

2. For matrices A ∈ Rm×n and B ∈ Rn×p , we have that (AB)T = B T AT


T
3. For an invertible matrix A ∈ Rn×n , we have that (A−1 ) = (AT )−1 .

4. A matrix A is square if the number of rows equals the number of columns.


That is, A ∈ Rn×n .

5. A square matrix A ∈ Rn×n is symmetric if A = AT .

6. A symmetric matrix A ∈ Rn×n necessarily has real eigenvalues.

7. A symmetric matrix A ∈ Rn×n is positive definite if for all x ∈ Rn with x =


6 0,
we have that xT Ax > 0.

8. A symmetric matrix A ∈ Rn×n is also positive definite if all of its eigenvalues


are positive real valued.

9. A symmetric matrix A ∈ Rn×n is positive semi-definite (also non-negative


definite) if for all x ∈ Rn with x =
6 0, we have that xT Ax ≥ 0. Or alternatively,
all of the eigenvalues are non-negative real valued.

10. Covariance matrices are always positive semi-definite. If a covariance matrix


has some zero valued eigenvalues, then it is called degenerate.

68
11. If X, Y ∈ Rn are random vectors, then
 
cov (X, Y ) = E (X − EX)(Y − EY )T ∈ Rn×n .

12. If X, Y ∈ Rn are random vectors and A, B ∈ Rm×n are non-random real valued
matrices, then

cov (AX, BY ) = Acov (X, Y ) B T ∈ Rm×m .

13. If Y ∈ Rn is multivariate normal–i.e. Y ∼ N (µ, Σ)–and


 A∈R
m×n then AY

is also multivariate normal with AY ∼ N Aµ, AΣA . T

14. A square matrix A ∈ Rn×n is idempotent if A2 = A.

69

You might also like