Stat 378
Stat 378
Stat 378
Adam B Kashlak
Mathematical & Statistical Sciences
University of Alberta
Edmonton, Canada, T6G 2G1
Preface 1
2 Model Assumptions 18
2.1 Plotting Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 Plotting Residuals . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Variance Stabilizing . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 Box-Cox and the power transform . . . . . . . . . . . . . . . 25
2.2.4 Cars Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Model Problems . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Piecewise Polynomials . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Interacting Regressors . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Influence and Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 The Hat Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 Cook’s D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.3 DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.4 DFFITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.5 Covariance Ratios . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.6 Influence Measures: An Example . . . . . . . . . . . . . . . . 39
2.5 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Model Building 43
3.1 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.1 Identifying Multicollinearity . . . . . . . . . . . . . . . . . . . 45
3.1.2 Correcting Multicollinearity . . . . . . . . . . . . . . . . . . . 46
3.2 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Subset Models . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.2 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.3 Forward and Backward Selection . . . . . . . . . . . . . . . . 52
3.3 Penalized Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 Best Subset Regression . . . . . . . . . . . . . . . . . . . . . 55
3.3.3 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.4 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.5 Penalized Regression: An Example . . . . . . . . . . . . . . . 56
A Distributions 65
A.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2 Chi-Squared distribution . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.3 t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.4 F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
The following are lecture notes originally produced for an upper level under-
graduate course on linear regression at the University of Alberta in the fall of 2017.
Regression is one of the main, if not the primary, workhorses of statistical inference.
Hence, I do hope you will find these notes useful in learning about regression.
The goal is to begin with the standard development of ordinary least squares in
the multiple regression setting, then to move onto a discussion of model assumptions
and issues that can arise in practice, and finally to discuss some specific instances of
generalized linear models (GLMs) without delving into GLMs in full generality. Of
course, what follows is by no means a unique exposition but is mostly derived from
three main sources: the text, Linear Regression Analysis, by Montgomery, Peck, and
Vining; the course notes of Dr. Linglong Kong who lectured this same course in
2013; whatever remains inside my brain from supervising (TAing) undergraduate
statistics courses at the University of Cambridge during my PhD years.
Adam B Kashlak
Edmonton, Canada
August 2017
1
Chapter 1
Introduction
Linear regression begins with the simple but profound idea that some observed
response variable, Y ∈ R, is a function of p input or regressor variables x1 , . . . , xp
with the addition of some unknown noise variable ε. Namely,
Y = f (x1 , . . . , xp ) + ε
where the noise is generally assumed to have mean zero and finite variance. In this
setting, Y is usually considered to be a random variable while the xi are considered
fixed. Hence, the expected value of Y is in terms of the unknown function f and the
regressors:
E (Y |x1 , . . . , xp ) = f (x1 , . . . , xp ).
While f can be considered to be in some very general classes of functions, we
begin with the standard linear setting. Let β0 , β1 , . . . , βp ∈ R. Then, the multiple
regression model is
Y = β0 + β1 x1 + . . . + βp xp = β T X
where β = (β0 , . . . , βp )T and X = (1, x1 , . . . , xp )T . The simple regression model is a
submodel of the above where p = 1, which is
Y = β0 + β1 x1 + ε,
Y = Xβ + ε (1.0.1)
2
25 ●
SZ
20
Nobel prizes per 10 million
●
15 SW
●
10 GE
●
NO
● ● ●
5 ● FR
● FN ●BE
CA
AU IR
● ●
● IT
●JP
0 ●● PR
CHBZ SP
2 4 6 8 10 12
where
1 x1,1 . . . x1,p
Y1 1 x2,1 . . . x2,p β0 ε1
..
β = ... ,
..
Y = . , X = . .. , ε = . .
. .. . .
. . . .
Yn βp εn
1 xn,1 . . . xn,p
An example of a linear regression from a study from the New England Journal of
Medicine can be found in Figure 1.1. This study highlights the correlation between
chocolate consumption and Nobel prizes received in 16 different countries.
3
Table 1.1: The four variables in the linear regression model of Equation 1.0.1 split
between whether they are fixed or random variables and between whether or not the
analyst knows their value.
Known Unknown
Fixed X β
Random Y ε
Definitions
Before continuing, we require the following collection of terminology.
The response Y and the regressors X were already introduced above. These
elements comprise the observed data in our regression. The noise or error variable is
ε. The entries in this vector are usually considered to be independent and identically
distributed (iid) random variables with mean zero and finite variance σ 2 < ∞.
Very often, thisvector will be assumed to have a multivariate normal distribution:
ε ∼ N 0, σ 2 In where In is the n × n identity matrix. The variance σ 2 is also
generally considered to be unknown to the analyst.
The unknown vector β is our parameter vector. Eventually, we will construct an
estimator β̂ from the observed data. Given such an estimator, the fitted values are
Ŷ := X β̂. These values are what the model believes are the expected values at each
regressor.
Given the fitted values, the residuals are r = Y − Ŷ which is a vector with entries
ri = Yi − Ŷi . This is the difference between the observed response and the expected
response of our model. The residuals are of critical importance to testing how good
our model is and will reappear in most subsequent sections. Pn
Lastly, there is the concept of sum of squares. Letting Ȳ = n −1
Pn i=1 Yi be the
2
sample mean for Y , the total sum of squares is SStot = i=1 (Yi − Ȳ ) , which can
be thought of as the total variation of the responses. This can be decomposed into a
sum of the explained sum of squares and the residual sum of squares as follows:
n
X n
X
2
SStot = SSexp + SSres = (Ŷi − Ȳ ) + (Yi − Ŷi )2 .
i=1 i=1
The explained sum of squares can be thought of as the amount of variation explained
by the model while the residual sum of squares can be thought of as a measure of
the variation that is not yet contained in the model. The sum of squares gives us
an expression for the so called coefficient of determination, R2 = SSexp /SStot =
1 − SSres /SStot ∈ [0, 1], which is treated as a measure of what percentage of the
variation is explained by the given model.
4
1.1 Point Estimation
In the ordinary least squares setting, the our choice of estimator is
n
X
β̂ = arg min (Yi − Xi,· · β̃)2 (1.1.1)
β̃∈Rp i=1
where Xi,· is the ith row of the matrix X. In the simple regression setting, this
reduces to
n
X
(β̂0 , β̂1 ) = arg min (Yi − (β̃0 + β̃1 x1 ))2 .
(β̃0 ,β̃1 )∈R2 i=1
Note that this is equivalent to choosing a β̂ to minimize the sum of the squared
residuals.
It is perfectly reasonable to consider other criterion beyond minimizing the sum
of squared residuals. However, this approach results in an estimator with many nice
properties. Most notably is the Gauss-Markov theorem:
Theorem 1.1.1 (Gauss-Markov Theorem). Given the regression setting from Equa-
tion 1.0.1 and that for the errors, Eεi = 0 for i = 1, . . . , n, Var (εi ) = σ 2 for
i = 1, . . . , n, and cov (εi , εj ) = 0 for i 6= j, then the least squares estimator results in
the minimal variance over all linear unbiased estimators. (This is sometimes referred
to as the “Best Linear Unbiased Estimator” or BLUE)
Hence, it can be shown that the estimator is unbiased, Eβ̂ = β and that the
constructed least squares line passes through the centre of the data in the sense
the sum of the residuals is zero, ni=1 ri = 0 and that Ȳ = β̂ X̄ where Ȳ =
P
thatP
n−1 ni=1 Yi is the sample mean of the Yi and where X̄ is the vector of column means
of the matrix X.
5
Pn
The above is the kth entry in the vector ∇ i=1 (Yi − Xi,· β̂)2 . Hence,
n
X
∇ (Yi − Xi,· β̂)2 = −2X T Y + 2X T X β̂.
i=1
X T Y = X T X β̂
In the case that n > p, the matrix PX := X(X T X)−1 X T is a rank p+1 projection
matrix. Similarly, In − PX is the complementary rank n − p − 1 projection matrix.
Intuitively, this implies that the fitted values are the projection on the observed
values onto a p-dimensional subspace while the residuals arise from
a projection
onto
the orthogonal subspace. As a result, it can be shown that cov Ŷ , r = 0.
Now that we have an explicit expression for the least squares estimator β̂, we
can show that it is unbiased.
T
= E (X T X)−1 X T Y ((X T X)−1 X T Y ) − ββ T
= (X T X)−1 X T E Y Y T X(X T X)−1 − ββ T
Thus far, we have only assumed that ε is a random vector with iid entries with
mean zero and variance σ 2 . If in addition, we assumed that ε has a normal or
Gaussian distribution, then
6
Furthermore, with a little work, one can show that for the fitted values and residuals
also have normal distributions in this setting:
Ŷ ∼ N X β̂, σ 2 PX , and r ∼ N 0, σ 2 (In − PX ) .
Notice that the two above covariance matrices are not generally of full rank. This
assumption that the errors follow a normal distribution is a very common assumption
to make in practice.
7
and the unbiased estimator of σ 2 being SSres /(n − p − 1) = σ̂ 2 (n/(n − p − 1)).
For a more precise explanation of where this comes from, see Cochran’s theorem
(https://en.wikipedia.org/wiki/Cochran%27s_theorem), which is beyond the
scope of this course.
Chocolate-Nobel Data
Running a regression in R on the chocolate consumption vs Nobel prize data from
Figure 1.1 results in
−0.9910
β̂ = .
1.3545
E (AY ) = AXβ
= (X T X)−1 X T + D Xβ
= β + DXβ,
the unbiased condition implies that DXβ = 0 for any β ∈ Rp+1 and hence that
DX = 0.
Next, we compute the variance of the arbitrary linear unbiased estimator to get
= σ 2 (X T X)−1 + DDT .
Remark 1.1.2. Note that DDT is positive semi-definite for any choice of D as for
any w ∈ Rp+1 , we have
T
wT (DDT )w = (DT w) (Dw) = kDwk2 ≥ 0.
Remark 1.1.3. While β̂ has minimal variance over all unbiased estimators, we can
lessen the variance further if we allow for biased estimators. This is considered in
many more advanced regression methods such as ridge regression.
8
1.2 Hypothesis Testing
1.2.1 Goodness of fit3
We now have a model for our data, and in some sense, this model is optimal as it
minimizes the squared errors. However, even being optimal, we are still interested
in knowing whether or not this is a good model for our data. This is a question of
goodness of fit.
The first question to ask is, do any of the regressors provide information about
the response in the linear model framework? This can be written mathematically as
H0 : β1 = . . . = βp = 0 H1 : ∃i ≥ 1 s.t. βi 6= 0, (1.2.1)
which is asking is there at least one βi that we can claim is non-zero and hence
implies that the regressor xi has some nontrivial influence over y.
To test this hypothesis, we revisit the explained and residual sums of squares
introduced in the Definitions section. Specifically, we already have that SSres /σ 2 ∼
χ2 (n − p − 1) from above. Similarly, SSexp /σ 2 ∼ χ2 (p) under the null hypothesis
where β1 = . . . = βp = 0, and hence any variation in those terms should be pure
noise. Lastly, it can be demonstrated that SSres and SSexp are independent random
variables, which intuitively follows from the orthogonality of the fitted values and
the errors.
The usual test statistic for Hypothesis 1.2.1 is
SSexp /p
∼ F (p, n − p − 1) ,
SSres /(n − p − 1)
which leads to an F test. If the test statistic is large, then the explained variation
is larger than the noise resulting in a small p-value and a rejection of the null
hypothesis.
9
1.2.2 Regression coefficients4
Given that the previous F test results in a significant p-value, the subsequent question
is to ask which of the p regressors are significant? Hence, we have the following
hypotheses for j = 0, 1, . . . , p.
H0,j : βj = 0 H1,j : βj 6= 0.
Each individual β̂j ∼ N βj , σ 2 (X T X)−1
j,j where (X T X)−1
j,j is the jth entry in the
diagonal of (X T X)−1 5
j,j . Thus, under the null hypothesis that βj = 0, we have that
q
β̂j / σ 2 (X T X)−1
j,j ∼ N (0, 1) .
β̂j
q ∼ t (n − p − 1) ,
−1
(X T X)j,j SSres /(n − p − 1)
and a t test can be performed. If the value of the test statistic is large, then there
may be sufficient evidence to reject the null that βj = 0. The denominator is often
referred to as the standard error. To simplify future formulae, this will be denoted
as se(βj ).
It is worth noting that this test looks for significant influence of the jth regressor
on the response given all of the other regressors. Hence, it quantifies the marginal as
opposed to the absolute effect of that variable on the model. These ideas will be
investigated further when discussing variable selection in Section 3.2. However, as a
quick word of caution, when p hypothesis tests are performed, the analyst needs to
consider multiple testing corrections.
10
1.2.3 Partial F-test6
In the previous two sections, we first tested as to whether or not there exists at least
one βj , j = 1, . . . , p, that is non-zero. Then, we tested whether or not a specific βj
is non-zero. The next logical question is whether or not some collection of βj ’s of
size strictly between 1 and p has a non-zero element. That is, for a fixed q,
H0 : βp−q+1 = . . . = βp = 0 H1 : ∃i ≥ p − q + 1 s.t. βi 6= 0. (1.2.2)
Here, we are comparing two different models, which are the partial and full models,
respectively,
Y = β0:p−q X + ε, and Y = β0:p−q X + βp−q+1:p X + ε,
and want to know whether the final q regressors add any significant explanation to
our model given the other p − q. For the above notation,
βi:j = (0, . . . , 0, βi , βi+1 , . . . , βj , 0, . . . , 0)T .
To run the hypothesis test 1.2.2, we would have to compute the least squares
estimator in the partial model, β̂1:p−q , and the standard least squares estimator in
the full model, β̂. Then, we will have to compute the additional explained sum of
squares gained from adding the q extra regressors to our model, which is
SSexp (βp−q+1:p |β1:p−q ) = SSexp (β) − SSexp (β1:p−q ),
the explained sum of squares from the full model minus the explained sum of squares
from the partial model.
Similarly to the full F-test from above, we have under the null hypothesis that
SSexp (βp−q+1:p |β1:p−q )/σ 2 ∼ χ2 (q) . Hence,
SSexp (βp−q+1:p |β1:p−q )/q
∼ F (q, n − p − 1) ,
SSres /(n − p − 1)
so if this test statistic is large, then we have evidence to suggestion that at least one
of the additional q regressors adds some explanatory power to our model.
β̂j − βj
∼ t (n − p − 1) ,
se(βj )
6
See Montgomery, Peck, Vining the latter half of Section 3.3.2.
7
See Montgomery, Peck, Vining Sections 2.4.1 and 3.4.1 for simple and multiple regression,
respectively.
11
Hence, a 1 − α confidence interval for the parameter βj is
While the above can be used to produce a confidence interval for each individual
parameter, combining these intervals will not result in a 1 − α confidence set for the
entire parameter vector. To construct such a confidence region, a little more care is
required.8 Also, we will construct a confidence set for the entire vector (β0 , β1 , . . . , βp ),
which results in p + 1 degrees of freedom in what follows. As β̂ ∼ N β, σ 2 (X T X)−1
we have that
T
σ −2 (β̂ − β) X T X(β̂ − β) ∼ χ2 (p + 1) .
From before, we have that SSres /σ 2 ∼ χ2 (n − p − 1). Hence
T
(β̂ − β) X T X(β̂ − β)/(p + 1)
∼ F (p + 1, n − p − 1) .
SSres /(n − p − 1)
A 95% and a 99% confidence ellipsoid for the Chocolate-Nobel prize data is
displayed in Figure 1.2. Notice that both ellipses contain β̂0 = 0 which had a t
statistic p-value of 0.649. Meanwhile neither contain β̂1 = 0 whose p-value was the
very significant 0.0015. The confidence ellipses were plotted with help from the R
library ellipse.
12
3.0
2.5
99% confidence
2.0
choco
1.5
95% confidence
1.0
0.5
0.0
−5 0 5
(Intercept)
Figure 1.2: Confidence ellipses at the 95% and 99% level for the chocolate consump-
tion vs Nobel prizes data.
Hence,
β̂ · x − E Ŷ0 X = x
p ∼ N (0, 1) ,
σ 2 xT (X T X)−1 x
and
β̂ · x − E Ŷ0 X = x
p ∼ t (n − p − 1) ,
(SSres /(n − p − 1))xT (X T X)−1 x
which results in the following 1 − α confidence interval:
s
SSres
β̂ · x − tα/2,n−p−1 xT (X T X)−1 x ≤
n−p−1
≤ E Ŷ0 X = x = β · x ≤
s
SSres
≤ β̂ · x + tα/2,n−p−1 xT (X T X)−1 x.
n−p−1
13
1.4.2 Prediction for a new observation10
In the previous subsection, we asked for a confidence interval for the expected value
of theresponse
given
a new vector of regressors, which was a confidence interval
for E Ŷ0 X = x = β · x based on β̂ · x. Now, we want to determine a confidence
interval for the future response given a vector of regressors. That is, we want an
interval for Y0 = β · x + ε0 ∼ N β · x, σ 2 , but, as usual, β unknown. To circumvent
this, note that
Y0 − Ŷ0 = (β · x + ε0 ) − β̂ · x ∼ N 0, σ 2 (1 + xT (X T X)−1 x) ,
because the variances of ε0 and β̂ · x sum as these are independent random variables.
Hence, applying the usual rearrangement of terms and replacement of σ 2 with
SSres /(n − p − 1) results in
s
(1 + xT (X T X)−1 x)SSres
β̂ · x − tα/2,n−p−1 ≤ Y0 ≤
n−p−1
s
(1 + xT (X T X)−1 x)SSres
≤ β̂ · x + tα/2,n−p−1 .
n−p−1
Intervals for the Chocolate-Nobel prize data for both the expected mean and for
a new observation are displayed in Figure 1.3.
y = β0 + β1 x1 + . . . + βp xp + ε
where the regressors x1 , . . . , xp ∈ R can take on any real value. However, very often
in practice, we have regressors that take on categorical values. For example, male vs
female, employed vs unemployed, treatment vs placebo, Edmonton vs Calgary, etc.
When there is a binary choice as in these examples, we can choose one category to
correspond to zero and the other category to correspond to one.
As an example, consider
y = β0 + β1 x1 + β2 x2 + ε (1.5.1)
10
See Montgomery, Peck, Vining Section 3.5.
11
See Montgomery, Peck, and Vining Chapter 8.
12
See Montgomery, Peck, and Vining Section 8.1 and 8.2.
14
25 ●
SZ
20
Nobel prizes per 10 million
●
15 SW
New Obs
●
10 GE
●
NO
● ● ●
5 ● FR
● FN ● BE
CA
AU IR
●
●
● IT Expected Obs
●JP
0 ● ●
SP
PR
CHBZ
2 4 6 8 10 12
Figure 1.3: A 95% confidence interval (in blue) for the expected value and a 95%
prediction interval (in read) for a new observation for the chocolate consumption vs
Nobel prizes data.
y = β0 + β1 x1 + 0 + ε
y = β0 + β1 x1 + β2 + ε = (β0 + β2 ) + β2 x1 + ε.
What we have is two models with the same slope β1 but with two different intercepts
β0 and β0 + β2 , which are two parallel lines.
Remark 1.5.1. A first thought is to merely split the data and train two separate
models. However, we want to use the entire dataset at once specifically to estimate
the common slope β1 with as much accuracy as possible.
While the range of the regressors has changed, we will fit the least squares
estimate to the model precisely as before. Now, considering the model 1.5.1, assume
that we have m samples with x2 = 0 and n samples with x2 = 1. Our design matrix
takes on a new form:
Y1 1 x1 0
.. .. .. ..
. . . .
ε1
Ym
1 xm 0
β0
Y = Ym+1 , X= 1 xm+1 1 , β = β1 , ε = ... .
..
.. ..
.. β2 εm+n
. . . .
Ym+n 1 xm+n 1
15
However, the least squares estimate is computed as before as β̂ = (X T X)−1 X T Y .
Furthermore, we can perform hypothesis tests on the fitted model such as
H0 : β2 = 0 H1 : β2 6= 0,
which is equivalently asking whether or not the regression lines have the same
intercept.
Models can be expanded to include multiple indicator variables as long as the
matrix X T X is still invertible. For example, let’s suppose we want to look at wages
in Alberta with respect to age but partitioned for male vs female and for Edmonton
vs Calgary. Then, the model would look like
In the silly case that our data only consisted of men from Calgary and women from
Edmonton, then the final regressor is redundant and X T X will not be invertible.
While this extreme case should not occur, it is possible to have an imbalance in the
categories, which we will discuss later.
1.5.2 ANOVA13
ANOVA, or the Analysis of Variance, is a slightly overloaded term in statistics. We
already considered ANOVA tables when comparing nested models in the hypothesis
tests of Section 1.2. However, ANOVA can also be used in the setting of the so-called
One-Way Analysis of Variance 14 . In this case, we want to compare k samples for
equality of the means. For example, we take height measurements from randomly
selected citizens from different countries and ask whether or not there is significant
evidence to reject the claim that all nations have roughly the same height distribution.
The reason for discussing ANOVA in these notes is that it can be written in
a linear regression context as follows. Imagine that we have k different groups of
observations with sample sizes nj , j = 1, . . . , k for each group. Let yi,j be the ith
observation from the jth group where i ∈ {1, . . . , nj } and j ∈ {1, . . . , k}. The model
is
yi,j = µj + εi,j ,
which is each observation is just some group mean, µj , with the addition of random
noise. Pnj
From here, one can show that the fitted values as just ŷi,j = n−1 j l=1 yl,j ,
which is the jth sample mean. Then, an F-test can be performed similar to that of
Section 1.2 to test
16
To reformulate the model to align with our F-test from before, we rewrite it as a
linear regression with indicator variables for the regressors
17
Chapter 2
Model Assumptions
Introduction
Thus far, we have considered linear regression given some key assumptions. Namely,
for models of the form
y = β0 + β1 x1 + . . . + βp xp + ε
we assume that the noise or errors ε have zero mean, constant finite variance, and
are uncorrelated. Furthermore, in order to perform hypothesis tests–F test, t test,
partial F-tests–and to construct confidence and prediction intervals as we did in the
previous chapter, we further assumes that ε has a normal distribution.
In this chapter, we will consider deviations from these assumptions, which will
lead to questions such as
1. What happens if the variance of ε is not constant?
2. What happens if ε has heavier tails than those of a normal distribution?
3. What effect can outliers have on our model?
4. How can we transform our data to correct for some of these deviations?
5. What happens if the true model is not linear?
Y = Xβ + ε
1
See Montgomery, Peck, Vining Section 4.2.
18
where Y ∈ Rn , X ∈ Rn×(p+1) , β ∈ Rp+1 , and ε ∈ Rn . The least squares estimator is
β̂ = (X T X)−1 X T Y and the vector of residuals is thus
We can use the residuals to look for problems in our data with respect to
deviations from the assumptions. But first, they should be normalized in some way.
As we know from the previous chapter, the covariance of the residuals is (I − P )σ 2
and that SSres /(n − p − 1) is an unbiased estimator for the unknown variance σ 2 .
This implies that while the errors εi are assumed to be uncorrelated, the residuals
are, in fact, correlated. Explicitly,
where Pi,j is the (i, j)th entry of the matrix P . Hence, a standard normalization
technique is to write
ri
si = p ,
(1 − Pi,i )SSres /(n − p − 1)
which are denoted as the studentized residuals. For a linear model fit in R by the
function lm(), we can extract the residuals with resid() and extract the studentized
residuals with rstudent().
Remark 2.1.1. There are other types of residuals that can be computed such as
standardized residuals, PRESS residuals, and externally studentized residuals. These
are also used to look for outliers.
19
● ●
● ●
● ●
●
●
●
1
● ● ●
15
●
● ●
●
● ●●
● ● ●
●
● ● ●
● ● ●●
● ●
● ● ● ●
● ●
0
● ● ●
● ●
●
●● ● ● ●
● ●
10
Studentized Residual
● ● ● ●
●
● ● ●
● ● ●
● ● ● ● ●
● ● ●
−1
● ● ●
●
y1
● ●
● ●
● ● ● ●● ●
5
●● ●
● ●
−2
●
● ●
●
●
0
−3
● ●
●
●
−4
−5
● ●
0 1 2 3 4 5 0 1 2 3 4 5
x x
Figure 2.1: On the left, a plot of some data with an outlier labelled in blue and a
least squares regression line in red. On the right, the studentized residuals for each
of the 51 data points.
simple linear regression but with errors εi ∼ N 0, σi2 where σi2 was increasing in
i. Hence, the plot has an expanding look to it. The third plot in the bottom left
came from a simple linear regression with the addition of a quadratic term. Fitting
a model without the quadratic term still yielded significant test statistics, but failed
to account for the nonlinear interaction between x and y. The final plot in the
bottom right came from a simple linear regression where the errors were correlated.
Specifically, cov (εi , εj ) = min{xi , xj }.2
Definition 2.1.2 (Quantile Function). The quantile function is the inverse of the
cumulative distribution function. That is, in the case of the normal distribution, let
Z ∼ N (0, 1). Then the CDF is Φ(z) = P (Z < z) ∈ (0, 1) for z ∈ R. The quantile
function is Φ−1 (t) ∈ [−∞, ∞] for t ∈ [0, 1].3
20
Residuals vs Fitted Residuals vs Fitted
30
10
33 ● ● 37
47 ●
20
● ●
●
● ●
●
5
●
● ●
● ●
10
●
● ●
●
● ● ● ●
● ● ● ●●
● ●
Residuals
Residuals
● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
0
● ● ●● ● ● ●
●
0
● ● ●
● ● ●
● ●
● ● ● ● ●
● ●
● ● ●●
●● ● ●
●
● ● ●
● ● ●
−10
●
●
● ●
−5
●
●
●
● ●
−20
31 ● ●
● 18 49 ●
−10
−30
0 5 10 15 0 5 10 15
●3 28 ●
48 ●
10
1.0
●
●●
● ●
●
● ●
● ●● ●
●
0.5
●
5
● ●
● ● ● ●●
●
● ● ● ●
● ● ● ●
●
Residuals
Residuals
● ●
● ● ●
● ● ●
● ●
0.0
● ● ●●
● ● ●
● ● ● ●
● ●
0
● ●● ●
● ● ●
●
●
● ● ● ●
−0.5
● ●
● ● ●
● ●
● ● ●
● ●
● ●
●
●
−5
● ●
● ●
−1.0
●
● ●
● ●
●
●
● 15
● 20 ● 16
−1.5
−10
0 5 10 15 20 25 30 0 2 4 6 8 10 12
Figure 2.2: For examples of plotting residuals vs fitted values. Top left is when
assumptions hold. Top right has a non-constant variance. Bottom left has nonlinear
terms. Bottom right has correlated errors.
21
Normal Q−Q Normal Q−Q
2
9● 14 ●
●
3
● ● 44
●
●
● 37
● ●
●●
1
2
●
●●
Standardized residuals
Standardized residuals
●●●
●●●●
●
●
●●●
1
●
●
●
0
●
●●●
● ●●
● ●●
●
●●●
● ●●●●
●●● ●●
●
●●
0
●●●●●
●● ●●●●
●●
●●●
●●
● ●●
−1
● ●●
●● ● ●●
●●
−1
● ● ●
● ●
●
●
●
●
−2
−2
●6
● 12 ●
−2 −1 0 1 2 −2 −1 0 1 2
Figure 2.3: On the left, a normal Q-Q plot for data from a simple linear regression
with ε having a normal distribution. On the left, a normal Q-Q plot for data from a
simple linear regression with ε having a t distribution.
qi = Φ−1 (i/(n + 1)). In R, a slightly different formula is used. Figure 2.3 compares
two normal Q-Q plots. On the left, the errors are normally distributed and the
ordered residuals roughly follow the gray line. On the right, the errors have a t
distribution which has heavier tails than expected. Hence, we see deviation from the
gray line for the extreme residuals.
Remark 2.1.3. As noted in Montgomery, Peck, & Vining, these are not always
easy to interpret and often fail to capture non-normality in the model. However, they
seem to be very popular nonetheless.
2.2 Transformations4
Often, data does not follow all of the assumptions of the ordinary least squares
regression model. However, it is often possible to transform data in order to correct
for this deviations. We will consider such methods in the following subsections. One
dissatisfying remark about such methods is that they often are applied “empirically”,
which less euphemistically means in an add-hoc way. Sometimes there may be
genuine information suggesting certain methods for transforming data. Other times,
transforms are chosen because they seem to work.
4
See Montgomery, Peck, Vining Chapter 5.
22
2.2.1 Variance Stabilizing5
One of the major requirements of the least squares model is that the variance of the
errors is constant, which is that Var (yi ) = Var (εi ) = σ 2 for i = 1, . . . , n. Mainly,
when σ 2 is non-constant in y, problems can occur. Our goal is thus to find some
transformation T (y) so that Var (T (yi )) is constant for all i = 1, . . . , n.
Such a transformation T (·) can be determined through a tool known as the delta
method,6 which is beyond the scope of these notes. However, we will consider a
simplified version for our purposes. For simplicity of notation, we write EY = µ
instead of EY = β0 + β1 x1 + . . . + βp xp . Furthermore, assume that T is twice
differentiable.7 Then, Taylor’s theorem says that
T 00 (ξ)
T (Y ) = T (µ) + T 0 (µ)(Y − µ) + (Y − µ)2
2
for some ξ between Y and µ. We can, with a little hand-waving, ignore the higher
order remainder term and just write
T (Y ) ≈ T (µ) + T 0 (µ)(Y − µ),
which implies that
ET (Y ) ≈ T (µ) and Var (T (Y )) ≈ T 0 (µ)2 Var (Y ) .
We want a transformation such that Var (T (Y )) = 1 is constant. Meanwhile, we
assume that the variance of Y is a function of the mean µ, which is Var (Y ) = s(µ)2 .
Hence, we need to solve
Z
0 2 2 1
1 = T (µ) s(µ) or T (µ) = dµ.
s(µ)
Example 2.2.1. For a trivial example, assume that s(µ) = σ is already constant.
Then, Z
1
T (µ) = dµ = µ/σ.
σ
Thus, T is just scaling by σ to achieve a unit variance.
√
Example 2.2.2. Now, consider the nontrivial example with s(µ) = µ, which is
that the variance of Y is a linear function of µ. Then,
√
Z
1
T (µ) = √ dµ = 2 µ.
µ
This is the square root transformation, which is applied to, for example, Poisson data.
The coefficient of 2 in the above derivation can be dropped as we are not concerned
with the scaling.
5
See Montgomery, Peck, Vining Section 5.2.
6
See Chapter 3, Asymptotic Statistics, A W van der Vaart or https://en.wikipedia.org/
wiki/Delta_method.
7
We only require the second derivative for a more pleasant exposition.
23
Given that we have found a suitable transform, we can then apply it to the data
y1 , . . . , yn to get a model of the form
T (y) = β0 + β1 x1 + . . . + βp xp + ε
where the variance of ε is constant–i.e not a function of the regressors.
2.2.2 Linearization8
Another key assumption is that the relationship between the regressors and response,
x and y, is linear. If there is reason to believe that
y = f (β0 + β1 x1 + . . . + βp xp + ε)
and that f (·) is invertible, then we can rewrite this model as
y 0 = f −1 (y) = β0 + β1 x1 + . . . + βp xp + ε
and apply our usual linear regression tools.
An example from the textbook, which is also quite common in practice, is to
assume that
y = ceβ1 x1 +...+βp xp +ε
and apply a logarithm to transform to
y 0 = log y = β0 + β1 x1 + . . . + βp xp + ε
where β0 = log c. Furthermore, if ε has a normal distribution then eε has a log-normal
distribution. This model is particularly useful when one is dealing with exponential
growth in some population.
Remark 2.2.3. Linearization by applying a function to y looks very similar to the
variance stabilizing transforms of the previous section. In fact, such transforms have
an effect on both the linearity and the variance of the model and should be used with
care. Often non-linear methods are preferred.
Sometimes it is beneficial to transform the regressors, x’s, as well. As we are not
treating them as random variables, there are less problems to consider.
24
2.2.3 Box-Cox and the power transform9
In short, Box-Cox is a family of transforms parametrized by some λ ∈ R, which can
be optimized via maximum likelihood. Specifically, for yi > 0, we aim to choose the
best transform of the form
( λ
yi −1
(λ)
yi → yi = λ λ 6= 0
log yi λ = 0
by maximizing the likelihood as we did in Chapter 1, but with parameters β, σ 2 ,
and λ.
To do this, we assume that the transformed variables follow all of the usual least
squares regression assumptions and hence have a joint normal distribution with
n
!
1 X
f (Y (λ) ) = (2πσ 2 )−n/2 exp − 2 (yi − Xi,· β)2 .
2σ
i=1
From here, the MLEs for β and σ2 are solved for as before but are now in terms of
the transformed Y (λ) .
β̂ = (X T X)−1 X T Y (λ)
n (λ)
1 X (λ) SSres
σ̂ 2 = (yi − Xi,· β̂)2 = .
n n
i=1
Plugging these into the log likelihood gives and replacing all of the constants with
some C gives
n
n n X
log L = − log(2πσ̂ 2 ) − + (λ − 1) log yi
2 2
i=1
n Y
= C − log σ̂ 2 + log ( yi )λ−1 .
2
9
See Montgomery, Peck, Vining Section 5.4.
25
Defining the geometric mean of the yi to be γ = ( ni=1 yi )1/n , we have
Q
n n
log L = C − log σ̂ 2 + log γ 2(λ−1)
2 2
σ̂ 2
n
= C − log .
2 γ 2(λ−1)
Considering the term inside the log, we have that it is just the residual sum of
squares from the least squares regression
Y (λ)
= Xθ + ε
γ λ−1
If we wanted to extrapolate a bit, we can use this model to predict the stopping
distance for a speed of 50 mph, which is 179 feet.
We could stop here and be happy with a significant fit. However, looking at the
data, there seems to be a nonlinear relationship between speed and stopping distance.
Hence, we could try to fit a model with the response being the square root of the
stopping distance: lm( sqrt(dist)∼speed, data=cars). Doing so results in
p
(dist) = 1.28 + 0.32(speed) + ε.
In this case, we similarly get a significant p-value and a slightly higher R2 = 0.709.
The prediction for the stopping distance for a speed of 50 mph is now the much
higher 302 feet.
10
https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/cars.html
26
Linear
150
Stopping Distance
Square Root
Log−Log ●
100
●
● ●
● ●
● ● ● ●
● ● ●
● ●
50
● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
●
●
● ●
● ●
0
0 5 10 15 20 25 30 35
Speed
Figure 2.4: A plot of the cars dataset with three different regressions fit to it: simple
regression; square root transform; log-log transform.
and the predicted stopping distance for a car travelling at 50 mph is 254 feet. Note
that the square root transform is modelling y ∝ x2 whereas the log-log transform is
modelling y ∝ x1.6 , which is a slower rate of increase.
The three models considered are plotted in Figure 2.4. As we move away from
the range of the regressors–i.e 4 to 25 mph–the models begin to diverge making
extrapolating a bit dangerous.
27
model of the form
y = β 0 + β 1 x1 + ε
to data generated by
y = β0 + β1 x21 + ε.
If this is suspected, we can look at the residuals and try to transform x, such as
√
x → x, in order to put this back into the linear model framework. However, we
can also just fit a polynomial model to our data.
Consider the setting where we observe n data pairs (yi , xi ) ∈ R2 . We can then
attempt to fit a model of the form
y = β0 + β1 x + β2 x2 + . . . + βp xp + ε
1 x1 x21 . . . xp1
1 x2 x2 . . . xp
2 2
X = . .
.. . . ..
.. .. . . .
1 xn x2n . . . xpn
Overfitting
As a general rule, the degree of the polynomial model should be kept as low as
possible. High order polynomials can be misleading as they can often fit the data
quite well. In fact, given n data points, it is possible to fit an n − 1 degree polynomial
that passes through each data point. In this extreme case, all of the residuals would
be zero, but we would never expect this to be the correct model for the data.
12
This matrix arises in Linear Algebra as the Vandermonde matrix https://en.wikipedia.org/
wiki/Vandermonde_matrix.
13
See Montgomery, Peck, Vining Section 7.2.1
28
Problems can occur even when p is much smaller than n. As an example, two
models were fit to n = 50 data points generated from the model
y = 3x2 + ε
with ε ∼ N (0, 1). The first was a cubic model. The second was a degree 20 model.
The first regression resulted in three significant regressors. Note that in these two
cases, orthogonal polynomials were used to maintain numerical stability.
Estimate Std. Error t value Pr(>|t|)
Intercept 4.0856 0.1498 27.276 < 2e-16 ***
Degree 1 25.0368 1.0592 23.638 < 2e-16 ***
Degree 2 4.6890 1.0592 4.427 5.84e-05 ***
The second regression resulted in five significant regressors.
Estimate Std. Error t value Pr(>|t|)
Intercept 4.08562 0.13253 30.829 < 2e-16 ***
Degree 1 25.03676 0.93709 26.717 < 2e-16 ***
Degree 2 4.68903 0.93709 5.004 2.51e-05 ***
Degree 15 -2.79288 0.93709 -2.980 0.00578 **
Degree 17 -2.67700 0.93709 -2.857 0.00784 **
Furthermore, the ANOVA table seems to indicate that the addition of these variables
may be useful.
Model 1: y ∼ poly(x, 3)
Model 2: y ∼ poly(x, 20)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 46 51.606
2 29 25.466 17 26.139 1.751 0.08954 .
However, plotting the data in Figure 2.5 with the two different regression lines
demonstrates the overfitting problem with the very high order second model.
Extrapolating
Figure 2.5 also indicates problems that can occur when extrapolating with polynomial
models. When high degrees are present, the best fit curve can change directions
quickly and even make impossible or illogical predictions.
Beyond that, even when the fitted polynomial models all trend in the same
general direction, polynomials of different degree will diverge quickly from one
another. Consider Figure 2.6 where the data was generated from
y = 2x + 3x3 + ε.
All four models displayed generated very significant F tests. However, each one will
give very different answers to predicting the value of y when x = 5.
29
●
●
● ●
degree 3 ●
10
● ●
degree 20
●
●● ● ●●
●
●
y
● ●● ●●
5
● ●
● ● ●
● ●
● ●●
● ●
●● ● ●● ●
● ●●● ●
● ●
0
●●
● ●
●
Figure 2.5: A degree 3 and a degree 20 model fit to the same data.
Hierarchy
An hierarchical polynomial model is one such that if it contains a term of degree k,
then it will contain all terms of order i = 0, 1 . . . , k − 1 as well. In practice, it is not
strictly necessary to do this. However, doing so will maintain invariance to linear
shifts in the data.
Consider the simple linear regression model y = β0 + β1 x + ε. If we were to shift
the values of x by some constant a, then
y = β0 + β1 (x + a) + ε
= (β0 + aβ1 ) + β1 x + ε
= β00 + β1 x + ε
and we still have the same model but with a modified intercept term.
Now consider the polynomial regression model y = β0 + β2 x2 + ε. If we were to
similarly shift the values of x by some constant a, then
y = β0 + β2 (x + a)2 + ε
= (β0 + a2 β2 ) + 2β2 ax + β2 x2 + ε
= β00 + β10 x + β2 x2 + ε.
Now, our model has a linear term, that is β10 x, in it, which was not there before.
In general, for a degree p model, if x is shifted by some constant a, then
p
X p
X
y = β0 + βi xi ==⇒ y = β00 + βi0 xi .
i=1 i=1
30
30
●●
●
linear ●●
quadratic ●
20
●
●
cubic ●
●
quartic ●●
y
●
●
10
●
●●●●● ●
●
●
●●
5
● ●
●●
● ● ● ●●● ●
●
●●●● ●
●●● ●●
0
●
●
●
x
500
linear
quadratic
cubic
300
quartic
y
100
●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0
0 1 2 3 4 5
x
500
300
y
100
●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0
0 1 2 3 4 5
Figure 2.6: Top: four different regressions fit to the data of increasing degree.
Middle: the problem of extrapolating with higher degree polynomial regressions.
Bottom: a confidence interval for the mean of the cubic model.
31
Thus, the model is invariant under linear translation.
where 1[x > tj ] is the indicator function that takes on a value of 0 when x < tj and
a value of 1 otherwise, and where (x − tj )i+ = (x − tj )i 1[x > tj ].
Remark 2.3.2. Equation 2.3.1 is equivalent to fitting a separate pth order polynomial
to the data in each interval [tj , tj+1 ]. While doing so does result in many parameters to
estimate, it will allow us to perform hypothesis tests for continuity and differentiability
of the data, which we will discuss in the next section.
is a piecewise linear model with continuity at the knots. Conversely, a model of the
form
X k
y= β0,j 1[x ≥ tj ]
j=1
fits a piecewise constant model and can be used to look for change points in an
otherwise constant process.
In practice, spline models with only cubic terms are generally preferred as they
have a high enough order to ensure a good amount of smoothness–i.e. the curves are
twice continuously differentiable–but generally do not lead to overfitting of the data.
14
See Montgomery, Peck, Vining Section 7.2.2
32
An Example
Consider a model of the form
y = 2 + 3x − 4x5 + x7 + ε
with ε ∼ N (0, 4) with n = 50 observations with regressor x ∈ [0, 2]. A degree 4 and
degree 7 polynomial regression were fit to the data, and the results are displayed in
the top plot of Figure 2.7. These deviate quickly from the true curve outside of the
range of the data. The middle plot of Figure 2.7 fits a piecewise linear spline with 6
knots–i.e. k = 5. The model is of the form
y = β0,1 + β1,1 (x − 0.0)+ + β1,2 (x − 0.4)+ + . . . + β1,5 (x − 1.6)+ .
The bottom plot of Figure 2.7 fits a spline with only piecewise cubic terms and a
spline with cubic and all lower order terms. The only cubic model provides a very
reasonable approximation to the data. The full spline model becomes a bit crazy.
33
10
● ●
●
●
● ●
5
● ● ●
●
● ● ●●● ●
●
●● ● ● ● ●
●● ● ●
● ● ● ●
y
●
● ●
● ● ●
●●
Truth ● ●
−5
●
Degree 4 ● ●
●
Degree 7 ● ● ●
●●
x
10
● ●
●
●
● ●
5
● ● ●
●
● ● ●●● ●
●
●● ● ● ● ●
●● ● ●
● ● ● ●
y
●
● ●
● ● ●
●●
Truth ● ●
−5
●
Linear Spline ● ●
●
● ● ●
●●
x
10
● ●
●
●
● ●
5
● ● ●
●
● ● ●●● ●
●
●● ● ● ● ●
●● ● ●
● ● ● ●
y
●
● ●
● ● ●
●●
Truth ● ●
−5
●
Cubic Spline (only) ● ●
●
Cubic Spline (full) ● ● ●
●●
Figure 2.7: Top: A degree 4 and a degree 7 polynomial fit to data generated by
y = 2 + 3x − 4x5 + x7 + ε. Middle: A linear spline model. Bottom: A spline model
with only cubic terms and another including all lower order terms.
34
B-Splines
In what we considered above, the spline models were comprised of polynomials
with supports of the form [tj , ∞) for knots t1 < . . . < tk . However, there are many
different families of splines with different desirable properties. Some such families are
the B-splines, Non-uniform rational B-splines (NURBS), box splines, Bézier splines,
and many others. Here we will briefly consider the family of B-splines due to their
simplicity and popularity. Note that in the spline literature, sometimes the knots
are referred to as control points.
The ultimate goal of the of the B-splines is to construct a polynomial basis where
the constituent polynomials have finite support. Specifically, for an interval [a, b]
and knots a = t1 < t2 < . . . < tk < tk+1 = b, the constant polynomials will have
support on two knots such as [t1 , t2 ], the linear terms on three knots such as [t1 , t3 ],
and so on up to degree p terms, which require p + 2 knots.
B-splines can be defined recursively starting with the constant, or degree 0, terms:
Bj,0 (x) = 1x∈[tj ,tj+1 ] ,
which takes a value of 1 on the interval [tj , tj+1 ] and is 0 elsewhere. From here, the
higher order terms can be written as
x − tj tj+i+1 − x
Bj,i (x) = Bj,i−1 (x) + Bj+1,i−1 (x).
tj+i − tj tj+i+1 − tj+1
For example, with knots {0, 1, 2, 3}, we have
Constant: Bj,0 = 1x∈[j,j+1] j = 0, 1, 2
(x − j), x ∈ [j, j + 1]
Linear: Bj,1 = j = 0, 1
(j + 2 − x), x ∈ [j + 1, j + 2]
2
x /2, x ∈ [0, 1]
Quadratic: Bj,2 = (−2x2 + 6x − 3)/2, x ∈ [1, 2] j = 0.
(3 − x)2 /2, x ∈ [2, 3]
Here, we require that k > p as we will at least need knots t1 , . . . , tp+1 for a pth
degree polynomial. The total number of terms in the regression, which is number of
parameters βi,j to estimate, is
k + (k − 1) + . . . + (k − p).
35
1.0
0.8
0.6
y
0.4
0.2
0.0
x
Figure 2.8: An example of linear and quadratic B-splines.
Here we have linear and quadratic terms for x1 and x2 as well as an interaction term
β12 x1 x2 .
Fitting such models to the data follows from what we did for single variable
polynomials. In this case, the number of interaction terms can grow quite large in
practice. With k regressors, x1 , . . . , xk , there will be kp interaction terms of degree
p assuming p < k. This leads to the topic of Response Surface Methodology,17 which
is a subtopic of the field of experimental design.
We will not consider this topic further in these notes. However, it is worth noting
that in R, it is possible to fit a linear regression with interaction terms such as
with the simple syntax lm( y∼x1*x2*x3 ) where the symbol * replaces the usual +
from before.
16
See Montgomery, Peck, Vining Section 7.4
17
https://en.wikipedia.org/wiki/Response_surface_methodology
36
2.4 Influence and Leverage18
The overall intuition for this section is that each observation does not have an equal
influence on the estimation of β̂. If a given observed regressor x lies far from the
other observed values, it can have a strong effect on the least squares regression line.
The goal of this section is to identify such points or subset of points that have a
large influence on the regression.
If we use the R command lm() to fit a linear model to some data, then we
can use the command influence.measures() to compute an array of diagnostic
metrics for each observation to test its influence on the regression. The function
influence.measures() computes DFBETAS, DFFITS, covariance ratios, Cook’s
distances and the diagonal elements of the so-called hat matrix. We will look at each
of these in the following subsections.
Remark 2.4.1 (Warning). You may notice that the word “heuristic” appears often
in the following subsections when it comes to identifying observations with significant
influence. Ultimately, these are rough guidelines based on the intuition of past
statisticians and should not be taken as strict rules.
y = β0 + β1 x1 + . . . + βp xp ,
Remark 2.4.2. The ith diagonal entry Pi,i is referred to as the leverage of the ith
observation in Montgomery, Peck, & Vining. However, in R, leverage is Pi,i /(1−Pi,i ).
This sometimes referred to as the leverage factor.
18
See Montgomery, Peck, Vining Chapter 6
19
See Montgomery, Peck, Vining Section 6.2
37
2.4.2 Cook’s D20
Cook’s D or distance computes the distance between the vector of estimated pa-
rameters on all n data points, β̂, and the vector of estimated parameters on n − 1
data points, β̂(i) where the ith observation has been removed. Intuitively, if the ith
observation has a lot of influence on the estimation of β, then the distance between
β̂ and β̂(i) should be large.
For a linear model with p + 1 parameters and a sample size of n observations,
the usual form for Cook’s D is
T
(β̂(i) − β̂) X T X(β̂(i) − β̂)/(p + 1)
Di = .
SSres /(n − p − 1)
This is very similar to the confidence ellipsoid from Section 1.3. However, this is
not an usual F statistic, so we do not compute a p-value as we have done before.
Instead, some heuristics are used to determine what a large value is. Some authors
suggest looking for Di greater than 1 or greater than 4/n.21
Cook’s D can be written in different forms. One, in terms of the diagonal entries
of P , is
s2i
Pi,i
Di =
p + 1 1 − Pi,i
where si is the ith studentized residual. Another form of this measure compares the
distance between the usual fitted values on all of the data Ŷ = X β̂ and the fitted
values based on all but the ith observation, Ŷ(i) = X β̂(i) . That is,
T
(Ŷ(i) − Ŷ ) (Ŷ(i) − Ŷ )/(p + 1)
Di =
SSres /(n − p − 1)
Note that like Ŷ , the vector Ŷ(i) ∈ Rn . The ith entry in the vector Ŷ(i) is the
predicted value of y given xi .
2.4.3 DFBETAS22
The intuition behind DFBETAS is similar to that for Cook’s D. In this case, we
consider the normalized difference between β̂ and β̂(i) . What results is an n × (p + 1)
matrix whose ith row is
β̂ − β̂(i)
DFBETASi = q ∈ Rp+1
T −1
(X X)i,i SSres(i) /(n − p − 2)
20
See Montgomery, Peck, Vining Section 6.3
21
https://en.wikipedia.org/wiki/Cook%27s_distance
22
See Montgomery, Peck, Vining Section 6.4
38
where SSres(i) is the sum of the squared residuals for the model fit after removing the
ith data point and where (X T X)−1 T
i,i is the ith diagonal entry of the matrix (X X) .
−1
2.4.4 DFFITS23
The DFFITS value is very similar to the previously discussed DFBETAS. In this
case, we are concerned with by how much the fitted values change when the ith
observation is removed. Explicitly,
Ŷ − Ŷ(i)
DFFIT = q ∈ Rn .
T −1
(X X)i,i SSres(i) /(n − p − 2)
The claim is that DFFIT is effected by both leverage and prediction error. The
heuristic
p is to investigate any observation with DFFIT greater in magnitude than
2 (p + 1)/n.
39
37 ●
●
51 ●
8
●●
●
●
●
●
● ●
● ●
● ●
● ●
6
● ●
● ●
●
● ●
●
● ●● ● ●
y
● ●●
●
4
● ●● ● ●●
● ● ●●
●
●
● ●
●
●
2
10 ●
52 ●
0
0 1 2 3 4
Figure 2.9: A simple regression in red fit to the original 50 data points, and a simple
regression in blue fit to the original 50 and 2 anomalous data points.
where ε ∼ N (0, 1) and x ∈ [0, 2]. To this dataset, we add 2 anomalous points at
(4, 8) and at (1, 0). Thus, we fit a simple linear regression to the original 50 data
points and also to the new set of 52 data points resulting in the red and blue lines,
respectively, in Figure 2.9.
We can use the R function influence.measures() to compute a matrix contain-
ing the DFBETAS, DFFITS, covariance ratios, Cook’s D, and leverage for each data
point. Applying the recommended thresholds in the previous sections results in the
following extreme points, which are labelled in Figure 2.9:
Measure DF β0 DF β1 DFFITS
Extremes 10, 51, 52 37, 51 37, 51, 52
Measure Cov Ratio > Cov Ratio < Cook’s D Leverage
Extremes 52 51 37, 51, 52 51
40
• 37, 51, 52 have a strong influence on the estimation of Ŷ ;
In Chapter 1, we computed the least squares estimator as the vector β̂ such that
n
X
β̂ = arg min (Yi − Xi,· β̃)2
β̃∈Rp+1 i=1
In this setting, dividing by σi2 is the “weight” that gives this method the name
weighted least squares.
Proceeding as in chapter 1, we take a derivative with respect to the jth β̃j to get
n n
∂ X (Yi − Xi,· β̃)2 X (Yi − Xi,· β̃)
2 =2 Xi,j
∂ β̃j i=1 σi i=1
σi2
n n
X Yi Xi,j X Xi,· β̃Xi,j
=2 −2
i=1
σi2 i=1
σi2
Xn Xn
=2 Yi0 Xi,j
0
−2 0
Xi,· 0
β̃Xi,j
i=1 i=1
41
where Yi0 = Yi /σi and Xi,j
0 = X /σ . Hence, the least squares estimator is as before
i,j i
T T
β̂ = (X 0 X 0 )−1 X 0 Y 0
= (X T W X)−1 X T W Y
25
https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
42
Chapter 3
Model Building
Introduction
The following sections will discuss various topics regarding constructing a good
representative regression model for your data. Three main topics will be considered/
First, multicollinearity deals with the problem of linear relationships between
your regressions. We already know that we require the columns of the design matrix
to be linearly independent in order to solve for the least squares estimate. However, it
is possible to have near dependencies among the columns. This can lead to numerical
stability issues and unnecessary redundancy among the regressors.
Second, there are many different variable selection techniques in existence. Given
a large number of regressors to can be included in a model, the question is, which
should and which should not be included? We will discuss various techniques such
as forward and backward selection as well as different tools for comparing models.
Third, penalized regression will be discussed. This section introduces two modern
and quite powerful approaches to linear regression: ridge regression from the 1970’s
and LASSO from the 1990’s. Both arise from modifying how we estimate the
parameter vector β̂. Up until now, we have chosen β̂ to minimize the sum of the
squared error. Now, we will add a penalty term to this optimization problem, which
will encourage choices of β̂ with small-in-magnitude or just zero entries.
3.1 Multicollinearity1
The concept of multicollinearity is intuitively simple. Say we have a model of the
form
y = β0 + β1 x1 + β2 x2 + ε.
1
See Montgomery, Peck, Vining Section 9.3
43
This results in a design matrix of the form
1 x1,1 x1,2
.. .. ..
X = . . .
1 xn,1 xn,2
x2 = α0 + α1 x1 + ε.
If this simple regression has a strong fit,2 then the addition of the regressor x2 to the
original model is unnecessary as almost all of the explanatory information provided
by x2 with regards to predicting y is already provided by x1 . Hence, the inclusion of
x2 in our model is superfluous.
Taking a more mathematical approach, it can be shown that such near lin-
ear dependencies lead to a very high variance for the least squares estimator β̂.
Furthermore, the magnitude of the vector is much larger than it should be.
Assuming that the errors have a covariance matrix Var (ε) = σ 2 In , then we have
from before that
Var β̂ = Var (X T X)−1 X T Y =
With some effort, it can be shown that the diagonal entries of the matrix (X T X)−1
are equal to (1 − R02 )−1 , . . . , (1 − Rp2 )−1 where Rj2 is the coefficient of determination
for the model
which is trying to predict the jth regressor by the other p − 1 regressors. If the
remaining regressors are good predictors for xj , then the value Rj2 will be close to 1.
Hence,
σ2
Var β̂j =
1 − Rj2
will be very large.
Furthermore, this implies that the expected Euclidean distance between β̂ and β
will be quite large as well. Indeed, we have
p p
2 X
T
X
Var β̂i = σ 2 tr (X T X)−1
E (β̂ − β) (β̂ − β) = E β̂i − βi =
i=0 i=0
2 2
A significant F-test or R value, for example.
44
where tr (·) denotes the trace of a matrix–i.e. the sum of the diagonal entries. Hence,
if at least one of the Rj2 is close to 1, then the expected distance from our estimator
to the true β will be quite large.
The trace of a matrix is also equal to the sum of its eigenvalues. Hence, if we
denote the eigenvalues of X T X by λ1 , . . . , λp+1 , then
p+1
X
tr (X T X)−1 = λ−1
i .
i=1
6 j regressors are orthogonal to the jth regressor, then Rj2 = 0 and the
If all i =
term (X T X)−1j,j = 1. Under multicollinearity, (X T X)−1
j,j 1. Hence, the confidence
q
interval is expanded by a factor of (X T X)−1j,j when the regressors are not orthogonal.
We can alternatively examine the eigenvalues of the matrix X T X. Recall that
finding the least squares estimator is equivalent to solving a system of linear equations
of the form
X T X β̂ = X T Y.
To measure to stability of a solution to a system of equations to small perturbations,
a term referred to as the condition number is used.4 It is
κ = λmax /λmin
where λmax and λmin are the maximal and minimal eigenvalues, respectively. Ac-
cording to Montgomery, Peck, & Vining, values of κ less than 100 are not significant
whereas values greater than 1000 indicate severe multicollinearity.
3
See Montgomery, Peck, Vining Sections 9.4.2 & 9.4.3
4
This term arises in more generality in numerical analysis https://en.wikipedia.org/wiki/
Condition_number
45
If the minimal eigenvalue is very small, we can use the corresponding eigenvector
to understand the nature of the linear dependency. That is, consider the eigenvector
u = (u0 , u1 , . . . , up ) for the matrix X T X corresponding to the eigenvalue λmin . Recall
that this implies that
(X T X)u = λmin u ≈ 0,
which is approximately zero because λmin is close to zero. Hence, for regressors
1, x1 , . . . , xp ,
u0 + u1 x1 + . . . + up xp ≈ 0.
Thus, we can use the eigenvectors with small eigenvalues to get a linear relationship
between the regressors.
Remark 3.1.1. If you are familiar with the concept of the Singular Value Decom-
position, then you could alternatively consider the ratio between the maximal and
minimal singular values of the design matrix X.5 Furthermore, you can also analyze
the singular vectors instead of the eigen vectors.
Remark 3.1.2. A common thread among all of these alternatives is that they result
in a biased estimate for β unlike the usual least squares estimator. Often in statistics,
we begin with unbiased estimators, but can often achieve a better estimator by adding
a small amount of bias. This is the so-called bias-variance tradeoff.7
5
https://en.wikipedia.org/wiki/Singular-value_decomposition
6
See Montgomery, Peck, Vining Sections 9.5.4 for more on PC regression
7
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
46
3.2 Variable Selection8
In general, if we have p regressors, we may want to build a model consisting only
of the best regressors for modelling the response variable. In some sense, we could
compare all possible subset models. However, there are many issues with this, which
we will address in the following subsections. First, what are the effects of removing
regressors from your model? Second, how do we compare models if they are not
nested? Third, there are 2p possible models to consider. Exhaustively fitting and
comparing all of these models may be computational impractical or impossible.
Hence, how do we find a good subset of the regressors?
y = β0 + β1 x1 + . . . + βp+q xp+q + ε.
This can be written in terms of the design matrix and partitioned over the two sets
of regressors as
Y = Xβ + ε
= Xp βp + Xq βq + ε
47
Bias may increase
The first concern with the reduced model is that the estimator β̃p can be biased as
where Pp = Xp (XpT Xp )−1 XpT .10 The matrix A[Xq T (I − Pp )Xq ]−1 AT is symmetric
positive semi-definite, so the variance for β̂p can only be larger than β̃p .
If MSE(β̂p ) − MSE(β̃p ) is positive semi-definite, then the mean squared error has
decreased upon the removal of the regressors in Xq .
10
This expression can be derived via the formula for inverting a block matrix https://en.
wikipedia.org/wiki/Invertible_matrix#Blockwise_inversion
48
3.2.2 Model Comparison11
We have already compared models in Chapter 1 with the partial F-test. However,
for that test to make sense, we require the models to be nested–i.e. the larger model
must contain all of the parameters of the smaller model. But, given a model
y = β0 + β1 x1 + . . . + βp xp + ε,
we may want to compare two different subset models that are not nested. Hence, we
have some different measures to consider.
Note that ideally, we would compare all possible subset models. However,
given p regressors, there are 2p different models to consider, which will often be
computationally infeasible. Hence, in Section 3.2.3, we will consider two approaches
to model selection that avoid this combinatorial problem.
Remark 3.2.1. To avoid confusion and awkward notation, assume that all subset
models will always contain the intercept term β0
Mallows’ Cp
We can also compare different models by computing Mallows’ Cp . The goal of this
value is to choose the model the minimizes the mean squared prediction error, which
is
n
X E (ỹi − Eyi )2
M SP E =
σ2
i=1
where y˜i is the ith fitted value of the submodel and Eyi is the ith fitted value of the
true model. Furthermore, let ŷi be the ith fitted value for the full model. This is the
11
See Montgomery, Peck, Vining Section 10.1.3
49
expected squared difference between what the submodel predicts and what the real
value is. As usual with mean squared errors in statistics, we rewrite this in terms of
the variance plus the squared bias, which is
n
1 Xh 2
i
M SP E = E (ỹ i − Eỹ i + Eỹi − Ey i )
σ2
i=1
n
1 Xh i
= 2 E (ỹi − Eỹi )2 + (Eỹi − Eyi )2
σ
i=1
n
1 X
Var (ỹi ) + bias(ỹi )2
= 2
σ
i=1
Recall that the variance of the fitted values for the full model is Var (ŷ) = σ 2 Px
where Px = X(X T X)−1 X T . For a submodel with p1 < p regressors and design matrix
Xp1 , we get the similar Var (ỹ) = σ 2 Xp1 (XpT1 Xp1 )−1 XpT1 . As Xp1 (XpT1 Xp1 )−1 XpT1 is a
rank p1 + 1 projection matrix, we have that
n
X
Var (ỹi ) = σ 2 tr Xp1 (XpT1 Xp1 )−1 XpT1 = σ 2 (p1 + 1).
i=1
For the bias term, consider the expected residual sum of squares for the submodel:
n
X
E (SSres (p1 )) = E (yi − ỹi )2
i=1
n
X
=E (yi − Eỹi + Eỹi − Eyi + Eyi − ỹi )2
i=1
n
X
Var (r̃i ) + (Eỹi − Eyi )2
=
i=1
n
X
= (n − p1 − 1)σ 2 + bias(ỹi )2 .
i=1
Combining the bias and the variance terms derived above results in Mallows’ Cp
statistic for a submodel with p1 < p regressors:
E (SSres (p1 )) SSres (p1 )
Cp 1 = 2
− n + 2p1 + 2 ≈ − n + 2p1 + 2.
σ SSres /(n − p − 1)
Here, we estimate E (SSres (p1 )) by SSres (p1 ) and estimate σ 2 by SSres /(n − p − 1).
50
Remark 3.2.2. Note that if we compute Mallows’ Cp for the full model, we get
SSres
Cp = − n + 2p + 2 = p + 1.
SSres /(n − p − 1)
Hence, Mallows’ Cp in this case is just the number of parameters in the model. In
general, we want to find submodels with Cp value smaller than p + 1.
Information Criteria
Information criteria are concerned with quantifying the amount of information in a
model. With such a measure, we can choose a model that optimizes this measurement.
A main requirement for these methods is that the response y is the same. Hence,
we should not use the measures below when comparing transformed models–e.g.
different linearized models–without the necessary modifications.
The first such measure is the Akaike Information Criterion or AIC, which is a
measure of the entropy of a model. Its general definition is
where p + 1 is for the p regressors and 1 intercept term. Thus, adding more regressors
will decrease SSres but will increase p. The goal is to find a model with the minimal
AIC. This can be shown to give the same ordering as Mallows’ Cp when the errors
are normally distributed.
The second such measure is the closely related Bayesian Information Criterion or
BIC, which, in general, is
Remark 3.2.3. Using AIC versus using BIC for model selection can sometimes
result in different final choices. In some cases, one may be preferred, but often both
can be tried and discrepancies, if they exist, can be reported.
There are also other information criterion that are not as common in practise
such as the Deviation Information Criterion (DIC) and the Focused Information
Criterion (FIC).
51
3.2.3 Forward and Backward Selection12
Ideally, we choose a measure for model selection from the previous section and then
compare all possible models. However, for p possible regressors, this will result in 2p
models to check, which may be computationally infeasible. Hence, there are iterative
approaches that can be effective.
Forward selection is the process of starting with the constant model
y = β0 + ε
and choosing the best of the p regressors with respect to the model selection criterion.
This gives
y = β0 + β1 x1 + ε.
This process will continue to add terms to the model as long as it results in an
improvement in the criterion. For example, computing the AIC at each step.
Backwards selection is the reverse of forward selection. In this case, the algorithm
begins with the full model,
y = β0 + β1 x1 + . . . + βp xp + ε,
and iteratively removes the regressor that gives the biggest improvement in the model
selection criterion. If the best choice is to remove no regressors, then the process
terminates.
A third option is stepwise selection, which incorporates both forward and back-
ward steps. In this case, we begin with the constant model as in forward selection.
However, at every step, we choose either to add a new regressor to our model or
remove one that is already in the model depending on which choice improves the
criterion the most.
y = 2 + 3x − 4x5 + x7 + ε
and use the R function step( ) to choose a best model with respect to AIC.
12
See Montgomery, Peck, Vining Section 10.2.2
52
For our simulated data, the backward selection method, step( md1, direction=’backward’
), decided that the best AIC of 78.154 is achieved by not removing any regressors
from the model.
The forward selection method,
runs very similarly to the forward method, but makes one deletion in the final step:
Step Model AIC
0 y = β0 142.5
1 y = β 0 + β 1 x1 118.8
2 y = β 0 + β 1 x1 + β 7 x7 109.8
3 y = β0 + β1 x1 + β5 x5 + β7 x7 73.4
4 y = β 0 + β 5 x5 + β 7 x7 71.5
This method has more choices than forward selection, which can only grow the model,
and hence is able to achieve an even better AIC value.
The R package leaps is able to consider all possible subset models. Running this
process on the same data yields the top three best models in descending order as
Model 1 y = β0 + β5 x5 + β7 x7
Model 2 y = β0 + β6 x6 + β7 x7
Model 3 y = β0 + β5 x5 + β6 x6
Hence, in this case, the stepwise method was able to find the optimal model.
Remark 3.2.4. The VIF for x5 versus x7 in this example is 42.43, which is quite
large. Hence, model selection did not eliminate all of the multicollinearity present in
the model.
53
3.3 Penalized Regressions
No matter how we design our model, thus far we have always computed the least
squares estimator, β̂, by minimizing the sum of squared errors
( n )
X
β̂ = arg min (yi − XiT β)2 .
β∈Rp+1 i=1
This is an unbiased estimator for β. However, as we have seen previously, the variance
of this estimator can be quite large. Hence, we shrink the estimator towards zero
adding bias but decreasing the variance.13 In the context of regression, we add a
penalty term to the above minimization to get a new estimator
( n )
X
pen T 2
β̂ = arg min (yi − Xi β) + penalty(β) ,
β∈Rp+1 i=1
which increases as β increases thus attempting to enforce smaller choices for the
estimated parameters. We will consider some different types of penalized regression.14
Remark 3.3.1. We generally do not want to penalize the intercept term β0 . Often
to account for this, the regressors and response are centred–i.e. Y is replaced with
Y − Ȳ and each Xj is replaced with Xj − X̄j for j = 1, . . . , p–in order to set the
intercept term to zero.
for any λ ≥ 0. When λ = 0, we have the usual least squares estimator. As λ grows,
the β’s are more strongly penalized.
To solve for β̂λR , we proceed as before with the least squares estimator β̂ by
setting the partial derivatives equal to zero
n p
∂ X
2
X
2
0= (yi − β1 xi,1 − . . . − βp xi,p ) + λ βj .
∂βk
i=1 j=1
13
General idea of shrinkage is attributed to Stein (1956) https://en.wikipedia.org/wiki/
James%E2%80%93Stein_estimator
14
In R, the glmnet package has a lot of functionality to fit different types of penalized general
linear models http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
15
See Montgomery, Peck, Vining Section 9.5.3
16
Hoerl, A.E.; R.W. Kennard (1970).
54
This results in the system of equations
X T Y − (X T X)β̂λR − λβ̂λR = 0
with the ridge estimator being β̂λR = (X T X + λIn )−1 X T Y.
The matrix X T X is positive semi-definite even when p > n–i.e. the number of
parameters exceeds the sample size. Hence, any positive value λ will make X T X +λIn
invertible as it adds the positive constant λ to all of the eigenvalues. Increasing the
value of λ will increase the numerical stability of the estimator–i.e. decrease the
condition number of the matrix. Furthermore, it will decrease the variance of the
estimator while increasing the bias. It can also be shown that the bias of β̂λR is
Eβ̂λR − β = −λ(X T X + λIn )−1 β,
which implies that the estimator does, in fact, shrink towards zero as λ increases.
The main problem with this method is that the optimization is non-convex and
becomes severely difficult to compute in practice. This is why the forwards and
backwards selection methods are used for variable selection.
3.3.3 LASSO
The last method we consider is the Least Absolute Shrinkage and Selection Operator,
which is commonly referred to as just LASSO. This was introduced by Tibshirani
(1996) and has since been applied to countless areas of statistics. The form is quite
similar to ridge regression with one small but profound modification,
Xn Xp
L T 2
β̂λ = arg min (yi − Xi β) + λ |βj | ,
β∈Rp i=1 j=1
which is that the penalty term is now the sum of the absolute values instead of a
sum of squares.
The main reason for why this technique is popular is that it combines shrinkage
methods like ridge regression with variable selection and still results in a convex
optimization problem. Delving into the properties of this estimator requires convex
analysis and will be left for future investigations.
55
3.3.4 Elastic Net17
The elastic net regularization method combines both ridge and lasso regression into
one methodology. Here, we include a penalty term for each of the two methods:
X n p
X p
X
β̂λEN = arg min (yi − XiT β)2 + λ1 |βj | + λ2 βj2 .
β∈Rp i=1 j=1 j=1
This method has two tuning parameters λ1 ≥ 0 and λ2 ≥ 0. In the R library glmnet,
a mixing parameter α and a scale parameter λ is specified to get
n p
X X 1 − α
β̂λEN = arg min (yi − XiT β)2 + λ α|βj | + βj2 .
β∈R p 2
i=1 j=1
The intuition behind this approach is to combine the strengths of both ridge and lasso
regression. Namely, ridge regression shrinks the coefficients towards zero reducing
the variance while lasso selects a subset of the parameters to remain in the model.
56
Ridge Regression LASSO
0 11 27 36 42 46 50
1
1
0
0
Coefficients
t(x$coef)
−1
−1
−2
−2
0 50 100 150 200 0 5 10 15 20 25 30
x$lambda L1 Norm
57
Chapter 4
Introduction
A generalized linear model (GLM) extends the usual linear model to cases where
the errors ε have proposed distributions that are very different than the normal
distribution. In this case, we assume that ε comes from an exponential family1 , and
the form of the model is
g (Ey) = β0 + β1 x1 + . . . + βp xp .
Here, g(·) is referred to as the link function. As these are still parametric models, the
parameters β0 , . . . , βp are often solved for via maximum likelihood. These models
can be fit in R via the glm() function. While such models can be studied in full
generality, for this course, we will only consider two specific GLMs: the logistic
regression and the Poisson regression.
58
The goal is then to model Eyi = πi as a function of xT i β. While different link
3
functions are possible, the standard link function chosen is the logistic response
function–sometimes referred to as the logit function–which is
exp(xT
i β)
Eyi = πi =
1 + exp(xT
i β)
where the ratio πi /(1 − πi ) is the odds ratio or simply the odds. Futhermore, the
entire response log (πi /(1 − πi )) is often referred to as the log odds.
The estimator β̂ can be computed numerically via maximum likelihood as we
assume the underlying distribution of the data. Hence, we also get fitted values of
the form
exp xT
i β̂
ŷi = π̂i = .
1 + exp xTi β̂
However, as noted already, the variance is not constant. Hence, for residuals of the
form yi − π̂i to be useful, we will need to normalize them. The Pearson residual for
the logistic regression is
yi − π̂i
ri = p .
π̂i (1 − π̂i )
59
4.1.2 Testing model fit
Estimation of the parameters β is achieved by finding the maximum likelihood
estimator. The log likelihood in the logistic regression with Bernoulli data is
n
πiyi (1 − πi )1−yi
Y
log L(β) = log
i=1
n
X
= [yi log πi + (1 − yi ) log(1 − πi )]
i=1
n
X πi
= yi log + log(1 − πi )
1 − πi
i=1
n h i
xT
X
= yi x T
i β − log(1 + e i β) .
i=1
Theorem 4.1.1 (Wilks’ Theorem). For two statistical models with parameter spaces
Θ0 and Θ1 such that Θ0 ⊂ Θ1 4 and likelihoods L0 (θ) and L1 (θ), then -2 times the
log likelihood ratio has an asymptotic chi-squared distribution with degrees of freedom
equal to the difference in the dimensionality of the parameter spaces. That is,
supθ∈Θ0 L0 (θ) d 2
−2 log(LR) = −2 log −
→ χ (|Θ1 | − |Θ0 |) .
supθ∈Θ1 L1 (θ)
In the case of the logistic regression, we can use this result to construct an
analogue to the F-test when the errors have a normal distribution. That is, we can
compute the likelihood of the constant model
πi
log = β0 (4.1.1)
1 − πi
and claim from Wilks’ theorem that −2 log(LR) is approximately distributed χ2 (p)
assuming the sample size is large enough.
4
i.e. the models are nested
60
In R, the glm() function does not return a p-value as the lm() function does for
the F test. Instead, it provides two values, the null and residual deviances, which
are respectively
L(null model)
null deviance = −2 log , and
L(saturated model)
L(full model)
residual deviance = −2 log .
L(saturated model)
Here, the null model is from Equation 4.1.1 and the full model is from Equation 4.1.2.
The saturated model is the extreme model with p = n parameters, which perfectly
models the data. It is, in some sense, the largest possible model used as a baseline.
In practice, the deviances can be thought of like the residual sum of squares
from the ordinary linear regression setting. If the decrease is significant, then the
regressors have predictive power over the response. Furthermore,
L(null model)
(null deviance) − (residual deviance) = −2 log ,
L(full model)
which has an asymptotic χ2 (p) distribution. Such a goodness of fit test can be run
in R using anova( model, test="LRT" ). Note that there are other goodness-of-fit
tests possible for such models.
61
Logistic Regression
1.0 ● ● ● ● ● ● ● ● ●
0.8 Predicted
Truth
0.6
y
0.4
0.2
0.0 ● ● ● ● ● ● ● ● ● ● ● ●
Logistic Regression
1.0 ● ● ● ●
●
0.8 Predicted ●
Truth ● ● ●
0.6 ●
y/m
● ●
0.4 ●
● ●
●
●
0.2
0.0 ● ● ● ●
Figure 4.1: Plots of the 21 data points, the true Bernoulli probabilities in red,
and the predicted probabilities in blue. Top plot, yi ∼ Bernoulli (πi ). Bottom plot,
yi ∼ Binomial (mi , pi ).
62
4.2 Poisson Regression5
The Poisson regression is another very useful GLM to know, which models counts or
occurrences of rare events. Examples include modelling the number of defects in a
manufacturing line or predicting lightning strikes across a large swath of land.
The Poisson distribution with parameter λ > 0 has PDF f (x) = e−λ λx /x!. This
can often be thought of as a limiting case of the the binomial distribution as n → ∞
and p → 0 such that np → λ. If we want a linear model for a response variable y
such that yi has a Poisson (λi ) distribution, then, similarly to the logistic regression,
we apply a link function. In this case, one of the most common link functions used
is the log. The resulting model is
log Eyi = β0 + β1 x1 + . . . + βp xp
where Eyi = λi .
Similarly to the logistic regression case–and to GLMS in general–the parameters
β are estimated via maximum likelihood based on the assumption that y has a
Poisson distribution. Furthermore, the deviances can be computed and a likelihood
ratio test can be run to assess the fit of the model. The log likelihood for a Poisson
regression is
n
Y e−λi λyi i
log L(β) = log
yi !
i=1
n
X
= [−λi + yi log λi − log(yi !)]
i=1
n h i
T
X
= −exi β + yi xT
i β − log(y i !)
i=1
yi ∼ Poisson e2xi .
A Poisson regression with log link can be fit to the data using R’s glm() function
with the argument family=poisson(link=log).
The result from summary() is
with a p-value of 10−8 for βˆ1 = 1.95. A plot of the true and predicted curves is
displayed in Figure 4.2.
5
See Montgomery, Peck, Vining Section 13.3.
63
Poisson Regression
8 ● ●
Predicted ●
Truth
6
4 ● ●
y
● ● ●
2 ●
● ● ● ●
0 ● ● ● ● ● ● ● ●
Figure 4.2: Poisson regression (blue) and true rate parameter (red) fit to n = 21
data points.
64
Appendix A
Distributions
Introduction
The following is a short overview of some of the most common probability distributions
encountered in the context of linear regression.
65
The normal distribution can be extended to the multivariate normal distribution
for a vector X ∈ Rp where we write X ∼ N (µ, Σ). Here, µ ∈ Rp is the mean vector
while Σ ∈ Rp×p is a p × p matrix that is symmetric and positive semi-definite. This
distribution also has elliptical symmetry. The form of the pdf, assuming Σ is positive
definite, is
1 1 T −1
fN (µ,Σ) (x) = p exp − (x − µ) Σ (x − µ)
2π det(Σ) 2
Otherwise, X will have a degenerate normal distribution, which is still normal but
supported on a subspace of dimension less than p.
The multivariate normal (MVN) distribution has some very nice characterizations.
A vector X = (X1 , . . . , Xp ) is MVN if and only if every linear combination of the Xi
p 2
is univariate normal. That is, for all a ∈ R , a · X ∼ N µ̃, σ̃ . Hence, it is worth
emphasizing that if X is a vector that is marginally normal (i.e. every component
Xi is univariate normal), then the joint distribution is not necessarily multivariate
normal.
In general if two random variables X, Y ∈ R are independent, then we have
that cov (X, Y ) = 0. However, the reverse implication is not necessarily true. One
exception to this is that if (X, Y ) is MVN then if cov (X, Y ) = 0 then X and Y are
independent.
Its mean and variance are ν and 2ν, respectively. Note that while very often the
degrees of freedom parameter is a positive integer, it can take on any positive real
value and still be valid. In fact, the chi-squared distribution is a specific type of the
more general gamma distribution, which will not be discussed in these notes.
This random variable is quite useful in the context of linear regression with respect
to how it leads to the t and F distributions discussed in the following subsections.
However, it also arises in many other areas of statistics including Wilks’ Theorem
regarding the asymptotic distribution of the log likelihood ratio, goodness-of-fit in
multinomial hypothesis testing, and testing for independence in a contingency table.
66
A.3 t distribution
The t distribution is sometimes referred to as Student’s t distribution in recognition
of its, at the time, anonymous progenitor William Sealy Gosset working at the
Guinness brewery. It most notably arises in the context of estimation of a normal
random variable when the variance is unknown as occurs frequently in these lecture
notes.
Let Z ∼ N (0, 1) and let V ∼ χ2 (ν) be independent random variables. Then, we
say that
Z
T =p ∼ t (ν)
V /ν
has a t distribution with ν degrees of freedom. Such a distribution can be thought
of as a heavier tailed version of the standard normal distribution. In fact, t (1) is the
Cauchy distribution with pdf
A.4 F distribution
The F distribution arises often in the context of linear regression when a comparison
is made between two sources of variation. Let X ∼ χ2 (ν) and Y ∼ χ2 (η) be
independent random variables, then we write that
X/ν
F = ∼ F (ν, η)
Y /η
has an F distribution with degrees of freedom ν and η. Due to this form of the
distribution, if F ∼ F (ν, η), then F −1 ∼ F (η, ν). The F distribution is supported
on the positive real line. The F and t distributions are related by the fact that if
T ∼ t (ν), then T 2 ∼ F (1, ν).
67
Appendix B
1. For a matrix A ∈ Rn×p with row i and column j entry denoted ai,j , then the
transpose of A is AT ∈ Rp×n with row i and column j entry aj,i . That is, the
indices have swapped.
68
11. If X, Y ∈ Rn are random vectors, then
cov (X, Y ) = E (X − EX)(Y − EY )T ∈ Rn×n .
12. If X, Y ∈ Rn are random vectors and A, B ∈ Rm×n are non-random real valued
matrices, then
69