Chapter 2
Chapter 2
Chapter 2
2. The interest here is to see if a mother’s height can be used to predict her daughter’s
height.
3. For the ith pair of mother and daughter, let Yi denote the response variable, the daugh-
ter’s height (which we want to predict), and Xi the observed covariate or explanatory
variable, the mother’s height (which we want to use in the prediction), i = 1, . . . , n.
The problem is to model the relationship between Yi and Xi so that we can use X to
predict Y .
4. The first step of any statistical modelling exercise is to examine the data graphically,
or by calculating informative summaries.
5. The scatter plot of the data is provided in Figure 5. What all conclusion can we make
from here?
7. ...
8. ...
9. To interpret a scatterplot, think of dividing the scatterplot into narrow vertical strips
so that each vertical strip contains the observations on samples which have nearly the
same mother’s height. Within each vertical strip, we see a distribution of daughter’s
height which is the conditional distribution of mother’s height given that the mother’s
height is in the strip.
19
70
65
hts$Dheight
60
55
55 60 65 70
hts$Mheight
10. We need to describe how these distributions change as we move across the vertical
strips. The most e↵ective graphical way of seeing how the distributions change is to
draw a boxplot for the data in each strip. Examining these boxplots shows the features
we noted above.
60
55
Mheight
20
12. What can you conclude from these side by side boxplots?
13. ...
14. ...
15. In practice, we learn to interpret scatterplots without going through this formal exer-
cise, but it does give useful insight into the conditional nature of the interpretation.
16. Return to the original scatterplot. If there was no variability, the points in the
plot would fall exactly on a curve describing the exact relationship between Y and X;
the variability is reflected in the scatter of points about this curve.
17. An interesting question is what is the source of the variability? One explanation is that
the variability is due to the measurement process and stochastic inaccuracies which
arise in it. Alternatively, the daughter’s height may be a↵ected by other variables
which we have not held constant and their hidden variation induces variation in the
data. An entirely pragmatic view is that the source of the variability does not mater,
variability is simply there and we must deal with it.
• the intercept 0 is the mean value of daughter’s height when mother’s height is
0 (not really a practical thing to interpret here though).
• the slope 1 is the e↵ect on the mean daughter’s height of a unit change in
mother’s height (e.g. how many inches do we expect daughter’s height to increase
if the mother is an inch taller).
4. To complete the model, we need to say more about the random variation.
• Provided the observations have been made from unrelated experiments, it seems
reasonable to begin by assuming independence.
• It is difficult to say much about the shape of the conditional distribution of Yi
given Xi : we usually start by making the optimistic assumption that it is normal.
The discrepant points may make this somewhat implausible – it is important to
check later.
21
5. Putting the four assumptions together, we have
ind 2
Yi |Xi ⇠ N ( 0 + 1 Xi , ).
Yi = 0 + 1 Xi + ✏i
ind
with ✏i ⇠ N (0, 2 ) independent of Xi . Here ✏i is the random di↵erence between the
observation Yi and its conditional mean 0 + 1 Xi . Thus ✏i is a random variable
called an error (not meaning a mistake, though) with a distribution called the error
distribution. In this form, the model is a decomposition of the observation into a
structural part (the regression line) plus random variation (the error).
●
E(Y|X)=β0 + β1X
●
●
εi
●
Y
●
●
●
Figure 2.4: The structural and random components of the underlying population
8. Therefore, given a pair of variables, it matters which we choose as the response and
which as the explanatory variables. Often, this choice is clear, as for example when
22
• the aim of the modelling exercise is to predict one of the variables using the other
(as in the current example).
• one of the variables is controlled (fixed by the experimenter) while the variability
of the other is observed.
2. Least Squares Estimation works well when the error distribution is normal. This
entails estimating ( 0 , 1 ) by ( ˆ0 , ˆ1 ) which minimises the sum S of squared vertical
distances between the observed response values yi and the regression line 0 + 1 xi , at
the corresponding observed values of the explanatory variable. Clearly
n
X
2
S= (yi 0 1 xi )
i=1
so we have n
@S X
= 2 (yi 0 1 xi )
@ 0 i=1
and n
@S X
= 2 xi (yi 0 1 xi ).
@ 1 i=1
ȳ ˆ0 ˆ1 x̄ = 0 or ˆ0 = ȳ ˆ1 x̄
and n n
X X
xi yi nx̄ ˆ0 ˆ1 x2i = 0.
i=1 i=1
3. The first equation shows that the fitted line passes through the point(x̄, ȳ).
4. It follows that
n n
!
X X
0 = xi yi nȳx̄ ˆ1 x2i nx̄2
i=1 i=1
23
n
X n
X
= (xi x̄)yi ˆ1 (xi x̄)2
i=1 i=1
n
X
= (xi x̄)yi ˆ1 (n 1)s2X ,
i=1
Pn
say, where s2X = (n 1) 1
i=1 (xi x̄)2 , and hence
Pn
ˆ1 = i=1 (xi 2x̄)yi
P(n 1)sX
= ni=1 w1i yi ,
where
xi x̄
w1i = .
(n 1)s2X
Substituting back into the first equation, we obtain
n
X
ˆ0 = ȳ ˆ1 x̄ = w0i yi
i=1
where
1 x̄(xi x̄)
w0i = .
n (n 1)s2X
• Similarly
n
X n
X
SXY = (xi x̄)yi = x i yi nx̄ȳ.
i=1 i=1
ˆ0 = ȳ ˆ1 x̄
24
y x
4 30
10 50
15 70
21 90
21 100
22 120
29 140
34 160
39 180
41 200
(n 2)S 2 2
2
⇠ n 2
3. In practice we use
n
X
(n 2)s2 = (yi ˆ0 ˆ1 xi )2 = SY Y ˆ2 SXX.
1
i=1
Pn 2
Pn
where SY Y = 1=1 yi nȳ 2 and similarly, SXX = 1=1 x2i nx̄2 .
• Fitted Values are the estimates ŷi = ˆ0 + ˆ1 xi of the conditional means at the
actual observed melting point measurements
25
• Residuals are the di↵erences ri = yi ŷi between the observed responses and
the fitted values.
2. The decomposition
yi = ˆ0 + ˆ1 xi + ri ,
is analogous to the model decomposition into a structural part (regression line) plus
random variation (error) and shows that the residuals can be regarded as estimates of
the errors.
3. Note that the least squares estimates given above are single observations of correspond-
ing random variables (estimators) given by
n
X n
X
ˆ0 = w0i Yi , ˆ1 = w1i Yi
i=1 i=1
4. These are linear functions of the responses so it will be straightforward to write down
their sampling distributions. Similarly, the fitted values ŷi and residuals ri are observed
values of random variables
n
X n
X
Ŷi = w0j Yj + xi w1j Yj , i = 1, . . . , n
j=1 j=1
and
R i = Yi Ŷi , i = 1, . . . , n
respectively.
2.6 Diagnostics
Once we have fitted the model and computed the residuals from our fit, we can use the
residuals to explore the quality of the fit and the properties of the error distribution. The
most e↵ective methods for carrying out this exploration are graphical. These are collectively
known as diagnostic methods.
(a) Curvature: the band of points does not remain horizontal but curves as X
changes. This indicates that the conditional means are not linearly related to the
explanatory variable.
26
Figure 2.5: Curvature in a residual plot
(b) Heteroscedasticity: a fan shaped residual scatter indicates that the variability
in the error term ✏i in the model is not constant.
(c) Discrepant responses (outliers): Y-values which do not follow the same linear
model as the rest of the sample Y-values. The e↵ect of such values can be to bias
the sample estimates away from the parameters 0 and 1 of the linear model
satisfied by the rest of the sample values and to inflate the error.
2. How can you interpret the Anscombe plot for the heights data set? Is it a random
scatter?
3. ...
4. The information in the residual plot (ŷ, y ŷ) is also contained in the scatterplot (x, y).
Residual plots are graphically more e↵ective than scatterplots because the dominating
linear relationship has been removed and it is easier to look for deviations from the
horizontal rather than from the fitted regression line.
5. Residual plots are also useful in problems involving more than two variables where
scatterplots of pairs of variables are ine↵ective. In such cases, non-linearity of the
relationship between the response and an explanatory variable may not be obvious
27
Figure 2.7: Discrepant points in a residual plot
5
residual
0
−5
60 62 64 66 68
fitted
Figure 2.8: The Anscombe residual plot for the heights data set
from the Anscombe plot, and this plot should be supplemented by plotting the residuals
against each of the individual explanatory variables – more details later.
28
approximately a straight line, if the normal model holds. Departures from linearity
correspond to departures from the model.
[NB: Always check which axis the observed residuals are on since this a↵ects interpreta-
ord
tion. The R software
⇣ 3 ⌘puts the observed residuals, ri , on the y-axis and the theoretical
1 i 8
quantiles, n+ 14
on the x-axis. Thus, from a plot in R we can estimate as the
slope and estimate µ as the intercept.]
0
−5
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Figure 2.10: The normal probability plot of the residuals for the heights data set
4. How can you interpret the normal probability plot for the heights data set? Does the
points lie on a straight line?
29
5. ...
2.6.5 Remedies
1. Transformation: If the residual plot shows nonlinearity and/or heteroscedasticity or
the normal probability plot shows a long-tailed distribution, we may be able to find
a transformation which linearises the relationship, stabilises the variability and makes
the variability more normal. The most widely used transformations (in increasing
order of strength) are the cube root, square root (for count data), logarithmic (for
multiplicative relationships) or negative reciprocal (for time and rates). However, there
is no guarantee that a transformation can be found to remedy any of these defects,
never mind all simultaneously. Also, when a transformation has been used, the final
model fitted is no longer in terms of the original variables. This latter objection is
often unimportant but there are calibration problems in which it may matter.
2. Reacting to Discrepant observations: First try to establish whether the points
are mistakes, observations from a distinct, identifiable subgroup in the population or
due to changes in the data collection process. Often, unless we have access to the
people who collected the data, there is no additional information available. On the
basis of whatever information we can collect, we can correct mistakes, build more
complicated models (which either change the distributional assumptions or treat the
subsets separately) or exclude the subsets from the analysis. The choice between these
depends on whether we want to try to mix the possibly disparate structures into one
model or we are interested in the pattern followed by the bulk (majority) of the data.
We can exclude points in two ways.
• Robust methods: Use robust fitting methods to fit the model to the bulk of
the data.
• Exclusion: Remove the points and then refit using least squares. (The disad-
vantage of this approach is that the identification of the points is not always
straightforward and the inferences do not then reflect this facet of our analysis.)
3. Multiple analyses can be presented to show the range of possible scientific outcomes.
It is important to identify discrepant points and then explain and justify the actions
we take in relation to them.
31
2. We have
• |⇢| 1;
• |⇢| ⇡ 1 indicates a strong linear relationship,
• |⇢| = 0 indicates a weak linear relationship.
3. Positive (negative) values of ⇢ indicate that Y tends to increase (decrease) with increas-
ing X. The quantity ⇢ is a population parameter (a property of the joint probability
distribution of Y and X) which can be estimated, given a representative sample from
the population, by
sXY
⇢ˆ =
(n 1)sY sX
where s2X and s2Y are the sample variances, and
n
X
sXY = (yi ȳ)(xi x̄)
i=1
whatever the nature of the explanatory variable. The algebraic relationship ignores
di↵erences in context, interpretation etc. so should not be overinterpreted.
32
6. The correlation coefficient is sometimes used in conjunction with a regression model
to summarise the quality of the fit. A low value of s2 implies small estimated error
variance, and hence a good fit. It is more meaningful to look at s2 relative to how
well we could fit the data if we ignored the explanatory variable than to compare s2 to
the theoretical but unattainable ‘ideal value’ of zero. Without information about the
explanatory variable, then we would have to model the response values as identically
distributed about a common mean, and the error variance estimate would be s2Y the
sample variance of Y , as in Chapter 1. We can therefore measure the fit by
s2Y s2 s2
Adj R2 = =1 ,
s2Y s2Y
7. The linear relationship is strong if Adj R2 ⇡ 1 and weak if Adj R2 ⇡ 0. Using the
analysis of variance decomposition
n 1 2 ˆ2 s2 ) = n 1
s2 = (s (1 ⇢ˆ2 )s2Y
n 2 Y 1 X
n 2
and hence
n 1 2 1
Adj R2 = ⇢ˆ .
n 2 n 2
8. It follows that when n is large Adj R2 ⇡ ⇢ˆ2 . It is common to denote ⇢ˆ2 by R2 , and
then the relationship
2 n 2 1 (n 2)s2
R = Adj R2 + =1
n 1 n 1 (n 1)s2Y
9. The sample correlation coefficient ⇢ˆ (and consequently Adj R2 and R2 ) has limitations
as a summary of the strength of relationships. In particular,
• ⇢ˆ is sensitive to outliers
• ⇢ˆ is sensitive to skewed data distributions
• ⇢ˆ takes the same numerical value for very di↵erent scatterplots and
• a correlation of zero means no linear relationship but does not mean no
relationship at all.
33
● ● ●
●
● ● ●
● ● ●
● ●
● ●
● ●● ●
● ●
●
●●
● ● ● ●●
● ●
● ●
● ●
Y
● ● ●●● ●
● ● ●
● ●
● ●
● ● ● ●
● ●
●
● ●
●
● ●
● ● ● ●
● ●●
●
● ● ●
● ●
●
●
●
●●
●● ●●
●●
●●
● ● ●●
●● ●
●
X X
10. Thus the summary provided by a correlation coefficient can only be meaningfully in-
terpreted if we also have a scatterplot of the data. This problem can be overcome by
taking the initial steps in the modelling process (plotting the data, choosing trans-
formations etc.) but this is rarely done as the correlation coefficient is widely, and
incorrectly, viewed as a statistic which alleviates the need to examine the data. In any
case, once we have begun the modelling process, we may as well complete it properly.
34
ind P
3. Since Yi |Xi ⇠ N ( 0 + 1 Xi ,
2
) and ˆ1 = ni=1 w1i Yi , we have :
Shape ˆ1 is normally distributed, as ˆ1 is a linear combination of normally distributed
random variables Y1 . . . , Yn .
Mean
n
X
E( ˆ1 ) = w1i E(Yi )
i=1
Xn
= w1i ( 0 + 1 xi ) = 1
i=1
n
X n
X
as w1i = 0, w1i xi = 1.
i=1 i=1
so ˆ1 is an unbiased estimator of 1.
Variance
n
X
V ar( ˆ1 ) = 2
w1i V ar(Yi )
i=1
n
X
2 2
= w1i
i=1
2
=
(n 1)s2X
as
n
X
2 1
w1i = .
i=1
(n 1)s2X
5. The variance of the slope estimator is made as small as possible for a fixed n by making
s2X as large as possible. This can be achieved by putting the Xs at either end of their
feasible range. Note however that this optimal design leaves no possibility of checking
the linearity of the model. The e↵ect of discrepant explanatory variables is to increase
s2X and hence decrease V ar( ˆ1 ). This will result in a misleadingly small variance if the
corresponding Y value is incorrectly measured.
35
as a measure of precision of the estimator ˆ1 . As this standard deviation depends on
the unknown parameter , we use the standard error
s
s2
s.e.( ˆ1 ) =
(n 1)s2X
where has been replaced by the estimate s (see Section 2.4), to summarise the
precision of ˆ1 .
7. Return to the numerical example introduced before. Show that s.e.( ˆ0 ) = 1.214
and s.e.( ˆ1 ) = 0.00964.
n
X n
X
as w0i = 1, w0i xi = x̄ x̄ = 0
i=1 i=1
and
n
X
V ar( ˆ0 ) = 2
w0i V ar(Yi )
i=1
n
X
2 2
= w0i
✓
i=1 ◆
2 1 x̄2
= +
n (n 1)s2X
as
n
X Xn ✓ ◆
2 1 1 x̄(xi x̄) x̄2 (xi x̄)2
w0i = 2 +
i=1 i=1
n2 n (n 1)s2X (n 1)2 s4X
1 x̄2
= + .
n (n 1)s2X
9. It follows that ✓ ◆
ˆ0 ⇠ N 2 1 x̄2
0, +
n (n 1)s2X
and ✓ ◆1/2
1 x̄2
s.e.( ˆ0 ) = s + .
n (n 1)s2X
36
10. Finally,
n
X n ✓
X ◆
1 x̄(xi x̄) xi x̄ x̄
w0i w1i = =
i=1 i=1
n (n 1)s2X (n 1)s2X (n 1)s2X
n n
!
X X
Cov( ˆ0 , ˆ1 ) = Cov w0i Yi , w1i Yi
i=1 i=1
n
X
= w0i w1i V ar(Yi )
i=1
n
X
2
= w0i w1i
i=1
2
x̄
= .
(n 1)s2X
4. However, it is usually of greater interest to evaluate a confidence interval for one of the
regression parameters 0 or (particularly) 1 .
5. The sampling distributions of ˆ0 and ˆ1 derived in Section 2.8, cannot be used directly
in confidence interval construction, as they both depend on the unknown variance 2 ,
as well as the parameter of interest. However, because we have the result
(n 2)S 2 2
2
⇠ n 2
37
6. As, for any given n, the tn 2 distribution is known, we can find c such that
P ( c t( ˆ1 ) c) = 1 ↵
7. Hence, as
c t( ˆ1 ) c ) ˆ1 c s.e.( ˆ1 ) 1 ˆ1 c s.e.( ˆ1 )
we have constructed a confidence interval for 1. Exactly the same argument can be
applied for 0 . This is of the general form:
8. Interpretation: The random interval will contain the true value of 1 with probability
1 ↵.
9. Any particular interval does or does not contain 1; the probability is over the set of
possible intervals.
10. Practical consequence: 100(1 ↵)% of confidence intervals contain 1 and 100↵%
do not.
11. Return to the numerical example introduced before. Show that the 95% con-
fidence interval for 1 is (0.193, 0.237).
E(Ŷ0 ) = 0 + 1 x0
38
so this estimator is unbiased, and
V ar(Ŷ0 ) = V ar( ˆ 2 ˆ ˆ ˆ
✓ 0 ) + x0 V2ar( 1 ) + 2x0 2Cov( 0 , 1 ) ◆
2 1 x̄ x0 2x0 x̄
= + 2
+
✓ n (n 1)s X ◆ (n 1)s2X (n 1)s2X
1 (x0 x̄)2
= 2 +
n (n 1)s2X
2
= h00 ,
1 (x0 x̄)2
with h00 = n
+ (n 1)s2X
, so that
ˆ0 + ˆ1 x0 ⇠ N ( 0 + 1 x0 ,
2
h00 )
and
ˆ0 + ˆ1 x0 ( 0 + 1 x0 )
1/2
⇠ tn 2 .
sh00
[ ˆ0 + ˆ1 x0 c s h00 , ˆ0 + ˆ1 x0 + c s h00 ]
1/2 1/2
4. Notice that the width of these confidence intervals depends on x0 and increases as x0
gets further from the sample mean x̄. That is, the best estimates are made in the
centre of the observed explanatory variables.
5. The fitted values are estimates of the conditional mean function at the observed values
of covariates so
2
E(Ŷi ) = 0 + 1 xi , V ar(Ŷi ) = hii
and
39
2.11 Predicting observations
1. Suppose that we want to predict the observation Y0 = 0 + 1 x0 + ✏0 at a value x0
of X, where ✏0 is independent of the errors in the sample data observations. This is
di↵erent from the inference problems we have considered so far because in those we
estimated a fixed but unknown constant parameter whereas here we are ‘estimating’ a
random variable. Nevertheless, the basic approach only needs minor modification.
2. Notice that Ŷ0 = ˆ0 + ˆ1 x0 satisfies
E(Ŷ0 Y0 ) = 0
so Ŷ0 is an unbiased predictor of Y0 . Making use of the fact that Y0 , and hence ✏0 , is
independent of Y1 , . . . , Yn , the prediction variance is
E(Ŷ0 Y0 )2 = E( ˆ0 + ˆ1 x0 0 1 x0 ✏0 ) 2
= E( ˆ0 + ˆ1 x0 0
2
1 x0 ) + E(✏0 )
2
= 2 (1 + h00 )
3. Using a similar approach to that in Section 2.10, we can obtain 100(1 ↵)% prediction
intervals of the form
[ ˆ0 + ˆ1 x0 cs(1 + h00 )1/2 , ˆ0 + ˆ1 x0 + cs(1 + h00 )1/2 ]
where P (tn 2 < c) = ↵/2.
4. This is of the general form:
40
2.12 Residuals
1. We can also write down the properties of the residuals
R i = Yi Ŷi = Yi ˆ0 ˆ1 xi .
2. This looks like but is not the same as the prediction problem because the observation
Yi is in the sample and hence cannot be independent of ( ˆ0 , ˆ1 ) which depend explicitly
on Y1 , . . . , Yn . The mean and variance are
E(Ri ) = 0
and
V ar(Ri ) = V ar(Yi ) + V ar( ˆ0 + ˆ1 xi ) 2Cov(Yi , ˆ0 + ˆ1 xi ).
3. Now
n
!
X
Cov(Yi , ˆ0 + ˆ1 xj ) = Cov Yi , (w0k + w1k xj )Yk
k=1
= (w0i + w1i xj )V ar(Yi )
✓ ◆
2 1 x̄(xi x̄) xj (xi x̄)
= +
n (n 1)s2X (n 1)s2X
✓ ◆
2 1 (xi x̄)(xj x̄)
= +
n (n 1)s2X
= 2 hij
so
2 2
V ar(Ri ) = (1 + hii 2hii ) = (1 hii )
and, for i 6= j,
Cov(Ri , Rj ) = Cov(Yi ˆ0 ˆ1 xi , Yj ˆ0 ˆ1 xj )
= Cov(Yi , ˆ0 + ˆ1 xj ) Cov(Yj , ˆ0 + ˆ1 xi ) + Cov( ˆ0 + ˆ1 xi , ˆ0 + ˆ1 xj )
2
= (hij + hij hij )
2
= hij .
4. Note that, in general hij 6= 0 which implies that possibly there is correlation between
Ri and Rj , even though we assumed that ✏i and ✏j are independent. This is because
Ri and Rj are both dependent on ˆ0 and ˆ1 . The correlation between Ri and Rj is
defined by:
Cov(Ri ,Rj )
Corr(Ri , Rj ) = p
V ar(Ri ) V ar(Rj )
2h
= p 2 (1
ij
2 (1
hii ) hjj )
hij
= p .
(1 hii ) (1 hjj )
41
6. The residuals are useful for examining patterns and have nice simple distributional
properties. It is sometimes useful to scale them so to the standardised residuals
Ri /s.
We expect 5% of |Ri /s| > 2 and very few of |Ri /s| > 3 in normally distributed data.
The fact the variance of the residuals depends on i does not usually matter (and
this dependence diminishes as n increases). When this is a concern, we may use the
studentised residuals
Ri /[s(1 hii )1/2 ]
instead.
⇤
where 1 is the conjectured value.
2. The basic idea of testing is to take our estimator of 1 and see how close it is to 1⇤ . If
it is close to 1⇤ , then we conclude that the data support H0 ; if it is far from 1⇤ , then
we conclude there is evidence against H0 . As described in Section 1.3.2, we summarise
the evidence against H0 , by computing a p-value which is the probability under (the
model and) H0 of observing a value of ˆ1 as far from 1⇤ or further than that actually
observed.
ˆ1 1
t( ˆ1 ) = ⇠ tn 2
s.e.( ˆ1 )
42
5. We compute the p-value for the test by examining how probable it is that a tn 2
distribution would generate a value of t⇤ ( ˆ1 ) as, or more, extreme than the observed
data value. We can write the calculation succinctly as
6. This a two-sided test (we are checking for departures from H0 in either direction).
Directional tests are possible but relatively uncommon in practice, and we will not
develop them here.
43
2.14 Model comparison
1. The most common test of interest for a simple linear regression model is the test of the
hypothesis that 1 = 0. The reason that this test is so important is that it provides a
formal way of comparing the linear regression model
ind 2
Yi ⇠ N ( 0 + 1 xi , ), i = 1, . . . , n
where the distribution of the response variable Y explicitly depends on the explanatory
variable X through a linear relationship in its conditional mean, with the simpler (more
parsimonious) model
ind
Yi ⇠ N ( 0 , 2 ) i = 1, . . . , n
where there is no relationship between the variables Y and X.
2. These models can be compared by testing the hypothesis H0 : 1 = 0 within the linear
regression model. If this hypothesis is not rejected, then we can draw the conclusion
that the simpler model is adequate for the data, and there is no reason to entertain
the more complex model – there is no apparent relationship between y and X (or at
least not one which can be explained by a simple linear regression).
3. Here, and more generally when using a hypothesis test to compare two models, the
null hypothesis corresponds to the simpler model. Furthermore, the null hypothesis has
special status in a hypothesis test – it is not rejected unless the data provide significant
evidence against it. This is consistent with a desire for parsimony in modelling. We
do not reject the simpler model in favour of the more complex model unless the data
provide significant evidence that we should (the simper model a poor fit to the data
in comparison with the more complex model).
4. The two models above can be compared from a di↵erent perspective which will be
useful later (where we want to perform more general model comparisons).
ind
• When we fit the model Yi ⇠ N ( 0 + 1 Xi , 2 ), i = 1, . . . , n, the
P quality of the fit is
reflected by the residual sum of squares (RSS) (n 2)s2 = ni=1 ri2 .
ind
• Under the simpler model (which we still denote H0 ) Yi ⇠ N ( 0 , 2 ), i = 1, . . . , n,
thePleast squares estimator of 0 is ˆ0 = Ȳ so the residual sum of squares (RSS)
is ni=1 (yi ȳ)2 = (n 1)s2Y .
– Although we use the same symbol 0 in both models, we have 0 = E(Yi |Xi =
0) or 0 is the intercept in the simple regression model and 0 = E(Yi ) or 0
is the mean in the reduced model. This raises a point which is very important
in general: the interpretation of a parameter depends on the context, and this
is not reflected in the notation.
• We can compare the two models by examining
RSS under H0 RSS under regression model
= (n 1)s2Y (n 2)s2
Xn X n
= (yi ȳ)2 ri2 .
i=1 i=1
44
Now we have
n
X n
X
(yi ȳ)2 = (yi ŷi + ŷi ˆ0 ˆ1 x̄)2
i=1 i=1
n
X
= (ri + ˆ1 (xi x̄))2
i=1
n
X n
X n
X
= ri2 + ˆ12 (xi x̄) + 2 ˆ1
2
ri (xi x̄)
i=1 i=1 i=1
n
X
= ˆ12 (n 1)s2X + ri2
i=1
Hence,
(n 1)s2Y = ˆ12 (n 1)s2X + (n 2)s2
which is sometimes referred to as the analysis of variance decomposition.
5. For the linear regression model, we call the first term the total sum of squares, the
second the regression sum of squares and the third the residual sum of squares. Hence,
we have
6. Now, as suggested above, we can reformulate the comparison of the two models, by
considering a test statistic based on the improvement in fit provided by the regression
model, over the simpler model
n
X n
X
(yi ȳ) 2
ri2 = ˆ12 (n 1)s2X = Regression SS.
i=1 i=1
If the regression model fits the data significantly better than the simple (null) model
H0 then the regression sum of squares will be large. But how do we determine if the
regression sum of squares is so large that we should reject H0 .
7. It is clear that, under H0 : 1 = 0, the regression sum of squares ˆ12 (n 1)s2X , divided
by 2 has a 21 distribution. Furthermore, as we shall prove later, it is independent of
S 2 . Hence, we consider the test statistic
Regression Mean Square
F =
Residual Mean Square
ˆ2 (n 1)s2 ˆ2 (n 1)s2 / 2
= 1 2
X
= 1
2 2
X
s s/
45
X12 /1
=
Xn2 2 /(n 2)
where X12 has a 21 distribution and Xn2 2 has a 2n 2 distribution. We call the distribu-
tion of such a ratio, an F distribution. The F distribution depends on two parameters
corresponding to the degrees of freedom in the numerator and denominator. Hence,
we denote the F distribution above by F1,n 2 . It is a ‘known’ distribution, properties
of which can be obtained from statistical software, or statistical tables.
8. The comparison of the null model and regression models based on the F-ratio is iden-
tical to a test of the same hypothesis based on the t-statistic because
ˆ2 (n 1)s2
⇣ ˆ1
⌘2 ⇣ ˆ ⌘2
• F = 1 s2 X = s/[(n 1)s 2 ]1/2 = s.e.(1ˆ )
X 1
2 /1 2
⇣ ⌘2
• F1,k ⇠ 21 /k ⇠ N (0,1)
2 /k ⇠ [ N2 /k]
(0,1)
1/2 ⇠ t2k
k k k
9. The analysis of variance decomposition and the associated F -test are often presented
in an ANOVA table. The ANOVA table for comparing the simple linear regression
model (Yi = 0 + 1 xi + ✏i ) with the null model (Yi = 0 + ✏i ) is given by:
Source Df Sum of Squares Mean SS F Value P Value
ˆ2 (n 1)s2
Regression 1 ˆ (n 1)s ˆ (n 1)s2
2 2 2 1 x
P r(F > F Value)
1 x 1 x s2
2 2
Residuals n 2 (n 2)s s
Pn
Total n 1 i=1 (yi ȳ 2 )
46
13.
12. The P-value, observed to be 0.0000, says that the linear regression model is significant.
3. Present, interpret and use the model to make inferences and predictions, remembering
always to appropriately summarise uncertainty.
2. Dr R. Telford collected skinfold (the sum of four skinfold measurements) and percent
body fat measurements on 102 elite male athletes training at the Australian Institute
of Sport.
3. The scatterplot of the data shows that the percent body fat increases linearly with the
skinfold measurement and that the variation in percent body fat also increases with
the skinfold measurement. There do not seem to be any unusual data points and there
are no other obvious patterns to note.
47
4. It is often the case that we can simplify the relationships exhibited by data by trans-
forming one or more of the variables. In the example, we want to preserve the simple
linear relationship for the conditional mean but stabilise the variability (i.e. make it
constant) so that it is easier to describe. We therefore try transforming both vari-
ables. There is no theoretical argument to suggest a transformation here but empirical
experience suggests that we try the log transformation.
5. The scatterplot on the log scale shows that the log transformation of both variables
preserves the linear conditional location relationship and stabilises the conditional vari-
ability.
6. Let Yi denote the log percent body fat measurement and Xi the log skinfold measure-
ment on athlete i, where i = 1, . . . , 102. Then we have
2
E(Yi |xi ) = 0 + 1 xi and V ar(Yi |xi ) =
7. After examining the diagnostics, we conclude that the linear regression model provides
a good approximation to the underlying population. Specifically,
Ŷ = 1.25 + 0.88x
(0.097)(0.025)
Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept –1.250 0.097 –12.940 0.000
log skinfold 0.882 0.024 35.590 0.000
Analysis of Variance
Source DF SS MS F Prob>F
Model 1 8.5240 8.5240 1266.3 0.0000
Error 100 0.6731 0.0067
C Total 101 9.1971
8. A 95% confidence interval for the population slope parameter 1 is (0.83, 0.93).
9. A common test of interest is the test of the hypothesis that 1 = 0. i.e. that skinfold is not
useful for predicting bodyfat. The 95% confidence interval for 1 does not contain zero so
we can conclude that there is evidence against the hypothesis and that skinfold is a useful
variable in the model for predicting body fat. A formal hypothesis test can be carried out by
computing the t-ratio 35.59 which has p-value zero (to 4 decimal places). We conclude that
1 is significantly di↵erent from zero.
48
11. A 95% prediction interval for the log body fat percentage of a single male athlete with a
skinfold of 70 is (2.334, 2.663).
12. Although in our body fat example, we have worked on the simpler log scale, it may be useful
to make predictions on the raw scale. Note that we can transform the fitted model to show
that on the raw scale, the fitted model is
13. The validity of the inferences depends on the representativeness of the sample and the validity
of the model. It is vital when collecting data to ensure that it is representative of the
population of interest and before making inferences to ensure that the model is appropriate.
Strictly speaking, model assessment is a form of informal inference and has an impact on
other inferences but it is not sensible to simply hope that a model is valid when empirical
verification is available.
49
bfat.lm <- lm(logbfat ~ logskin, data=bodyfat)
qqnorm(bfat.lm$res)
qqline(bfat.lm$res)
par(mfrow=c(1,1))
x <- seq(from=-5, to=5, length=500)
y <- dt(x, df=100)
plot(x, y, xlab="", ylab="", type="l")
title("T-density with df=100")
abline(v=abs(tstat))
abline(h=0)
50
# Create a new data set called new
symb <- c("Fitted line", "95% CI for mean", "95% CI for observation")
cat("Please click on an empty space on the graphics window\n")
legend(locator(1), legend = symb, lty = c(1, 2, 4), col = c(5, 2, 3))
# R output
> summary(bfat.lm)
Call:
51
lm(formula = logbfat ~ logskin, data = bodyfat)
Residuals:
Min 1Q Median 3Q Max
-2.234e-01 -5.566e-02 5.226e-05 5.639e-02 1.530e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.24988 0.09661 -12.94 <2e-16 ***
logskin 0.88225 0.02479 35.59 <2e-16 ***
---
> anova(bfat.lm)
Analysis of Variance Table
Response: logbfat
Df Sum Sq Mean Sq F value Pr(>F)
logskin 1 8.5240 8.5240 1266.3 < 2.2e-16 ***
Residuals 100 0.6731 0.0067
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 0.1 1
52
2.18 Chapter Recap
1. The simple linear regression model is given by:
ind 2
Yi = 0 + 1 xi + ✏i , ✏i ⇠ N (0, ), i = 1, . . . , n.
All these formulae are equivalent for ˆ1 . We use the form which is mostP
convenient for
the purpose, e.g. for proving theorems we use the form in the middle ni=1 w1i yi and
for numerical calculation we use the last form.
3. Fitted values and residuals are defined as: ŷi = ˆ0 + ˆ1 xi , and ri = yi ŷi .
(a) We plot the residuals against the fitted values to check for homoscedasticity. Plot
should look like a random scatter if this assumption holds.
(b) We order and plot the residuals against the normal order statistics to check for
normality. This plot should be a straight line if the normality assumption holds.
5. Define notations:
n
X n
X
1 1
s2x = (xi 2
x̄) , s2y = (yi ȳ)2 ,
n 1 i=1
n 1 i=1
n
X n
X n
X
sxy = (xi x̄)(yi ȳ), sxx = (xi x̄)2 , syy = (yi ȳ)2
i=1 i=1 i=1
sxy
6. Define the sample correlation coefficient, r = (n 1)sx sy
. Now,
Pn
(x x̄)(yi ȳ) sxy sxy sy sy
ˆ1 = Pn i
i=1
2
= 2 = ⌘r .
i=1 (xi x̄) sx sx sy sx sx
⇣ ⌘
7. Have: Var( ˆ1 ) = , Var( ˆ0 ) = , Cov( ˆ0 , ˆ1 ) =
2 1 2 2
Pn 2
+ Pn x̄ Pn x̄
.
i=1 (xi x̄)2 n i=1 (xi x̄)
2
i=1 (xi x̄)2
53
11. The residual standard error is ˆ which is given by s.
12. Estimate
⇣ the conditional
⌘ mean of Y0 for a new x0 by ˆ0 + ˆ1 x0 . Its variance is
(x x̄) 2
2 1
n
+ Pn 0(xi x̄)2 = 2 h00 .
i=1
14. The standard error of an estimate of a parameter is the estimated value of the square
root of the variance of that parameter.
15. To construct a confidence interval, use the general formula: Estimate ± Critical
Value ⇥ Standard Error. Use this to find intervals for any 0 and 1 , the conditional
mean and the future observation Y0 for a new x0 .
16. The critical value is an appropriate quantile of the t-distribution with n 2 df. For
95% intervals, the R command is qt(0.975, df=n-2).
17. The critical value can be found from the table in the Formula Sheet. Choose row
using the residual degrees of freedom and the column using the confidence level. For
example, when n = 12 for 95% confidence interval use the value, 2.23 which is at the
row for k = 10 and at the column for ↵ = 0.025.
19. The ANOVA table for comparing the simple linear regression model with the null
model (Yi = 0 + ✏i ) is given by:
54
22. Adjusted R2 is the proportionate reduction in error variance achieved by the model.
s2
adj R2 = 1 .
s2y
55