Notes 1
Notes 1
Notes 1
E[Yi ] = β 0 + β 1 Xi
where β0 is the mean of when when X = 0 (assuming this is a reasonable level of X), or more
generally the Y –intercept of the regression line; β1 is the change in the mean of Y as X increases
by a single unit, or the slope of the regression line. Note that in practice β0 and β1 are unknown
parameters that will be estimated from sample data.
Individual measurements are assumed to be independent, and normally distributed around the
mean at their corresponding X level, with standard deviation σ. This can be stated as below:
Yi = β 0 + β 1 Xi + ε i εi ∼ N ID(0, σ 2 )
where εi is a random error term and N ID(0, σ 2 ) means normally and independently distributed
with mean 0, and variance σ 2 .
1.1.1 Examples
The following two examples are based on applications of regression in pharmacodynamics and
microeconomics.
The following data were published by J.G. Wagner, et al, in the 1968 article: “Correlation
of Performance Test Scores with ‘Tissue Concentration’ of Lysergic Acid Diethylamide in Human
Subjects,” (Clinical Pharmacology & Therapeutics, 9:635–638).
Y — Mean score on math test (relative to control) for a group of five male volunteers.
A sample of n = 7 points were selected, with Xi and Yi being measured at each point in time.
These 7 observations are treated as a sample from all possible realizations from this experiment.
The parameter β1 represents the systematic change in mean score as tissue concentration increases
by one unit, and β0 represents the true mean score when the concentration is 0. The data are given
in Table 1.1.1.
i Xi Yi
1 1.17 78.93
2 2.97 58.20
3 3.26 67.47
4 4.69 37.47
5 5.83 45.65
6 6.00 32.92
7 6.41 29.97
The following (approximate) data were published by Joel Dean, in the 1941 article: “Statistical
Cost Functions of a Hosiery Mill,” (Studies in Business Administration, vol. 14, no. 3).
A sample of n = 48 months of data were used, with Xi and Yi being measured for each month.
The parameter β1 represents the change in mean cost per unit increase in output (unit variable
cost), and β0 represents the true mean cost when the output is 0, without shutting plant (fixed
cost). The data are given in Table 1.1.1 (the order is arbitrary as the data are printed in table
form, and were obtained from visual inspection/approximation of plot).
2. Specify the levels of Xi , i = 1, . . . , n. This can be done easily with do loops or by brute force.
3. Obtain n standard normal errors Zi ∼ N (0, 1), i = 1, . . . , n. Statistical routines have them
built in, or transformations of uniform random variates can be obtained.
5. For the case of random Xi , these steps are first completed for Xi in 2), then continued for Yi .
The Zi used for Xi must be independent of that used for Yi .
responses (Yi ) and the fitted regression line (Ŷi = β̂0 + β̂1 Xi ), where β̂0 and β̂1 are sample based
estimates of β0 and β1 , respectively.
Mathematically, we can label the error sum of squares as the sum of squared distances between
the observed data and their mean values based on the model:
n
X n
X
Q = (Yi − E(Yi ))2 = (Yi − (β0 + β1 Xi ))2
i=1 i=1
The least squares estimates of β0 and β1 that minimize Q, which are obtained by taking derivatives,
setting them equal to 0, and solving for β̂0 and β̂1 .
n
X
∂Q
= 2 (Yi − β0 − β1 Xi )(−1) = 0
∂β0 i=1
n
X n
X n
X
⇒ (Yi − β0 − β1 Xi ) = Yi − nβ0 − β1 Xi = 0 (1)
i=1 i=1 i=1
n
X
∂Q
= 2 (Yi − β0 − β1 Xi )(−Xi ) = 0
∂β1 i=1
n
X n
X n
X n
X
⇒ (Yi − β0 − β1 Xi )Xi = X i Yi − β 0 Xi − β 1 Xi2 = 0 (2)
i=1 i=1 i=1 i=1
From equations (1) and (2) we obtain the so–called “normal equations”:
n
X n
X
nβ̂0 + β̂1 Xi = Yi (3)
i=1 i=1
n
X n
X n
X
β̂0 Xi + β̂1 Xi2 = X i Yi (4)
i=1 i=1 i=1
Pn
Multipliying equation (3) by i=1 Xi and equation (4) by n, we obtain the following two equations:
n
X n
X n
X n
X
2
nβ̂0 Xi + β̂1 ( Xi ) = ( Xi )( Yi ) (5)
i=1 i=1 i=1 i=1
n
X n
X n
X
nβ̂0 Xi + nβ̂1 Xi2 = n X i Yi (6)
i=1 i=1 i=1
Subtracting equation (5) from (6), we get:
n
X n
X n
X n
X n
X
β̂1 (n Xi2 − ( X i )2 ) = n X i Yi − ( Xi )( Yi )
i=1 i=1 i=1 i=1 i=1
Pn Pn
Pn ( Xi )( i=1 Yi ) Pn
i=1 Xi Yi −
i=1
Pn n 2 i=1 (Xi − X)(Yi − Y)
⇒ β̂1 = Pn = Pn 2
(7)
( i=1 Xi )
2 i=1 (Xi − X)
i=1 Xi − n
Now, from equation (1), we get:
n
X n
X
nβ̂0 = Yi − β̂1 Xi ⇒ β̂0 = Y − β̂1 X (8)
i=1 i=1
The residuals are defined as the difference between the observed responses (Yi ) and their predicted
values (Ŷi ), where the residuals are denoted as ei (they are estimates of εi ):
1.2.1 Examples
Numerical results for the two examples desribed before are given below.
This dataset has n = 7 observations with a mean LSD tissue content of X = 4.3329, and a
mean math score of Y = 50.0871.
n
X n
X n
X n
X n
X
Xi = 30.33 Xi2 = 153.8905 Yi = 350.61 Yi2 = 19639.2365 Xi Yi = 1316.6558
i=1 i=1 i=1 i=1 i=1
Pn Pn
Pn ( Xi )( i=1 Yi ) (30.33)(350.61)
i=1 Xi Yi −
i=1
1316.6558 −
β̂1 = Pn n = 7
=
Pn ( i=1 Xi )2 (30.33)2
2
i=1 Xi −
153.8905 − 7
n
i Xi Yi Ŷi ei
1 1.17 78.93 78.5820 0.3480
2 2.97 58.20 62.3656 -4.1656
3 3.26 67.47 59.7529 7.7171
4 4.69 37.47 46.8699 -9.3999
5 5.83 45.65 36.5995 9.0505
6 6.00 32.92 35.0680 -2.1480
7 6.41 29.97 31.3743 -1.4043
Table 3: LSD concentrations, math scores, fitted values and residuals – Wagner, et al (1968)
.
This dataset has n = 48 observations with a mean output (in 1000s of dozens) of X = 31.0673,
and a mean monthly cost (in $1000s) of Y = 65.4329.
n
X n
X n
X n
X n
X
Xi = 1491.23 Xi2 = 54067.42 Yi = 3140.78 Yi2 = 238424.46 Xi Yi = 113095.80
i=1 i=1 i=1 i=1 i=1
Pn Pn
Pn ( Xi )( i=1 Yi ) (1491.23)(3140.78)
i=1 Xi Yi −
i=1
113095.80 −
β̂1 = Pn n 2 = 48
=
Pn ( i=1 Xi ) (1491.23)2
2
i=1 Xi −
54067.42 − 48
n
70
70
60
60
50
50
40
40
33 00
20
1 2 3 4 5 6 7
conc
8 07 0
70
60
60
50
50
40
40
30
30
20
1 02 0
01 21 0 3 20 4 30 5 40 6 5 07
scioznec
Yi = Ŷi + ei i = 1, . . . , n
Table 4: Approximated Monthly Outputs, total costs, fitted values and residuals – Dean (1941)
.
Here is a proof that the final term on the right-hand side is 0 (which is very easy in matrix
algebra):
Pn Pn
i=1 (Xi − X)(Yi − Y) i=1 (Xi − X)(Yi − Y)
Ŷi = β̂0 + β̂1 Xi = (Y − Pn 2
X) + Xi Pn 2
(11)
i=1 (Xi − X) i=1 (Xi − X)
Pn Pn
i=1 (Xi − X)(Yi − Y) i=1 (Xi − X)(Yi − Y)
ei = Yi − Ŷi = Yi − (Y − Pn 2
X) + Xi Pn 2
(12)
i=1 (Xi − X) i=1 (Xi − X)
Pn 2 Pn 2
i=1 (Xi − X)(Yi − Y ) 2 i=1 (Xi − X)(Yi − Y )
Y −X Pn 2
− X i P n 2
−
i=1 (Xi − X) i=1 (Xi − X)
Pn Pn
i=1 (Xi − X)(Yi − Y ) i=1 (Xi − X)(Yi − Y )
2Xi Y − X Pn 2
Pn 2
i=1 (Xi − X) i=1 (Xi − X)
Pn Pn
i=1 (Xi − X)(Yi − Y ) i=1 (Xi − X)(Yi − Y ) 2
= Yi Y − X Pn 2
+ X Y
i i P n 2
−Y −
i=1 (X i − X) i=1 (X i − X)
Pn 2 Pn
2 i=1 (Xi − X)(Yi − Y) i=1 (Xi − X)(Yi − Y)
X Pn 2
+ 2Y X Pn 2
−
i=1 (Xi − X) i=1 (Xi − X)
Pn 2 Pn 2
i=1 (Xi − X)(Yi − Y) i=1 (Xi − X)(Yi − Y)
Xi2 Pn 2
− 2Xi X Pn 2
i=1 (Xi − X) i=1 (Xi − X)
Pn 2
i=1 (Xi − X)(Yi − Y) 2
= Pn 2
(2Xi X − Xi2 − X )+
i=1 (Xi − X)
Pn
i=1 (Xi − X)(Yi − Y) 2
Pn 2
(−Yi X + Xi Yi + 2Y X − 2Xi Y ) + Yi Y − Y
i=1 (Xi − X)
Pn X n Xn Xn Xn
i=1 (Xi − X)(Yi − Y ) 2
Pn 2
( X i Y i − X Y i + 2nXY − 2Y X i ) + Y Yi − nY
i=1 (Xi − X) i=1 i=1 i=1 i=1
Pn 2 n
X
i=1 (Xi − X)(Yi − Y) 2 2
= Pn 2
(2nX − Xi2 − nX )+
i=1 (Xi − X) i=1
Pn X n
i=1 (Xi − X)(Yi − Y ) 2 2
Pn 2
( Xi Yi − nXY + 2nXY − 2nXY ) + nY − nY
i=1 (X i − X) i=1
Pn 2 n
X Pn Xn
i=1 (Xi − X)(Yi − Y) 2 i=1 (Xi − X)(Yi − Y)
= Pn 2
(nX − Xi2 ) + Pn 2
( Xi Yi − nXY )
i=1 (Xi − X) i=1 i=1 (Xi − X) i=1
n
X n
X n
X n
X n
X
(Xi − X)(Yi − Y ) = Xi Yi + nXY − X Yi − Y Xi = Xi Yi − nXY
i=1 i=1 i=1 i=1 i=1
Thus we can partition the total (uncorrected) sum of squares into the sum of squares of the
predicted values (the model sum of squares) and the sum of squares of the errors (the residual sum
of squares).
n
X n
X n
X
Yi2 = Ŷi2 + e2i
i=1 i=1 i=1
X
n X
n X
n X
n
= nβ̂02 + 2β̂0 β̂1 Xi + β̂12 Xi2 = 2
n(Y − β̂1 X) + 2(Y − β̂1 X)β̂1 Xi + β̂12 Xi2
i=1 i=1 i=1 i=1
n
X
2 2 2
= nY + nβ̂12 X − 2nβ̂1 Y X + 2nβ̂1 Y X − 2β̂12 nX + β̂12 Xi2
i=1
2 2 X
n
2 X
n
2
= nY − nβ̂12 X + β̂12 Xi2 = nY + β̂12 ( Xi2 − nX )
i=1 i=1
n
X
2
= nY + β̂12 (Xi − X)2
i=1
n
X n
X
SS(RESIDUAL) = e2i = (Yi − Ŷi )2 = SS(TOTAL UNCORRECTED) − SS(MODEL)
i=1 i=1
The total (uncorrected) sum of squares is of little interest by itself, since it depends on the
level of the data, but not the variability (around the mean). The total (corrected) sum of squares
measures the sum of the squared deviations around the mean.
n
X n
X 2
SS(TOTAL CORRECTED) = (Yi − Y )2 = Yi2 − nY
i=1 i=1
n
X n
X
2
= (SS(MODEL) − nY ) + SS(RESIDUAL) = β̂12 (Xi − X)2 + e2i
i=1 i=1
= SS(REGRESSION) + SS(RESIDUAL)
Note that the model sum of squares considers both β0 and β1 , while the regression sum of
squares considers only the slope β1 .
For a general regression model, with p independent variables, we have the following model:
Yi = β0 + β1 Xi1 + · · · βp Xip i = 1, . . . , n
which contains p0 = p + 1 model parameters. The Analysis of Variance is given in Table 1.3,
which contains sources of variation, their degrees of freedom, and sums of squares.
The mean squares for regression and residuals are the corresponding sums of squares divided
by their respective freedoms:
SS(REGRESSION)
M S(REGRESSION) =
p
SS(RESIDUAL)
M S(RESIDUAL) =
n − p0
The expected value of the mean squares are given below (for the case where there is a single
independent variable), the proof is given later. These are based on the assumption that the model
is fit is the correct model.
n
X
E[M S(REGRESSION)] = σ 2 + β12 (Xi − X)2
i=1
E[M S(RESIDUAL)] = σ2
The Coefficient of Determination (R2 ) is the ratio of the regression sum of squares to the
total (corrected) sum of squares, and represents the fraction of the variation in Y that is “explained”
by the set of independent variables X1 , . . . , Xp .
SS(REGRESSION) SS(RESIDUAL)
R2 = = 1−
SS(TOTAL CORRECTED) SS(TOTAL CORRECTED)
1.3.1 Examples
Numerical results for the two examples desribed before are given below.
The Analysis of Variance for the LSD/math score data are given in Table 1.3.1. Here, n = 7,
p = 1, and p0 = 2. All relevant sums are obtained from previous examples.
SS(REGRESSION) 1824.30
M S(REGRESSION) = = = 1824.30
p 1
SS(RESIDUAL) 253.92
M S(RESIDUAL) = = = 50.78
n − p0 5
The coefficient of determination for this data is:
SS(REGRESSION) 1824.30
R2 = = = 0.8778
SS(TOTAL CORRECTED) 2078.22
Approximately 88% of the variation in math scores is “explained” by the linear relation between
math scores and LSD concentration.
Example 2 – Estimating Cost Function of a Hosiery Mill
The Analysis of Variance for the output/cost data are given in Table 1.3.1. Here, n = 48, p = 1,
and p0 = 2. All relevant sums are obtained from previous examples and computer output.
Table 7: The Analysis of Variance for the hosiery mill cost function data
.
SS(REGRESSION) 31125.98
M S(REGRESSION) = = = 31125.98
p 1
SS(RESIDUAL) 1788.19
M S(RESIDUAL) = = = 38.87
n − p0 46
The coefficient of determination for this data is:
SS(REGRESSION) 31125.98
R2 = = = 0.9457
SS(TOTAL CORRECTED) 32914.17
Approximately 95% of the variation in math scores is “explained” by the linear relation between
math scores and LSD concentration.
Xn Xn n
X
β0 β1 β1
= Pn 2
(X i − X) + Pn 2
(X i − X)X i = Pn 2
(Xi − X)Xi
i=1 (Xi − X) i=1 i=1 (Xi − X) i=1 i=1 (Xi − X) i=1
" n n
# " n #
β1 X X β 1
X 2
= Pn 2
Xi2 − X Xi = Pn 2
Xi2 − nX
i=1 (Xi − X) i=1 i=1 i=1 (X i − X) i=1
n
X
β1
= Pn 2
(Xi − X)2 = β1
i=1 (Xi − X) i=1
2 X
n
1 σ2
= Pn 2
σ2 (Xi − X)2 = Pn 2
i=1 (X i − X) i=1 i=1 (Xi − X)
With the further assumption that Yi (and, more specifically, εi ) being normally distributed, we
can obtain the specific distribution of β̂1 .
P P
h i n
t
n
Pn Xi −X Yi h Pn ∗ i
i=1 i=1 (Xi −X)2
E etβ̂1 = E e i=1 = E e i=1 ti Yi
σ 2 t2
mY (t) = E etY = eµt+ 2
( 2 )
h ∗ i Xi − X Xi − X σ 2 t2
⇒ E eti Yi = exp (β0 + β1 Xi ) Pn t+ Pn
i=1 (Xi − X)
2
i=1 (Xi − X)
2 2
Xn
t
= Pn 2
{0 + β 1 (Xi − X)2 } = β1 t (17)
i=1 (X i − X) i=1
n
X 2 Xn
Xi − X σ 2 t2 σ 2 t2 σ 2 t2
Pn 2
= Pn (Xi − X)2 = Pn (18)
i=1 i=1 (Xi − X)
2 2( i=1 (X i − X) 2 )2
i=1
2 i=1 (Xi − X)
2
Putting equations (17) and (18) back into equation 1(16), we get:
h i σ 2 t2
mβ̂1 (t) = E etβ̂1 = exp{β1 t + Pn 2
2 i=1 (Xi − X)
which is the moment generating function of a normally distributed random variable with mean
P
β1 and variance σ 2 / ni=1 (Xi − X)2 . Thus, we have the complete sampling distribution of β̂1 under
the model’s assumptions.
where {ai } and {di } are constants and {Yi } are random variables. Then:
n
X n X
X
Cov[U, W ] = ai di V (Yi ) + ai dj Cov[Yi , Yj ]
i=1 i=1 j6=i
The expected values of the the two linear functions of the Yi in equation (19) are as follow:
Xn n
1 1 1 X
E[U ] = (β0 + β1 Xi ) = (nβ0 ) + β1 Xi = β0 + β1 X
i=1
n n n i=1
n
X X(Xi − X)
E[W ] = Pn 2
(β0 + β1 Xi )
i=1 i=1 (Xi − X)
Xn
X
= Pn 2
[β0 Xi + β1 Xi2 − β0 X − β1 XXi ]
i=1 (X i − X) i=1
Xn Xn
X 2
= Pn [β
2 0
(X i − X) + β 1 ( Xi2 − nX )]
i=1 (X i − X) i=1 i=1
Xn
X
Pn 2
[0 + β 1 (Xi − X)2 ] = β1 X
i=1 (X i − X) i=1
Now to get the variance of β̂0 (again assuming that Cov[Yi , Yj ] = 0f ori 6= j) :
Xn n 2
X 2
1 1 1 σ2
V ar[U ] = V ar[ Yi ] = V ar[Yi ] = n σ2 =
i=1
n i=1
n n n
2 2 σ2
V ar[W ] = V ar[β̂1 X] = X V ar[β̂1 ] = X Pn 2
i=1 (Xi − X)
Xn n
X
1 X(Xi − X) σ2 X
Cov[U, W ] = Pn V ar[Yi ] = Pn (Xi − X) = 0
i=1
n i=1 (Xi − X)
2 n 2
i=1 (Xi − X) i=1
2
" 2
#
σ2 X σ2 2 1 X
⇒ V ar[β̂0 ] = V ar[U ]+V ar[W ]−2Cov[U, W ] = + Pn = σ + Pn
n i=1 (Xi − X)
2 n i=1 (Xi − X)
2
Note that Cov[U, W ] = Cov[Y , β̂1 X] = 0, then Y and β̂1 are independent. We can also write
β̂0 as a single linear function of Yi , allowing use of the moment generating function method to
determine it’s normal sampling distribution:
" 2
#
Xn
1 X (Xi − X) X
n
β̂0 = − Pn Yi = ai Yi
i=1
n i=1 (Xi − X)
2
i=1
1.4.3 Distribution of Ŷi
The distibution of Ŷi , which is an estimate of the population mean of Yi at the level Xi of the
independent variable is obtained as follows:
σ2 (Xi − X)2 σ 2
V ar[Ŷi ] = V ar[Y ] + (Xi − X)2 V ar[β̂1 ] + 2(Xi − X)Cov[Y , β̂1 ] = + Pn +0
n i=1 (Xi − X)
2
2 1 (Xi − X)2
= σ + Pn
n i=1 (Xi − X)
2
Further, since Ŷi is a linear function of the Yi , then Ŷi has a normal sampling distribution.
E[Ŷ0 − Y0 ] = 0
1 (X0 − X)2 1 (X0 − X)2
V ar[Ŷpred0 ] = V ar[Ŷ0 −Y0 ] = σ2 + Pn +σ 2 = σ 2 1 + + Pn
n i=1 (X i − X) 2 n i=1 (Xi − X)
2
2
• s2 (Ŷ i) = s2 1
+ Pn(Xi −X) 2 Estimated variance of estimated mean at Xi
n (X i −X)
i=1
2
• s2 (Ŷ pred0 ) = s2 1+ 1
n + Pn(X0 −X)(Xi −X)2
Estimated variance of prediction at X0
i=1
1.4.6 Examples
Estimated variances are computed for both of the previous examples.
Here we obtain estimated variances for β̂1 , β̂0 , the true mean, and a future score when the tissue
concentration is 5.0:
s2 50.78
s2 (β̂1 ) = Pn 2
= = 2.26
i=1 (Xi − X)
22.48
" 2
#
2 2 1 X 1 4.33292
s (β̂0 ) = s + Pn = 50.78 + = 49.66
n i=1 (Xi − X)
2 7 22.48
1 (Xi − X)2 1 (5 − 4.3329)2
s2 (Ŷ5 ) = s2 + Pn = 50.78 + = 8.26
n i=1 (Xi − X)
2 7 22.48
2 2 1 (X0 − X)2 1 (5 − 4.3329)2
s (Ŷpred0 ) = s 1 + + Pn = 50.78 1 + + = 59.04
n i=1 (Xi − X)
2 7 22.48
Here we obtain estimated variances for β̂1 , β̂0 , the true mean, and a future cost when the
production output is 30.0:
s2 38.87
s2 (β̂1 ) = Pn 2
= = 0.0050
i=1 (Xi − X)
7738.94
" 2
#
2 2 1 X 1 31.06732
s (β̂0 ) = s + Pn = 38.87 + = 5.66
n i=1 (Xi − X)
2 48 7738.94
1 (Xi − X)2 1 (30 − 31.0673)2
s2 (Ŷ30 ) = s2 + Pn = 38.87 + = 0.82
n i=1 (Xi − X)
2 48 7738.94
2 2 1 (X0 − X)2 1 (30 − 31.0673)2
s (Ŷpred0 ) = s 1 + + Pn = 38.87 1 + + = 39.69
n i=1 (Xi − X)
2 48 7738.94
To determine whether there is a negative association between math scores and LSD q concentra-
tion, we conduct the following test at α = 0.05 significance level. Note that s(β̂1 ) = s2 (β̂1 ) =
√
2.26 = 1.50.
H0 : β1 = 0 Ha : β1 < 0
β̂1 − 0 −9.01
T S : t0 = = = −6.01 RR : t0 ≤ −t.05,5 = −2.015 P − val = P (t ≤ −6.01)
s(β̂1 ) 1.50
Ŷ5 = 89.12 − 9.01(5) = 44.07 44.07 ± 2.571(2.87) ≡ 44.07 ± 7.38 ≡ (36.69, 51.45)
Here, we use the F -test to determine whether there is an association between product costs and
the production output at α = 0.05 significance level.
H0 : β1 = 0 Ha : β1 6= 0
M S(REGRESSION) 31125.98
T S : F0 = = = 800.77 RR : F0 ≥ F(.05,1,46) ≈ 1.680 P −val = P (F ≥ 800.77)
M S(RESIDUAL) 38.87
Unit variable cost is the average increment in total production cost per unit increase in produc-
tion output (β1 ). We obtain a 95% confidence interval for this parameter:
√
β̂1 = 2.0055 s2 (β̂1 ) = .0050 s(β̂1 ) = .0050 = .0707 t(.025,46) ≈ 2.015
As the production output increases by 1000 dozen pairs, we are very confident that mean costs
increase by between 1.86 and 2.15 $1000. The large sample size n = 48 makes our estimate very
precise.
1.6 Regression Through the Origin
In some practical situations, the regression line is expected (theoretically) to pass trough the origin.
It is important that X = 0 is a reasonable level of X in practice for this to be the case. For instance,
in the hosiery mill example, if firm knows in advance that production will be 0 they close plant
and have no costs if they are able to work in “short–run,” however most firms still have “long–run”
costs if they know they will produce in future. If a theory does imply that the mean response (Y )
is 0 when X = 0, we have a new model:
Yi = β 1 Xi + ε i i = 1, . . . , n
This is obtained by taking the derivative of Q with respect to β1 , and setting it equal to 0. The
value β̂1 that solves that equality is the least squares estimate of β1 .
n
X
∂Q
= 2 (Yi − β1 Xi )(−Xi ) = 0
∂β1 i=1
n n Pn
X X X i Yi
⇒ Yi Xi = β̂1 Xi2 ⇒ β̂1 = Pi=1
n 2
i=1 i=1 i=1 Xi
Note that for this model, the residuals do not necessarily sum to 0:
Pn
i=1 Xi Y i
ei = Y − Ŷi = Y − β̂1 Xi = Yi − Pn 2 Xi
i=1 Xi
n
X n
X Pn n
Xi Y i X
⇒ ei = Yi − Pi=1
n 2 Xi
i=1 i=1 i=1 Xi i=1
This last term is not necessarily (and will probably rarely, if ever, in practice be) 0.
The uncorrected sum of squares is:
n
X n
X n
X n
X
Yi2 = Ŷi2 + e2i + 2 Ŷi ei
i=1 i=1 i=1 i=1
The last term (the cross–product term) is still 0 under the no–interecept model:
n
X n
X Pn Pn
Pi=1 Xi Yi Pi=1 Xi Yi
ei Ŷi = (Yi − n 2 Xi )( n 2 Xi )
i=1 i=1 i=1 Xi i=1 Xi
X Pn Xn Pn Pn n Pn 2 X
Xi Y i X
n n
i=1 Xi Yi i=1 Xi Yi i=1 Xi Yi
= Yi ( P n 2 Xi )− ( P n 2 Xi ) 2 = Pi=1
n 2 Xi Y i − P n 2 Xi2 = 0
i=1 i=1 Xi i=1 i=1 Xi i=1 Xi i=1 i=1 Xi i=1
So, we obtain the same partitioning of the total sum of squares as before:
n
X n
X n
X
Yi2 = Ŷi2 + e2i
i=1 i=1 i=1
The model sum of squares is based on only one parameter, so it is not broken into the components
of mean and regression as it was before. Similarly, the residual sum of squares has n − 1 degrees of
freedom. Assuming the model is correct:
n
X
E[M S(MODEL)] = σ 2 + β1 Xi2
i=1
E[M S(RESIDUAL)] = σ2
The variance of the estimator β̂1 is:
"P # n
n
X i Yi 1 X
V ar[β̂1 ] = V ar Pi=1
n 2 = Pn 2 2 Xi2 V ar[Yi ]
i=1 Xi ( i=1 Xi ) i=1
n
X
1 σ2
= Pn 2 2
( Xi2 )σ 2 = Pn 2
( i=1 Xi ) i=1 i=1 Xi
σ 2 X02
V ar[Ŷ0 ] = V ar[X0 β̂1 ] = X02 V ar[β̂1 ] = Pn 2
i=1 Xi
n
X n
X
SS(RESIDUAL) = Yi2 − Ŷi2 = 5992.48 − 1268.73 = 4723.75
i=1 i=1
SS(RESIDUAL) 4723.75
s2 = M S(RESIDUAL) = = = 5.10
n−1 927
s2 5.10
s2 (β̂1 ) = Pn 2 = = .0017 s(β̂1 ) = .0409
i=1 Xi 3044.92
Note that there is a positive association between adult children’s heights and their parent’s
heights. However, as the parent’s height increases by 1”, the adult child’s height increases by
between 0.5633” and 0.7257” on average. This is an example of regression to the mean.
The least squares estimates β̂0 , β̂1 , . . . , β̂p are the values that minimize the residual sum of
squares:
n
X n
X
SS(RESIDUAL) = (Yi − Ŷi )2 = (Yi − β̂0 + β̂1 Xi1 + · · · + β̂p Xip )2
i=1 i=1
We will obtain these estimates after we write the model in matrix notation.
Diagonal Matrix – Square matrix where all elements off main diagonal are 0
Scalar – Matrix with one row and one column (single element)
The transpose of a matrix is the matrix generated by interchanging the rows and columns of
the matrix. If the original matrix is A, then its transpose is labelled A0 . For example:
" # 2 1
2 4 7
A= ⇒ A0 = 4 7
1 7 2
7 2
Matrix addition (subtraction) can be performed on two matrices as long as they are of
equal order (dimension). The new matrix is obtained by elementwise addition (subtraction) of the
two matrices. For example:
" # " # " #
2 4 7 1 3 0 3 7 7
A= B= ⇒ A+B=
1 7 2 2 4 8 3 11 10
Matrix multiplication can be performed on two matrices as long as the number of columns
of the first matrix equals the number of rows of the second matrix. The resulting has the same
number of rows as the first matrix and the same number of columns as the second matrix. If
C = AB and A has s columns and B has s rows, the element in the ith row and j th column of C,
which we denote cij is obtained as follows (with similar definitions for aij and bij ):
s
X
cij = ai1 b1j + ai2 b2j + · · · ais bsj = aik bkj
k=1
For example:
" # 1 5 6
2 4 7
A= B= 2 0 1 ⇒
1 7 2
3 3 3
" # " #
2(1) + 4(2) + 7(3) 2(5) + 4(0) + 7(3) 2(6) + 4(1) + 7(3) 31 31 37
C = AB = =
1(1) + 7(2) + 2(3) 1(5) + 7(0) + 2(3) 1(6) + 7(1) + 2(3) 21 11 19
Note that C has the same number of rows as A and the same number of columns as C. Note
that in general AB 6= BA; in fact, the second matrix may not exist due to dimensions of matrices.
However, the following equality does hold: (AB)0 = B0 A0 .
Scalar Multiplication can be performed between any scalar and any matrix. Each element
of the matrix is multiplied by the scalar. For example:
" # " #
2 4 7 4 8 14
A= ⇒ 2A =
1 7 2 2 14 4
The determinant is scalar computed from the elements of a matrix via well–defined (although
rather painful) rules. Determinants only exist for square matrices. The determinant of a matrix A
is denoted as |A|.
4. The determinant is obtained by summing the product of the elements and cofactors for any
P
row or column of A. By using row i of A, we get |A| = nj=1 aij θij
" #
6 0
a12 = 5 A12 = |A11 | = 6(1) − 0(2) = 6 θ12 = (−1)1+2 (6) = −6
2 1
" #
6 8
a13 = 2 A13 = |A13 | = 6(5) − 8(2) = 14 θ13 = (−1)1+3 (14) = 14
2 5
Note that we would have computed 78 regardless of which row and column we used.
An important result in linear algebra states that if |A| = 0, then A is singular, otherwise A is
nonsingular (full rank).
The inverse of a square matrix A, denoted A−1 , is a matrix such that A−1 A = I = AA−1
where I is the identity matrix of the same dimension as A. A unique inverse exists if A is square
and full rank.
The identity matrix, when multiplied by any matrix (such that matrix multiplication exists)
returns the same matrix. That is: AI = A and IA = A, as long as the dimensions of the matrices
conform to matrix multiplication.
θ11 = 8 θ12 = −6 θ13 = 14 θ21 = 5 θ22 = 6 θ23 = −40 θ31 = −16 θ32 = 12 θ33 = 50
8 5 −16
1
A−1 = −6 6 12
78
14 −40 50
As a check:
8 5 −16 10 5 2 1 0 0
1
A−1 A = −6 6 12 6 8 0 = 0 1 0 = I3
78
14 −40 50 2 5 1 0 0 1
To obtain the inverse of a diagonal matrix, simply compute the recipocal of each diagonal
element.
The following results are very useful for matrices A, B, C and scalar λ, as long as the matrices’
dimensions are conformable to the operations in use:
1. A + B = B+A
2. (A + B) + C = A + (B + C)
3. (AB)C = A(BC)
4. C(A + B) = CA + CB
5. λ(A + B) = λA + λB
6. (A0 )0 = A
7. (A + B)0 = A 0 + B0
8. (AB)0 = B 0 A0
9. (ABC)0 = C 0 B0 A0
Ax = y
• No solution
• A unique solution
A set of linear equations is consistent if any linear dependencies among rows of A also appear
in the rows of y. For example, the following system is inconsistent:
1 2 3 x1 6
2 4 6 x2 = 10
3 3 3 x3 9
This is inconsistent because the coefficients in the second row of A are twice those in the first row,
but the element in the second row of y is not twice the element in the first row. There will be no
solution to this system of equations.
A set of equations is consistent if r(A) = r([Ay]) where [Ay] is the augmented matrix [A|y].
When r(A) equals the number of unknowns, and A is square:
x = A−1 y
where tr(A) is the trace of A. The subspace of a projection is defined, or spanned, by the columns
or rows of the projection matrix P.
Ŷ = PY is the vector in the subspace spanned by P that is closest to Y in distance. That is:
Ŷ + e = PY + (I − P)Y = Y
“Proof ” – Consider p = 3:
d(a0 x) d(a0 x)
a0 x = a1 x1 + a2 x2 + a3 x3 = ai ⇒ =a
dxi dx
x1
x0 Ax = x1 a11 + x2 a21 + x3 a31 x1 a12 + x2 a22 + x3 a32 x1 a13 + x2 a23 + x3 a33 x2 =
x3
= x21 a11 + x1 x2 a21 + x1 x3 a31 + x1 x2 a12 + x22 a22 + x2 x3 a32 + x1 x3 a13 + x2 x3 a23 + x23 a33
∂x0 Ax X
⇒ = 2aii xi + 2 aij xj (aij = aji )
∂xi
j6=i
0
∂x Ax
∂x1 2a11 x1 + 2a12 x2 + 2a13 x3
∂x0 Ax ∂x0 Ax
⇒ = ∂x2 = 2a21 x1 + 2a22 x2 + 2a23 x3 = 2Ax
∂x ∂x0 Ax 2a31 x1 + 2a32 x2 + 2a33 x3
∂x3
where i represents the observational unit, the second subscript on X represents the independent
variable number, p is the number of independent variables, and p0 = p + 1 is the number of
model parameters (including the intercept term). For the model to have a unique set of regression
coefficients, n > p0 .
X — n × p0 model matrix containing a column of 1’s and p columns of levels of the independent
variables X1 , . . . , Xp
Y = X β+ ε
Y1 1 X11 ··· X1p β0 ε1
Y2 1 X21 ··· X2p β1 ε2
.. = .. .. .. .. .. + ..
. . . . . . .
Yn 1 Xn1 ··· Xnp βp εn
For our models, X will be of full column rank, meaning r(X) = p0 .
The elements of β, are referred to as partial regression coefficients, βj represents the change
in E(Y ) as the j th independent variable is increased by 1 unit, while all other variables are held
constant. The terms “controlling for all other variables” and “ceteris parabis” are also used to
describe the effect.
We will be working with many different models (that is, many different sets of independent
variables). Often we will need to be more specific of which independent variables are in our model.
We denote the partial regression coefficient for X2 in a model containing X1 , X2 , and X3 as β2·13 .
εi ∼ N ID(0, σ 2 ) i = 1, . . . , n
This implies that the joint probability density function for the random errors is:
n n
" ( )# ( Pn )
Y Y −ε2i − 2
i=1 εi
f (ε1 , ε2 , . . . , εn ) = fi (εi ) = (2π)−1/2 σ −1 exp = (2π)−n/2 σ −n exp
i=1 i=1
2σ 2 2σ 2
In terms of the observed responses Y1 , . . . , Yn , we have:
The least squares estimates are Best Linear Unbiased Estimates (B.L.U.E.). Under normality
assumption, maximum likelihood estimates are Minimum Variance Unbiased Estimates (M.V.U.E.).
In either event, the estimate of β is:
For the LSD concentration/math score example, we have the following model for Y = X β + ε:
78.93 1 1.17 ε1
58.20 1 2.97 ε2
67.47 1 3.26 ε3
β0
37.47 = 1 4.69
β1 + ε4
45.65 1 5.83 ε5
32.92 1 6.00 ε6
29.97 1 6.41 ε7
For least squares estimation, we minimize Q( β), the error sum of squares with respect to β:
Q( β) = (Y − X β)0 (Y − X β) = Y 0 Y − Y 0 X β − β 0 X0 Y + β 0 X0 X β = Y 0 Y−2Y 0 X β + β 0 X0 X β
X0 X β̂ = X0 Y ⇒ β̂ = (X0 X)−1 X0 Y
For the LSD concentration/math score example, we have the following normal equations and
least squares estimates:
" #" # " #
0 0 7 30.33 β̂0 350.61
X X β̂ = XY ⇒ =
30.33 153.8905 β̂1 1316.6558
" #
1 153.8905 −30.33
(X0 X)−1 =
7(153.8905) − (30.33)2 −30.33 7
" #" # " # " #
1 153.8905 −30.33 350.61 1 14021.3778 89.129
β̂ = = =
7(157.3246 −30.33 7 1316.6558 7(157.3246 −1417.4607 −9.0095
= X β̂ = X(X0 X)−1 X0 Y = PY
Here, P is the projection of hat matrix, and is of dimension n × n. The hat matrix is symmetric
and idempotent:
For the LSD concentration/math score example, we have the following hat matrix (generated
in a computer matrix language):
1 1.17
1 2.97
1 3.26
1 153.8905 −30.33 350.61 1 1 1 1 1 1 1
P=
1 4.69
7(157.3246 =
−30.33 7 1316.6558 1.17 2.97 3.26 4.69 5.83 6.00 6.41
1 5.83
1 6.00
1 6.41
0.58796 0.33465 0.29384 0.09260 −0.06783 −0.09176 −0.14946
0.33465 0.22550 0.20791 0.12120 0.05207 0.04176 0.01690
0.29384 0.20791 0.19407 0.12581 0.07139 0.06327 0.04370
=
0.09260 0.12120 0.12581 0.14853 0.16665 0.16935 0.17586
−0.06783 0.05207 0.07139 0.16665 0.24259 0.25391 0.28122
−0.09176 0.04176 0.06327 0.16935 0.25391 0.26652 0.29694
−0.14946 0.01690 0.04370 0.17586 0.28122 0.29694 0.33483
The vector of residuals, e is the vector generated by elementwise subtraction between the data
vector Y and the fitted vector Ŷ . It can be written as follows:
e = Y − Ŷ = Y − PY = (I − P)Y
Also, note:
Ŷ + e = PY + (I − P)Y = (P + I − P)Y = Y
For the LSD concentration/math score example, we have the following fitted and residual vec-
tors:
89.12 − 9.01(1.17) = 78.58 78.93 − 78.58 = 0.35
89.12 − 9.01(2.97) = 62.36 58.20 − 62.36 = −4.16
89.12 − 9.01(3.26) = 59.75 67.47 − 59.75 = 7.72
Ŷ = X β̂ =
89.12 − 9.01(4.69) = 46.86
e = Y − Ŷ =
37.47 − 46.86 = −9.39
89.12 − 9.01(5.83) = 36.59 45.65 − 36.59 = 9.06
89.12 − 9.01(6.00) = 35.06 32.92 − 35.06 = −2.14
89.12 − 9.01(6.41) = 31.37 29.97 − 31.37 = −1.40
The expectation vector is the vector made up of the elementwise expected values of the
elements of the random vector.
E(z1 ) µ1
E[Z] = E(z2 ) = µ2 = µz
E(z3 ) µ3
The variance-covariance matrix is the 3 × 3 matrix made up of variances (on the main
diagonal) and the covariances (off diagonal) of the elements of Z.
V ar(z1 ) Cov(z1 , z2 ) Cov(z1 , z3 )
Var[Z] = Cov(z2 , z1 ) V ar(z2 ) Cov(z2 , z3 ) = Vz =
Cov(z3 , z1 ) Cov(z3 , z2 ) V ar(z3 )
E[(z1 − µ1 )2 ] E[(z1 − µ1 )(z2 − µ2 )] E[(z1 − µ1 )(z3 − µ3 )]
E[(z2 − µ2 )(z1 − µ1 )] E[(z2 − µ2 )2 ] E[(z2 − µ2 )(z3 − µ3 )] =
E[(z3 − µ3 )(z1 − µ1 )] E[(z3 − µ3 )(z2 − µ2 )] 2
E[(z3 − µ3 ) ]
= E[(Z − µz )(Z − µz )0 ] = Vz
Now let A be a k × n matrix of constants and z be a n × 1 random vector with mean vector µz ,
and variance-covariance matrix Vz . Suppose further that we can write A and U = Az as follow:
a01
a02
A = ..
.
a0k
a01 z u1
a02 z u2
U = Az =
.. =
..
. .
a0k z uk
where a0i is a 1 × n row vector of constants.
To obtain E[U] = µu , consider each element of U, namely ui and it’s expectation E(ui ).
E[ui ] = E[a0i z] = E[ai1 z1 +ai2 z2 +· · ·+ain zn ] = ai1 E[z1 ]+ai2 E[z2 ]+· · ·+ain E[zn ] = a0i µz i = 1, . . . , k
To obtain the variance covariance matrix of U = Az, first consider the definition of V[U], then
write in terms of U = Az:
E[Yi ] = µ i = 1, . . . , n V ar[Yi ] = σ 2 i = 1, . . . , n
1
n n
X 2 2
1 1 1 σ2
V ar[Y ] = a0 Var[Y]a = 1 1
··· 1
σ2 I
= σ 2 a0 a = σ 2
n = σ2 n =
n n n ··· n n n
1 i=1
n
Var[Y] = σ2 I
Now consider the least squares estimator β̂ of the regression coefficient parameter vector β.
Var[ β̂] = Var[A0 Y] = A0 Var[Y]A = (X0 X)−1 X0 σ 2 IX(X0 X)−1 = σ 2 (X0 X)−1 (X0 X)(X0 X)−1 = σ 2 (X0 X)−1
3.6 Multivariate Normal Distribution
Suppose a n × 1 random vector Z has a multivariate normal distribution with mean vector 0 and
variance-covariance matrix σ 2 I. This would occur if we generated n independent standard normal
random variables and put them together in vector form. The density function for Z, evaluated any
fixed point z is:
1
Z ∼ N ID(0, σ 2 I) ⇒ fZ (z) = (2π)−n/2 |σ 2 I|−1/2 exp{− z0 (σ 2 I)−1 z}
2
Assuming the model is correct, we’ve already obtained the mean and variance of β̂. We further
know that its distibution is multivariate normal: β̂ ∼ N (β, σ 2 (X0 X)−1 ).
The vector of fitted values Ŷ = X β̂ = X(X0 X)−1 X0 Y = PY is also a linear function of Y and
thus also normally distributed with the following mean vector and variance-covariance matrix:
Var[Ŷ] = X(X0 X)−1 X0 σ 2 IX(X0 X)−1 X0 = σ 2 X(X0 X)−1 X0 X(X0 X)−1 X0 = σ 2 X(X0 X)−1 X0 = σ 2 P
That is Ŷ ∼ N (X β, σ 2 P).
The vector of residuals e = Y − Ŷ = (I − P)Y is also normal with mean vector and variance-
covariance matrix:
ε ∼ N ( 0, σ 2 I) ⇒ e ∼ N ( 0, σ 2 (I − P))
Often the hgoal is to predict a ifuture outcome when the set of independent levels are at a given
setting, x00 = 1 x01 · · · x0p . The future observation Y0 and its predicted value based on the
estimated regression equation are:
Y0 = x00 β + ε0 ε0 ∼ N ID(0, σ 2 )
It is assumed ε0 ∼ N (0, σ 2 ) and is independent from the errors in the observations used to fit
the regression model (ε1 , . . . , εn ).
The prediction error is:
V [Y0 − Ŷ0 ] = V [Y0 ] + V [Ŷ0 ] = σ 2 + σ 2 x00 (X0 X)−1 x0 = σ 2 [1 + x00 (X0 X)−1 x0 ]
2. The degrees of freedom associated with any quadratic form is equal to the rank of the defining
matrix, which is equal to its trace when the defining matrix is idempotent.
3. Two quadratic forms are orthogonal if the product of their defining matices is 0
Note that Y 0 Y = Y 0 IY, so that I is the defining matrix, which is symmetric and idempotent. The
degrees of freedom for SS(TOTAL UNCORRECTED) is then the rank of I, which is its trace, or
n.
Now, we decompose the Total uncorrected sum of squares into it’s model and error components.
= (PY)0 (PY) + (PY)0 (I − P)Y + [(I − P)Y]0 PY + [(I − P)Y]0 [(I − P)Y] =
= Y0 P0 PY + Y0 P0 (I − P)Y + Y0 (I − P)0 PY + Y0 (I − P)0 (I − P)Y =
= Y0 PPY + (Y0 PY − Y0 PPY) + (Y0 PY − Y0 PPY) + (Y0 IY − Y0 IPY − Y0 PIY + Y0 PPY) =
= Y0 PY + (Y0 PY − Y0 PY) + (Y0 PY − Y0 PY) + (Y0 Y − Y0 PY − Y0 PY + Y0 PY) =
= Y0 PY + 0 + 0 + (Y0 Y − Y0 PY) = Y0 PY + Y0 (I − P)Y =
= Y0 P0 PY + Y0 (I − P)0 (I − P)Y = Ŷ0 Ŷ + e0 e
We obtain the degrees of freedom as follow, making use of the following identities regarding the
trace of matrices:
tr(AB) = tr(BA) tr(A + B) = tr(A) + tr(B)
SS(MODEL) = Ŷ 0 Ŷ = Y 0 PY
df (MODEL) = tr(P) = tr(X(X0 X)−1 X0 ) = tr((X0 X)−1 X0 X) = tr(Ip0 ) = p0 = p + 1
SS(RESIDUAL) = e0 e = Y 0 (I − P)Y
df (RESIDUAL) = tr(I − P) = tr(In ) − tr(P) = n − p0 = n − p − 1
Table 8 gives the Analysis of Variance including degrees of freedom, and sums of squares (both
definitional and computational forms).
Source of Degrees of Sum of Squares
Variation Freedom Definitional Computational
TOTAL(UNCORRECTED) n YY0 Y0 Y
0 0
MODEL p =p+1 Ŷ 0 Ŷ = Y 0 PY β̂ X0 Y
0
ERROR n − p0 e0 e = Y 0 (I − P)Y Y 0 Y− β̂ X0 Y
The total uncorrected sum of squares represents variation (in Y ) around 0. We are usually
interested in variation around the sample mean Y . We will partition the model sum of squares into
two components: SS(REGRESSION) and SS(µ). The first sum of squares is associated with β1
and the second one is associated with β0 .
Consider the following model, we obtain the least squares estimates and model sum of squares.
Yi = β 0 + εi = µ + εi
1
1
Y=X β+ ε X= .. =1 β = [µ] = [β0 ]
.
1
X
n
β̂ = (X0 X)−1 X0 Y = (10 1)−1 10 Y 10 1 = n 1 0 Y = Yi
i=1
Pn
Yi
⇒ β̂ = (10 1)−1 10 Y = i=1
=Y
n
n
X
0 2
SS(µ) = β̂ X0 Y = Y ( Yi ) = nY
i=1
1
= Y0 1(10 1)−1 10 Y = Y0 ( 110 )Y =
n
1 1
n n · · · n1
1 1 ··· 1
n n n 0 1
Y0 . . . .. Y = Y ( J)Y
.
. . . . . . n
1 1 1
n n ··· n
To demonstrate that the defining matrix for SS(REGRESSION) is idempotent and that the
three sum of squares are orthogonal, consider the following algebra where X∗ is the matrix made
up of the columns of X associated with the p independent variables and not the column for the
intercept.
X = [1|X∗ ] PX = P[1|X∗ ] = X = [1|X∗ ]
⇒ P1 = 1 ⇒ PJ = J
0 ∗ 0
X = [1|X ] X P = [1|X ] P = X0 = [1|X∗ ]0
0 ∗ 0
10 P = 10 ⇒ JP = J
⇒
1 1 1 1 1 1 1 1 1 1
(P− J)(P− J) = PP − P( J)− JP + ( J)( J) = P − ( J) − ( J) + ( J) = P− J
n n n n n n n n n n
Summarizing what we have obtained so far (where all defining matrices are idempotent):
1 1 1
SS(µ) = Y0 ( J)Y df (µ) = tr( J) = (n) = 1
n n n
1 1 1
SS(REGRESSION) = Y0 (P− J)Y df (REGRESSION) = tr((P− J) = tr(P)−tr( J) = p0 −1 = p+1−1 = p
n n n
SS(RESIDUAL) = Y0 (I − P)Y df (RESIDUAL) = tr(I − P) = tr(I) − tr(P) = n − p0
To show that the sums of squares for the mean, regression, and residual are pairwise orthogonal,
consider the products of their defining matrices: First for SS(µ) and SS(REGRESSION):
1 1 1 1 1 1 1
( J)(P− J) = JP − ( J)( J) = J − J = 0
n n n n n n n
For the LSD concentration/math score example, we have the ANOVA in Table 9.
• A unique least squares solution exists iff the model is full rank.
• All defining matrices in the Analysis of Variance are idempotent.
• The defining matrices for the mean, regression, and residual are pairwise orthogonal and
sum to I. Thus they partition the total uncorrected sum of squares into orthogonal sums of
squares.
• Degrees of freedom for quadratic forms are the ranks of their defining matrices; when idem-
potent, the trace of a matrix is its rank.
E[Y] = µ Var[Y] = VY = Vσ 2
E[Y] = X β Var[Y] =σ 2 In
1 1 1
E[SS(REGRESSION)] = E[Y0 (P− J)Y] = σ 2 tr(P− J) + β 0 X0 (P− J)X β =
n n n
1 1
σ 2 (p0 − 1) + β 0 X0 X β − β 0 X0 JX β = pσ 2 + β 0 X0 (I− J)X β
n n
This last matrix can be seen to involve the regression coefficients: β1 , . . . , βp , and not β0 as
follows:
1 1
X0 (I− J)X = X0 X − X0 JX
n n
Pn Pn
Pn n Pi=1
n
Xi1 · · · P i=1 Xip
n
Xi1 2
i=1 Xi1 ··· i=1 Xi1 Xip
i=1
X0 X = .. .. .. ..
.
Pn . Pn . Pn . 2
i=1 Xip i=1 Xip Xi1 ··· i=1 Xip
Pn n Pn n ··· Pn n
1 0 1
i=1 Xi1 i=1 Xi1 ··· i=1 Xi1
X JX = .. .. .. .. X =
n n .
Pn . Pn . Pn .
i=1 Xip i=1 Xip ··· i=1 Xip
Pn Pn
n2 n i=1 Xi1 ··· n i=1 Xip
Pn Pn 2 Pn Pn
1
n i=1 Xi1 ( i=1 Xi1 ) ··· ( i=1 Xi1 ) ( i=1 Xip )
= .. .. .. .. =
n . . . .
Pn Pn Pn Pn 2
n i=1 Xip ( i=1 Xip ) ( i=1 Xi1 ) ··· ( i=1 Xip )
Pn Pn
n Pi=1 Xi1 ··· Pn Xip
i=1 P
Pn n
( i=1 Xi1 )
2
( Xi1 )(
n
Xip )
i=1 Xi1 ··· i=1 i=1
n n
= .. .. .. ..
. .P .
Pn Pn .
Pn ( Xip )(
n
Xi1 ) ( Xip )
2
1 1
⇒ X0 (I− J)X = X0 X − X0 JX =
n n
Pn Pn
Pn Pn n Xi1 ··· Xip
Pi=1 Pn i=1 P
Pn n Pi=1
n
Xi1 ··· Pn i=1 Xip Pn n
( i=1 Xi1 )
2
( Xi1 )(
n
Xip )
Xi1 2
i=1 Xi1 ··· i=1 Xi1 Xip
i=1 Xi1 ··· i=1 i=1
i=1 n n
= .. .. .. .. − .. .. .. ..
. .
Pn . Pn . Pn . 2 . Pn .P Pn .
i=1 Xip i=1 Xip Xi1 ··· i=1 Xip
Pn ( Xip )(
n
Xi1 ) ( Xip )
2
0 0P ··· 0
P Pn
Pn n
( i=1 Xi1 )
2
Pn (
n
Xi1 )( Xip )
0 X 2
− ··· Xi1 Xip − i=1 i=1
i=1 i1 n i=1 n
= . .. .. .. =
.. . . . P
P Pn
Pn n
( i=1 Xip )( i=1 Xi1 ) Pn 2
n
( i=1 Xip )
2
0 Pn 0 ··· Pn 0
0 − X 1 )2
i=1 (Xi1 ··· i=1 (Xi1 − X 1 )(Xip − X p )
= .. .. .. .. =
. . . .
Pn Pn 2
0 i=1 (X ip − X p )(Xi1 − X 1 ) ··· i=1 (X ip − X p )
Thus, E[SS(REGRESSION)] involves a quadratic form in β1 , . . . , βp , and not in β0 since the first
row and column of the previous matrix is made up of 0s . Now, we return to E[SS(RESIDUAL)]:
Now we can obtain the expected values of the mean squares from the Analysis of Variance:
SS(REGRESSION) 1 0 0 1
M S(REGRESSION) = ⇒ E[M S(REGRESSION)] = σ 2 + β X (I− J)X β
p p n
SS(RESIDUAL)
M S(RESIDUAL = ⇒ E[M S(RESIDUAL)] = σ 2
n − p0
Note that the second term in E[M S(REGRESSION)] is a quadratic form in β, if any βi 6= 0
(i = 1, . . . , p), then E[M S(REGRESSION)] > E[M S(RESIDUAL)], otherwise they are equal.
Y =X β+Z γ+ ε ε ∼ N (0, σ 2 I)
E[SS(RESIDUAL)] 1
E[M S(RESIDUAL)] = 0
= σ2 + γ 0 Z0 (I − P)Z γ
n−p n − p0
which is larger than σ 2 if the elements of γ are not all equal to 0 (which would make our fitted
model correct).
Theoretical Estimated
Estimator Variance Variance
β̂ σ 2 (X0 X)−1 s (X0 X)−1
2
Table 10: Theoretical and estimated variances of regression estimators in Matrix form
Y ∼ N (µ, Vσ 2 ) r(V) = n
then:
1
1. Y 0 σ2 A Y is distributed noncentral χ2 with:
1 1 1 (10 X β)2
Ω(MEAN) = 2
β 0 X0 JX β =
2σ n 2σ 2 n
The last equality is obtained by recalling that J = 110 , and:
Since we have already shown that the quadratic forms for SS(µ), SS(REGRESSION), and
SS(RESIDUAL) are all pairwise orthogonal, and in our current model V = I, then these sums of
squares are all independent due to the second part of Cochran’s Theorem.
Consider any linear function K0 β̂ = K0 (X0 X)−1 X0 Y = BY. Then, by the last part of Cochran’s
Theorem, K0 β̂ is independent of SS(RESIDUAL):
B = K0 (X0 X)−1 X0 V=I A = (I − P)
⇒ BVA = K0 (X0 X)−1 X0 − K0 (X0 X)−1 X0 P =
K0 (X0 X)−1 X0 − K0 (X0 X)−1 X0 X(X0 X)−1 X0 = K0 (X0 X)−1 X0 − K0 (X0 X)−1 X0 = 0
X12 /ν1
F =
X22 /ν2
where X12 is distibuted noncentral χ2 with ν1 degrees of freedom and noncentrality parameter Ω1 ,
and X22 is distibuted central χ2 with ν2 degrees of freedom. Further, assume that X12 and X22 are
independent. Then, F is distrbuted noncental F with ν1 numerator, ν2 denominator degrees of
freedom, and noncentrality parameter Ω1 .
β 0 X0 (P− n1 J)X β
inator degrees’ of freedom, and noncentrality parameter Ω = 2σ 2
• The noncentrality parameter for SS(REGRESSION) does not involve β0 , and for full rank X, Ω =
0 ⇐⇒ β1 = β2 = · · · = βp = 0, otherwise Ω > 0
This theory leads to the F -test to determine whether the set of p regression coefficients β1 , β2 , . . . , βp
are all equal to 0:
β1
β2
• H0 : β = 0 where β =
∗
∗
..
.
βp
• HA : β ∗ 6= 0
SS(REGRESSION)/p M S(REGRESSION)
• T S : F0 = M S(RESIDUAL)/(n−p0 )
= M S(RESIDUAL)
• RR : F0 :≥ F(α,p,n−p0 )
• The power of the test under a specific alternative can be found by finding the area under
the relevant noncentral-F distribution to the right of the critical value defining the rejection
region.
Example 1 – LSD Pharmacodynamics
Suppose that the true parameter values are: β0 = 90, β1 = −10, and σ 2 = 50 (these are
consistent with the least squares estimates). Recall that the fact β0 6= 0 has no bearing on the
F -test, only that β1 6= 0. Then:
β 0 X0 (P− n1 J)X β
Ω= = 22.475
2σ 2
Figure 3 gives the central-F distribution (the distribution of the test statistic under the null hy-
pothesis) and the noncentral-F distribution (the distribution of the test statistic under this specific
alternative hypothesis). Further, the power of the test under these specific parameter levels is the
area under the noncentral-F distribution to the right of Fα,1,5 . Table 4.3.1 gives the power (the
probability we reject H0 under H0 and several sets of values in the alternative hypothesis) for three
levels of α, where F(.100,1,5) = 4.06, F(.050,1,5) = 6.61, and F(.010,1,5) = 16.26. The reason that
the column for the noncentrality parameter is 2Ω is that SAS’ function for returning
a tail area from a noncentral-F distribution is twice the noncentrality parameter we
use in this section’s notation.)
fho_act
0.020
0.018
0.016
0.014
0.012
0.010
0.008
0.006
0.004
0.002
0.000
0 10 20 30 40 50 60 70 80 90 100
f
Note that as the true slope parameter moves further away from 0 for a fixed α level, the power
increases. Also, as α (the size of the rejection region) decreases, so does the power of the test.
Under the null hypothesis (β1 = 0), the size of the rejection region is the power of the test (by
definition).
α
β1 2Ω 0.100 0.050 0.010
0 0 0.100 0.050 0.010
-2 1.80 0.317 0.195 0.053
-4 7.19 0.744 0.579 0.239
-6 16.18 0.962 0.890 0.557
-8 28.77 0.998 0.987 0.831
-10 44.95 1.000 0.999 0.959
Table 11: Power = Pr(Reject H0 ) under several configurations of Type I error rate (α) and slope
parameter (β1 ) for σ 2 = 50
H 0 : K0 β = m HA : K0 β 6= m
Parameter – K0 β − m
Estimator – K0 β̂ − m E[K0 β̂ − m] = K0 β − m
That is, Q is a quadratic form in K0 β̂ − m with A = [K0 (X0 X)−1 K]−1 = V−1 . Making use of the
earlier result, regarding expectations of quadratic forms, namely:
E[Q] = σ 2 tr[(K0 (X0 X)−1 K)−1 (K0 (X0 X)−1 K)] + (K0 β − m)0 [K0 (X0 X)−1 K]−1 (K0 β − m) =
= σ 2 tr[Ik ] + K0 β − m)0 [K0 (X0 X)−1 K]−1 (K0 β − m) = kσ 2 + (K0 β − m)0 [K0 (X0 X)−1 K]−1 (K0 β − m)
Now, AV = Ik is idempotent and r(A) = r(K) = k (with the restrictions on K0 stated above).
So as long as ε holds our usual assumptions (normality, constant variance, independent elements),
then Q/σ 2 is distributed noncentral-χ2 with k degrees of freedom and noncentrality parameter:
So, as before, for the test of β ∗ = 0, we have a sum of squares for a hypothesis that is
noncentral-χ2 , in this case having k degrees of freedom. Now, we show that Q is independent
of SS(RESIDUAL), for the case m = 0 (it holds regardless, but the math is messier otherwise).
Q = (K0 β̂)0 [K0 (X0 X)−1 K]−1 (K0 β̂) SS(RESIDUAL) = Y0 (I − P)Y
0
Q = β̂ K[K0 (X0 X)−1 K]−1 (K0 β̂) = Y0 X(X0 X)−1 K[K0 (X0 X)−1 K]−1 (K0 (X0 X)−1 X0 Y
Recall that Y 0 AY and Y 0 BY are independent if BVA = 0. Here, V = I.
• H0 : K0 β − m = 0
• HA : K0 β − m 6= 0
• T S : F0 = Q/k
= (K0 β̂ −m)0 [K0 (X0 X)−1 K]−1 (K0 β̂ −m)/k
s2 M S(RESIDUAL)
• RR : F0 ≥ F(α,k,n−p0 )
In this case, K0 (X0 X)−1 K is a scalar, with an inverse that is its recipocal. Also, K0 β̂ − m is
a scalar.
(K0 β̂ − m)2
Q = (K0 β̂ − m)0 [K0 (X0 X)−1 K]−1 (K0 β̂ − m) = 0 0 −1
K (X X) K
!2
(K0 β̂ − m)2 K0 β̂ − m
⇒ F0 = 2 0 0 −1 = = t20
s [K (X X) K] sqrts2 [K0 (X0 X)−1 K]
0
Case 2 - Testing k Specific βjs = 0
In this case, K0 β − m is simply a “subvector” of the vector β, and K0 (X0 X)−1 K is a “sub-
matrix” of (X0 X)−1 . Be careful of row and column labels because of β0 .
Suppose we wish to test that the last q < p elements of β are 0, controlling for the remaining
p − q independent variables:
Q/q
⇒ F0 =
s2
This is a simplification of case 2, with K0 (X0 X)−1 K being the (j + 1)st element of (X0 X)−1 ,
and K0 β̂ = β̂j .
" #2
(β̂j )2 (β̂j )2 β̂j
Q= ⇒ F0 = 2 = p = t20
cjj s cjj s2 cjj
H 0 : K0 β = m HA : K0 β 6= m
First, the Full Model is fit, that lets all parameters to be free (HA ), and the least squares estimate
is obtained. The residual sum of squares is obtained and labelled SS(RESIDUALFULL ). Under
the full model, with no restriction on the parameters, p0 parameters are estimated and this sum of
squares has n − p0 .
Next, the Reduced Model is fit, that places k ≤ p0 constraints on the parameters (H0 ). Any
remaining parameters are estimated by least squares. The residual sum of squares is obtained and
labelled SS(RESIDUALREDUCED ). Note that since we are forcing certain parameters to take
on specific values SS(RESIDUALREDUCED ) ≥ SS(RESIDUALFULL ), with the equality only
taking place if the estimates from the full model exactly equal the constrained values under H0 .
With the k constraints, only p0 − k parameters are being estimated and the residual sum of squares
has n − (p0 − k) degrees of freedom.
We obtain the sum of squares and degrees of frredom for the test by taking the difference in
the residual sums of squares and in their corresponding degrees’ of freedom:
Q = SS(RESIDUALREDUCED ) − SS(RESIDUALFULL ) df (Q) = (n − (p0 − k)) − (n − p0 ) = k
As before:
(SS(RESIDUAL )−SS(RESIDUAL
REDUCED FULL )
Q/k 0 0
(n−(p −k))−(n−p )
F = 2 =
s SS(RESIDUAL
FULL )
n−p0
H0 : β1 = β2 (k = 1)
∗ ∗
Yi = β0 +β1 Xi1 +β1 Xi2 +β3 Xi3 +β4 Xi4 +εi == β0 +β1 (Xi1 +Xi2 )+β3 Xi3 +β4 Xi4 +εi = β0 +β1 Xi1 +β3 Xi3 +β4 Xi4 +εi Xi1 =X
H0 : β0 = 100 β1 = 5 (k = 2)
Yi = 100+5Xi1+β2 Xi2 +β3 Xi3 +β4 Xi4 +εi ⇒ Yi −100−5Xi1 = β2 Xi2 +β3 Xi3 +β4 Xi4 +εi Yi∗ = β2 Xi2 +β3 Xi3 +β4 Xi4 +ε
• When β0 is in the model, and not involved in H0 : K0 β = 0 then we can use Q = SS(MODELFULL )−
SS(MODELREDUCED ).
• When β0 6= 0, is in the reduced model, you cannot use the difference in Regression sums of
squares, since SS(TOTAL UNCORRECTED)differs between the two models.
Best practice is always to use the error sums of squares.
4.4.3 R-Notation to Label Sums of Squares
Many times in practice, we wish to test that a subset of the partial regression coefficients are all
equal to 0. We can write the model sum of squares for a model containing β0 , β1 , . . . , βp as:
R(β0 , β1 , . . . , βp ) = SS(MODEL)
The logic is to include all βi in R(·) that are in the model being fit. Returning to the case of testing
the last q < p regression coefficients are 0:
Partial (TYPE III) Sums of Squares: R(β0 , . . . , βi−1 , βi , βi+1 , . . . , βp )−R(β0 , . . . , βi−1 , βi+1 , . . . , βp ) =
Sequential (TYPE I) Sums of Squares: R(β0 , . . . , βi−1 , βi ) − R(β0 , . . . , βi−1 ) = R(βi |β0 , . . . , βi−1 )
SS(REGRESSION) = R(β1 , . . . , βp |β0 ) = R(β1 |β0 ) + R(β2 |β1 , β0 ) + · · · + R(βp |β1 , . . . , βp−1 )
The last statement shows that sequential sums of squares (corrected for the mean) sum to the
regression sum of squares. The partial sums of squares do not sum to the regression sum of squares
unless the last p columns of X are mutually pairwise orthogonal, in which case the partial and
sequential sums of squares are identical.
q q
β̂j − βj
⇒ p ∼ t(n−p0 ) ⇒ P r{β̂j − t(α/2,n−p0 ) s2 cjj ≤ βj ≤ β̂j − t(α/2,n−p0 ) s2 cjj } = 1 − α
s2 cjj
q
⇒ (1 − α)100% Confidence Interval for βj : β̂j ± t(α/2,n−p0 ) s2 cjj
By a similar argument, we have a (1 − α)100% confidence interval for the mean at a given
combination of levels of the independent variables, where Ŷ0 = x00 β̂:
q
Ŷ0 ± t(α/2,n−p0 ) s2 x00 (X0 X)−1 x0
From the section on the general linear tests, if we set K0 = Ip0 , we have the following distribu-
tional property:
( β̂ − β)0 [(X 0 X)−1 ]−1 ( β̂ − β)
∼ F(p0 ,n−p0 )
p0 s2
⇒ P r{( β̂ − β)0 (X0 X)( β̂ − β) ≤ p0 s2 F(1−α,p0 ,n−p0 ) } = 1 − α
Values of β in this set constitute a joint (1 − α)100% confidence region for β.
4.6 A Test for Model Fit
A key assumption for the model is that the relation between Y and X is linear (that is E[Y] = X β).
However, the relationship may be nonlinear. S-shaped functions are often seen in biological and
business applications, as well as the general notion of “diminishing marginal returns,” for instance.
A test can be conducted, when replicates are obtained at various combinations of levels of the
independent variables. Suppose we have c unique levels of x in our sample. It’s easiest to consider
the test when there is a single independent variable (but it generalizes straightforwardly). We
obtain the sample size (nj ), mean (Y j ) and variance (s2j ) at each unique level of X (Y is the overall
sample mean for Y ). We obtain the a partition of SS(TOTAL CORRECTED) to test:
H0 : E[Yi ] = β0 + β1 Xi HA : E[Yi ] = µi 6= β0 + β1 Xi
where µi is the mean of all observations at the level Xi and is not linear in Xi . The alternative
can be interpreted as a 1–way ANOVA where the means are not necessarily equal. The partition
is given in Table 12, with the following identities with respect to sums of squares:
where SS(RESIDUAL) = SS(LF) + SS(PE). Intuitively, these sums of squares and their degrees
of freedom can be written as:
X
n X
c
SS(LF) = (Y (i) − Ŷi )2 = nj (Y j − Ŷ(j) )2 dfLF = c − 2
i=1 j=1
n
X c
X
SS(PE) = (Yi − Y (i) )2 = (nj − 1)s2j dfLF = n − c
i=1 j=1
where Y (i) is the mean for the group of observations at the same level of X as observation i and
Ŷ(j) is the fitted value. The test for goodness-of-fit is conducted as follows:
Source df SS
Pn
Regression (REG) 1 (Ŷi − Y )2
Pc i=1
Lack of Fit (LF) c−2 n (Y − Ŷ(j) )2
Pc j j
j=1
2
Pure Error (PE) n−c j=1 (nj − 1)sj
• RR : F0 ≥ F(α,c−2,n−c)
y
50000
40000 Model 2
30000
Model 1
20000
10000
-10000
-20000
1000 1500 2000 2500 3000
size
Data were generated from each of these models, and the simple linear regression model
was fit. The least squares estimates for model 1 (correctly specified) and for model 2 (incorrectly
fit) were obtained:
Model 1: Ŷ = 5007.22 + 10.0523X
Model 2: Ŷ = 4638.99 + 10.1252X
The observed values, group means, and fitted values from the simple linear regression model are
obtained for both the correct model (1) and the incorrect model (2) in Table 4.6.
The sums of squares for lack of fit are obtained by taking deviations between the group means
(which are estimates of E[Yi ] under HA ) and the fitted values (which are estimates of E[Yi ] under
H0 ).
n
X
Model 1 SS(LF) = nj (Y (i) − Ŷi )2 = 2(15303.96−15059.40)2 +· · ·+2(35496.89−35164.04)2 = 631872.71
i=1
Correct Model (1) Incorrect Model (2)
i Xi Yi Y (i) Ŷi Yi Y (i) Ŷi
1 1000 14836.70 15303.96 15059.49 12557.72 12429.03 14764.23
2 1000 15771.22 15303.96 15059.49 12300.33 12429.03 14764.23
3 1500 20129.51 19925.80 20085.63 21770.87 21045.46 19826.86
4 1500 19722.08 19925.80 20085.63 20320.05 21045.46 19826.86
5 2000 25389.36 25030.88 25111.77 27181.07 27242.41 24889.48
6 2000 24672.40 25030.88 25111.77 27303.75 27242.41 24889.48
7 2500 29988.95 29801.31 30137.90 31139.83 30931.27 29952.10
8 2500 29613.68 29801.31 30137.90 30722.71 30931.27 29952.10
9 3000 35362.12 35496.89 35164.04 33335.17 32799.24 35014.73
10 3000 35631.66 35496.89 35164.04 32263.31 32799.24 35014.73
Table 13: Observed, fitted, and group mean values for the lack of fit test
n
X
Model 2 SS(LF) = nj (Y (i) −Ŷi )2 = 2(12429.03−14764.23)2+· · ·+2(32799.24−35014.73)2 = 36683188.83
i=1
The sum of squares for pure error are obtained by taking deviations between the observed
outcomes and their group means (this is used as an unbiased estimate of σ 2 under HA , after dividing
through by its degrees of freedom).
n
X
Model 1 SS(PE) = (Yi − Y (i) )2 = (14836.70 − 15303.96)2 + · · · + (35631.66 − 35496.89)2 = 883418.93
i=1
n
X
Model 2 SS(PE) = (Yi − Y (i) )2 = (12557.72 − 12429.03)2 + · · · + (32263.31 − 32799.24)2 = 1754525.81
i=1
The F -statistics for testing between H0 (that the linear model is the true model) and HA (that
the true model is not linear) are:
The rejection region for these tests, based on α = 0.05 significance level is:
RR : F0 ≥ F(.05,3,5) = 5.41
Thus, we fail to reject the null hypothesis that the model is correct when the data were generated
from the correct model. Further, we do reject the null hypothesis when data were generated from
the incorrect model.