Multiple Linear Reegression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Multiple Regression

A regression model that involves more than one regressor variable is called a multiple regression

model.

Suppose we have k regressors or predictor variables X1 , . . . , Xk and it is known that the

conditional expectation of Y given X1 , . . . , Xk is a linear function of X’s as well as β’s, i.e.,

E[Y |X1 , . . . , Xk ] = β0 + β1 X1 + . . . + βk Xk . (1)

Now for n observations on Y and X’s, the model (1) can be expressed as

yi = β0 + β1 xi1 + . . . + βk xik + i ; i = 1, 2, . . . , n (2)

where β0 is the intercept of the regression plane (hyperplane) and βj (j = 1, 2, . . . , k) are

called the partial regression coefficients. (The parameters βj (j = 0, 1, 2, . . . , k) are called the

regression coefficients. This model describes a hyperplane in the k-dimensional space of the

regressor variables Xj .)

• The parameter βj represents the expected change in the response Y per unit change in

Xj when all of the remaining regressor variables Xl (l 6= j) are held constant. For this

reason the parameters βj (j = 1, 2, . . . , k) are often called partial regression coefficients.

1 Model in matrix notation

It is often convenient to deal with multiple regression models if they are expressed in matrix

notation. This allows a very compact display of the model, data and results. In matrix

notation, the model given by equation (2) is

y = Xβ +  (3)

1
Italic+Bold=Vector

       
 y1  1 x11 x12 . . . x1k  β0   1 
       
       
y  1 x
21 x22 . . . x2k 
 β   
 2   1  2
where y =   , X = 
    , β =   , and  =  .
.
 ..  . .
 .. .. .
.. . 
.. 
.
 .. 
.
 .. 
       
       
       
yn 1 xn1 xn2 . . . xnk βk n

In general, y is an n × 1 vector of the observations, X is an n × p matrix of the levels of

the regressor variables, β is a p × 1 (where p = k + 1) vector of the regression coefficients and

 is an n × 1 vector of random errors.

2 Assumptions

1. Conditional expectation of Y given X1 , X2 , . . . , Xk is a linear function of the parameters

β0 , β1 , β2 , . . . , βk .

2. E() = 0 or E(y) = Xβ.

3. Cov() = σ 2 I or Cov(y) = σ 2 I.

4.  has multivariate normal distribution, i.e.,  ∼ N (0, σ 2 I).

5. X is a non-stochastic matrix.
non random/fixed

6. n > p and rank(X) = p, i.e., the columns of X matrix are linearly independent.

• If n < p or if there is a linear relationship among the x’s, for example x5 = 2x2 , then X

will not have full column rank.

• If the values of the xij ’s are planned (chosen by the researcher), then the X matrix

essentially contains the experimental design and is sometimes called the design matrix.

2
3 Ordinary Least-Squares (OLS) estimation

We wish to find the least-squares estimators, β,


b that minimizes the error sum of squares
n
X
S= 2i = >  = (y − Xβ)> (y − Xβ)
i=1

= y > y − β > X> y − y > Xβ + β > X> Xβ

= y > y − 2β > X> y + β > X> Xβ (4)

since β > X> y is a 1 × 1 matrix, or a scalar, and its transpose (β > X> y)> = y > Xβ is the same

scalar. The least-squares estimators must satisfy

∂S
= −2X> y + 2X> Xβ
b=0
∂β β
b

which simplifies to

X> Xβ
b = X> y. (5)

Equations (5) are the least-squares estimating equations. To solve the estimating equations,

multiply both sides of (5) by the inverse of X> X. Thus, the least-squares estimator of β is

b = (X> X)−1 X> y


β (6)

provided that the inverse matrix (X> X)−1 exists. The (X> X)−1 matrix will always exist if the

regressors are linearly independent, i.e., if no column of the X matrix is a linear combination

of the other columns.

• The diagonal elements of X> X are the sums of squares of the elements in the columns

of X, and the off-diagonal elements are the sums of cross products of the elements in

the columns of X. Furthermore, note that the elements of X> y are the sums of cross

products of the columns of X and the observations yi .

3
• For simple linear regression model y = α + βx + , show that
 
y − βx
b
> > −1 >
 
β = (b
b α, β) = (X X) X y =  P
b  .
(xi − x)(yi − y) 
P
(xi − x)2
• The fitted regression model corresponding to the levels of the regressor variables x> =

(1, x1 , x2 , . . . , xk ) is
k
X
>b
ybx = x β = βb0 + βbj xj .
j=1

• The vector of fitted values ybi corresponding to the observed values yi is

y b = X(X> X)−1 X> y = Hy.


b = Xβ

• The n×n matrix H = X(X> X)−1 X> is usually called the hat matrix. It maps the vector

of observed values into a vector of fitted values. The hat matrix and its properties play

a central role in regression analysis.

4 Properties of the least-squares estimators β


b

1. Linearity: β
b is a linear function of the observations.

Proof. We have,

> −1 >
β
b
OLS = (X X) X y = Ay with A = (aij )

or  
a10 a20 . . . an0 
  
 
a a21 . . . an1 
 11   y1 
  
 . .. ..   
 .. . . 
  y2 
 

β
b OLS =

 
 . 
  .. 
a a2j . . . anj   
 1j
  
 . .. ..   
 .
 . . .  yn

 
 
a1k a2k . . . ank

4
   
P
βb0   ai0 yi 
   
  P 
βb   a y 
 1  i1 i 
   
.  . 
 ..   .. 
   
=⇒ β
b
OLS   = P
=   

 βb   a y 
 j  ij i 
   
.  . 
.  . 
.  . 
   
  P 
βbk aik yi
n
X
=⇒ βbj = aij yi ; j = 0, 1, 2, . . . , k.
i=1

Since aij ’s are constant, we can say that the elements of β


b are linear function of the

observations.

2. Unbiasedness: E(β)
b = β.

Proof. We have,

b = (X> X)−1 X> y = (X> X)−1 X> (Xβ + )


β

= (X> X)−1 (X> X)β + (X> X)−1 X>  = β + (X> X)−1 X> .

b = β + (X> X)−1 X> E() = β + 0 = β.


∴ E(β)

Thus, β
b is an unbiased estimator of β irrespective of any distribution properties of 

provided only that E() = 0.

b = σ 2 (X> X)−1 .
3. Variance: var(β)

Proof. We have,

b = (X> X)−1 X> y = Ay where A = (X> X)−1 X> .


β

5
Now,

b = var(Ay) = Avar(y)A> = Aσ 2 IA> = σ 2 AA>


var(β)

= σ 2 {(X> X)−1 X> }{(X> X)−1 X> }>

= σ 2 (X> X)−1 (X> X){(X> X)−1 }>

= σ 2 {(X> X)−1 }> = σ 2 (X> X)−1 .

• If we let C = (X> X)−1 , the variance of βbj is σ 2 Cjj and the covariance between βbj

and βbf is σ 2 Cjf .


   P 2
  
xi −x
α
b  var(bα) cov(bα, β)
b  P
n (xi − x)2
P
(xi − x)2 

• β
b =  , var(β)
b =  = σ2 
 .
     −x 1 
βb cov(b
α, β)
b var(β)
b P P
(xi − x)2 (xi − x)2
 
1 x1 
    
  P
1 1 ... 1  1 x2   n xi 

• X> X = 

 =
  . .  P
.

x1 x2 . .
. . . xn  . . 
  xi
P 2
xi
 
 
1 xn
 
 y1 
    
  P
 1 1 . . . 1   y2   yi 
 
• X> y = 

  = 
  .  P
.

x1 x2 . . . xn   ..  xi yi

 
 
yn
 P 2   
xi −x
x2i − xi 
P P
 P P
 n (xi − x)2 (x 2
i − x) 
 
> −1
• (X X) =  = X 1  .
 −x 1  n (xi − x)2  P 
P P − xi n
(xi − x)2 (xi − x)2

6
4. Minimum variance: Among all linear and unbiased estimators of β, OLS estimator

of β has the minimum variance, i.e., β


b is BLUE.

Proof. Since β
b is a vector its variance is actually a matrix. Consequently, we want

to show that β
b minimizes the variance for any linear combination of the estimated

coefficients, l> β.
b

We have

b = l> var(β)l
var(l> β) b = l[σ 2 (X> X)−1 ]l = σ 2 l> (X> X)−1 l

which is a scalar.

e = Cy be another linear unbiased estimator of β with C = (X> X)−1 X> + D


Let β

where D is a p × n non-zero matrix.

e = E[Cy] = E[{(X> X)−1 X> +D}(Xβ+)] = [(X> X)−1 X> +D]E(Xβ+) =


Now E(β)

[(X> X)−1 X> + D]Xβ = (X> X)−1 (X> X)β + DXβ = β + +DXβ. Therefore, β
e is

unbiased if and only if DX = 0.

The variance of β
e is

e = var[{(X> X)−1 X> + D}y] = [(X> X)−1 X> + D]var(y)[(X> X)−1 X> + D]>
var(β)

= [(X> X)−1 X> + D]σ 2 I[(X> X)−1 X> + D]>

= σ 2 [(X> X)−1 X> + D][X(X> X)−1 + D> ]

= σ 2 [(X> X)−1 (X> X)(X> X)−1 + DX(X> X)−1 + (X> X)−1 X> D> + DD> ]

= σ 2 [(X> X)−1 + DX(X> X)−1 + (X> X)−1 X> D> + DD> ]

= σ 2 [(X> X)−1 + DD> ]

7
because DX = 0, which in turn implies that (DX)> = X> D> = 0. As a result

var(l> β)
e = l> var(β)l
e = l> [σ 2 {(X> X)−1 + DD> }]l

= σ 2 l> (X> X)−1 l + σ 2 l> DD> l]

= l> var(β)l
b + σ 2 l> DD> l

= var(l> β)
b + σ 2 l> DD> l ≥ var(l> β)
b

as DD> is at least a positive semidefinite matrix; hence σ 2 l> DD> l ≥ 0.

Let us define l∗ = D> l. Hence


n
X
> ∗> ∗
lDD l = l l = li∗2
i=1

which must be strictly greater than 0 for some l 6= 0 unless D = 0.

5. Maximum Likelihood Estimation: If the errors are normally and independently

distributed, i.e.,  ∼ N (0, σ 2 I) then β


b OLS is the MLE of β.

Proof. The normal density function for the error is


1
1 − 2 2i
f (i ) = √ e 2σ .
σ 2π
Qn
The likelihood function is the joint density of 1 , 2 , . . . , n or i=1 f (i ). Therefore, the

likelihood function is
n 1 >
2
Y 1 −  
L(, β, σ ) = f (i ) = n/2 n
e 2σ 2 .
i=1
(2π) σ

Now since  = y − Xβ, the likelihood function becomes


1
2 1 −
2
(y − Xβ)> (y − Xβ)
L(y, X, β, σ ) = e 2σ .
(2π)n/2 σ n

2
Now β
b
M LE is that estimator of β which maximizes L(y, X, β, σ ). Again, maximization

of L(y, X, β, σ 2 ) is equivalent to minimization of the quantity >  = (y−Xβ)> (y−Xβ),

which is the principle of least squares.

8
So, under normality assumptions of , β
b
OLS is the MLE of β.

• It is convenient to work with the log of likelihood

n 1
lnL(y, X, β, σ) = l(y, X, β, σ) = − ln(2π) − nln(σ) − 2 (y − Xβ)> (y − Xβ).
2 2σ

Now

∂l n 1 b > (y − Xβ)
2
| e 2 = − + 3 (y − Xβ) b =0
∂σ β, σ e σ
e σ e
b > (y − Xβ)
(y − Xβ) b
=⇒ σe2 = .
n

5 Residual

The difference between the observed value yi and the corresponding fitted value ybi is the

residual ei = yi − ybi . The n residual may be conveniently written in matrix notation as

e=y−y
b = y − Xβ
b = y − Hy = (I − H)y.

5.1 Properties of Residuals

1. Residuals are orthogonal to the predicted values as well as to the design matrix, X, in

b > e = 0.
the model y = Xβ + , i.e., (a) X> e = 0, (b) y

Proof. (a) We have

e=y−y b = y − X(X> X)−1 X> y


b = y − Xβ

= y − Hy = (I − H)y = My with M = I − H

= M(Xβ + ) = MXβ + M = (I − H)Xβ + M

= Xβ − HXβ + M = Xβ − X(X> X)−1 (X> X)β + M

= Xβ − Xβ + M = M

9
Now

X> e = X> M = X> (I − H)

= (X> − X> X(X> X)−1 X> )

= (X> − X> ) = 0.

b > e = (Xβ)
(b) y b > X> e = β
b >e = β b > 0 = 0 from (a).

Pn
2. If there is a β0 term in the model, then i=1 ei = 0.

Proof. We know,

X> e = 0

or,     
 1 1 . . . 1   e1  0
    
    
x
 11 x21 . . . xn1   e2  0
   
   =  
 . .. ..   .  .
 .. .   ..   .. 
.     

    
    
x1k x2k . . . xnk en 0

or,    
P
 ei  0
   
P   
 x e  0
 i1 i   
  =  .
 .  .
 ..   .. 
   
   
P   
xik ei 0

From the 1st element, we can say


X
ei = 0.

3. var(e) = σ 2 M.

10
Proof. var(e) = var(M) = Mvar()M> = Mσ 2 IM> = σ 2 MM> = σ 2 M since M is

idempotent.

6 Partitioning of Total SS into components

The residual sum of squares is

X
SSE = e2i = e> e = (y − y
b )> (y − y
b)

b > (y − Xβ)
= (y − Xβ) b

= y > y − y > Xβ b > X> y + β


b −β b > X> Xβ
b

= y > y − 2β b > X> Xβ.


b > X> y + β b Because of scalar

Since X> Xβ
b = X> y, this last equation becomes since e=y-y^ or e=y-XB^ or X`e=X`y-X`XB^ or
X`y=X`XB^ since X`e=0

b > X> y + β
SSE =y > y − 2β b > X> y = y > y − β
b > X> y.

b > X> y + SSE


=⇒ y > y = β

b > X> y + e> e


=⇒ y > y = β

b > X> y − ny 2 + e> e


=⇒ y > y − ny 2 = β

=⇒ SST = SSR + SSE.

7 Sum of squares are quadratic form

1.
P 2
X
2
X
> yi
SST = (yi − y) = yi2 2
− ny = y y − n
n
( yi )2
P
> 1 1
=y y− = y > y − y > Jy = y > (I − J)y
n n n

where J is a square matrix with all elements 1.

11
2.

b > X> y − ny 2 = [(X> X)−1 X> y]> X> y − 1 y > Jy


SSR = β
n
1 1
= y > X(X> X)−1 X> y − y > Jy = y > Hy − y > Jy
n n
1
= y > (H − J)y.
n

3.

SSE = e> e = [(I − H)y]> [(I − H)y]

= y > (I − H)> (I − H)y = y > (I − H)y.

1 1
Since each of the matrices (I − J), (H − J) and (I − H) are symmetric, SST , SSR and
n n
SSE are quadratic forms.

• SST , as usual, has n − 1 degrees of freedom associated with it. SSE has n − p degrees

of freedom associated with it since p parameters need to be estimated in the regression

function. Finally, SSR has p − 1 degrees of freedom associated with it, representing the

number of X- variables, X1 , . . . , Xk .

8 Distribution of a quadratic form

Theorem 1. If y is a k × 1 random vector with mean µ and nonsingular variance–covariance

matrix Σ and A is a k × k symmetric matrix of constants, then E(y > Ay) = tr(AΣ) + µ> Aµ.

Corollary 1.1. If y has mean 0 and variance σ 2 I then, E(y > Ay) = tr(Aσ 2 I)+0 = σ 2 tr(A).

Theorem 2. If y is a k × 1 random vector with mean µ and variance–covariance matrix σ 2 I

and A is a k × k symmetric matrix of constants, then

y > Ay 0
2
∼ χ2 (r, λ)
σ
12
if and only if A is idempotent with rank(A) = r, where the non-centrality parameter

µ> Aµ
λ= .
σ2

9 Unbiased estimator of σ 2

We have,

SSE = y > (In − H)y.

∴ E(SSE) = tr{(In − H)σ 2 I} + (Xβ)> (In − H)Xβ as E(y) = Xβ.

Now

(Xβ)> (In − H)Xβ = β > X> (In − H)Xβ

= β > X> Xβ − β > X> HXβ

= β > X> Xβ − β > X> X(X> X)−1 X> Xβ

= β > X> Xβ − β > X> Xβ = 0.

Hence

E(SSE) = σ 2 {tr(In ) − tr(H)} = σ 2 {n − tr[X(X> X)−1 X> ]}

= σ 2 {n − tr[(X> X)−1 (X> X)]} Since, tr(AB)=tr(BA)

= σ 2 {n − tr(I)p } = σ 2 (n − p)

 
SSE
=⇒ E = σ 2 =⇒ E(M SE) = σ 2 .
n−p

Thus M SE is an unbiased estimator of σ 2 .

13
10 Coefficient of multiple determination

The coefficient of multiple determination for a subset regression model with p terms (p − 1

regressors and an intercept term β0 ), denoted by Rp2 , is defined as

SSR(p) SSE(p)
Rp2 = =1−
SST SST

where SSR(p) and SSE(p) denote the regression sum of squares and the residual sum of

squares, respectively, for a p term subset model. The Rp2 is often called the proportion of

variation explained by the p − 1 regressors.

Range: Because 0 ≤ SSE(p) ≤ SST , it follows that 0 ≤ Rp2 ≤ 1.

• Rp2 = 0 : Rp2 assumes the value 0 when all βbj = 0 (j = 1, 2, . . . , k). In that case, the

fitted model will be

yb = βb0 = y, i.e., ybi = y (i = 1, 2, . . . , n).

• Rp2 = 1 : Rp2 takes the value 1 when all y observations falls directly on the fitted regression

plane, i.e., when yi = ybi for all i.

• Rp2 is the square of the multiple correlation coefficient, i.e., Rp = corr(y, yb).

• Rp2 increases as p increases and is a maximum when p = k + 1.

14
2
11 Adjusted Ra.p

A large Rp2 does not necessarily imply that the fitted model is a useful one. Adding more

regressors to the model can only increase Rp2 and never reduce it, because SSE can never

become larger with more regressors. Thus it is possible for models that have large values of

Rp2 to perform poorly in prediction or estimation.

Some analyst prefer to use the adjusted Rp2 statistic, defined for a p-term equation as

2 SSE/(n − p) n − 1 SSE
Ra.p =1− =1− .
SST /(n − 1) n − p SST
n−1
=1− .(1 − Rp2 ).
n−p

2 n−1 2 k 2
The range of Ra.p is 1 − ≤ Ra.p ≤ 1 =⇒ − ≤ Ra.p ≤ 1. It can be negative and
n−p n−p
its value will always be less than or equal to that of Rp2 . The Ra.p
2
statistic does not necessarily

increase as additional regressors are introduced into the model.

2
• The Ra.p penalizes us for adding terms that are not helpful, so it is very useful in

15
evaluating and comparing candidate regression models. (compare equation fitted not

only to a specific set of data but also to two or more entirely different data sets)

2
• The Ra.p tells you the percentage of variation explained by only the regressors that

actually affect the response variable.

2
• The Ra.p criterion for selecting the best model is equivalent to the residual mean square

criterion.

12 Residual Mean Square

The residual mean square for a p-term regression model is

SSE(p)
M SE(p) = .
n−p

Because SSE(p) always decreases as p increases, M SE(p) initially decreases, then stabi-

lizes and eventually may increase. The eventual increase in M SE(p) occurs when the reduction

in SSE(p) from adding a regressor to the model is not sufficient to compensate for the loss of

one degree of freedom in the denominator.

13 Sequential Sums of Squares

Let us suppose we want to fit a model

Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + . (1)

16
and entered the X’s in the order given in the model. The extra sums of squares

SSR1 = SSR(βb1 |βb0 ) sequential sum of square for X1

SSR2 = SSR(βb2 |βb1 , βb0 ) sequential sum of square for X2

SSR3 = SSR(βb3 |βb2 , βb1 , βb0 ) sequential sum of square for X3

SSR4 = SSR(βb4 |βb3 , βb2 , βb1 , βb0 ) sequential sum of square for X4

SSR5 = SSR(βb5 |βb4 , βb3 , βb2 , βb1 , βb0 ) sequential sum of square for X5

are often called the sequential sums of squares.

Suppose the reduced model is

Y = β0 + β1 X1 + β2 X2 + . (2)

If SSR(βb0 , βb1 , βb2 , βb3 , βb4 , βb5 ) = S1 and SSR(βb0 , βb1 , βb2 ) = S2 , then

Extra sum of square of


X3,X4,X5 assuming that SSR(βb3 , βb4 , βb5 |βb0 , βb1 , βb2 ) =S1 − S2 extra sum of square
X2,X1,X0 are already in
the model
= (S1 − ny 2 ) − (S2 − ny 2 )
corrected sum of square for 1st model corrected sum of square for 2nd model
when it becomes a difference between sums of squares corrected for βb0 , i.e.,

extra sum of square as a


SSR(βb1 , βb2 , βb3 , βb4 , βb5 |βb0 ) − SSR(βb1 , βb2 |βb0 ). difference between regression
sum of squares corrected for Bo

We can rewrite S1 − S2 as
extra sum of square due to residual

S1 − S2 = (y > y − S2 ) − (y > y − S1 ) difference of residual sums of squares

when it becomes a difference of residual sums of squares but in reverse order, because the

regression with the larger SSR (S1 ) must have the smaller SSE; and vice versa for S2 .

We could get S1 − S2 by summing the sequential sums of squares, i.e.,


extra sum of square due to x3,x4,x5
S1 − S2 = SSR5 + SSR4 + SSR3 . Sequential sum of square

17
14 Partial Sums of Squares

If we have several terms in a regression model we can think of them as entering the equation

in any desired sequence. If we find

SSR(βbj |βb0 , βb1 , . . . , βbj−1 , βbj+1 , . . . , βbk ), j = 1, 2, . . . , k

we shall have a one degree of freedom sum of squares, which measures the contribution to the

regression sum of squares of each coefficient βbj given that all the terms that did not involve

βj were already in the model. Another way of saying that is that we have a measure of the

value of βj as though it were added to the model last. The corresponding mean square, equal

to the sum of squares since it has one degree of freedom, can be compared by an F - test for

βj . This particular type of F - test is often called a partial F - test for βj .

When a suitable model is being built the partial F - test is a useful criterion for adding

or removing terms from the model. The effect of an X- variable (Xq , say) in determining a

response may be large when the regression equation includes only Xq . However, when the

same variable is entered into the equation after other variables, it may affect the response

very little, due to the fact that Xq is highly correlated with variables already in the regression

equation.

Table 1 on page 19 and Table 2 on page 19 represents the ANOVA table for only two

regressors.

15 Orthogonal Columns in the X matrix

Consider the regression model with k regressors

y = Xβ + 

18
Table 1: ANOVA table for partial F - test having two regressors

SV SS df MS F
Regression |βb0 SSR(βb1 , βb2 |βb0 ) 2 M SR(βb1 , βb2 |βb0 ) M SR(βb1 , βb2 |βb0 )/M SE
Due to βb1 |β0 SSR(βb1 |βb0 ) 1 M SR(βb1 |βb0 ) M SR(βb1 |βb0 )/M SE
Due to βb2 |βb1 , βb0 SSR(βb2 |βb1 , βb0 ) 1 M SR(βb2 |βb1 , βb0 ) M SR(βb2 |βb1 , βb0 )/M SE
Residual SSE n−3 M SE
Total SST n−1

Table 2: ANOVA table for partial F - test having two regressors

SV SS df MS F
Regression |βb0 SSR(βb1 , βb2 |βb0 ) 2 M SR(βb1 , βb2 |βb0 ) M SR(βb1 , βb2 |βb0 )/M SE
Due to βb2 |βb0 SSR(βb2 |βb0 ) 1 M SR(βb2 |βb0 ) M SR(βb2 |βb0 )/M SE
Due to βb1 |βb2 , βb0 SSR(βb1 |βb2 , βb0 ) 1 M SR(βb1 |βb2 , βb0 ) M SR(βb1 |βb2 , βb0 )/M SE
Residual SSE n−3 M SE
Total SST n−1

where y is n × 1, X is n × p, β is p × 1, and p = k + 1. We would like to determine if some

subset of r < k regressors contributes significantly to the regression model. Let us partition

β as
 
β1
β=
β2

where β 1 is (p − r) × 1 and β 2 is r × 1. The corresponding partitioning of X will be


 
..
X= X1 .X2 .

The model may be written as

y = Xβ +  = X1 β 1 + X2 β 2 + 

where the n × (p − r) matrix X1 represents the columns of X associated with β 1 and the n × r

matrix X2 represents the columns of X associated with β 2 . This is called the full model.

The extra sum of squares method allows us to measure the effect of the regressors in X2

b |β
conditional on those in X1 by computing SSR(β b ). However, if the columns in X1 are
2 1

19
orthogonal to the columns in X2 , we can determine a sum of squares due to β
b that is free of
2

any dependence on the regressors in X1 .

The estimating equations for the model y = Xβ+ are (X> X)β
b = X> y. These estimating

equations can be written as


    
X> X 1 X>1 X2  β 1 
b X> y 
 1   =  1 .
    
X>
2 X 1 X >
2 X 2 β
b
2 X >
2 y

Now if the columns of X1 are orthogonal to the columns in X2 , X> >


1 X2 = 0 and X2 X1 = 0.

Then the estimating equations become

X> >
1 X1 β 1 = X1 y,
b X> >
2 X2 β 2 = X2 y
b

with solution

b = (X> X1 )−1 X> y,


β b = (X> X2 )−1 X> y.
β
1 1 1 2 2 2

Note that the least-squares estimator of β 1 is β


b regardless of whether or not X2 is in the
1

model, and the least-squares estimator of β 2 is β


b 2 regardless of whether or not X1 is in the

model.

The regression sum of squares for the full model is

>
SSR(β) b X> y
b =β
 
 X> y 
 
> >  1 
= β1 , β2 
b b 
X>
2 y

b > X> y + β
=β b > X> y.
1 1 2 2

However, the estimating equations form two sets, and for each set we have

SSR(β b > X> y


b 1) = β
1 1

>
SSR(β b X> y.
b 2) = β
2 2

20
Thus

SSR(β)
b = SSR(β
b ) + SSR(β
1
b ).
2

Therefore,

b |β
SSR(β b − SSR(β
b ) = SSR(β) b ) = SSR(β
b )
1 2 2 1

and

b |β
SSR(β b − SSR(β
b ) = SSR(β) b ) = SSR(β
b ).
2 1 1 2

Consequently, SSR(β
b 1 ) measures the contribution of the regressors in X1 to the model

unconditionally, and SSR(β


b ) measures the contribution of the regressors in X2 to the model
2

unconditionally.

21

You might also like