Multiple Linear Reegression

Multiple Regression
A regression model that involves more than one regressor variable is called a multiple regression
model.
Suppose we have k regressors or predictor variables X1 , . . . , Xk and it is known that the
conditional expectation of Y given X1 , . . . , Xk is a linear function of X’s as well as β’s, i.e.,
E[Y |X1 , . . . , Xk ] = β0 + β1 X1 + . . . + βk Xk . (1)
Now for n observations on Y and X’s, the model (1) can be expressed as
yi = β0 + β1 xi1 + . . . + βk xik + i ; i = 1, 2, . . . , n (2)
where β0 is the intercept of the regression plane (hyperplane) and βj (j = 1, 2, . . . , k) are
called the partial regression coefficients. (The parameters βj (j = 0, 1, 2, . . . , k) are called the
regression coefficients. This model describes a hyperplane in the k-dimensional space of the
regressor variables Xj .)
• The parameter βj represents the expected change in the response Y per unit change in
Xj when all of the remaining regressor variables Xl (l 6= j) are held constant. For this
reason the parameters βj (j = 1, 2, . . . , k) are often called partial regression coefficients.
1 Model in matrix notation
It is often convenient to deal with multiple regression models if they are expressed in matrix
notation. This allows a very compact display of the model, data and results. In matrix
notation, the model given by equation (2) is
y = Xβ + (3)
1
Italic+Bold=Vector
       
 y1  1 x11 x12 . . . x1k  β0   1 
       
       
y  1 x
21 x22 . . . x2k 
 β   
 2   1  2
where y =   , X = 
    , β =   , and =  .
.
 ..  . .
 .. .. .
.. . 
.. 
.
 .. 
.
 .. 
       
       
       
yn 1 xn1 xn2 . . . xnk βk n
In general, y is an n × 1 vector of the observations, X is an n × p matrix of the levels of
the regressor variables, β is a p × 1 (where p = k + 1) vector of the regression coefficients and
is an n × 1 vector of random errors.
2 Assumptions
1. Conditional expectation of Y given X1 , X2 , . . . , Xk is a linear function of the parameters
β0 , β1 , β2 , . . . , βk .
2. E() = 0 or E(y) = Xβ.
3. Cov() = σ 2 I or Cov(y) = σ 2 I.
4. has multivariate normal distribution, i.e., ∼ N (0, σ 2 I).
5. X is a non-stochastic matrix.
non random/fixed
6. n > p and rank(X) = p, i.e., the columns of X matrix are linearly independent.
• If n < p or if there is a linear relationship among the x’s, for example x5 = 2x2 , then X
will not have full column rank.
• If the values of the xij ’s are planned (chosen by the researcher), then the X matrix
essentially contains the experimental design and is sometimes called the design matrix.
2
3 Ordinary Least-Squares (OLS) estimation
We wish to find the least-squares estimators, β,

b that minimizes the error sum of squares
n
X
S= 2i = > = (y − Xβ)> (y − Xβ)
i=1
= y > y − β > X> y − y > Xβ + β > X> Xβ
= y > y − 2β > X> y + β > X> Xβ (4)
since β > X> y is a 1 × 1 matrix, or a scalar, and its transpose (β > X> y)> = y > Xβ is the same
scalar. The least-squares estimators must satisfy
∂S
= −2X> y + 2X> Xβ
b=0
∂β β
b
which simplifies to
X> Xβ
b = X> y. (5)
Equations (5) are the least-squares estimating equations. To solve the estimating equations,
multiply both sides of (5) by the inverse of X> X. Thus, the least-squares estimator of β is
b = (X> X)−1 X> y

β (6)
provided that the inverse matrix (X> X)−1 exists. The (X> X)−1 matrix will always exist if the
regressors are linearly independent, i.e., if no column of the X matrix is a linear combination
of the other columns.
• The diagonal elements of X> X are the sums of squares of the elements in the columns
of X, and the off-diagonal elements are the sums of cross products of the elements in
the columns of X. Furthermore, note that the elements of X> y are the sums of cross
products of the columns of X and the observations yi .
3
• For simple linear regression model y = α + βx + , show that
 
y − βx
b
> > −1 >
 
β = (b
b α, β) = (X X) X y =  P
b  .
(xi − x)(yi − y) 
P
(xi − x)2
• The fitted regression model corresponding to the levels of the regressor variables x> =
(1, x1 , x2 , . . . , xk ) is
k
X
>b
ybx = x β = βb0 + βbj xj .
j=1
• The vector of fitted values ybi corresponding to the observed values yi is
y b = X(X> X)−1 X> y = Hy.

b = Xβ
• The n×n matrix H = X(X> X)−1 X> is usually called the hat matrix. It maps the vector
of observed values into a vector of fitted values. The hat matrix and its properties play
a central role in regression analysis.
4 Properties of the least-squares estimators β

b
1. Linearity: β
b is a linear function of the observations.
Proof. We have,
> −1 >
β
b
OLS = (X X) X y = Ay with A = (aij )
or  
a10 a20 . . . an0 
  
 
a a21 . . . an1 
 11   y1 
  
 . .. ..   
 .. . . 
  y2 
 

β
b OLS =

 
 . 
  .. 
a a2j . . . anj   
 1j
  
 . .. ..   
 .
 . . .  yn

 
 
a1k a2k . . . ank
4
   
P
βb0   ai0 yi 
   
  P 
βb   a y 
 1  i1 i 
   
.  . 
 ..   .. 
   
=⇒ β
b
OLS   = P
=   

 βb   a y 
 j  ij i 
   
.  . 
.  . 
.  . 
   
  P 
βbk aik yi
n
X
=⇒ βbj = aij yi ; j = 0, 1, 2, . . . , k.
i=1
Since aij ’s are constant, we can say that the elements of β

b are linear function of the
observations.
2. Unbiasedness: E(β)
b = β.
Proof. We have,
b = (X> X)−1 X> y = (X> X)−1 X> (Xβ + )

β
= (X> X)−1 (X> X)β + (X> X)−1 X> = β + (X> X)−1 X> .
b = β + (X> X)−1 X> E() = β + 0 = β.

∴ E(β)
Thus, β
b is an unbiased estimator of β irrespective of any distribution properties of
provided only that E() = 0.
b = σ 2 (X> X)−1 .
3. Variance: var(β)
Proof. We have,
b = (X> X)−1 X> y = Ay where A = (X> X)−1 X> .

β
5
Now,
b = var(Ay) = Avar(y)A> = Aσ 2 IA> = σ 2 AA>

var(β)
= σ 2 {(X> X)−1 X> }{(X> X)−1 X> }>
= σ 2 (X> X)−1 (X> X){(X> X)−1 }>
= σ 2 {(X> X)−1 }> = σ 2 (X> X)−1 .
• If we let C = (X> X)−1 , the variance of βbj is σ 2 Cjj and the covariance between βbj
and βbf is σ 2 Cjf .

   P 2
  
xi −x
α
b  var(bα) cov(bα, β)
b  P
n (xi − x)2
P
(xi − x)2 

• β
b =  , var(β)
b =  = σ2 
 .
     −x 1 
βb cov(b
α, β)
b var(β)
b P P
(xi − x)2 (xi − x)2
 
1 x1 
    
  P
1 1 ... 1  1 x2   n xi 

• X> X = 

 =
  . .  P
.

x1 x2 . .
. . . xn  . . 
  xi
P 2
xi
 
 
1 xn
 
 y1 
    
  P
 1 1 . . . 1   y2   yi 
 
• X> y = 

  = 
  .  P
.

x1 x2 . . . xn   ..  xi yi

 
 
yn
 P 2   
xi −x
x2i − xi 
P P
 P P
 n (xi − x)2 (x 2
i − x) 
 
> −1
• (X X) =  = X 1  .
 −x 1  n (xi − x)2  P 
P P − xi n
(xi − x)2 (xi − x)2
6
4. Minimum variance: Among all linear and unbiased estimators of β, OLS estimator
of β has the minimum variance, i.e., β

b is BLUE.
Proof. Since β
b is a vector its variance is actually a matrix. Consequently, we want
to show that β
b minimizes the variance for any linear combination of the estimated
coefficients, l> β.
b
We have
b = l> var(β)l
var(l> β) b = l[σ 2 (X> X)−1 ]l = σ 2 l> (X> X)−1 l
which is a scalar.
e = Cy be another linear unbiased estimator of β with C = (X> X)−1 X> + D

Let β
where D is a p × n non-zero matrix.
e = E[Cy] = E[{(X> X)−1 X> +D}(Xβ+)] = [(X> X)−1 X> +D]E(Xβ+) =

Now E(β)
[(X> X)−1 X> + D]Xβ = (X> X)−1 (X> X)β + DXβ = β + +DXβ. Therefore, β
e is
unbiased if and only if DX = 0.
The variance of β
e is
e = var[{(X> X)−1 X> + D}y] = [(X> X)−1 X> + D]var(y)[(X> X)−1 X> + D]>
var(β)
= [(X> X)−1 X> + D]σ 2 I[(X> X)−1 X> + D]>
= σ 2 [(X> X)−1 X> + D][X(X> X)−1 + D> ]
= σ 2 [(X> X)−1 (X> X)(X> X)−1 + DX(X> X)−1 + (X> X)−1 X> D> + DD> ]
= σ 2 [(X> X)−1 + DX(X> X)−1 + (X> X)−1 X> D> + DD> ]
= σ 2 [(X> X)−1 + DD> ]
7
because DX = 0, which in turn implies that (DX)> = X> D> = 0. As a result
var(l> β)
e = l> var(β)l
e = l> [σ 2 {(X> X)−1 + DD> }]l
= σ 2 l> (X> X)−1 l + σ 2 l> DD> l]
= l> var(β)l
b + σ 2 l> DD> l
= var(l> β)
b + σ 2 l> DD> l ≥ var(l> β)
b
as DD> is at least a positive semidefinite matrix; hence σ 2 l> DD> l ≥ 0.
Let us define l∗ = D> l. Hence

n
X
> ∗> ∗
lDD l = l l = li∗2
i=1
which must be strictly greater than 0 for some l 6= 0 unless D = 0.
5. Maximum Likelihood Estimation: If the errors are normally and independently
distributed, i.e., ∼ N (0, σ 2 I) then β

b OLS is the MLE of β.
Proof. The normal density function for the error is

1
1 − 2 2i
f (i ) = √ e 2σ .
σ 2π
Qn
The likelihood function is the joint density of 1 , 2 , . . . , n or i=1 f (i ). Therefore, the
likelihood function is
n 1 >
2
Y 1 −
L(, β, σ ) = f (i ) = n/2 n
e 2σ 2 .
i=1
(2π) σ
Now since = y − Xβ, the likelihood function becomes

1
2 1 −
2
(y − Xβ)> (y − Xβ)
L(y, X, β, σ ) = e 2σ .
(2π)n/2 σ n
2
Now β
b
M LE is that estimator of β which maximizes L(y, X, β, σ ). Again, maximization
of L(y, X, β, σ 2 ) is equivalent to minimization of the quantity > = (y−Xβ)> (y−Xβ),
which is the principle of least squares.
8
So, under normality assumptions of , β
b
OLS is the MLE of β.
• It is convenient to work with the log of likelihood
n 1
lnL(y, X, β, σ) = l(y, X, β, σ) = − ln(2π) − nln(σ) − 2 (y − Xβ)> (y − Xβ).
2 2σ
Now
∂l n 1 b > (y − Xβ)
2
| e 2 = − + 3 (y − Xβ) b =0
∂σ β, σ e σ
e σ e
b > (y − Xβ)
(y − Xβ) b
=⇒ σe2 = .
n
5 Residual
The difference between the observed value yi and the corresponding fitted value ybi is the
residual ei = yi − ybi . The n residual may be conveniently written in matrix notation as
e=y−y
b = y − Xβ
b = y − Hy = (I − H)y.
5.1 Properties of Residuals
1. Residuals are orthogonal to the predicted values as well as to the design matrix, X, in
b > e = 0.
the model y = Xβ + , i.e., (a) X> e = 0, (b) y
Proof. (a) We have
e=y−y b = y − X(X> X)−1 X> y

b = y − Xβ
= y − Hy = (I − H)y = My with M = I − H
= M(Xβ + ) = MXβ + M = (I − H)Xβ + M
= Xβ − HXβ + M = Xβ − X(X> X)−1 (X> X)β + M
= Xβ − Xβ + M = M
9
Now
X> e = X> M = X> (I − H)
= (X> − X> X(X> X)−1 X> )
= (X> − X> ) = 0.
b > e = (Xβ)
(b) y b > X> e = β
b >e = β b > 0 = 0 from (a).
Pn
2. If there is a β0 term in the model, then i=1 ei = 0.
Proof. We know,
X> e = 0
or,     
 1 1 . . . 1   e1  0
    
    
x
 11 x21 . . . xn1   e2  0
   
   =  
 . .. ..   .  .
 .. .   ..   .. 
.     

    
    
x1k x2k . . . xnk en 0
or,    
P
 ei  0
   
P   
 x e  0
 i1 i   
  =  .
 .  .
 ..   .. 
   
   
P   
xik ei 0
From the 1st element, we can say

X
ei = 0.
3. var(e) = σ 2 M.
10
Proof. var(e) = var(M) = Mvar()M> = Mσ 2 IM> = σ 2 MM> = σ 2 M since M is
idempotent.
6 Partitioning of Total SS into components
The residual sum of squares is
X
SSE = e2i = e> e = (y − y
b )> (y − y
b)
b > (y − Xβ)
= (y − Xβ) b
= y > y − y > Xβ b > X> y + β

b −β b > X> Xβ
b
= y > y − 2β b > X> Xβ.

b > X> y + β b Because of scalar
Since X> Xβ
b = X> y, this last equation becomes since e=y-y^ or e=y-XB^ or X`e=X`y-X`XB^ or
X`y=X`XB^ since X`e=0
b > X> y + β
SSE =y > y − 2β b > X> y = y > y − β
b > X> y.
b > X> y + SSE

=⇒ y > y = β
b > X> y + e> e

=⇒ y > y = β
b > X> y − ny 2 + e> e

=⇒ y > y − ny 2 = β
=⇒ SST = SSR + SSE.
7 Sum of squares are quadratic form
1.
P 2
X
2
X
> yi
SST = (yi − y) = yi2 2
− ny = y y − n
n
( yi )2
P
> 1 1
=y y− = y > y − y > Jy = y > (I − J)y
n n n
where J is a square matrix with all elements 1.
11
2.
b > X> y − ny 2 = [(X> X)−1 X> y]> X> y − 1 y > Jy

SSR = β
n
1 1
= y > X(X> X)−1 X> y − y > Jy = y > Hy − y > Jy
n n
1
= y > (H − J)y.
n
3.
SSE = e> e = [(I − H)y]> [(I − H)y]
= y > (I − H)> (I − H)y = y > (I − H)y.
1 1
Since each of the matrices (I − J), (H − J) and (I − H) are symmetric, SST , SSR and
n n
SSE are quadratic forms.
• SST , as usual, has n − 1 degrees of freedom associated with it. SSE has n − p degrees
of freedom associated with it since p parameters need to be estimated in the regression
function. Finally, SSR has p − 1 degrees of freedom associated with it, representing the
number of X- variables, X1 , . . . , Xk .
8 Distribution of a quadratic form
Theorem 1. If y is a k × 1 random vector with mean µ and nonsingular variance–covariance
matrix Σ and A is a k × k symmetric matrix of constants, then E(y > Ay) = tr(AΣ) + µ> Aµ.
Corollary 1.1. If y has mean 0 and variance σ 2 I then, E(y > Ay) = tr(Aσ 2 I)+0 = σ 2 tr(A).
Theorem 2. If y is a k × 1 random vector with mean µ and variance–covariance matrix σ 2 I
and A is a k × k symmetric matrix of constants, then
y > Ay 0
2
∼ χ2 (r, λ)
σ
12
if and only if A is idempotent with rank(A) = r, where the non-centrality parameter
µ> Aµ
λ= .
σ2
9 Unbiased estimator of σ 2
We have,
SSE = y > (In − H)y.
∴ E(SSE) = tr{(In − H)σ 2 I} + (Xβ)> (In − H)Xβ as E(y) = Xβ.
Now
(Xβ)> (In − H)Xβ = β > X> (In − H)Xβ
= β > X> Xβ − β > X> HXβ
= β > X> Xβ − β > X> X(X> X)−1 X> Xβ
= β > X> Xβ − β > X> Xβ = 0.
Hence
E(SSE) = σ 2 {tr(In ) − tr(H)} = σ 2 {n − tr[X(X> X)−1 X> ]}
= σ 2 {n − tr[(X> X)−1 (X> X)]} Since, tr(AB)=tr(BA)
= σ 2 {n − tr(I)p } = σ 2 (n − p)

SSE
=⇒ E = σ 2 =⇒ E(M SE) = σ 2 .
n−p
Thus M SE is an unbiased estimator of σ 2 .
13
10 Coefficient of multiple determination
The coefficient of multiple determination for a subset regression model with p terms (p − 1
regressors and an intercept term β0 ), denoted by Rp2 , is defined as
SSR(p) SSE(p)
Rp2 = =1−
SST SST
where SSR(p) and SSE(p) denote the regression sum of squares and the residual sum of
squares, respectively, for a p term subset model. The Rp2 is often called the proportion of
variation explained by the p − 1 regressors.
Range: Because 0 ≤ SSE(p) ≤ SST , it follows that 0 ≤ Rp2 ≤ 1.
• Rp2 = 0 : Rp2 assumes the value 0 when all βbj = 0 (j = 1, 2, . . . , k). In that case, the
fitted model will be
yb = βb0 = y, i.e., ybi = y (i = 1, 2, . . . , n).
• Rp2 = 1 : Rp2 takes the value 1 when all y observations falls directly on the fitted regression
plane, i.e., when yi = ybi for all i.
• Rp2 is the square of the multiple correlation coefficient, i.e., Rp = corr(y, yb).
• Rp2 increases as p increases and is a maximum when p = k + 1.
14
2
11 Adjusted Ra.p
A large Rp2 does not necessarily imply that the fitted model is a useful one. Adding more
regressors to the model can only increase Rp2 and never reduce it, because SSE can never
become larger with more regressors. Thus it is possible for models that have large values of
Rp2 to perform poorly in prediction or estimation.
Some analyst prefer to use the adjusted Rp2 statistic, defined for a p-term equation as
2 SSE/(n − p) n − 1 SSE
Ra.p =1− =1− .
SST /(n − 1) n − p SST
n−1
=1− .(1 − Rp2 ).
n−p
2 n−1 2 k 2
The range of Ra.p is 1 − ≤ Ra.p ≤ 1 =⇒ − ≤ Ra.p ≤ 1. It can be negative and
n−p n−p
its value will always be less than or equal to that of Rp2 . The Ra.p
2
statistic does not necessarily
increase as additional regressors are introduced into the model.
2
• The Ra.p penalizes us for adding terms that are not helpful, so it is very useful in
15
evaluating and comparing candidate regression models. (compare equation fitted not
only to a specific set of data but also to two or more entirely different data sets)
2
• The Ra.p tells you the percentage of variation explained by only the regressors that
actually affect the response variable.
2
• The Ra.p criterion for selecting the best model is equivalent to the residual mean square
criterion.
12 Residual Mean Square
The residual mean square for a p-term regression model is
SSE(p)
M SE(p) = .
n−p
Because SSE(p) always decreases as p increases, M SE(p) initially decreases, then stabi-
lizes and eventually may increase. The eventual increase in M SE(p) occurs when the reduction
in SSE(p) from adding a regressor to the model is not sufficient to compensate for the loss of
one degree of freedom in the denominator.
13 Sequential Sums of Squares
Let us suppose we want to fit a model
Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + . (1)
16
and entered the X’s in the order given in the model. The extra sums of squares
SSR1 = SSR(βb1 |βb0 ) sequential sum of square for X1
SSR2 = SSR(βb2 |βb1 , βb0 ) sequential sum of square for X2
SSR3 = SSR(βb3 |βb2 , βb1 , βb0 ) sequential sum of square for X3
SSR4 = SSR(βb4 |βb3 , βb2 , βb1 , βb0 ) sequential sum of square for X4
SSR5 = SSR(βb5 |βb4 , βb3 , βb2 , βb1 , βb0 ) sequential sum of square for X5
are often called the sequential sums of squares.
Suppose the reduced model is
Y = β0 + β1 X1 + β2 X2 + . (2)
If SSR(βb0 , βb1 , βb2 , βb3 , βb4 , βb5 ) = S1 and SSR(βb0 , βb1 , βb2 ) = S2 , then
Extra sum of square of

X3,X4,X5 assuming that SSR(βb3 , βb4 , βb5 |βb0 , βb1 , βb2 ) =S1 − S2 extra sum of square
X2,X1,X0 are already in
the model
= (S1 − ny 2 ) − (S2 − ny 2 )
corrected sum of square for 1st model corrected sum of square for 2nd model
when it becomes a difference between sums of squares corrected for βb0 , i.e.,
extra sum of square as a

SSR(βb1 , βb2 , βb3 , βb4 , βb5 |βb0 ) − SSR(βb1 , βb2 |βb0 ). difference between regression
sum of squares corrected for Bo
We can rewrite S1 − S2 as
extra sum of square due to residual
S1 − S2 = (y > y − S2 ) − (y > y − S1 ) difference of residual sums of squares
when it becomes a difference of residual sums of squares but in reverse order, because the
regression with the larger SSR (S1 ) must have the smaller SSE; and vice versa for S2 .
We could get S1 − S2 by summing the sequential sums of squares, i.e.,

extra sum of square due to x3,x4,x5
S1 − S2 = SSR5 + SSR4 + SSR3 . Sequential sum of square
17
14 Partial Sums of Squares
If we have several terms in a regression model we can think of them as entering the equation
in any desired sequence. If we find
SSR(βbj |βb0 , βb1 , . . . , βbj−1 , βbj+1 , . . . , βbk ), j = 1, 2, . . . , k
we shall have a one degree of freedom sum of squares, which measures the contribution to the
regression sum of squares of each coefficient βbj given that all the terms that did not involve
βj were already in the model. Another way of saying that is that we have a measure of the
value of βj as though it were added to the model last. The corresponding mean square, equal
to the sum of squares since it has one degree of freedom, can be compared by an F - test for
βj . This particular type of F - test is often called a partial F - test for βj .
When a suitable model is being built the partial F - test is a useful criterion for adding
or removing terms from the model. The effect of an X- variable (Xq , say) in determining a
response may be large when the regression equation includes only Xq . However, when the
same variable is entered into the equation after other variables, it may affect the response
very little, due to the fact that Xq is highly correlated with variables already in the regression
equation.
Table 1 on page 19 and Table 2 on page 19 represents the ANOVA table for only two
regressors.
15 Orthogonal Columns in the X matrix
Consider the regression model with k regressors
y = Xβ +
18
Table 1: ANOVA table for partial F - test having two regressors
SV SS df MS F
Regression |βb0 SSR(βb1 , βb2 |βb0 ) 2 M SR(βb1 , βb2 |βb0 ) M SR(βb1 , βb2 |βb0 )/M SE
Due to βb1 |β0 SSR(βb1 |βb0 ) 1 M SR(βb1 |βb0 ) M SR(βb1 |βb0 )/M SE
Due to βb2 |βb1 , βb0 SSR(βb2 |βb1 , βb0 ) 1 M SR(βb2 |βb1 , βb0 ) M SR(βb2 |βb1 , βb0 )/M SE
Residual SSE n−3 M SE
Total SST n−1
Table 2: ANOVA table for partial F - test having two regressors
SV SS df MS F
Regression |βb0 SSR(βb1 , βb2 |βb0 ) 2 M SR(βb1 , βb2 |βb0 ) M SR(βb1 , βb2 |βb0 )/M SE
Due to βb2 |βb0 SSR(βb2 |βb0 ) 1 M SR(βb2 |βb0 ) M SR(βb2 |βb0 )/M SE
Due to βb1 |βb2 , βb0 SSR(βb1 |βb2 , βb0 ) 1 M SR(βb1 |βb2 , βb0 ) M SR(βb1 |βb2 , βb0 )/M SE
Residual SSE n−3 M SE
Total SST n−1
where y is n × 1, X is n × p, β is p × 1, and p = k + 1. We would like to determine if some
subset of r < k regressors contributes significantly to the regression model. Let us partition
β as

β1
β=
β2
where β 1 is (p − r) × 1 and β 2 is r × 1. The corresponding partitioning of X will be

..
X= X1 .X2 .
The model may be written as
y = Xβ + = X1 β 1 + X2 β 2 +
where the n × (p − r) matrix X1 represents the columns of X associated with β 1 and the n × r
matrix X2 represents the columns of X associated with β 2 . This is called the full model.
The extra sum of squares method allows us to measure the effect of the regressors in X2
b |β
conditional on those in X1 by computing SSR(β b ). However, if the columns in X1 are
2 1
19
orthogonal to the columns in X2 , we can determine a sum of squares due to β
b that is free of
2
any dependence on the regressors in X1 .
The estimating equations for the model y = Xβ+ are (X> X)β
b = X> y. These estimating
equations can be written as

    
X> X 1 X>1 X2  β 1 
b X> y 
 1   =  1 .
    
X>
2 X 1 X >
2 X 2 β
b
2 X >
2 y
Now if the columns of X1 are orthogonal to the columns in X2 , X> >

1 X2 = 0 and X2 X1 = 0.
Then the estimating equations become
X> >
1 X1 β 1 = X1 y,
b X> >
2 X2 β 2 = X2 y
b
with solution
b = (X> X1 )−1 X> y,

β b = (X> X2 )−1 X> y.
β
1 1 1 2 2 2
Note that the least-squares estimator of β 1 is β

b regardless of whether or not X2 is in the
1
model, and the least-squares estimator of β 2 is β

b 2 regardless of whether or not X1 is in the
model.
The regression sum of squares for the full model is
>
SSR(β) b X> y
b =β
 
 X> y 

> >  1 
= β1 , β2 
b b 
X>
2 y
b > X> y + β
=β b > X> y.
1 1 2 2
However, the estimating equations form two sets, and for each set we have
SSR(β b > X> y

b 1) = β
1 1
>
SSR(β b X> y.
b 2) = β
2 2
20
Thus
SSR(β)
b = SSR(β
b ) + SSR(β
1
b ).
2
Therefore,
b |β
SSR(β b − SSR(β
b ) = SSR(β) b ) = SSR(β
b )
1 2 2 1
and
b |β
SSR(β b − SSR(β
b ) = SSR(β) b ) = SSR(β
b ).
2 1 1 2
Consequently, SSR(β
b 1 ) measures the contribution of the regressors in X1 to the model
unconditionally, and SSR(β

b ) measures the contribution of the regressors in X2 to the model
2
unconditionally.
21

Multiple Linear Reegression

Uploaded by

Copyright:

Available Formats

Multiple Linear Reegression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Linear Reegression

Uploaded by

Copyright:

Available Formats

Multiple Regression

Suppose we have k regressors or predictor variables X1 , . . . , Xk and it is known that the

conditional expectation of Y given X1 , . . . , Xk is a linear function of X’s as well as β’s, i.e.,

E[Y |X1 , . . . , Xk ] = β0 + β1 X1 + . . . + βk Xk . (1)

yi = β0 + β1 xi1 + . . . + βk xik + i ; i = 1, 2, . . . , n (2)

where β0 is the intercept of the regression plane (hyperplane) and βj (j = 1, 2, . . . , k) are

reason the parameters βj (j = 1, 2, . . . , k) are often called partial regression coefficients.

1 Model in matrix notation

notation, the model given by equation (2) is

In general, y is an n × 1 vector of the observations, X is an n × p matrix of the levels of

the regressor variables, β is a p × 1 (where p = k + 1) vector of the regression coefficients and

 is an n × 1 vector of random errors.

1. Conditional expectation of Y given X1 , X2 , . . . , Xk is a linear function of the parameters

2. E() = 0 or E(y) = Xβ.

4.  has multivariate normal distribution, i.e.,  ∼ N (0, σ 2 I).

will not have full column rank.

We wish to find the least-squares estimators, β,

= y > y − β > X> y − y > Xβ + β > X> Xβ

= y > y − 2β > X> y + β > X> Xβ (4)

scalar. The least-squares estimators must satisfy

b = (X> X)−1 X> y

of the other columns.

products of the columns of X and the observations yi .

• The vector of fitted values ybi corresponding to the observed values yi is

y b = X(X> X)−1 X> y = Hy.

a central role in regression analysis.

4 Properties of the least-squares estimators β

Since aij ’s are constant, we can say that the elements of β

b = (X> X)−1 X> y = (X> X)−1 X> (Xβ + )

b = β + (X> X)−1 X> E() = β + 0 = β.

provided only that E() = 0.

b = (X> X)−1 X> y = Ay where A = (X> X)−1 X> .

b = var(Ay) = Avar(y)A> = Aσ 2 IA> = σ 2 AA>

= σ 2 {(X> X)−1 X> }{(X> X)−1 X> }>

= σ 2 (X> X)−1 (X> X){(X> X)−1 }>

= σ 2 {(X> X)−1 }> = σ 2 (X> X)−1 .

and βbf is σ 2 Cjf .

of β has the minimum variance, i.e., β

e = Cy be another linear unbiased estimator of β with C = (X> X)−1 X> + D

where D is a p × n non-zero matrix.

e = E[Cy] = E[{(X> X)−1 X> +D}(Xβ+)] = [(X> X)−1 X> +D]E(Xβ+) =

unbiased if and only if DX = 0.

= [(X> X)−1 X> + D]σ 2 I[(X> X)−1 X> + D]>

= σ 2 [(X> X)−1 X> + D][X(X> X)−1 + D> ]

= σ 2 [(X> X)−1 + DX(X> X)−1 + (X> X)−1 X> D> + DD> ]

= σ 2 [(X> X)−1 + DD> ]

= σ 2 l> (X> X)−1 l + σ 2 l> DD> l]

as DD> is at least a positive semidefinite matrix; hence σ 2 l> DD> l ≥ 0.

Let us define l∗ = D> l. Hence

which must be strictly greater than 0 for some l 6= 0 unless D = 0.

5. Maximum Likelihood Estimation: If the errors are normally and independently

distributed, i.e.,  ∼ N (0, σ 2 I) then β

Proof. The normal density function for the error is

Now since  = y − Xβ, the likelihood function becomes

of L(y, X, β, σ 2 ) is equivalent to minimization of the quantity >  = (y−Xβ)> (y−Xβ),

which is the principle of least squares.

• It is convenient to work with the log of likelihood

residual ei = yi − ybi . The n residual may be conveniently written in matrix notation as

5.1 Properties of Residuals

yi = β0 + β1 xi1 + . . . + βk xik + i ; i = 1, 2, . . . , n (2)

is an n × 1 vector of random errors.

2. E() = 0 or E(y) = Xβ.

4. has multivariate normal distribution, i.e., ∼ N (0, σ 2 I).

b = (X> X)−1 X> y = (X> X)−1 X> (Xβ + )

b = β + (X> X)−1 X> E() = β + 0 = β.

provided only that E() = 0.

e = E[Cy] = E[{(X> X)−1 X> +D}(Xβ+)] = [(X> X)−1 X> +D]E(Xβ+) =

distributed, i.e., ∼ N (0, σ 2 I) then β

Now since = y − Xβ, the likelihood function becomes

of L(y, X, β, σ 2 ) is equivalent to minimization of the quantity > = (y−Xβ)> (y−Xβ),

= M(Xβ + ) = MXβ + M = (I − H)Xβ + M

= Xβ − HXβ + M = Xβ − X(X> X)−1 (X> X)β + M

X> e = X> M = X> (I − H)

= (X> − X> X(X> X)−1 X> )