Multiple Linear Reegression
Multiple Linear Reegression
Multiple Linear Reegression
A regression model that involves more than one regressor variable is called a multiple regression
model.
Now for n observations on Y and X’s, the model (1) can be expressed as
called the partial regression coefficients. (The parameters βj (j = 0, 1, 2, . . . , k) are called the
regression coefficients. This model describes a hyperplane in the k-dimensional space of the
regressor variables Xj .)
• The parameter βj represents the expected change in the response Y per unit change in
Xj when all of the remaining regressor variables Xl (l 6= j) are held constant. For this
It is often convenient to deal with multiple regression models if they are expressed in matrix
notation. This allows a very compact display of the model, data and results. In matrix
y = Xβ + (3)
1
Italic+Bold=Vector
y1 1 x11 x12 . . . x1k β0 1
y 1 x
21 x22 . . . x2k
β
2 1 2
where y = , X =
, β = , and = .
.
.. . .
.. .. .
.. .
..
.
..
.
..
yn 1 xn1 xn2 . . . xnk βk n
2 Assumptions
β0 , β1 , β2 , . . . , βk .
3. Cov() = σ 2 I or Cov(y) = σ 2 I.
5. X is a non-stochastic matrix.
non random/fixed
6. n > p and rank(X) = p, i.e., the columns of X matrix are linearly independent.
• If n < p or if there is a linear relationship among the x’s, for example x5 = 2x2 , then X
• If the values of the xij ’s are planned (chosen by the researcher), then the X matrix
essentially contains the experimental design and is sometimes called the design matrix.
2
3 Ordinary Least-Squares (OLS) estimation
since β > X> y is a 1 × 1 matrix, or a scalar, and its transpose (β > X> y)> = y > Xβ is the same
∂S
= −2X> y + 2X> Xβ
b=0
∂β β
b
which simplifies to
X> Xβ
b = X> y. (5)
Equations (5) are the least-squares estimating equations. To solve the estimating equations,
multiply both sides of (5) by the inverse of X> X. Thus, the least-squares estimator of β is
provided that the inverse matrix (X> X)−1 exists. The (X> X)−1 matrix will always exist if the
regressors are linearly independent, i.e., if no column of the X matrix is a linear combination
• The diagonal elements of X> X are the sums of squares of the elements in the columns
of X, and the off-diagonal elements are the sums of cross products of the elements in
the columns of X. Furthermore, note that the elements of X> y are the sums of cross
3
• For simple linear regression model y = α + βx + , show that
y − βx
b
> > −1 >
β = (b
b α, β) = (X X) X y = P
b .
(xi − x)(yi − y)
P
(xi − x)2
• The fitted regression model corresponding to the levels of the regressor variables x> =
(1, x1 , x2 , . . . , xk ) is
k
X
>b
ybx = x β = βb0 + βbj xj .
j=1
• The n×n matrix H = X(X> X)−1 X> is usually called the hat matrix. It maps the vector
of observed values into a vector of fitted values. The hat matrix and its properties play
1. Linearity: β
b is a linear function of the observations.
Proof. We have,
> −1 >
β
b
OLS = (X X) X y = Ay with A = (aij )
or
a10 a20 . . . an0
a a21 . . . an1
11 y1
. .. ..
.. . .
y2
β
b OLS =
.
..
a a2j . . . anj
1j
. .. ..
.
. . . yn
a1k a2k . . . ank
4
P
βb0 ai0 yi
P
βb a y
1 i1 i
. .
.. ..
=⇒ β
b
OLS = P
=
βb a y
j ij i
. .
. .
. .
P
βbk aik yi
n
X
=⇒ βbj = aij yi ; j = 0, 1, 2, . . . , k.
i=1
observations.
2. Unbiasedness: E(β)
b = β.
Proof. We have,
= (X> X)−1 (X> X)β + (X> X)−1 X> = β + (X> X)−1 X> .
Thus, β
b is an unbiased estimator of β irrespective of any distribution properties of
b = σ 2 (X> X)−1 .
3. Variance: var(β)
Proof. We have,
5
Now,
• If we let C = (X> X)−1 , the variance of βbj is σ 2 Cjj and the covariance between βbj
6
4. Minimum variance: Among all linear and unbiased estimators of β, OLS estimator
Proof. Since β
b is a vector its variance is actually a matrix. Consequently, we want
to show that β
b minimizes the variance for any linear combination of the estimated
coefficients, l> β.
b
We have
b = l> var(β)l
var(l> β) b = l[σ 2 (X> X)−1 ]l = σ 2 l> (X> X)−1 l
which is a scalar.
[(X> X)−1 X> + D]Xβ = (X> X)−1 (X> X)β + DXβ = β + +DXβ. Therefore, β
e is
The variance of β
e is
e = var[{(X> X)−1 X> + D}y] = [(X> X)−1 X> + D]var(y)[(X> X)−1 X> + D]>
var(β)
= σ 2 [(X> X)−1 (X> X)(X> X)−1 + DX(X> X)−1 + (X> X)−1 X> D> + DD> ]
7
because DX = 0, which in turn implies that (DX)> = X> D> = 0. As a result
var(l> β)
e = l> var(β)l
e = l> [σ 2 {(X> X)−1 + DD> }]l
= l> var(β)l
b + σ 2 l> DD> l
= var(l> β)
b + σ 2 l> DD> l ≥ var(l> β)
b
likelihood function is
n 1 >
2
Y 1 −
L(, β, σ ) = f (i ) = n/2 n
e 2σ 2 .
i=1
(2π) σ
2
Now β
b
M LE is that estimator of β which maximizes L(y, X, β, σ ). Again, maximization
8
So, under normality assumptions of , β
b
OLS is the MLE of β.
n 1
lnL(y, X, β, σ) = l(y, X, β, σ) = − ln(2π) − nln(σ) − 2 (y − Xβ)> (y − Xβ).
2 2σ
Now
∂l n 1 b > (y − Xβ)
2
| e 2 = − + 3 (y − Xβ) b =0
∂σ β, σ e σ
e σ e
b > (y − Xβ)
(y − Xβ) b
=⇒ σe2 = .
n
5 Residual
The difference between the observed value yi and the corresponding fitted value ybi is the
e=y−y
b = y − Xβ
b = y − Hy = (I − H)y.
1. Residuals are orthogonal to the predicted values as well as to the design matrix, X, in
b > e = 0.
the model y = Xβ + , i.e., (a) X> e = 0, (b) y
= y − Hy = (I − H)y = My with M = I − H
= Xβ − Xβ + M = M
9
Now
= (X> − X> ) = 0.
b > e = (Xβ)
(b) y b > X> e = β
b >e = β b > 0 = 0 from (a).
Pn
2. If there is a β0 term in the model, then i=1 ei = 0.
Proof. We know,
X> e = 0
or,
1 1 . . . 1 e1 0
x
11 x21 . . . xn1 e2 0
=
. .. .. . .
.. . .. ..
.
x1k x2k . . . xnk en 0
or,
P
ei 0
P
x e 0
i1 i
= .
. .
.. ..
P
xik ei 0
3. var(e) = σ 2 M.
10
Proof. var(e) = var(M) = Mvar()M> = Mσ 2 IM> = σ 2 MM> = σ 2 M since M is
idempotent.
X
SSE = e2i = e> e = (y − y
b )> (y − y
b)
b > (y − Xβ)
= (y − Xβ) b
Since X> Xβ
b = X> y, this last equation becomes since e=y-y^ or e=y-XB^ or X`e=X`y-X`XB^ or
X`y=X`XB^ since X`e=0
b > X> y + β
SSE =y > y − 2β b > X> y = y > y − β
b > X> y.
1.
P 2
X
2
X
> yi
SST = (yi − y) = yi2 2
− ny = y y − n
n
( yi )2
P
> 1 1
=y y− = y > y − y > Jy = y > (I − J)y
n n n
11
2.
3.
1 1
Since each of the matrices (I − J), (H − J) and (I − H) are symmetric, SST , SSR and
n n
SSE are quadratic forms.
• SST , as usual, has n − 1 degrees of freedom associated with it. SSE has n − p degrees
function. Finally, SSR has p − 1 degrees of freedom associated with it, representing the
number of X- variables, X1 , . . . , Xk .
matrix Σ and A is a k × k symmetric matrix of constants, then E(y > Ay) = tr(AΣ) + µ> Aµ.
Corollary 1.1. If y has mean 0 and variance σ 2 I then, E(y > Ay) = tr(Aσ 2 I)+0 = σ 2 tr(A).
y > Ay 0
2
∼ χ2 (r, λ)
σ
12
if and only if A is idempotent with rank(A) = r, where the non-centrality parameter
µ> Aµ
λ= .
σ2
9 Unbiased estimator of σ 2
We have,
Now
Hence
= σ 2 {n − tr(I)p } = σ 2 (n − p)
SSE
=⇒ E = σ 2 =⇒ E(M SE) = σ 2 .
n−p
13
10 Coefficient of multiple determination
The coefficient of multiple determination for a subset regression model with p terms (p − 1
SSR(p) SSE(p)
Rp2 = =1−
SST SST
where SSR(p) and SSE(p) denote the regression sum of squares and the residual sum of
squares, respectively, for a p term subset model. The Rp2 is often called the proportion of
• Rp2 = 0 : Rp2 assumes the value 0 when all βbj = 0 (j = 1, 2, . . . , k). In that case, the
• Rp2 = 1 : Rp2 takes the value 1 when all y observations falls directly on the fitted regression
• Rp2 is the square of the multiple correlation coefficient, i.e., Rp = corr(y, yb).
14
2
11 Adjusted Ra.p
A large Rp2 does not necessarily imply that the fitted model is a useful one. Adding more
regressors to the model can only increase Rp2 and never reduce it, because SSE can never
become larger with more regressors. Thus it is possible for models that have large values of
Some analyst prefer to use the adjusted Rp2 statistic, defined for a p-term equation as
2 SSE/(n − p) n − 1 SSE
Ra.p =1− =1− .
SST /(n − 1) n − p SST
n−1
=1− .(1 − Rp2 ).
n−p
2 n−1 2 k 2
The range of Ra.p is 1 − ≤ Ra.p ≤ 1 =⇒ − ≤ Ra.p ≤ 1. It can be negative and
n−p n−p
its value will always be less than or equal to that of Rp2 . The Ra.p
2
statistic does not necessarily
2
• The Ra.p penalizes us for adding terms that are not helpful, so it is very useful in
15
evaluating and comparing candidate regression models. (compare equation fitted not
only to a specific set of data but also to two or more entirely different data sets)
2
• The Ra.p tells you the percentage of variation explained by only the regressors that
2
• The Ra.p criterion for selecting the best model is equivalent to the residual mean square
criterion.
SSE(p)
M SE(p) = .
n−p
Because SSE(p) always decreases as p increases, M SE(p) initially decreases, then stabi-
lizes and eventually may increase. The eventual increase in M SE(p) occurs when the reduction
in SSE(p) from adding a regressor to the model is not sufficient to compensate for the loss of
Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + . (1)
16
and entered the X’s in the order given in the model. The extra sums of squares
SSR4 = SSR(βb4 |βb3 , βb2 , βb1 , βb0 ) sequential sum of square for X4
SSR5 = SSR(βb5 |βb4 , βb3 , βb2 , βb1 , βb0 ) sequential sum of square for X5
Y = β0 + β1 X1 + β2 X2 + . (2)
If SSR(βb0 , βb1 , βb2 , βb3 , βb4 , βb5 ) = S1 and SSR(βb0 , βb1 , βb2 ) = S2 , then
We can rewrite S1 − S2 as
extra sum of square due to residual
when it becomes a difference of residual sums of squares but in reverse order, because the
regression with the larger SSR (S1 ) must have the smaller SSE; and vice versa for S2 .
17
14 Partial Sums of Squares
If we have several terms in a regression model we can think of them as entering the equation
we shall have a one degree of freedom sum of squares, which measures the contribution to the
regression sum of squares of each coefficient βbj given that all the terms that did not involve
βj were already in the model. Another way of saying that is that we have a measure of the
value of βj as though it were added to the model last. The corresponding mean square, equal
to the sum of squares since it has one degree of freedom, can be compared by an F - test for
When a suitable model is being built the partial F - test is a useful criterion for adding
or removing terms from the model. The effect of an X- variable (Xq , say) in determining a
response may be large when the regression equation includes only Xq . However, when the
same variable is entered into the equation after other variables, it may affect the response
very little, due to the fact that Xq is highly correlated with variables already in the regression
equation.
Table 1 on page 19 and Table 2 on page 19 represents the ANOVA table for only two
regressors.
y = Xβ +
18
Table 1: ANOVA table for partial F - test having two regressors
SV SS df MS F
Regression |βb0 SSR(βb1 , βb2 |βb0 ) 2 M SR(βb1 , βb2 |βb0 ) M SR(βb1 , βb2 |βb0 )/M SE
Due to βb1 |β0 SSR(βb1 |βb0 ) 1 M SR(βb1 |βb0 ) M SR(βb1 |βb0 )/M SE
Due to βb2 |βb1 , βb0 SSR(βb2 |βb1 , βb0 ) 1 M SR(βb2 |βb1 , βb0 ) M SR(βb2 |βb1 , βb0 )/M SE
Residual SSE n−3 M SE
Total SST n−1
SV SS df MS F
Regression |βb0 SSR(βb1 , βb2 |βb0 ) 2 M SR(βb1 , βb2 |βb0 ) M SR(βb1 , βb2 |βb0 )/M SE
Due to βb2 |βb0 SSR(βb2 |βb0 ) 1 M SR(βb2 |βb0 ) M SR(βb2 |βb0 )/M SE
Due to βb1 |βb2 , βb0 SSR(βb1 |βb2 , βb0 ) 1 M SR(βb1 |βb2 , βb0 ) M SR(βb1 |βb2 , βb0 )/M SE
Residual SSE n−3 M SE
Total SST n−1
subset of r < k regressors contributes significantly to the regression model. Let us partition
β as
β1
β=
β2
y = Xβ + = X1 β 1 + X2 β 2 +
where the n × (p − r) matrix X1 represents the columns of X associated with β 1 and the n × r
matrix X2 represents the columns of X associated with β 2 . This is called the full model.
The extra sum of squares method allows us to measure the effect of the regressors in X2
b |β
conditional on those in X1 by computing SSR(β b ). However, if the columns in X1 are
2 1
19
orthogonal to the columns in X2 , we can determine a sum of squares due to β
b that is free of
2
The estimating equations for the model y = Xβ+ are (X> X)β
b = X> y. These estimating
X> >
1 X1 β 1 = X1 y,
b X> >
2 X2 β 2 = X2 y
b
with solution
model.
>
SSR(β) b X> y
b =β
X> y
> > 1
= β1 , β2
b b
X>
2 y
b > X> y + β
=β b > X> y.
1 1 2 2
However, the estimating equations form two sets, and for each set we have
>
SSR(β b X> y.
b 2) = β
2 2
20
Thus
SSR(β)
b = SSR(β
b ) + SSR(β
1
b ).
2
Therefore,
b |β
SSR(β b − SSR(β
b ) = SSR(β) b ) = SSR(β
b )
1 2 2 1
and
b |β
SSR(β b − SSR(β
b ) = SSR(β) b ) = SSR(β
b ).
2 1 1 2
Consequently, SSR(β
b 1 ) measures the contribution of the regressors in X1 to the model
unconditionally.
21