Appliedstat 2017 Chapter 4 5
Appliedstat 2017 Chapter 4 5
Appliedstat 2017 Chapter 4 5
y i = β 0 + β 1 x i + ei , (4.1)
i = 1, 2, ..., n.
4.2 Estimation
In the least-squares approach, we seek estimators βb0 and βb1 that minimize the
sum of squares of the deviations (yi − ybi )2 of the n observed yi ’s from their
predicted values.
CHAPTER 4. SIMPLE LINEAR REGRESSION 28
• Normal equations:
n
∑ ( yi − β 0 − β 1 xi ) = 0
i =1
n
∑ xi ( yi − β 0 − β 1 xi ) = 0
i =1
Note that ∑in=1 ai = 0 and ∑in=1 a2i = 1/SXX , where SXX = ∑in=1 ( xi − x̄ )2 .
βb0 = ȳ − βb1 x̄
p
βb1 can be also expressed as βb1 = SXy /SXX = r Syy /SXX = rSdyy /Sd XX .
• E( βb1 ) = β 1
E( βb0 ) = β 0
σ2
Var ( βb1 ) = n
∑i=1 ( xi − x̄ )2
1 x̄2
Var ( βb0 ) = σ2 n + n
∑i=1 ( xi − x̄ )2
cov(ȳ, βb1 ) = 0
e and yb
• Partition of sum of squares: Using orthogonality between b
that is
n n n
∑ y2i = ∑ yb2i + ∑ (yi − ybi )2,
i =1 i =1 i =1
or
n n n
∑ (yi − ȳ)2 = ∑ (ybi − ȳ)2 + ∑ (yi − ybi )2
i =1 i =1 i =1
SST SSR SSE
• H0 : β 1 = β 10
√
βb1 − β 10 /(σ/ SXX ) βb − β
T= √ = 1 √ 10
σ2 /σ2
b σ/ SXX
b
σ2 =
is distributed with t with n − 2 degrees of freedom under the null, where b
SSE/(n − 1) = MSE.
• H0 : β 0 = β 00
βb0 is distributed as normal with mean E βb0 = β 0 and var (ȳ − βb1 x̄ ) = σ2 (1/n +
x̄2 / ∑in=1 ( xi − x̄ )2 . Note that cov(ȳ, βb1 x̄ ) = 0.
βb0 − β 00
T= q
x̄2
σ2 ( n1 +
b SXX )
= σ2 [1/n + ( xi − x̄ )2 /SXX ]
4.5 Residuals
• ei = y i − β 0 − β 1 x i . b
ei = yi − βb0 − βb1 xi
= σ2 [1 − 1/n − ( xi − x̄ )2 /SXX ],
Multiple Regression
• The model and assumptions: The multiple linear regression model can be ex-
pressed as
yi = β 0 + β 1 x1i + · · · + β p x pi + ei = Xi β + ei
y = Xβ + e.
The assumptions for ei or yi are essentially the same as those for simple linear
regression:
1. E(ei ) = 0 or E(yi | Xi ) = Xi β.
2. cov(y) = σ2 In .
AX = I. Then
var ( βb)
βb has the smallest variance and equality holds when A = ( X T X )−1 X T .
5.2 Estimation of σ2
• Note that var ( βb) = σ2 ( X T X )−1 is unknown. Its estimator can be obtained by
c ( βb) = s2 ( X T X )−1 .
var
βb ∼ N ( β, σ2 ( X T X )−1 )
nbσ2
σ2
∼ χ2 ( n − p − 1)
one for which the columns of X provide the basis vectors. We may denote the k
columns of X as x1 , x2 , · · · , xk . Then the subspace associated with these k basis
It is important to clarify first what the geometric approach to least squares is not. In
combination of
twothe xi , i =
dimensions, we 1, · · · k. the
illustrated Formally, it issquares
principle of least defined as
by creating a two-
dimensional scatter plot (Fig. 6.1) of the n points (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). We
then visualized the least-squares regression line as the best-fitting straight line to
the data. This approach can be generalized to present the least-squares estimate in
multiple linear regression on thebasis of the best-fitting k hyperplane in (k þ 1)-
S( x1 , · · · , xk ) ≡ z ∈ E z =
dimensional space to the n points (x 11 , x12 , .
∑
n. . , x1k , y1 ), (x21 , x22 , . . . , x2k , y2 ), . . . ,
bi x i , bi ∈ R
(xn1 , xn2 , . . . , xnk , yn ). Although this approach is somewhat useful in visualizing
multiple linear regression, the geometric approach toi =least-squares 1 estimation in
multiple linear regression does not involve this high-dimensional generalization.
We would like to represent y using Xb and find
The geometric approach to be discussed below is appealing because of its math-
b such that euclidian distance
ematical elegance. For example, the estimator is derived without the use of matrix cal-
between y and culus.Xb Also,isthethe geometricshortest.
approach providesSuch deeper X βbinsight
should projection of y onto
be inference.
into statistical
Several advanced statistical methods including kernel smoothing (Eubank and
S( x1 , · · · , xk ).Eubank
Let b e1999),
= (Fourier
y−X analysis (Bloomfield
βb). i.e., for every 2000), and
1997) can be understood as generalizations of this geometric approach. The geo-
wavelet analysis
element in S( (Ogden
x1 , · · · , xk ) that is
metric approach to linear models was first proposed by Fisher (Mahalanobis 1964).
represented byChristensenXb, it (1996) should satisfy and Sengupta (2003) discuss the linear stat-
and Jammalamadaka
istical model almost completely from the geometric perspective.
< Xb,
7.4.1 Parameter >=Space,
eData
Space,b )T b
( Xband = b T XSpace
ePrediction T
e=
b 0.
The geometric approach to least squares begins with two high-dimensional spaces, a
(k þ 1)-dimensional space and an n-dimensional space. The unknown parameter
vector b can be viewed as a single point in (k þ 1)-dimensional space, with axes cor-
Following tworesponding
figuresto (from the textbook)
the k þ 1 regression coefficients billustrate the geometry of
0 , b1 , b0 , . . . , bk . Hence we call this
least squares.
space the parameter space (Fig. 7.3). Similarly, the data vector y can be viewed as a
Figure 7.3 Parameter space, data space, and prediction space with representative elements.
actually has the status of a subspace because it is closed under addition and scalar
multiplication (Harville 1997, pp. 28 – 29). This subset is said to be the subspace gen-
erated or spanned by the columns of X, and we will call this subspace the prediction
space. The columns of X constitute a basis set for the prediction space.
Figure 7.4 Geometric relationships of vectors associated with the multiple linear regression
model.
1
2
SSE ∼ χ2 (n − p − 1).
σ
1 T
Let λ T = 2σ2
µ ( In − n1 Jn )µ = 1 T T
2σ2
β X ( In − n1 Jn ) Xβ. Then
CHAPTER 5. MULTIPLE REGRESSION 35
1
SST ∼ χ2 (n − 1, λ T ).
σ2
1 T
λR = 2σ2
µ (H − n1 Jn )µ = 1 T T
2σ2
β X (H − n1 Jn ) Xβ. Then
1
2
SSR ∼ χ2 ( p, λ R ).
σ
1. 0 ≤ R2 ≤ 1
2. R = corr
d (yi , ybi ), where corr
d is the sample correlation.
3. If β 1 = β 2 = · · · = β p = 0
p
( R2 − n−1 )(n−1) ( n −1) R2 − p
• Adjusted R2a = n − p −1 = n − p −1 . Details will be discussed later.
CHAPTER 5. MULTIPLE REGRESSION 36
y = X1 β∗1 + e∗ .
A = ( X1T X1 )−1 X1T X2 and B = X2T X2 − X2T X1 A. Thus var ( βbj ) > var ( βb∗j ), where
βbj ’s are entries in β
b and
1
∗
(ii) var ( x0T β
b ) ≥ var ( x T β
01 1 ), where x01 is a part of x0 that corresponds to β1 .
b
(Proof) (i) We can verify (i) by directly applying the result of the inverse of the
partitioned matrix. On the other hand
(y − X1 βb∗1 )T (y − X1 βb∗1 )
s2 = ,
n−q
has
β T2 X2T ( I − X1 ( X1T X1 )−1 X1T ) X2 β 2
E ( s2 ) = σ 2 + ,
n−q
where q is the dimension of β1 .
(Proof)
= y T ( In − X1 ( X1T X1 )−1 X1 )y
5.6 Orthogonalization
then the regression coefficient vector for x2j on X1 is b j = ( X1T X1 )−1 X1T x2j ,
and xb2j = X1 b j = X1 ( X1T X1 )−1 X1T x2j . For all columns of X2 , this becomes
b 2 ( X1 ) = X1 ( X T X1 )−1 X T X2 = X1 A, where A = ( X T X1 )−1 X T X2 . Note
X 1 1 1 1
that X2.1 = X2 − X
b 2 ( X1 ) is orthogonal to X1 :
X1T X2.1 = 0.
X2.1 = X2 − X
b 2 ( X1 )
Since X2.1 is orthogonal to X1 , we obtain the same βb2 as in the full model
yb = X1 βb1 + X2 βb2 .
• Remark 1: The above three steps imply that the estimation of β 2 can be viewed
as the partial correlation between y and X2 after removing the effect of X1 . In
other words, the estimation of β 2 takes into account the amount of variation in
y due to X2 after the effect of X1 has been accounted for, and the relationship
between X1 and X2 should be corrected.
• Remark 2: One can directly show that βb2 and β̃ 2 are equivalent, where β̃ 2 is the
T
β̃ 2 = ( X2.1 X2.1 )−1 X2.1
T T
(y − yb( X1 )) = ( X2.1 X2.1 )−1 X2.1
T
y
X2T ( I − H1 ) X2 X2T ( I − H1 )y
0 β2
= .
X1T X2 X1T X1 β1 X1T y
Thus,
βb2 = [ X2T ( I − H1 ) X2 ]−1 X2T ( I − H1 )y = β̃ 2
n 0 α
b nȳ
T T =
0 Xc Xc βc
b XcT y
α = ȳ
Then b
5.8 Adjusted R2
SSE p /(n− p)
−1
• R2adj,p = 1 − 1 − R2p nn− p = 1− SST/(n−1)
= 1 − (n − 1) MSE p /SST, where
MSE p = SSE p /(n − p) is the mean squared error when p covariates are consid-
ered. Note that the model with largest R2adj,p is equivalent to the model with the
smallest MSE p .
n − 1 (n − k) + F (k − p)
R2adj,p = 1 − (1 − R2k )
n−k n−p
One can also show that F ≥ 1 and R2adj,p ≤ R2adj,k are equivalent. This implies
that model selection using the adjusted R2 tends to overfit.
x1 = tank temperature (◦ F)
x2 = gasoline temperature (◦ F)
CHAPTER 5. MULTIPLE REGRESSION 41
Estimation of β
attach(gas)
(Intercept) x1 x2 x3 x4
t(chol2inv(chol(t(X)%*%X))%*%t(X)%*%y)
coefficients(fit)
(Intercept) x1 x2 x3 x4
sume of squares
SST= sum((y-mean(y))ˆ2)
SSR= sum((fit$fitted-mean(y))ˆ2)
SSE= sum((y-fit$fitted)ˆ2) ### or sum((fit$res)ˆ2)
SSR
[1] 2520
CHAPTER 5. MULTIPLE REGRESSION 42
SSE
[1] 201.2
SST-SSR
[1] 201.2
H=X%*%solve(t(X)%*%X)%*%t(X)
n=length(y)
J=rep(1,n)%*%t(rep(1,n))
SSE2= t(y)%*%(diag(n)-H)%*%y
SSE2
[,1]
[1,] 201.2
[,1]
[1,] 2520
R2
summary(fit)$r.squared
[1] 0.9261
SSR/SST
[1] 0.9261
CHAPTER 5. MULTIPLE REGRESSION 43
R2=SSR/SST
summary(fit)$adj.r.squared
[1] 0.9151
[1] 0.9151
summary(fit)
anova(fit)
attributes(summary)
attributes(anova)
par(mfrow=c(2,2))
plot(fit,which=1)
plot(fit,which=2)
plot(fit,which=3)
plot(fit,which=4)