Appliedstat 2017 Chapter 4 5

Chapter 4
Simple Linear Regression
4.1 The model
The simple linear regression model for n observations can be written as
y i = β 0 + β 1 x i + ei , (4.1)
i = 1, 2, ..., n.
To complete the model in (4.1), we make the following additional assumptions:
1. E(ei | xi ) = 0 for all i = 1, 2, ..., n, or, equivalently, E(yi | xi ) = β 0 + β 1 xi .
2. var (ei | xi ) = σ2 for all i = 1, 2, ..., n, or, equivalently, var (yi | xi ) = σ2 .
3. cov(ei , e j | xi , x j ) = 0 for all i 6= j or, equivalently, cov(yi , y j | xi , x j ) = 0.
Conditioning arguments may be dropped in the lecture notes.
4.2 Estimation
In the least-squares approach, we seek estimators βb0 and βb1 that minimize the
sum of squares of the deviations (yi − ybi )2 of the n observed yi ’s from their
predicted values.
CHAPTER 4. SIMPLE LINEAR REGRESSION 28
• Normal equations:
n
∑ ( yi − β 0 − β 1 xi ) = 0
i =1
n
∑ xi ( yi − β 0 − β 1 xi ) = 0
i =1
∑in=1 ( xi − x̄ )(yi −ȳ) ( x − x̄ )

• βb1 = ∑in=1 ( xi − x̄ )2
We will use expression βb1 = ∑in=1 ai yi , where ai = ∑n (i x − x̄)2 .
i =1 i
Note that ∑in=1 ai = 0 and ∑in=1 a2i = 1/SXX , where SXX = ∑in=1 ( xi − x̄ )2 .
βb0 = ȳ − βb1 x̄
p
βb1 can be also expressed as βb1 = SXy /SXX = r Syy /SXX = rSdyy /Sd XX .
• E( βb1 ) = β 1
E( βb0 ) = β 0
σ2
Var ( βb1 ) = n
∑i=1 ( xi − x̄ )2

1 x̄2
Var ( βb0 ) = σ2 n + n
∑i=1 ( xi − x̄ )2
cov(ȳ, βb1 ) = 0
σ2 = (n − 2)−1 ∑in=1 (yi − βb0 − βb1 xi )2 .

b
e and yb
• Partition of sum of squares: Using orthogonality between b
||y||2 = || X βb||2 + ||b

e||2 ,
that is
n n n
∑ y2i = ∑ yb2i + ∑ (yi − ybi )2,
i =1 i =1 i =1
or
n n n
∑ (yi − ȳ)2 = ∑ (ybi − ȳ)2 + ∑ (yi − ybi )2
i =1 i =1 i =1
SST SSR SSE
• Coefficient of Determination: R2 = r2 = SSR/SST = 1 − SSE/SST = S2Xy /(SXX Syy ).
The proportion of variation in y that is explained by the model.
r is sample correlation coefficient between y and x.

• SST = SSR + SSE, and SSE and SSR are independent.
• Suppose that e = (e1 , · · · , en ) T ∼ Nn (0, σ2 I ). Under H0 : β 1 = 0, SSE/σ2 ∼

χ2 (n − 2), SSR/σ2 ∼ χ2 (1).
4.3 Testing hyporthesis
• For the distributions of estimators, we assume e = (e1 , · · · , en ) T ∼ Nn (0, σ2 I ).
• H0 : β 1 = β 10
√
βb1 − β 10 /(σ/ SXX ) βb − β
T= √ = 1 √ 10
σ2 /σ2
b σ/ SXX
b
σ2 =
is distributed with t with n − 2 degrees of freedom under the null, where b
SSE/(n − 1) = MSE.
• H0 : β 0 = β 00
Recall βb0 = ȳ − βb1 x̄.
βb0 is distributed as normal with mean E βb0 = β 0 and var (ȳ − βb1 x̄ ) = σ2 (1/n +
x̄2 / ∑in=1 ( xi − x̄ )2 . Note that cov(ȳ, βb1 x̄ ) = 0.
βb0 − β 00
T= q
x̄2
σ2 ( n1 +
b SXX )
is distributed with t with n − 2 degrees of freedom under the null.
4.4 Estimating E(yi | xi )
• An estimator of the regression line, E(y| x ) = β 0 + β 1 x is yb = βb0 + βb1 x. We can
check its unbiasedness, E( βb0 + βb1 x ) = β 0 + β 1 x.

• Confidence interval can be placed using
var ( βb0 + βb1 xi ) = var {ȳ + βb1 ( xi − x̄ )}
= var (ȳ) + var ( βb1 )( xi − x̄ )2 + 2( xi − x̄ )cov(ȳ, βb1 )
= σ2 [1/n + ( xi − x̄ )2 /SXX ]
• Under the normality of the data, yb is also normally distributed.
4.5 Residuals
• ei = y i − β 0 − β 1 x i . b
ei = yi − βb0 − βb1 xi
ei is also normally distributed with mean

• Under the normality of the data, b
E (b
ei ) = 0 and variance
var (yi − βb0 − βb1 xi ) = var {yi − ȳ − βb1 ( xi − x̄ )}
= var (yi − ȳ) + ( xi − x̄ )2 var ( βb1 ) − 2( xi − x̄ )cov(yi − ȳ, βb1 )
= σ2 [1 − 1/n − ( xi − x̄ )2 /SXX ],
where cov(yi − ȳ, βb1 ) = ai σ2 is used. Recall that ai = ( xi − x̄ )/ ∑nj=1 ( x j − x̄ )2 .

Chapter 5
Multiple Regression
5.1 The model
• The model and assumptions: The multiple linear regression model can be ex-
pressed as
yi = β 0 + β 1 x1i + · · · + β p x pi + ei = Xi β + ei
for i = 1, · · · , n. The matrix form is
y = Xβ + e.
The assumptions for ei or yi are essentially the same as those for simple linear
regression:
1. E(ei ) = 0 or E(yi | Xi ) = Xi β.
2. cov(y) = σ2 In .
• Estimation of β: The estimator of β minimizes (y − Xβ) T (y − Xβ), the solution
of X T Xβ = X T Y. When X is of full column rank, βb = ( X T X )−1 ( X T Y ).
b E( βb) = β, Var ( βb) = σ2 ( X T X )−1 . Under the normality of data,

• Properties of β:
βb is normally distributed as well.
• Gauss Markov Theorem: If E(y) = Xβ and cov(y) = σ2 In , the least-squares

estimators are best linear unbiased estimators (BLUE).
CHAPTER 5. MULTIPLE REGRESSION 32
(Proof) Let β̃ = Ay representing linear estimators. To restrict among unbiased

estimators, we impose E β̃ = Ay = AXβ = β, which implies A should satisfy
AX = I. Then
β̃ = Ay = { A − ( X T X )−1 X T + ( X T X )−1 X T }y = { A − ( X T X )−1 X T }y + βb
var ( β̃) = ( A − ( X T X )−1 X T )( A T − X ( X T X )−1 )σ2
+ 2( A − ( X T X )−1 X T ) X ( X T X )−1 σ2 + var ( βb)
var ( βb)
βb has the smallest variance and equality holds when A = ( X T X )−1 X T .
The above theorem holds for c T βb in estimating c T β.
5.2 Estimation of σ2
• Let s2 = (n − p − 1)−1 (y − X βb) T (y − X βb). We can see that E(s2 ) = σ2 , and

( n − p −1) s2
under normality assumption, σ2
∼ χ2 ( n − p − 1).
• Note that var ( βb) = σ2 ( X T X )−1 is unknown. Its estimator can be obtained by
c ( βb) = s2 ( X T X )−1 .
var
• MLE: Assuming y ∼ N ( Xβ, σ2 I ), the maximum likelihood estimator for β is
βb = ( X T X )−1 X T y, and for σ2 , b

σ2 = n−1 (y − X βb) T (y − X βb)
βb ∼ N ( β, σ2 ( X T X )−1 )
nbσ2
σ2
∼ χ2 ( n − p − 1)
σ2 (or s2 ) are independent.

βb and b
5.3 Geometry of least squares
: Denote by En a Euclidean space in n dimensions. Consider a subspace in k
dimensions where n > k. A subspace that is of particular interest to us is the

one for which the columns of X provide the basis vectors. We may denote the k
columns of X as x1 , x2 , · · · , xk . Then the subspace associated with these k basis
vectors will be denoted by S( X ) or S( x1 , · · · , xk ). The basis vectors are said

to span this subspace, which will in general be a k-dimensional subspace. The
subspace S( x1152
, · · · MULTIPLE
, xk ) consists
REGRESSION:of every vector that can be formed as a linear
ESTIMATION
It is important to clarify first what the geometric approach to least squares is not. In
combination of
twothe xi , i =
dimensions, we 1, · · · k. the
illustrated Formally, it issquares
principle of least defined as
by creating a two-
dimensional scatter plot (Fig. 6.1) of the n points (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). We
then visualized the least-squares regression line as the best-fitting straight line to
the data. This approach can be generalized to present the least-squares estimate in
multiple linear regression on thebasis of the best-fitting k hyperplane in (k þ 1)-
S( x1 , · · · , xk ) ≡ z ∈ E z =
dimensional space to the n points (x 11 , x12 , .
∑
n. . , x1k , y1 ), (x21 , x22 , . . . , x2k , y2 ), . . . ,
bi x i , bi ∈ R
(xn1 , xn2 , . . . , xnk , yn ). Although this approach is somewhat useful in visualizing
multiple linear regression, the geometric approach toi =least-squares 1 estimation in
multiple linear regression does not involve this high-dimensional generalization.
We would like to represent y using Xb and find
The geometric approach to be discussed below is appealing because of its math-
b such that euclidian distance
ematical elegance. For example, the estimator is derived without the use of matrix cal-
between y and culus.Xb Also,isthethe geometricshortest.
approach providesSuch deeper X βbinsight
should projection of y onto
be inference.
into statistical
Several advanced statistical methods including kernel smoothing (Eubank and
S( x1 , · · · , xk ).Eubank
Let b e1999),
= (Fourier
y−X analysis (Bloomfield
βb). i.e., for every 2000), and
1997) can be understood as generalizations of this geometric approach. The geo-
wavelet analysis
element in S( (Ogden
x1 , · · · , xk ) that is
metric approach to linear models was first proposed by Fisher (Mahalanobis 1964).
represented byChristensenXb, it (1996) should satisfy and Sengupta (2003) discuss the linear stat-
and Jammalamadaka
istical model almost completely from the geometric perspective.
< Xb,
7.4.1 Parameter >=Space,
eData
Space,b )T b
( Xband = b T XSpace
ePrediction T
e=
b 0.
The geometric approach to least squares begins with two high-dimensional spaces, a
(k þ 1)-dimensional space and an n-dimensional space. The unknown parameter
vector b can be viewed as a single point in (k þ 1)-dimensional space, with axes cor-
Following tworesponding
figuresto (from the textbook)
the k þ 1 regression coefficients billustrate the geometry of
0 , b1 , b0 , . . . , bk . Hence we call this
least squares.
space the parameter space (Fig. 7.3). Similarly, the data vector y can be viewed as a
Figure 7.3 Parameter space, data space, and prediction space with representative elements.
actually has the status of a subspace because it is closed under addition and scalar
multiplication (Harville 1997, pp. 28 – 29). This subset is said to be the subspace gen-
erated or spanned by the columns of X, and we will call this subspace the prediction
space. The columns of X constitute a basis set for the prediction space.
7.4.2 Geometric Interpretation of the Multiple Linear Regression Model

CHAPTER 5. MULTIPLE
The multipleREGRESSION
linear regression model (7.4) states that y is equal to a vector in the 34
prediction space, E(y) ¼ Xb, plus a vector of random errors, 1 (Fig. 7.4). The
Figure 7.4 Geometric relationships of vectors associated with the multiple linear regression
model.
5.4 Sum of squares
• Let H = X ( X T X )−1 X T . We can see that H Jn = Jn , from postmultiplying 1T to

both sides of X T X ( X T X )−1 X T 1 = X T 1. Therefore ( In − H )( H − n1 Jn ) = 0. Then
n
1
SST = ∑ (yi − ȳ) = y 2 T
In − Jn y
i =1
n

T 1
= y In − H + H − Jn y
n

T T 1
= y ( In − H )y + y H − Jn y
n
= SSE + SSR,
where SST is “total sum of squares,” SSE is “sum of squares of errors” and SSR
is “regression sum of squares.”
• Consider the distribution of SSE when y ∼ N ( Xβ, σ2 I ). Note that ( I − H ) X = 0,

X T ( I − H ) = 0, therefore SSE = y T ( In − H )y = (y − Xβ) T ( In − H )(y − Xβ).
Also In − H is idempotent and rank( I − H ) = tr ( I − H ) = n − p − 1. Then
1
2
SSE ∼ χ2 (n − p − 1).
σ
• Now consider the distribution of SST = y T ( In − n1 Jn )y. ( In − n1 Jn ) is idempotent

and rank( In − n1 Jn ) = tr ( In − n1 Jn ) = n − 1.
1 T
Let λ T = 2σ2
µ ( In − n1 Jn )µ = 1 T T
2σ2
β X ( In − n1 Jn ) Xβ. Then
1
SST ∼ χ2 (n − 1, λ T ).
σ2
• Now consider the distribution of SSR = y T ( H − n1 Jn )y. First note that ( In −

H )( H − n1 Jn ) = 0 and ( H − n1 Jn ) is idempotent. Let
1 T
λR = 2σ2
µ (H − n1 Jn )µ = 1 T T
2σ2
β X (H − n1 Jn ) Xβ. Then
1
2
SSR ∼ χ2 ( p, λ R ).
σ
• Coefficient of determination: R2 = SSR/SST

Properties of R2 are:
1. 0 ≤ R2 ≤ 1
2. R = corr
d (yi , ybi ), where corr
d is the sample correlation.
3. If β 1 = β 2 = · · · = β p = 0
R2 = SSR/SST = SSR/(SSE + SSR)

1
σ2
SSR
= 1
σ2
SSE + σ12 SSR
W1
∼ ,
W1 + W2
where W1 ∼ G ( p/2, 2), W2 ∼ G ((n − p − 1)/2, 2) and W1 ⊥ W2
∼ Beta( p/2, (n − p − 1)/2)

p/2 p
Therefore, E( R2 ) = p/2+(n− p−1)/2
= ( n −1)
.
4. R2 is invariant with respect to linear transformation x to Ax where A is of

full rank.
p
( R2 − n−1 )(n−1) ( n −1) R2 − p
• Adjusted R2a = n − p −1 = n − p −1 . Details will be discussed later.
5.5 Model misspecification
• Case of underfitting: Let the true model be

y = Xβ + e

β
= ( X1 , X2 ) 1 +e
β2
= X1 β 1 + X2 β 2 + e
Consider model misspecification where a fitted model includes X1 only, i.e.,
y = X1 β∗1 + e∗ .
Then, we have βb∗1 = ( X1T X1 )−1 X1T y
E( βb∗1 ) = β 1 + Aβ 2 where A = ( X1T X1 )−1 X1T X2 .
E( βb∗1 ) = E(( X1T X1 )−1 X1T y)
= ( X1T X1 )−1 X1T ( X1 β 1 + X2 β 2 )
= β 1 + ( X1T X1 )−1 X1T X2 β 2
var ( βb∗1 ) = σ2 ( X1T X1 )−1
βb∗1 is a biased estimator of β 1 .
• Theorem ! b = ( X T X )−1 X T y from the full model be partitioned as

7.9c. Let β
b = β1 , and let β b ∗ = ( X T X1 )−1 X T y be the estimator from the reduced
b
β 1 1 1
β2
b
model. Then
b ∗ ) = σ2 AB−1 A T , which is a positive definite matrix, where

b ) − cov( β
(i) cov( β 1 1
A = ( X1T X1 )−1 X1T X2 and B = X2T X2 − X2T X1 A. Thus var ( βbj ) > var ( βb∗j ), where
βbj ’s are entries in β
b and
1
∗
(ii) var ( x0T β
b ) ≥ var ( x T β
01 1 ), where x01 is a part of x0 that corresponds to β1 .
b
(Proof) (i) We can verify (i) by directly applying the result of the inverse of the
partitioned matrix. On the other hand
B = X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 = X2T ( In − X1 ( X1T X1 )−1 X1T ) X2 0

and so is AB−1 A T positive definite.
• Theorem 7.9d When y = X1 β∗1 + e is fitted, the estimator of σ2 ,
(y − X1 βb∗1 )T (y − X1 βb∗1 )
s2 = ,
n−q
has
β T2 X2T ( I − X1 ( X1T X1 )−1 X1T ) X2 β 2
E ( s2 ) = σ 2 + ,
n−q
where q is the dimension of β1 .
(Proof)
SSE1 = y T y − y T X1 ( X1T X1 )−1 X1T y
= y T ( In − X1 ( X1T X1 )−1 X1 )y
E(SSE1 ) = tr {( In − X1 ( X1T X1 )−1 X1T )σ2 I }
+ β T X T [ In − X1 ( X1T X1 )−1 X1T ] Xβ
= (n − q)σ2 + β T2 X2T [ In − X1 ( X1T X1 )−1 X1T ] X2 β 2
5.6 Orthogonalization
• Theorem 7.10. If X1T X2 = O, then the estimator of β 1 in the full model y =

X1 β 1 + X2 β 2 + e is the same as the estimator of β∗1 in the reduced model y =
X1 β∗1 + e∗ .
• Estimation of β 2 using orthogonalization: Let y = X1 β 1 + X2 β 2 + e be the full
model and y = X1 β∗1 + e∗ be the reduced model.
Consider the following three steps:
1. Regress y on X1 and calculate residuals y − yb( X1 ), where yb( X1 ) = X1 βb∗1 =
X1 ( X1T X1 )−1 X1T y.
2. Regress the columns of X2 on X1 and obtain residuals X2.1 = X2 − X

b 2 ( X1 ) .
If X2 is written in terms of its columns as X2 = ( x21 , · · · , x2j , · · · , x2( p−q+1) ),

then the regression coefficient vector for x2j on X1 is b j = ( X1T X1 )−1 X1T x2j ,
and xb2j = X1 b j = X1 ( X1T X1 )−1 X1T x2j . For all columns of X2 , this becomes
b 2 ( X1 ) = X1 ( X T X1 )−1 X T X2 = X1 A, where A = ( X T X1 )−1 X T X2 . Note
X 1 1 1 1
that X2.1 = X2 − X
b 2 ( X1 ) is orthogonal to X1 :
X1T X2.1 = 0.
Using the alias matrix A, the residual matrix can be expressed as
X2.1 = X2 − X
b 2 ( X1 )
= X2 − X1 ( X1T X1 )−1 X1T X2 = X2 − X1 A
3. Regress y − yb( X1 ) on X2.1 = X2 − X

b 2 ( X1 ) .
Since X2.1 is orthogonal to X1 , we obtain the same βb2 as in the full model
yb = X1 βb1 + X2 βb2 .
• Remark 1: The above three steps imply that the estimation of β 2 can be viewed
as the partial correlation between y and X2 after removing the effect of X1 . In
other words, the estimation of β 2 takes into account the amount of variation in
y due to X2 after the effect of X1 has been accounted for, and the relationship
between X1 and X2 should be corrected.
• Remark 2: One can directly show that βb2 and β̃ 2 are equivalent, where β̃ 2 is the
estimator obtained in step 3.
T
β̃ 2 = ( X2.1 X2.1 )−1 X2.1
T T
(y − yb( X1 )) = ( X2.1 X2.1 )−1 X2.1
T
y
where X2.1 = X2 − X1 ( X1T X1 )−1 X1T X2 = ( I − H1 ) X2 and H1 = X1 ( X1T X1 )−1 X1T .
β̃ 2 = ( X2T ( I − H1 ) X2 )−1 X2T ( I − H1 )y.
On the other hand, βb is the solution of

X2T X2 X2T X1 X2T y

β2
= .
X1T X2 X1T X1 β1 X1T y
Then
I − X2T X1 ( X1T X1 )−1 X2T X2 X2T X1 I − X2T X1 ( X1T X1 )−1 X2T y

β2
= ,
0 I X1T X2 X1T X1 β1 0 I X1T y
X2T X2 − X2T H1 X2 X2T X1 − X2T H1 X1 X2T y − X2T H1 y

β2
= .
Using X2T X1 − X2T H1 X1 = 0, we have
X2T ( I − H1 ) X2 X2T ( I − H1 )y

0 β2
= .
Thus,
βb2 = [ X2T ( I − H1 ) X2 ]−1 X2T ( I − H1 )y = β̃ 2
5.7 Centered covariates
• Reparameterize the model such that yi = α + β 1 ( xi1 − x̄1 ) + · · · + β p ( xip − x̄ p ) +
ei , where α = β 0 + β 1 x̄1 + · · · + β p x̄ p . Let β c = ( β 1 , · · · β p ) T and Xc = (( xij −

α
x̄ j )). We can rewrite y = (1, Xc ) + e.
βc
Since column 1 is perpendicular to columns of centered X, Xc , the normal equa-
tion is

T α
= (1, Xc )T y
b
(1, Xc ) (1, Xc ) b
βc

n 0 α
b nȳ
T T =
0 Xc Xc βc
b XcT y
α = ȳ
Then b
βbc = ( XcT Xc )−1 XcT y = S− 1

XX S Xy
α − βb1 x̄1 · · · − βbp x̄ p = ȳ − βbTc x̄

βb0 = b
5.8 Adjusted R2
• It is well known that the coefficient of determination, R2 , increases as the num-
ber of variables increases. To address such issue, an adjusted R2 is developled.
SSE p /(n− p)

−1
• R2adj,p = 1 − 1 − R2p nn− p = 1− SST/(n−1)
= 1 − (n − 1) MSE p /SST, where
MSE p = SSE p /(n − p) is the mean squared error when p covariates are consid-
ered. Note that the model with largest R2adj,p is equivalent to the model with the
smallest MSE p .
• Relation to F statistics: When the model with p coefficients is a submodel of
the model with k coefficients, R2adj,p is related to the F statistic.

(SSE p −SSEk )/(k− p)
Recall that F = SSEk /(n−k)
.
Then, one can verify
n − 1 (n − k) + F (k − p)
R2adj,p = 1 − (1 − R2k )
n−k n−p
by using SSE p = SST (1 − R2p ).
One can also show that F ≥ 1 and R2adj,p ≤ R2adj,k are equivalent. This implies
that model selection using the adjusted R2 tends to overfit.
5.9 Numerical examples
• 7.53: gas vapor example

When gasoline is pumped into the tank of a car, vapors are vented into the
atmosphere. An experiment was conducted to determine whether y, the amount

of vapor, can be predicted using the following four variables based on initial
conditions of the tank and the dispensed gasoline:
x1 = tank temperature (◦ F)
x2 = gasoline temperature (◦ F)
x3 = vapor pressure in tank ( psi)
x4 = vapor pressure of gasoline ( psi)
Estimation of β
attach(gas)
fit=lm(y˜ . ,data=gas ) ## or fit=lm(y˜x1+x2+x3+x4,data=gas)

X=model.matrix(fit)
t(solve(t(X)%*%X)%*%t(X)%*%y)
(Intercept) x1 x2 x3 x4
[1,] 1.015 -0.02861 0.2158 -4.32 8.975
t(chol2inv(chol(t(X)%*%X))%*%t(X)%*%y)
[,1] [,2] [,3] [,4] [,5]

[1,] 1.015 -0.02861 0.2158 -4.32 8.975
coefficients(fit)
(Intercept) x1 x2 x3 x4
1.01502 -0.02861 0.21582 -4.32005 8.97489
sume of squares
SST= sum((y-mean(y))ˆ2)
SSR= sum((fit$fitted-mean(y))ˆ2)
SSE= sum((y-fit$fitted)ˆ2) ### or sum((fit$res)ˆ2)
SSR
[1] 2520
SSE
[1] 201.2
SST-SSR
[1] 201.2
H=X%*%solve(t(X)%*%X)%*%t(X)
n=length(y)
J=rep(1,n)%*%t(rep(1,n))
SSE2= t(y)%*%(diag(n)-H)%*%y
SSE2
[,1]
[1,] 201.2
SSR2= t(y)%*%(H - J/n)%*%y

SSR2
[,1]
[1,] 2520
R2
summary(fit)$r.squared
[1] 0.9261
SSR/SST
[1] 0.9261
R2=SSR/SST
summary(fit)$adj.r.squared
[1] 0.9151
((n-1)*R2 - 4)/(n-5) # p=4
[1] 0.9151
Some other commands
summary(fit)
anova(fit)
attributes(summary)
attributes(anova)
par(mfrow=c(2,2))
plot(fit,which=1)
plot(fit,which=2)
plot(fit,which=3)
plot(fit,which=4)

Appliedstat 2017 Chapter 4 5

Uploaded by

Copyright:

Available Formats

Appliedstat 2017 Chapter 4 5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Appliedstat 2017 Chapter 4 5

Uploaded by

Copyright:

Available Formats

Chapter 4

Simple Linear Regression

4.1 The model

The simple linear regression model for n observations can be written as

To complete the model in (4.1), we make the following additional assumptions:

1. E(ei | xi ) = 0 for all i = 1, 2, ..., n, or, equivalently, E(yi | xi ) = β 0 + β 1 xi .

2. var (ei | xi ) = σ2 for all i = 1, 2, ..., n, or, equivalently, var (yi | xi ) = σ2 .

3. cov(ei , e j | xi , x j ) = 0 for all i 6= j or, equivalently, cov(yi , y j | xi , x j ) = 0.

Conditioning arguments may be dropped in the lecture notes.

∑in=1 ( xi − x̄ )(yi −ȳ) ( x − x̄ )

σ2 = (n − 2)−1 ∑in=1 (yi − βb0 − βb1 xi )2 .

||y||2 = || X βb||2 + ||b

• Coefficient of Determination: R2 = r2 = SSR/SST = 1 − SSE/SST = S2Xy /(SXX Syy ).

The proportion of variation in y that is explained by the model.

r is sample correlation coefficient between y and x.

• SST = SSR + SSE, and SSE and SSR are independent.

• Suppose that e = (e1 , · · · , en ) T ∼ Nn (0, σ2 I ). Under H0 : β 1 = 0, SSE/σ2 ∼

4.3 Testing hyporthesis

• For the distributions of estimators, we assume e = (e1 , · · · , en ) T ∼ Nn (0, σ2 I ).

Recall βb0 = ȳ − βb1 x̄.

is distributed with t with n − 2 degrees of freedom under the null.

4.4 Estimating E(yi | xi )

• An estimator of the regression line, E(y| x ) = β 0 + β 1 x is yb = βb0 + βb1 x. We can

check its unbiasedness, E( βb0 + βb1 x ) = β 0 + β 1 x.

• Confidence interval can be placed using

var ( βb0 + βb1 xi ) = var {ȳ + βb1 ( xi − x̄ )}

= var (ȳ) + var ( βb1 )( xi − x̄ )2 + 2( xi − x̄ )cov(ȳ, βb1 )

• Under the normality of the data, yb is also normally distributed.

ei is also normally distributed with mean

var (yi − βb0 − βb1 xi ) = var {yi − ȳ − βb1 ( xi − x̄ )}

= var (yi − ȳ) + ( xi − x̄ )2 var ( βb1 ) − 2( xi − x̄ )cov(yi − ȳ, βb1 )

where cov(yi − ȳ, βb1 ) = ai σ2 is used. Recall that ai = ( xi − x̄ )/ ∑nj=1 ( x j − x̄ )2 .

5.1 The model

for i = 1, · · · , n. The matrix form is

• Estimation of β: The estimator of β minimizes (y − Xβ) T (y − Xβ), the solution

of X T Xβ = X T Y. When X is of full column rank, βb = ( X T X )−1 ( X T Y ).

b E( βb) = β, Var ( βb) = σ2 ( X T X )−1 . Under the normality of data,

• Gauss Markov Theorem: If E(y) = Xβ and cov(y) = σ2 In , the least-squares

(Proof) Let β̃ = Ay representing linear estimators. To restrict among unbiased

β̃ = Ay = { A − ( X T X )−1 X T + ( X T X )−1 X T }y = { A − ( X T X )−1 X T }y + βb

var ( β̃) = ( A − ( X T X )−1 X T )( A T − X ( X T X )−1 )σ2

+ 2( A − ( X T X )−1 X T ) X ( X T X )−1 σ2 + var ( βb)

The above theorem holds for c T βb in estimating c T β.

• Let s2 = (n − p − 1)−1 (y − X βb) T (y − X βb). We can see that E(s2 ) = σ2 , and

• MLE: Assuming y ∼ N ( Xβ, σ2 I ), the maximum likelihood estimator for β is

βb = ( X T X )−1 X T y, and for σ2 , b

σ2 (or s2 ) are independent.

5.3 Geometry of least squares

: Denote by En a Euclidean space in n dimensions. Consider a subspace in k

dimensions where n > k. A subspace that is of particular interest to us is the

vectors will be denoted by S( X ) or S( x1 , · · · , xk ). The basis vectors are said

7.4.2 Geometric Interpretation of the Multiple Linear Regression Model

5.4 Sum of squares

• Let H = X ( X T X )−1 X T . We can see that H Jn = Jn , from postmultiplying 1T to

is “regression sum of squares.”

• Consider the distribution of SSE when y ∼ N ( Xβ, σ2 I ). Note that ( I − H ) X = 0,

• Now consider the distribution of SST = y T ( In − n1 Jn )y. ( In − n1 Jn ) is idempotent

• Now consider the distribution of SSR = y T ( H − n1 Jn )y. First note that ( In −

• Coefficient of determination: R2 = SSR/SST

R2 = SSR/SST = SSR/(SSE + SSR)

B = X2T X2 − X2T X1 ( X1T X1 )−1 X1T X2 = X2T ( In − X1 ( X1T X1 )−1 X1T ) X2 0

SSR2= t(y)%%(H - J/n)%%y