Gr5205 Midterm Key

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

lOMoARcPSD|41888394

GR5205 Midterm Key

LINEAR REGRESSION MODELS (Columbia University in the City of New York)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Hasna NAFIR ([email protected])
lOMoARcPSD|41888394

STATGR5205 Midterm - Fall 2017 - October 18

Name: Bruce Banner

UNI:

The GU5205 midterm is closed notes and closed book. Calculators are allowed. Tablets,
phones, computers and other equivalent forms of technology are strictly prohibited. Students
are not allowed to communicate with anyone with the exception of the TA and the professor.
If students violate these guidelines, they will receive a zero on this exam and potentially
face more severe consequences. Students must include all relevant work in the handwritten
problems to receive full credit.

Problem 1 [65 pts]


Consider the simple linear regression model
iid 2
(1) Yi = 0 + 1 xi + ✏i , ı = 1, . . . , n, ✏i ⇠ N (0, ),

and least squares estimators

ˆ1 = Sxy and ˆ0 = Ȳ ˆ1 x̄.


Sxx
For this problem, you can use the following results:

x̄2 2
✓ ◆
1
(2) E[ ˆ0 ] = 0, E[ ˆ1 ] = 1, V ar[ ˆ0 ] = 2
+ , V ar[ ˆ1 ] = .
n Sxx Sxx

For this exercise, use the scalar form of the simple linear regression model, i.e.,
don’t use matrices.

Part A (5 pts)
Under model (1), prove that ˆ0 ˆ1 is an unbiased estimator of 0 1. Note that you can
directly use the relations from (2).

ECB ]
E[p^o ]
-

E[ fo
-

B ] ,
=
,

=
Pi
-

po
1

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

Part B (20 pts)


Under model (1), derive an expression for Cov( ˆ0 , ˆ1 ), where ˆ0 and ˆ1 are the least squares
estimators. Simplify the result as much as possible. Note that you can directly use the
relations from (2).

Note that
§ ,
=

§,KiYi , Ki=×ij,÷ .

Cov ( Y
,
that

}
Note,
)
under

=
Cov
model

( ±¥,Yi
11 )
,

,
?IkiYi )

( Yi )
=
tn¥¥ ,
kjcov
Yj ,

t.IS.K;Cw(
Yi
=
in
¥ ,
Kiraly ;) + ,Yjl

=
02 In
-2k ;

= 0

Then
covlt -

F. I F )
Covcpo B.)
,
= ,

= covet ,i3i )
-
Ecov ( B ,1& )
,

= 0 - Ivar ( § ) ,

-
o
=
-
× -

2 Sxx

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

Part C (10 pts)


Under model (1), derive an expression for V ar[ ˆ0 ˆ1 ], where ˆ0 and ˆ1 are the least
squares estimators. Simplify the result as much as possible. Note that you can directly use
the relations from (2).
Note: if you cannot complete Part B, then express the solution to Part C in
terms of Cov( ˆ0 , ˆ1 ).

Po ] tvarlp ] 2cal ,p^ )


Vatpo§ ,7=var[
-
-
pro .
,

=o?ktE÷l+EI→*fx¥ )

=O?(tn+st×alx2+2xti] )
=oytn+kE¥ )

if students
credit
Give partial
wrote

varciooip .
]=o?( It ¥1 ) .

2*co✓lpo,p .
)

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

Part D (15 pts)


Although not a very useful or common approach, we now consider a testing procedure to see
if the intercept statistically differs from the slope, i.e., consider testing the null/alternative
pair
H0 : 0 = 1 versus HA : 0 6= 1 .
Write down both the T-statistic and F-statistic for testing the above null/alternative pair.
When constructing the F-statistic, also identify the full and reduced models. Write your
solution on pages 4 and 5.
Note: when specifying the full and reduced models, you do not have to derive
the maximum likelihood estimators but make sure to identify them.

IF Ho is true =o
po p
-

T .
Stat ,
,

Pri
-

ECFO 'M Fo Fi
=§o
-
-

=
T
vampire ,
) nsetkt 'tI¥ )

T ~
t(df=n -

2)

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

F- test

FI Y; + xit e e.int Nato )


=
po p ,
;
,

do = F B. I
§
-

ML -
Estimators :
,
=
55×4 ,

)2 SSE
SSEF
=

If ( Y; -
pro -

§ ,
x :
=

dff =
n
-
2

Reduced Under Ho =p
:p .
,

Y =p +
p Xi +
E =
( ITXI ) + E eiiitvlo ,o2 )

B.
, ;
; , ,

ML
-
Estimator
§ =

ICXITDYIZCX 2
;
+ 1)

SSER =
E. ( Y ;
-

Fx )2 ;

dfr = n -
1

F- statistic

( SSER E) -
Ss 11
F =

5 SE

F ~
f( df = 1
dfz = n -2 )
,
,

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

Part E (15 pts)


Consider the following toy dataset displayed in the scatter plot below. Let the predictor
variable be assigned as x, the response as Y and assign n as the sample size. Note that there
are n = 100 cases in this dataset.

Using the R code and output displayed on pages 6, 7 and 8, test the if the intercept statis-
tically differs from the slope, i.e., test the null/alternative pair:

H0 : 0 = 1 versus HA : 0 6= 1.

To receive full credit, compute both the T-statistic and F-statistic for testing the above
null/alternative pair. Also compute the correct p-value and state the statistical conclusion.
Write the solution on the top of page 7.
Note:
1-pt(t.calc,98)=0.1185218 1-pf(f.calc,1,98)=0.2370437
1-pt(t.calc,99)=0.1185074 1-pf(f.calc,1,99)=0.2370147

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

Fstat

= 1.18899

98 ) = 2*-11852

Prvahe
1.18899 =2*pt(
,x ,

= 2370436

F- Stat
" 242.76 -
239.31
" EF SSE
fan
Ers
= ,= =

.
÷ -

98 2.44

taiIIEE÷÷¥yIIiEY÷¥÷n =
1.4139

) =
2370437
P .
value =
pt ( 1.4139 ,
1,98

Ho because -237 > L


FTR .

R code and Output: conclude that the intercept


> # Means and S.xx
> mean(x) does
-
not
statistically differ
[1] 5.0423 x -

from the
> mean(Y) slope .

[1] 4.9012

> sum((x-mean(x))^2)
[1] 834.4460
sxx Also note :

7
1.188992=1.4139

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

> # Model 1 with Summary and ANOVA #-----------------------------------------------

> summary(lm(Y~x))
Bo

:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1708 0.3144 3.725 0.000327 ***
x 0.7398 0.0541 13.676 < 2e-16 ***
B .

Residual standard error: 1.563 on 98 degrees of freedom


Multiple R-squared: 0.6562,Adjusted R-squared: 0.6527
F-statistic: 187 on 1 and 98 DF, p-value: < 2.2e-16

> anova(lm(Y~x))

Df Sum Sq Mean Sq F value Pr(>F)


x 1 456.71 456.71 187.03 < 2.2e-16 ***
Residuals 98 239.31 2.44
SSEF SSEOOMSE
=

> # Model 2 with ANOVA #-----------------------------------------------

> anova(lm(Y~x-1))

Df Sum Sq Mean Sq F value Pr(>F)


x 1 2825.00 2825.00 1023.8 < 2.2e-16 ***
Residuals 99 273.18 2.76

> # Model 3 with ANOVA #-----------------------------------------------

> x.new <- x+1


> anova(lm(Y~x.new))

Df Sum Sq Mean Sq F value Pr(>F)


x.new 1 456.71 456.71 187.03 < 2.2e-16 ***
Residuals 98 239.31 2.44

> # Model 4 with ANOVA #-----------------------------------------------

> x.new <- x+1


> anova(lm(Y~x.new-1))

Df Sum Sq Mean Sq F value Pr(>F)


x.new 1 2855.42 2855.42 1164.5 < 2.2e-16 ***
Residuals 99 242.76 2.45

aER
8

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

Problem 2 [35 pts]


Consider the following study examining the effects of different amounts of THC, the major
ingredient in marijuana, injected directly in the brain. The response variable (Y ) is locomo-
tor activity. In this approach, the researchers run an ANCOVA model (or multiple linear
regression model) on the pos-injection scores, partialling out pre-injection differences. Such
a procedure would adjust for the fact that much of the variability in post-injection activity
could be accounted for by the variability in pre-injection activity. Note that variables D1
through D4 are indicator variables representing the different dosage levels and xi is the
continuous variable pre-injection activity.

control .1 micro g (D1 ) .5 micro g (D2 ) 1 micro g (D3 ) 2 micro g (D4 )

X Y X Y X Y X Y X Y
Pre Post Pre Post Pre Pos Pre Pos Pre Pos

4.34 1.30 1.55 0.93 7.18 5.10 6.94 2.29 4.00 1.44
3.50 0.94 10.56 4.44 8.33 4.16 6.10 4.75 4.10 1.11
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
.. .. .. .. .. .. .. ..
. . . . . . 7.35 2.35 . .
.. .. .. .. .. ..
. . . . 6.30 4.84 . .
1.90 0.93 9.58 4.22 5.54 2.93

n1 = 10 n2 = 10 n3 = 9 n4 = 8 n5 = 10

The statistical model used in our setting is:


Yi = 0 + 1 Di1 + 2 Di2 + 3 Di3 + 4 Di4 + 5 xi + ✏i

iid 2
i = 1, . . . , 47, ✏i ⇠ N (0, ),
where ( (
1 if .1 micro grams 1 if .5 micro grams
Di1 = , Di2 =
0 if otherwise 0 if otherwise
( (
1 if 1 micro grams 1 if 2 micro grams
Di3 = , Di4 =
0 if otherwise 0 if otherwise

and xi is the respondent’s pre-injection locomotor activity.

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

Part A (25 pts)


Run a single hypothesis testing procedure to see if THC dosage levels statistically influence
post-locomotor activity, after controlling for pre-locomotor activity. To receive full credit,
state the correct null/alternative pair, compute the test statistic and identify the correct
P-value. To complete this exercise, use the R code & output displayed on Page 11. Assume
↵ = 0.05 significance.

Ho :p ,
=
Pa =P ]= 134=0

to
HA : At least are
Pj .

Full model

Y=
pot p ,
D. +
p , Dst 1331>3+134 By +
PSX + E

df h -
6 =
,==
Reduced Model

-2
Y= potpsx + { dfr= n =
4s

SSER -

SSEF SSEF
fan =

dfr -

dff Tff

=
29.34931 20.12544 20.12544
-

its -
4 1 41

=
4.698

P -
valve = 1 -
pf ( 4.698 4,41 ) =
-003262.05
,

at 5% significance Conclude
Reject Ho .

10
that THC levels statistically
dosage
inference
post
-

loco motor
activity ,
after

Controlling for pre - iocomotor activity .

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

R code and Output:


# Full model SSE and Summary #-----------------------------------------------

> full.model <- lm(Y~D1+D2+D3+D4+X,data=THC.Data)


> sum(residuals(full.model)^2)

O
[1] 20.12544 SSE
,=
= SSE
> summary(full.model)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.37384 0.27689 -1.350 0.1844
D1 0.61834 0.32855 1.882 0.0669 .
D2 1.45653 0.35656 4.085 0.0002 ***
D3 0.66599 0.34231 1.946 0.0586 .
D4 0.21998 0.31456 0.699 0.4883
X 0.43466 0.04918 8.838 4.84e-11 ***
---
Residual standard error: 0.7006 on 41 degrees of freedom
Multiple R-squared: 0.8042,Adjusted R-squared: 0.7803
F-statistic: 33.67 on 5 and 41 DF, p-value: 1.704e-13

> # Reduced models SSE values #-----------------------------------------------

> reduced.1 <- lm(Y~X,data=THC.Data)


> sum(residuals(reduced.1)^2)

0
[1] 29.34931 SSER
> reduced.2 <- lm(Y~D1+D2+D3+D4,data=THC.Data)
> sum(residuals(reduced.2)^2)

[1] 58.4661

> # P-values #-----------------------------------------------

> 1-pf(f.calc,1,41)
[1] 0.03606229

> 1-pf(f.calc,3,41)
[1] 0.006554796

> 1-pf(f.calc,4,41)

C) p value
.

[1] 0.003256544

11

Downloaded by Hasna NAFIR ([email protected])


lOMoARcPSD|41888394

Part B (10 pts)


Run a Bonferroni procedure to test which THC dosage levels statistically influence post-
locomotor activity, after controlling for pre-locomotor activity, i.e., simultaneously test the
null hypotheses H0 : 1 = 0, H0 : 2 = 0, H0 : 3 = 0, H0 : 4 = 0. To receive full
credit, circle the correct R output and briefly identify which THC dosage levels statistically
influence post-locomotor activity. Assume 95% family-wise error rate.
14=4
R code and Output:

th
> confint(model.1,level=1-.05/2)
1.25 % 98.75 %
(Intercept) -1.0180992 0.2704152
D1 -0.1461084 1.3827809
D2 0.6269190 2.2861442
D3 -0.1304642 1.4624432 -

D4 -0.5119099 0.9518764 £28,41 £4141


X 0.3202275 0.5490896

> confint(model.1,level=1-.05/4)
=>
0.625 % 99.375 % level =L -

-0514
(Intercept) -1.0972852 0.3496011

¥
D1 -0.2400666 1.4767392
D2 0.5249510 2.3881122
D3 -0.2283567 1.5603357
D4 -0.6018672 1.0418337
X 0.3061627 0.5631544
"
> confint(model.1,level=1-.05/8)

D1
0.312 % 99.688 %
(Intercept) -1.1720667 0.4243826
-0.3287988 1.5654713
wnmsg
!gIikInt
D2 0.4286546 2.4844086 THC level
D3 -0.3208042 1.6527832
dosage .

D4 -0.6868210 1.1267874
X 0.2928803 0.5764369

> confint(model.1,level=1-.05/10)
0.25 % 99.75 %
(Intercept) -1.1953778 0.4476938
D1 -0.3564587 1.5931312
D2 0.3986367 2.5144265
D3 -0.3496223 1.6816013
D4 -0.7133031 1.1532696
X 0.2887398 0.5805773

12

Downloaded by Hasna NAFIR ([email protected])

You might also like