Econometrics Assignment 3
Econometrics Assignment 3
Econometrics Assignment 3
Reza Brianca
October 31, 2016
Part 1
Use the data in meap00_01.RData to answer this question. The original source of this data set is Michigan Department of Education [source]
(www.michigan.gov/mde).
a. Estimate the model
math4 = β 0 + β 1 lunch + β 2 log(enroll) + β 3 log(exppp) + u
by OLS and obtain the usual standard errors and the robust standard errors. How do they generally compare?
The normal regression result would give us
And the robust regression result would give us
b. Apply the White test for heteroskedasticity. What is the value of the F test? What do you conclude?
In this analysis, we would have H0 that the model was homoskedastic. We could apply white test by modifying BreushPagan test with using
squares and cross products of all variables in the model
The white test for heteroskedasticity result would have very low pvalue and F = 229.78. Therefore, we could reject H0 (meaning this model was
not homoskedastic).
2
c. Obtain g^i as the fitted values from the regression log (u
^i ) on math4
2 ^
i , math4 i , where math4 i are the
^ ^
OLS fitted values and the u^i are the OLS residuals. Let h
^
i = exp(g i ). Use the hi to obtain WLS estimates. Are
^ ^
there big differences with the OLS coefficients?
From point (a), regular OLS would provide result as follows:
And the WLS would provide result as follows:
There were several differences between WLS and OLS although all of the coefficients sign remain as they were and all of the variables were
statistically significant. However, the coefficient values changed as follows:
1. The intercept decrease from 91.932 to 50.478
2. The coefficient lenroll increased from 5.399 to 2.647 implying that increasing school enrollment would have less impact to math student
satisfactory
3. The coefficient lexppp increased from 3.525 to 6.474 implying that increasing expenditure per enrollment would have less impact to math
student satisfactory
4. The coefficient for lunch had very small change (less than 0.001)
5. lexppp became highly statistically significant compared to OLS model
d. Obtain the standard errors for WLS that allow misspecification of the variance function. Do these differ much
from the usual WLS standard errors?
The regular WLS regression result was as follows:
And the robust WLS regression result was as follows:
e. For estimating the effect of spending on math4, does OLS or WLS appear to be more precise?
Answer: WLS appeared to have better accuration than OLS. This could be implied by the significance level of lexppp in OLS was at 9.31%
while in WLS it became 0.012%. This was also observable by looking at the comparing the robust model. lexpp would have 13.44% significant
level in OLS and 0.036% in WLS.
Part 2
Use the data in nbasal.RData to answer this question. A regression equation is needed to study the factors that influence salaries of NBA
players. We use log(wage) as the dependent variable. Potential factors that will affect a player’s wage include experience (exper, coll),
games’ participation (games, avgmin ), position (f orward, center, guard ), performance (points, rebounds, assists ), prestige (
draf t, allstar), and demographic factors ( black, children, marr ). For this question, it is fine if you only consider these variables in the level
form.
a. Identify if there is any multicollinearity problem
Out of 22 variables provided in nbasal.RData, there were 8 variables that contribute to multicollinearity namely
wage, exper, age, minutes, guard, avgmin, agesq and marr ∗ black
If we performed correlation test after removed the missing values, we would get several high correlated independent variables among these 8
variables with the remaining 14 variables as follows (top 3 only):
1. Variable age with exper (r(238) = 0.94, pvalue < .001)
2. Variable points with avgmin (r(238) = 0.87, pvalue < .001)
3. Variable points with minutes (r(238) = 0.82, pvalue < .001)
Therefore, it was likely to have multicollinearity problem with these variables and we could use one of the pair in our model
b. Find the model(s) with the lowest AIC by using forward and backwardstepwise selections
The forward selection model using lowest AIC would give us following result:
allstar provided lowest significant level (pvalue < .1).
The backward selection model using lowest AIC would give us following result:
c. Investigate the four residual plots of model(s) from (b). Are residual plots satisfactory? Comment
The result for forward step selection were as follows:
Residuals vs Fitted Values Scale Location
1.5 2
0.5 1.5
1
−0.5
−1
0.5
−1.5
−2
0
−2.5
6 7 8 6 7 8
QQ Plot Leverage vs Residuals
3 1.5
2 1
1 0.5
0 0
−0.5
−1
−1
−2
−1.5
−3
−2
−4
−2.5
−2 0 2 0 0.05 0.1 0.15 0.2
In the forward selection, Residual vs Fitted plot showed the furthest observation for this model were observation 24, 29, and 166. Meanwhile the
normal QQ plot displayed the outliers were observation 24, 29 and 166. In the ScaleLocation plot, the identified outliers were the same as QQ
plot 24, 29, and 166. And in the Residual vs Leverage plot, observation 103, 104, and 166 were the outliers.
Below are the result for backward step selection
Residuals vs Fitted Values Scale Location
1.5 2
1.5
0.5
1
−0.5
−1
0.5
−1.5
−2
0
6 7 8 6 7 8
QQ Plot Leverage vs Residuals
3 1.5
2 1
1 0.5
0 0
−0.5
−1
−1
−2
−1.5
−3
−2
−4
In the backward selection, Residual vs Fitted plot showed the furthest observation for this model were observation 24, 29, and 166. Meanwhile
the normal QQ plot also displayed the same were observation 24, 29 and 166. In the ScaleLocation plot, the identified outliers were similar as
QQ plot 24, 29, and 166. And in the Residual vs Leverage plot, observation 103, 104, and 166 were the outliers.
Based on these 2 selections diagrams, we can conclude that the residual plot for both selection were consistent. Therefore, we could identify
the outliers in our sample were 24, 29, 103, 104, and 166. We can also say that the models were satisfactory based on the following:
1. The residuals did not create any distinctive pattern in residual vs fitted plot meaning there was no linear relationship among residuals
2. The residuals mostly followed the straight line in normal QQ plot meaning the residuals were normally distributed
3. The residuals mostly spread equally along the ranges of the predictors in scalelocation plot meaning we can assume that the model has
equal variance (homoscedasticity)
4. Although there were several outliers in the observations, all of these outliers were not influential as they all still inside the Cook’s distance
based on the residual vs leverage plot
d. Repeat (b) without avgmin. Comment on the differences
The forward selection model using lowest AIC would give us following result:
lwage = 6.064 + 0.059points + 0.065exper − 0.01 draf t − 0.325allstar + 0.042 rebounds + 0.036assists
(0.119) (0.01) (0.01) (0.002) (0.14) (0.01s6) 0.021
The backward selection model using lowest AIC would give us following result:
lwage = 5.811 + 0.064exper + 0.249f orward + 0.272center + 0.069points + 0.058assists − 0.01 draf t − 0.372allstar + 0.161blac
(0.162) (0.011) (0.097) (0.125) (0.009) (0.025) (0.002) (0.142) (0.095)
If we did not use variable avgmin , the difference between these models were as follows:
1. The forward selection would use 6 variables namely points, exper, draf t, allstar, rebounds and assists and backward selection would
use 8 variables namely exper, f orward, center, points, assists, draf t, allstar and black
2. The intercept in forward selection model was higher (6.064) than backward selection (5.811)
3. The same variables used in both selection remained to be high significant level (pvalue < .001) were exper , draf t and points
4. Variable allstar had higher siginificance level in backward selection (pvalue < .01) compared to forward selection (pvalue < .05)
5. Variable assists also had higher significance level in backward selection (pvalue < .05) compared to forward selection (pvalue < .1)