Practice Midterm2 Fall2011
Practice Midterm2 Fall2011
Practice Midterm2 Fall2011
Professor Brainerd
Part I: Multiple Choice 1. Under imperfect multicollinearity a. The OLS estimator cannot be computed. b. two or more of the regressors are highly correlated. c. the OLS estimator is biased even in samples of n > 100. d. the error terms are highly, but not perfectly, correlated. e. none of the above. 2. Imagine you regressed earnings of individuals on a constant, a binary variable (Male) which takes on the value 1 for males and is 0 otherwise, and another binary variable (Female) which takes on the value 1 for females and is 0 otherwise. Because females typically earn less than males, you would expect a. the coefficient for Male to have a positive sign, and for Female a negative sign. b. both coefficients to be the same distance from the constant, one above and the other below. c. none of the OLS estimators to exist because there is perfect multicollinearity. d. this to yield a difference in means statistic. e. none of the above. 3. The adjusted R 2 , or R , is given by a. 1 b. c. d. e.
n 2 SSR . n k 1 TSS n 1 ESS . 1 n k 1 TSS n 1 SSR . 1 n k 1 TSS ESS . TSS none of the above
2
4. If you reject a joint null hypothesis using the F-test in a multiple hypothesis setting, then a. a series of t-tests may or may not give you the same conclusion. b. the regression is always significant. c. all of the hypotheses are always simultaneously rejected. d. the F-statistic must be negative. e. none of the above.
5. The interpretation of the slope coefficient in the model Yi 0 1 ln( X i ) ui is as follows: a. b. c. d. a 1% change in X is associated with a 1 % change in Y. a 1% change in X is associated with a change in Y of 0.01 1 . a change in X by one unit is associated with a 100 1 % change in Y. a change in X by one unit is associated with a 1 change in Y.
6. To decide whether Yi 0 1 X ui or ln(Yi ) 0 1 X ui fits the data better, you cannot consult the regression R 2 because a. ln(Y) may be negative for 0<Y<1. b. the TSS are not measured in the same units between the two models. c. the slope no longer indicates the effect of a unit change of X on Y in the log-linear model. d. the regression R 2 can be greater than one in the second model.
7. The components of internal validity are a. a large sample, and BLUE property of the estimator. b. a regression R 2 above 0.75 and serially uncorrelated errors. c. unbiasedness and consistency of the estimator, and desired significance level of hypothesis testing. d. nonstochastic explanatory variables, and prediction intervals close to the sample mean. e. (a) and (c)
8. Simultaneous causality a. means you must run a second regression of X on Y. b. leads to correlation between the regressor and the error term. c. means that a third variable affects both Y and X. d. cannot be established since regression analysis only detects correlation between variables.
9. In the fixed effects regression model, using (n - 1) binary variables for the entities, the coefficient of the binary variable indicates a. the level of the fixed effect of the ith entity. b. either 0 or 1. c. the difference in fixed effects between the ith and the first entity. d. the response in the dependent variable to a percentage change in the binary variable. e. (a) and (b)
10. Consider the special panel case where T = 2. If some of the omitted variables, which you hope to capture in the changes analysis, in fact change over time, then the estimator on the included change regressor a. will be unbiased only when allowing for heteroskedastic-robust standard errors. b. may still be unbiased. c. will only be unbiased in large samples. d. will always be unbiased. e. none of the above.
Part II: True/False/Uncertain 1. To test whether or not the population regression function is linear rather than a polynomial of order r, one should use the test of (r-1) restrictions using the F-statistic. 2. If the true model is Y = 1 + 2X1 + 3X2 + u but you estimate Y = 1 + 2X1 + u, your estimate of 2 will always be biased. 3. If is significantly different from zero at a 5% significance level, then X has a causal impact on Y.
4. Suppose Var(ui) is positively correlated with X in the population. In this case, the OLS estimators are biased. (Unless otherwise stated, assume all relevant assumptions hold.)
5. In a simple regression of test scores on hours studied, we estimate a slope coefficient of 4.52. However, we have omitted health status, a variable that describes how healthy the student is this week. True/False/Uncertain: The true 1 is smaller than 4.52. 6. ln(Y) = 1 + 2ln(X) + u is linear in both the parameters and the variables. 7. Suppose we have the following model: Yi 0 1 X i 2 X i2 u i . The effect on Y of a change in X will depend on the level of X. 8. Suppose you are working with a cross-sectional sample of 50 states. For each state you have information on the beer tax rate and the number of drunk driving arrests in the past year. T/F/U: If you are trying to estimate the causal impact of beer taxes on drunk driving, you should include state fixed effects in your model.
Part III: Short answer problems 1. Earnings functions attempt to find the determinants of earnings, using both continuous and binary variables. One of the central questions analyzed in this relationship is the returns to education. a. Collecting data from 253 individuals, you estimate the following relationship ln(Earn) = 0.54 + 0.083 Educ, R2= 0.20, SER = 0.445 (0.14) (0.011) where ln(Earn) is the log of average hourly earnings and Educ is years of education. What is the effect on earnings of an additional year of schooling? If you had a strong belief that years of high school education were different from college education, how would you modify the equation? What if your theory suggested that there was a diploma effect? b. You read in the literature that there should also be returns to on-the-job training. To approximate on-the-job training, researchers often use a potential experience variable, which is defined as Exper = Age Educ 6. You incorporate the experience variable into your original regression ln(Earn) = -0.01 + 0.101 Educ + 0.033 Exper 0.0005 Exper2 , (0.16) (0.012) (0.006) (0.0001) R2= 0.34, SER = 0.405 What is the effect of an additional year of experience for a person who is 40 years old and had 12 years of education? What about for a person who is 60 years old with the same education background? c. Test for the statistical significance of each of the coefficients of the added variables. Why has the coefficient on education changed so little?
2. Consider the following regression output for an unrestricted and a restricted model. Unrestricted model: Dependent Variable: TESTSCR Method: Least Squares Date: 07/31/06 Time: 17:35 Sample: 1 420 Included observations: 420 Variable Coefficient Std. Error t-Statistic Prob.
R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood Durbin-Watson stat Restricted model:
Mean dependent var 654.16 S.D. dependent var 19.05 Akaike info criterion 7.16 Schwarz criterion 7.22 F-statistic 324.94 Prob(F-statistic) 0.00
Dependent Variable: TESTSCR Method: Least Squares Date: 07/31/06 Time: 17:37 Sample: 1 420 Included observations: 420 Variable Coefficient Std. Error t-Statistic Prob.
C STR EL_PCT LOG(AVGINC) R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood Durbin-Watson stat
593.48 -0.39 -0.43 28.36 0.71 0.71 10.26 43792.4 -1571.8 1.30
Mean dependent var 654.16 S.D. dependent var 19.05 Akaike info criterion 7.50 Schwarz criterion 7.54 F-statistic 342.98 Prob(F-statistic) 0.00
Calculate the homoskedasticity only F-statistic and determine whether the null hypothesis can be rejected at the 5% significance level.
3. Using data from 93 countries, a researcher estimates the following model (heteroskedasticityrobust t-statistics in parentheses)
GRAFTi = 4.702 + 2.456 FEMPARi + 0.478 ln GNPi + 0.003 EDUCi 0.281 CATHOLICi (-8.177) (3.270) (5.311) (0.086) (-1.767) 0.152 MUSLIMi + 0.481 BRITCOLi + 0.312 NOTCOLi + 0.141 ETHNICi + 0.092 FREEDOMSi (-0.792) (3.672) (1.950) (0.705) (2.788) adj. R2 = 0.75, n = 93 where GRAFTi = index of the amount of perceived corruption in country i, with higher values meaning less corruption FEMPARi = proportion of the Parliament that is female in country i ln GNPi = logarithm of GNP in country i EDUCi = average years of schooling in country i CATHOLICi = proportion of population Catholic in country i MUSLIMi = proportion of population Muslim in country i BRITCOLi = 1 if former British colony NOTCOLi = 1 if never colonized ETHNICi = proportion of population in largest ethnic group in country i FREEDOMSi = index of political freedoms in country i, with higher values meaning more political freedoms a. Briefly explain what heteroskedasticity is, and why the researcher used heteroskedasticityrobust standard errors in his estimation. b. Which of the coefficients are significantly different from 0 at the 5% significance level (twotailed test)? c. Why might the researcher have chosen to leave variables in the regression that are not statistically significant? d. Briefly explain how you could test whether the effect of proportion of the parliament that is female on the corruption measure differs for European countries. e. Is multicollinearity a problem for this regression? Why or why not?
testscr is the average test score of 5th graders in the district comp_stu is the average number of computers per student in the district
-----------------------------------------------------------------------------testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------comp_stu | 79.40485 13.81145 5.75 0.000 _cons | 643.3633 309.28 0.000 639.2743 647.4523 ------------------------------------------------------------------------------
Regression 2:
. reg testscr comp_stu avginc Source | SS df MS -------------+-----------------------------Model | 79955.8549 2 39977.9274 Residual | 72153.7387 417 173.030549 -------------+-----------------------------Total | 152109.594 419 363.030056 Number of obs F( 2, 417) Prob > F R-squared Adj R-squared Root MSE = = = = = = 420 231.05 0.0000 0.5256 0.5234 13.154
-----------------------------------------------------------------------------testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------comp_stu | 40.22145 10.08643 3.99 0.000 20.39487 60.04803 avginc | 1.808115 .0906701 19.94 0.000 1.629887 1.986342 _cons | 620.9952 1.865068 332.96 0.000 617.3291 624.6613 ------------------------------------------------------------------------------
a. How does the interpretation of comp_stu change when we add avginc to the regression? b. Which regression is better? Why? c. The coefficient on comp_stu went down significantly with the inclusion of avginc to the model. What does this suggest about the correlation between comp_stu and avginc? Explain.
NEW VARIABLE: . gen avginc2=avginc*avginc . reg testscr comp_stu avginc avginc2 Source | SS df MS -------------+-----------------------------Model | 88100.6286 3 29366.8762 Residual | 64008.965 416 153.867704 -------------+-----------------------------Number of obs F( 3, 416) Prob > F R-squared Adj R-squared = = = = = 420 0.0000 0.5792
Total |
152109.594
419
363.030056
Root MSE
12.404
-----------------------------------------------------------------------------testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------comp_stu | 45.50465 9.539196 4.77 0.000 26.75362 64.25569 avginc | 3.874927 .2966648 13.06 0.000 3.291778 4.458076 avginc2 | -.0445311 .0061206 -7.28 0.000 -.0565623 -.0324998 _cons | 601.3871 3.21818 186.87 0.000 595.0612 607.713 ------------------------------------------------------------------------------
d. What is the interpretation of the coefficient on avginc2? e. Would you accept or reject the null hypothesis that there is a linear relationship between avginc and testscr? Explain.
NEW VARIABLE: . gen comp_inc=comp_stu*avginc . reg testscr comp_stu avginc comp_inc Source | SS df MS -------------+-----------------------------Model | 82406.4226 3 27468.8075 Residual | 69703.171 416 167.5557 -------------+-----------------------------Total | 152109.594 419 363.030056 Number of obs F( 3, 416) Prob > F R-squared Adj R-squared Root MSE = = = = = = 420 163.94 0.0000 0.5418 0.5385 12.944
-----------------------------------------------------------------------------testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------comp_stu | 109.9386 20.75689 5.30 0.000 69.13712 150.7401 avginc | 2.59762 .2248998 11.55 0.000 2.155539 3.039702 comp_inc | -4.558676 1.192024 -3.82 0.000 -6.901818 -2.215535 _cons | 609.333 3.559197 171.20 0.000 602.3367 616.3292 ------------------------------------------------------------------------------
5. Suppose you were interested in estimating the impact of HIV on school enrollment in South Africa. Consider the following model:
Where is a dummy variable indicating whether or not individual i is enrolled, is a dummy variable indicating whether the individual had contracted HIV, and is an error term. You decide to estimate the model using OLS, and estimate the following coefficients: . (When the dependent variable is binary, the coefficient gives the change in probability that Y=1 for a unit change in X).
Can you interpret these estimates as causal? Provide TWO reasons why these estimates may not be internally valid. Give a brief explanation of the potential biases and please be specific in your explanation.
6. A study, published in 1993, used U.S. state panel data to investigate the relationship between minimum wages and employment of teenagers. The sample period was 1977 to 1989 for all 50 states. The author estimated a model of the following type:
where E is the employment to population ratio of teenagers, M is the nominal minimum wage, W is average hourly earnings in manufacturing, i indexes state and t indexes time. In addition, other explanatory variables, such as the adult unemployment rate, the teenage population share, and the teenage enrollment rate in school, were included. Note that U.S. states are allowed to set their own minimum wage, as long as it is at least as high as the federal minimum wage. a. Briefly discuss the advantage of using panel data in this situation rather than pure cross section or time series data. b. The author decided to use eight regional dummy variables instead of the 49 state dummy variables. What is the implicit assumption made by the author? Can you test for its validity? If so, how? c. The results, using time and region fixed effects only, were as follows:
= -0.182 ln(Mit /Wit ) + ...; R2= 0.727 (0.036)
Interpret the coefficient on ln( Mit /Wit ). d. State minimum wages often do not exceed the federal minimum wage. As a result, the author decided to use the federal minimum wage in his specification above rather than the state minimum wage. How does the original equation
change? Rewrite the new equation. How would this change your interpretation of 1? Explain.