Handout 4 Multiple Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

ECON 41

Ouline #4: Multiple regression

4.1. Multiple regression:


• Definition: Regression with more than one regressor (X variable)
• Why?
o There are almost always multiple causes of outcomes
o If we don’t control for omitted variables (sometimes called confounders), our estimates may suffer
from omitted variable bias.
o Adding more regressors can lead to less-biased prediction and less-biased estimate of the
coefficient of interest.

4.2. The problem of omitted variable bias (OVB)


• Omitted variable bias occurs when an omitted variable (not included as a regressor)
(1) is correlated (positively or negatively) with an included regressor (X), and
(2) is a determinant of Y.
Notice this is a two-part definition
• Omitted variable bias violates the least squares assumption that E(ui | Xi) = 0
• An example could be a regression of test scores on class size, omitting median household income.

Class size

Test scores

HH income

• Omitted variable bias does NOT go away with larger sample: the slope estimate is not consistent.
• The direction of the bias depends on the correlation of u and X.

4.3. Addressing omitted variable bias using multiple regression (multivariable regression)
• Basically, add the omitted variable(s) to the regression, which then controls for their effect on Y.
• Regression model with two regressors (two explanatory variables): Yi = b0 + b1 X1i + b2 X 2i + ui
• The interpretation of each slope is now the partial effect, holding the other variable(s) constant:
DY ¶Y
b1 = holding X 2 constant =
DX1 ¶ X1
• Can add many variables. Obviously have to have more observations than variables! But this is no problem
in era of big data. Can have regressions with thousands of explanatory/control/covariates/independent
variables.
• The multiple regression coefficients can be estimated using OLS: again, choose the intercept and slope
parameters (βs) to minimize the sum of squared deviations between Y and predicted Y.
• If one of the explanatory variables is a dummy variable, interpret the estimated coefficient as the
difference, on average, in the outcome, between the 1 category and the 0 category, holding constant the
level of the other covariates.

4.4. Least squares assumptions for multiple regression (first three are basically the same as before)
• Error term has conditional mean zero: E(ui | X1i, X2i, … Xki) = 0
• Observations are i.i.d. (random sampling)
• Large outliers are unlikely
• No perfect multicollinearity

4.5. Measures of fit for multiple regression


• Standard error of the regression (SER) estimates the standard deviation of the error term. (slightly different
formula; divide SSR by (n-k-1))
• R2 is the fraction of the sample variation of Y explained by the regressors, ESS/TSS. But R2 not good
measure.
• When you add a regressor, R2 never decreases and usually always increases, if only a little.
o Hence adding even a meaningless regressor can increase R2
• To compare the fit between regressions with different numbers of regressors, we often use the adjusted R2,
which essentially eliminates the advantage of the regression with more regressors:
n − 1 SSR s2
R2 = 1 −  = 1 − u2ˆ
n − k − 1 TSS sY
where k is the number of regressors (X variables)
4.6. Multicollinearity
• Perfect multicollinearity: One of the regressors is a perfect (exact) linear function of the other regressors
o In this case it is impossible to compute the OLS estimates
o The dummy variable “trap”
o How R handles perfect multicollinearity? Try it to see.
• Imperfect multicollinearity: two or more regressors are highly correlated
o Causes large standard errors on coefficients
o Not a logical flaw: But it might be hard to distinguish the independent effects of two variables that
are highly correlated.

4.7. Judging the magnitude of regression coefficients: effect size


• Statistical significance is important, in the sense that we want to know if there is any effect that is
distinguishable from zero.
• But… is the effect economically significant? I.e., is it big enough to matter? This may be MORE
important!
• There is no standard way of answering this: it depends on context and research question.
• Some ways to think about magnitude:
o Based on the units, does the coefficient seem large? Example: effect of college degree on earnings
o Is there previous research on the subject that establishes a benchmark for comparison?
o Is there some other “natural” comparison? Examples:
▪ Compare estimated return to an additional year of education with rate of return on an
alternative investment.
▪ Benefit-cost calculation: For example, estimate cost in dollars of increasing test score via
hiring teachers.
o Within the range of variation in the data, do “typical” changes in X account for a substantial
amount of variation in Y?
• One approach is to calculate the partial effect of a one standard deviation change in X with a one standard
deviation change in Y. Then coefficients on variables that are in completely different magnitudes can be
compared. One could ask, “What would be the effect on the outcome of a change that would be typical or
reasonable for that variable, i.e., a one standard deviation change.

You might also like