EC220/221 Introduction To Econometrics: Canh Thien Dang

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

EC220/221 Introduction to Econometrics

Canh Thien Dang


[email protected]

Michael Gmeiner
[email protected]

Lecture 8: Bivariate Regression


Michaelmas Term
14 October 2021
Outline
Previous lectures
• Omitted variable bias and how it affects regression estimates.
• Selection bias and how removing OVB reduces selection bias.
• In a case study, we evaluated that there is seemingly no effect of attending a private school on
earnings once controlling for selectivity.

This lecture
• Application: Does class size affect student outcomes?
• Using regression as a data descriptive tool.
• The Conditional Expectation Function (CEF).
• Mechanics of regression (some mathematical derivations).
• Interpreting regression results.

1
Class-Size Effects: Small is Good, Big is Bad?

Do smaller classes result in better outcomes for students?


2
Class-Size Effects: Small is Good, Big is Bad?
• Do smaller classes result in better outcomes for students?
• Smaller classes allow teaching to needs of specific students and increase participation.
• Bigger classes foster self-learning skills.

• What if we compare scores of students in countries with large and small average class sizes?
• Suppose we obtain data on class size in several countries, and data on scores for a standardised test
(e.g., the PISA test). (Programme for International Student Assessment).
• Countries with larger (or smaller) average class size might have other educational norms that harm (or
improve) test score averages.

𝐸 𝑌𝑖 𝐷𝑖 = 1 − 𝐸[𝑌𝑖 |𝐷𝑖 = 0] = 𝐸 𝑌1𝑖 − 𝑌0𝑖 𝐷𝑖 = 1 + 𝐸[𝑌0𝑖 │𝐷𝑖 = 1] − 𝐸[𝑌0𝑖 |𝐷𝑖 = 0]


Observed difference in average score Average treatment Selection
effect on the treated Bias

3
Class-Size Effects: A Study in California
• Data: Primary school students in California school
districts (n = 420) for 1999.

• Variables:
• The outcome, 𝑌, is 5th grade test scores (Stanford-9
achievement test). Specifically, we observe the district
average for the sum of math and reading scores.

• Class Size - Student-teacher ratio (STR) = number of students


in the district divided by the number of full-time equivalent
teachers.

4
Possible Confounder
• Do smaller classes result in better outcomes for students?

• First, why do some districts have small classes and others large classes?
• Primarily due to income of local families (local taxes fund schools).
• Bias will result due to residential sorting of high-income families whose children may have
different test scores anyway.

Causal ?
Treatment Outcome of Interest
(Class Size) (Test Scores)

Confounder
(District Incomes) 5
Scatterplot

Stata command:

𝑡𝑤𝑜𝑤𝑎𝑦 𝑠𝑐𝑎𝑡𝑡𝑒𝑟 𝑦𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑥𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

If you want to learn about aesthetics of graph making, type


“ℎ𝑒𝑙𝑝 𝑡𝑤𝑜𝑤𝑎𝑦” into the Stata command line.

6
Some Observations About the Data
• Data are all primary school districts in California in 1999.
• We observe the whole population of interest. There are no
sampling issues like in lectures 4 and 5 when we talked about
standard errors of sample estimates.

• In raw data we observe a relationship between class size and


test scores.

• The outcome has high variance, even for the same class size.
• This means there are a lot of other factors causing different test
scores across districts.
• There is often a lot of randomness in relationships we study in
economics.

7
The Conditional Expectation Function
• We want to know the value of the test score “on average” across districts for a
given class size.

• Let 𝑌𝑖 be the test score in school district 𝑖, and 𝑋𝑖 the student-teacher ratio:
𝐸(𝑌𝑖 |𝑋𝑖 )
is the conditional expectation function (CEF);
𝐸(𝑌𝑖 |𝑋𝑖 = 𝑥)
is the value of the CEF for a particular value of 𝑋𝑖 , say 𝑥 = 18.

• The CEF is a commonly-used summary of the bivariate relationship.

8
The Conditional Expectation Function - Example

Class Size Test Score


18 700 • 𝐸 𝑌 𝑋 = 18 = 725
18 750
• 𝐸 𝑌 𝑋 = 19 = 700
19 750
19 700 • 𝐸 𝑌 𝑋 = 20 = 600
19 650 • 𝐸 𝑌 𝑋 = 21 = 580
20 600
21 580

9
The Conditional Expectation Function

Data are in “bins” for


class size in categories:
[14-15), [15-16), …, [25-26)
The average 𝑌 is calculated
for the observations in each bin,
then that average is the vertical
axis value plotted at the horizontal
axis value that is the lower bound
of the bin.
(e.g., at 𝑋 = 14 the plotted point is
the average score for classes with
student-teacher ratio in [14,15).)

10
The Data and the CEF

This is nice, but what we want is a


single number, 𝛽, for which we
can say “the effect of increasing
student-teacher ratio by 1 on class
size is 𝛽”

11
Simplifying the CEF
• Goal: summarize the relationship of 𝑋 and 𝑌 by creating a straight line that “best”
approximates the CEF.
• Choose 𝑎, 𝑏 such that 𝑎 + 𝑏𝑥 is “really close” to the CEF.
• How do we choose “the best” line that approximates the CEF?

760

720
𝒂 Slope is 𝒃
680

640

600

0 5 10 15 20 25 12
Least Squares Minimization
The “best” line minimises the expectation of the squared distance from the line to the CEF.

𝐸 𝐸 𝑌𝑖 𝑋𝑖 − 𝑎 − 𝑏𝑋𝑖 2

• The 𝑎 and 𝑏 that minimizes the above is the “best linear predictor” (BLP) of 𝑌 given 𝑋.

• The BLP of the CEF and the BLP of 𝑌 happen to be the same, i.e., the same 𝑎 and 𝑏 minimize both,
𝐸 𝐸 𝑌𝑖 𝑋𝑖 − 𝑎 − 𝑏𝑋𝑖 2 and 𝐸 𝑌𝑖 − 𝑎 − 𝑏𝑋𝑖 2

In practice, we don’t observe 𝐸 𝑌𝑖 𝑋𝑖 , but we do observe 𝑌𝑖 . The line that approximates the CEF is created
by solving the minimization problem in blue. This is the OLS regression line.

The proof of equivalence is beyond the scope of the course.

13
Data, the CEF, and the Regression Line

𝑡𝑤𝑜𝑤𝑎𝑦 (𝑠𝑐𝑎𝑡𝑡𝑒𝑟 𝑦𝑣𝑎𝑟 𝑥𝑣𝑎𝑟) (𝑙𝑓𝑖𝑡 𝑦𝑣𝑎𝑟 𝑥𝑣𝑎𝑟)

14
Residuals
min 𝐸 𝑌𝑖 − 𝑎 − 𝑏𝑋𝑖 2
𝑎,𝑏
• For any, not necessarily minimizing, 𝑎′ and 𝑏’, the residual is the difference between the point and the line.
• If 𝑋𝑖 = 24, 𝑌𝑖 = 679, 𝑎′ = 760, and 𝑏′ = −5,

𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑌𝑖 − 𝑎′ − 𝑏′𝑋𝑖
= 679 − 760 − 24 −5
= 39 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙

Residuals are negative if a point is below the line.


• If 𝑋𝑖 = 20, 𝑌𝑖 = 610, 𝑎′ = 760, and 𝑏′ = −5,

𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑌𝑖 − 𝑎′ − 𝑏′𝑋𝑖
= 610 − 760 − 20 −5
= −50

We choose 𝑎 and 𝑏 to minimize the expected squared residual.


Mechanics
The regression line is the line 𝛼 + 𝛽𝑋 where 𝛼 and 𝛽 minimize:
𝐸 𝑌𝑖 − 𝑎 − 𝑏𝑋𝑖 2 .
To minimise this function, we take derivatives and set them equal to 0.
Multiplying by 𝑛 does not change the minimization.

Residual sum of squares (𝑅𝑆𝑆) = 𝑛𝐸 𝑌𝑖 − 𝑎 − 𝑏𝑋𝑖 2


𝜕𝑅𝑆𝑆 𝜕𝑅𝑆𝑆
= −2𝑛𝐸 𝑌𝑖 − 𝛼 − 𝛽𝑋𝑖 = 0 = −2𝑛𝐸 𝑌𝑖 − 𝛼 − 𝛽𝑋𝑖 𝑋𝑖 = 0
𝜕𝑎 𝜕𝑏

Divide by −2𝑛.
𝜕𝑅𝑆𝑆 𝜕𝑅𝑆𝑆
= 𝐸 𝑌𝑖 − 𝛼 − 𝛽𝑋𝑖 =0 = 𝐸 𝑌𝑖 − 𝛼 − 𝛽𝑋𝑖 𝑋𝑖 = 0
𝜕𝑎 𝜕𝑏

After taking the derivatives, we use 𝛼 and 𝛽, rather than 𝑎 and 𝑏, to denote the population parameters that solve the minimisation (which is at the point where derivatives are 0). 16
Review?

𝑌𝑖 and 𝑋𝑖 are random variables. 𝐶𝑜𝑣(𝑌𝑖 , 𝑋𝑖 ) is


A. 𝐸(𝑌𝑖 𝑋𝑖 )

B. 𝐸(𝑌𝑖 )𝐸(𝑋𝑖 )

C. 𝐸(𝑌𝑖 𝑋𝑖 ) − 𝐸(𝑌𝑖 )𝐸(𝑋𝑖 )

D. Undefined

Also, 𝑉𝑎𝑟 𝑋𝑖 = 𝐸 𝑋𝑖2 − 𝐸 𝑋𝑖 2

17
Solving for Estimates
Necessary conditions of the minimisation problem are:
1 𝐸 𝑌𝑖 − 𝛼 − 𝛽𝑋𝑖 =0 2 𝐸 𝑌𝑖 − 𝛼 − 𝛽𝑋𝑖 𝑋𝑖 = 0

The first condition gives us:


𝐸 𝑌𝑖 − 𝛼 − 𝛽𝐸 𝑋𝑖 = 0 ⟹ 𝛼 = 𝑬(𝒀𝒊 ) − 𝛽𝑬(𝑿𝒊 )
Using the the second condition:
𝐸 𝑌𝑖 − 𝛼 − 𝛽𝑋𝑖 𝑋𝑖 = 0 ⟹ 𝐸 𝑌𝑖 𝑋𝑖 − 𝛼𝑋𝑖 − 𝛽𝑋𝑖2 = 0 ⟹ 𝐸 𝑌𝑖 𝑋𝑖 − 𝛼𝐸 𝑋𝑖 − 𝛽𝐸 𝑋𝑖2 = 0
𝐸 𝑌𝑖 𝑋𝑖 − 𝛽𝐸 𝑋𝑖2 = 𝛼𝐸 𝑋𝑖

𝐸 𝑌𝑖 𝑋𝑖 − 𝛽𝐸 𝑋𝑖2 = 𝐸 𝑌𝑖 − 𝛽𝐸 𝑋𝑖 𝐸[𝑋𝑖 ]

𝐸 𝑌𝑖 𝑋𝑖 − 𝛽𝐸 𝑋𝑖2 = 𝐸 𝑌𝑖 𝐸[𝑋𝑖 ] − 𝛽𝐸 𝑋𝑖 2

𝐸 𝑌𝑖 𝑋𝑖 − 𝐸 𝑌𝑖 𝐸 𝑋𝑖 = 𝛽[𝐸 𝑋𝑖2 − 𝐸 𝑋𝑖 2 ]
𝐶𝑜𝑣 𝑌𝑖 , 𝑋𝑖 = 𝛽𝑉𝑎𝑟(𝑋𝑖 )
𝑪𝒐𝒗 𝒀𝒊 , 𝑿𝒊
𝛽=
𝑽𝒂𝒓 𝑿𝒊
18
Solution
The regression line is the line 𝛼 + 𝛽𝑋 where 𝛼 and 𝛽 minimize:
𝐸 𝑌𝑖 − 𝑎 − 𝑏𝑋𝑖 2 .
The solution is given by:
𝐶𝑜𝑣(𝑌𝑖 , 𝑋𝑖 )
𝛽=
𝑉𝑎𝑟(𝑋𝑖 )
𝛼 = 𝐸 𝑌𝑖 − 𝛽𝐸(𝑋𝑖 )

More formal derivations are discussed in Wooldridge Section 2.2 and in LT.
This estimator is called Ordinary Least Squares or OLS.

19
Why Squared and not Absolute Value?
• This is not examinable.
• Squaring requires the line to fit outliers better. An outlier will be far from any
estimated line, and the squared deviation will be larger than the absolute
deviation. Thus, when minimising the sum of squared residuals, the minimising
line doesn’t allow for as big of outliers.

• It is easier to solve for the solution because the absolute value function is not
everywhere-differentiable.

• The least squares estimator has nice properties that we will continue to learn
throughout studying metrics.
• Nevertheless, there is an estimator that minimizes the sum of absolute residuals, called LAD.
It is beyond the scope of EC220.

20
Interpretation – 𝛼
𝑌𝑖 = α + 𝛽𝑋𝑖 + 𝑒𝑖
𝐶𝑜𝑣 𝑋,𝑌 −8.15932
In the California class size data, = = −2.28
𝑉𝑎𝑟 𝑋 3.579
and 𝐸 𝑌 − −2.28 𝐸 𝑋 = 654.16 − −2.28 19.64 = 698.9.
෡𝑖 = 698.9 − 2.28𝑋𝑖
𝑌

• 𝛼 = 698.9 the vertical-axis intercept.

• 𝛼 is “the expected 𝑌 if 𝑋 = 0”, but often 𝑋 = 0 is logically


impossible or never occurs. In such a case, 𝛼 primarily serves
the role of defining the line of best fit, but the numerical value
is not of practical interest.

21
Interpretation – 𝛽 and 𝑒
𝑌𝑖 = α + 𝛽𝑋𝑖 + 𝑒𝑖
In the California class size data, the regression line is:
𝑌෡𝑖 = 698.9 − 2.28𝑋𝑖
• 𝛼 = 698.9 the vertical-axis intercept.
• 𝛽 = −2.28 the slope coefficient of the line.
• The change in 𝑌 when 𝑋 increases by 1.

• 𝑒𝑖 the error, the effect of all factors other than 𝑋 on 𝑌.


• The estimate for the error is the residual. In this setting we
have population data. The residual is the true error.
• If 𝑋 = 20 and 𝑌 = 620, then
𝑒𝑖 = 𝑌𝑖 − 𝑌෡𝑖 = 620 − 698.9 − 2.28 20 = −33.3

22
Stata Output 𝑌𝑖 = α + 𝛽𝑋𝑖 + 𝑒𝑖
𝑌෡𝑖 = 698.9 − 2.28𝑋𝑖

. regress testscr str

Source | SS df MS Number of obs = 420


-------------+------------------------------ F( 1, 418) = 22.58
Model | 7794.11004 1 7794.11004 Prob > F = 0.0000
Residual | 144315.484 418 345.252353 R-squared = 0.0512
-------------+------------------------------ Adj R-squared = 0.0490
Total | 152109.594 419 363.030056 Root MSE = 18.581

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -2.279808 .4798256 -4.75 0.000 -3.22298 -1.336637
_cons | 698.933 9.467491 73.82 0.000 680.3231 717.5428
------------------------------------------------------------------------------

Since the dataset is also our population of interest, we can take this as the population parameter, so we don't have to pay
attention to the standard error reported here. We will learn about how to use these standard errors in estimations with
samples next week.
23
Overview
The estimation equation (regression model)
𝑌𝑖 = α + 𝛽𝑋𝑖 + 𝑒𝑖
• 𝑌𝑖 is the “dependent variable” or “outcome”
• 𝑋𝑖 is the “independent variable”, a “regressor”, or a “covariate”
𝐶𝑜𝑣 𝑋,𝑌
• 𝛽= is the slope coefficient (often just “coefficient”)
𝑉𝑎𝑟 𝑋

• α = 𝐸 𝑌 − 𝛽𝐸[𝑋] is the “constant” or “intercept”


• 𝑒𝑖 is the “error”, the effect of all factors other than 𝑋 on 𝑌.
• The estimate of the error is the residual, which coincide with population data.

• Analogously, because we have population data, estimates of 𝛼 and 𝛽 coincide with the population parameters.
• If we had a sample, rather than a population of data, we would use 𝛼ො and 𝛽መ to denote the estimates, and 𝑒Ƹ to denote the residual. “hats”
denote estimates that may differ from population values.

24
Data Points and the Regression Equation
𝑌𝑖 = α + 𝛽𝑋𝑖 + 𝑒𝑖

The regression partitions each data point into two pieces:


(1) The predicted (fitted) value of the outcome (after we fit the parameters):
𝑌෡𝑖 = α + 𝛽𝑋𝑖

This represents the 𝑌 we would expect given 𝑋 if 𝑒 = 0.

(2) The residual after fitting:


𝑒𝑖 = 𝑌𝑖 − 𝑌෡𝑖 = 𝑌𝑖 − α − 𝛽𝑋𝑖

25
Predicted Values and Residuals

𝑌𝑖 = 𝛼 + 𝛽𝑋𝑖 + 𝑒𝑖


𝑌𝑖 = 𝛼 + 𝛽𝑋𝑖 . 𝑒2
This is the point on the regression line for 𝑋𝑖 .
If 𝑋𝑖 = 9, 𝛼 = 4, and 𝛽 = 0.2,
෡𝑖 = 𝛼 + 𝛽𝑋𝑖 = 4 + .2 9 = 5.8 𝑒1
𝑌

The residual, 𝑒𝑖 , is the difference


between the point and the line.
𝑒𝑖 = 𝑌𝑖 − 𝑌෡𝑖 .
If the individual had 𝑌𝑖 = 3, then 𝑒𝑖 = 3 − 5.8 = −2.8
Your Turn
෡𝑖 given the following regression?:
If 𝑋𝑖 = 21, what is the predicted value 𝑌
𝑌𝑖 = 698.9 − 2.28𝑋𝑖 + 𝑒𝑖
A. 651.02
B. 735.5
C. 698.9
D. 696.62

Then if the true 𝑌𝑖 = 650, what is 𝑒ෝ𝑖 ?

27
Predicted Values are Defined by the Line

651.02

28
Bivariate Regressions: Summary
• Bivariate relationships can be graphed in a scatterplot.
• This provides useful but sometimes overwhelming information.

• The conditional expectation function summarises the relationship between the


dependent variable and the regressor without a functional form.

• Linear regression is the best linear approximation to the data and to the CEF. It
neatly summarises the relationship between two variables in the regression slope
coefficient.

29

You might also like