Assignment2Sol (1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Assignment 2: ECON-UA 266 - Intro to Econometrics

Sahar Parsa

Fall 2024

The second assignment is due on Friday, September 20th, 2024. It covers the material related to the population
regression model and the OLS estimator. For the Data questions and any other questions that relies on using
R, report the output of your analysis in a “report style” pleasing to read and add the codes you used to
generate your results. Do not hand in the raw data, the raw output from R or any intermediary output
unless stated otherwise. [1] You are encouraged to discuss the problems with others, but [2] you must write
up your own results. Do not copy someone else’s answer.
IMPORTANT DISCLAIMER: The homework is not graded. The points are only to give you information
about the weight assigned to each questions.

Question 1 [5 points]
¯
Show that Ŷ = Ȳ .
Solution :
From class, we know we can write the predicted/fitted value as:

Ŷ = αOLS + βOLS X
Where we understand αOLS and βOLS to have been obtained from the FOC we derived in class, and where
Ŷ is the predicted value taken as a random variable. We can write each Yi as its fitted value, plus its residual
as another way to interpret an OLS regression. For each i, write:

Yi = Ŷi + eˆi
¯
To show that Ŷ = Ȳ , we sum each side of the equation and divide by N :

N N
1 X 1 X
Yi = (Ŷi + ei )
N i=1 N i=1

N N N
1 X 1 X 1 X
Yi = Ŷi + ei
N i=1 N i=1 N i=1
¯
Ȳ = Ŷ + ē
where we know that ē = 0 from the algebraic properties of the OLS estimators. Recall one of the properties
is that the sum, and therefore the sample average, of the OLS residuals is zero. In other words:
With this property in mind, the second term on the right side of the above equation becomes zero and we
simplify the expression as:

¯
Ȳ = Ŷ

1
Question 2 [15 points]
x+y
Let X and Y have joint pdf: PX,Y (x, y) = 15 , where x = 1, 2, 3 and y = 0, 1
a. [5 points] Find the Covariance and correlation of X and Y (write the formula and then find the
covariance and coefficient of correlation)
Solution :
Let’s first derive the joint pdf:

1+0 1
PX,Y (X = 1, Y = 0) = =
15 15
2+0 2
PX,Y (X = 2, Y = 0) = =
15 15
3+0 3
PX,Y (X = 3, Y = 0) = =
15 15
1+1 2
PX,Y (X = 1, Y = 1) = =
15 15
2+1 3
PX,Y (X = 2, Y = 1) = =
15 15
3+1 4
PX,Y (X = 3, Y = 1) = =
15 15
Additionally, let’s derive the marginal pdf:

3 5 7
PX (X = 1) = and PX (X = 2) = and PX (X = 3) =
15 15 15
6 9
PY (Y = 0) = and PY (Y = 1) =
15 15

Now solving for Cov(X, Y ):

Cov(X, Y ) = E[(Y − E[Y ])(X − E[X])] = E[XY ] − E[X]E[Y ]


Note that the random variable XY can take the following values: xy = 0, 1, 2, 3. We get this by multiplying
each possible value for X by each possible value for Y.

6 2 3 4 2 6 12 20
E[XY ] = (0) + (1) + (2) + (3) = + + =
15 15 15 15 15 15 15 15
3 5 7 3 10 21 34
E[X] = (1) + (2) + (3) = + + =
15 15 15 15 15 15 15
6 9 9 9
E[Y ] = (0) + (1) =0+ =
15 15 15 15

20 34 9 300 306 6
Cov(X, Y ) = E[XY ] − E[X]E[Y ] = ( ) − ( )( ) = − =− ≈ −0.027
15 15 15 225 225 225
Now solving for Corr(X, Y ):

Cov(X, Y ) Cov(X, Y )
Corr(X, Y ) = ρX,Y = =
sd(X)sd(Y ) σX σY
p p
σX = E[X 2 ] − E[X]2 and σY = E[Y 2 ] − E[Y ]2

2
r r r
3 5 7 34 2 3 20 63 1, 156 1, 290 1, 156
σX = ((12 )( 2 2
) + (2 )( ) + (3 )( )) − ( ) = ( + + )− = − ≈ 0.772
15 15 15 15 15 15 15 225 225 225
r r r
6 9 9 9 81 135 81
σY = ((02 )( ) + (12 )( )) − ( )2 = (0 + ) − = − ≈ 0.490
15 15 15 15 225 225 225
Cov(X, Y ) −0.027
Corr(X, Y ) = ρX,Y = ≈ ≈ −0.071
σX σY (0.772)(0.490)
b. [5 points] Find E[Y |X] (again write the formula first)
Solution :

P discrete random variables, the conditional expectation is written generally as: E[Y |X = x] =
For
t∈T tP (Y = t|X = x) where T is the support of Y (i.e. all the possible values that Y can take). Additionally,
remember that P (Y = t|X = x) = P (X=x,Y =t)
P (X=x) .

1 2 2
E[Y |X = 1] = (0)( ) + (1)( ) =
3 3 3
2 3 3
E[Y |X = 2] = (0)( ) + (1)( ) =
5 5 5
3 4 4
E[Y |X = 3] = (0)( ) + (1)( ) =
7 7 7
Note that E[Y |X] is a random variable, as X is a random variable.

2/3
 if X = 1 (X=1 with probability 3/15)
E[Y |X] = 3/5 if X = 2 (X=2 with probability 5/15) (1)

4/7 if X = 3 (X=3 with probability 7/15)

We can write this as: 


2/3
 with probability 3/15)
E[Y |X] = 3/5 with probability 5/15) (2)

4/7 with probability 7/15)

c. [5 points] Calculate directly E[E[Y |X]] and hence show that it is equal to E[Y ]. This is known as the
law of iterated expectation.
Solution :

2 3 3 5 4 7 6 15 28
E[E[Y |X]] = ( )( ) + ( )( ) + ( )( ) = + + = 0.6 = E[Y ]
3 15 5 15 7 15 45 75 105
Thus the law of iterated expectation holds.

Question 3 [10 points]


In class, we introduce two different concepts to study the relationship between X and Y . The first object
was the Conditional Expectation Function (CEF), and the second object was the univariate linear regression
model (LRM). Although the CEF is not always linear, when it is linear, then the LRM is the CEF. One
special case where the CEF is linear is when X takes one of two values as follows:
Consider E[Y |X] where X is a dummy variable that equals one with probability p and is zero otherwise.
Prove that the CEF and the regression of Y on X are the same in this case. Do this by showing that for
Bernoulli X:

α = E[Y ] − βE[X] = E[Y |X = 0]

3
β = Cov(X, Y )/V ar(X) = (E[Y |X = 1] − E[Y |X = 0])

Solution :
First, consider the formula for the slope:

β = Cov(X, Y )/V ar(X)

Remember that
Cov(X, Y ) = E[XY ] − E[X]E[Y ]
where E[X] = P r(X = 1) = p. Applying the law of iterated expectation, we can rewrite E[XY ] =
E[E[Y |X]X] = E[Y |X = 1]P r(X = 1) × 1 + E[Y |X = 0]P r(X = 0) × 0 = E[Y |X = 1]P r(X = 1) × 1 =
E[Y |X = 1]p and we can rewrite E[Y ] = E[E[Y |X]] = E[Y |X = 1]P r(X = 1) + E[Y |X = 0]P r(X = 0) =
E[Y |X = 1]p + E[Y |X = 0](1 − p).
Hence, we can rewrite

Cov(X, Y ) = E[Y |X = 1]p − (E[Y |X = 1]p + E[Y |X = 0](1 − p))p =

(E[Y |X = 1] − E[Y |X = 0])(1 − p)p

On the other hand, the denominator is the variance of a Bernoulli given by:

(1 − p)p

It follows that:
β = E[Y |X = 1] − E[Y |X = 0]
The slope is the difference in the conditional expectation Y .
For the intercept:
α = E[Y ] − βE[X]
= E[Y |X = 1]p + E[Y |X = 0](1 − p) − (E[Y |X = 1] − E[Y |X = 0])p =
E[Y |X = 0]

where we used the fact that E[X] = p.

Question 4 (Wooldridge Chapter 2 question 5) [15 points]


In the linear consumption function

cons
d = α̂ + β̂inc
where the (estimated) marginal propensity to consume (MPC) out of income is simply the slope, while the
average propensity to consume (APC) is cons/inc
d = α̂/inc + β̂.
Using observations for 100 families on annual income and consumption (both measured in dollars), the
following equation is obtained:

cons
d = 124.84 + 0.853inc
a. [5 points] Interpret the intercept in this equation, and comment on its sign and magnitude.
Solution :
The positive intercept indicates that if a given family had zero annual income, its predicted consumption
would be $124.84, which of course cannot literally be true. You can see that for low levels of income, this
linear function would not describe the relationship between income and consumption very well, which is why
we will eventually have to use other types of functions to describe such relationships.

4
b. [5 points] What is the predicted consumption when family income is $30, 000?
Solution :

cons
d = 124.84 + 0.853($30, 000) = $25, 714.84
c. [5 points] With inc on the x-axis, draw a graph of the estimated MPC and APC
Solution :

MPC = β̂ = 0.853
APC = cons/inc
d = $124.84/inc + 0.853

Note that the APC is not constant, it is always larger than the MPC, and it gets closer to the MPC as
income increases.
# If needed, run install.packages("ggplot2") first
library(ggplot2)
# Selecting a range of (arbitrary) income levels to plot
inc <- 25:1000
# Generating and naming the APC curve
apc <- 124.84/inc+0.853
# Generating and naming the MPC line
mpc <- 0.853
# Creating a data frame for the three objects of interest: inc, apc and mpc
inc_data <- data.frame(inc,apc,mpc)
# Plotting the two series (apc and mpc with income on x-axis and consumption on y-axis)
ggplot(data = inc_data, aes(x=inc)) +
geom_line(aes(y = apc, colour = "APC")) + geom_line(aes(y = mpc, colour = "MPC")) +
labs(title ="Estimating MPC & APC", x ="Income", y ="Consumption") +
scale_colour_manual("", values = c("APC" = "red", "MPC" = "blue"))

Estimating MPC & APC


6
Consumption

APC
MPC

0 250 500 750 1000


Income

5
Question 5 [15 points]
A college bookseller makes calls at the offices of professors and forms the impression that professors are more
likely to be away from their offices on Friday than any other working day. A review of the records of calls,
one-fifth of which are on Fridays, indicates that for 16% of Friday calls, the professor is away from the office,
while this occurs for only 12% of calls on every other working day. Define the random variables as follows: X
is equal to one if the call is made on Friday and zero if the call is made on Monday to Thursday and Y is
equal to one if the professor is away from the office and zero if the professor is in the office.
a. [5 points] Find the joint probability function for X and Y .
Solution :
Let’s first establish the marginal pdf of X, P r(X = x). Note that the probabilities are derived simply from
the fact that there are 5 working days in a week. X takes a value of 1 when the call is made on a Friday (in
other words, 1 possible working day of the week). On the other hand, X takes a value of 0 when the call is
made on Monday-Thursday (in other words, the other 4 possible working days of the week).

1
P r(A call is made on a Friday) = P r(X = 1) = = 0.2
5
4
P r(A call is made on a day that isn’t a Friday) = P r(X = 0) = = 0.8
5
We can use the following formula to calculate the joint probability distribution:

P (X = x, Y = y) = P (Y = y|X = x)P (X = x)
where

P (The professor is away from the office and Friday) = 0.16 ∗ 0.2 = 0.032
P (The professor is in the office and Friday) = 0.84 ∗ 0.2 = 0.168
P (The professor is away from the office and Not Friday) = 0.12 ∗ 0.8 = 0.096
P (The professor is in the office and Not Friday) = 0.88 ∗ 0.8 = 0.704

Note that the joint probabilities sum to one, which is an easy way to check that our calculations are correct.
b. [5 points] Find the conditional probability function for Y given X = 1 and X = 0.
Solution :
These are given in the question:

P (The professor is in the office|F riday) = 0.84

and
P (The professor is absent|F riday) = 0.16

P (The professor is in the office|Not Friday) = 0.88


and
P (The professor is absent|Not Friday) = 0.12
c. [5 points] Find E[Y |X]

6
Solution :
P
For E[Y |X = x] = t∈T tP (Y = t|X = x) where T is the support of Y (i.e. all the possible val-
ues that Y can take). Because we have a Bernoulli distribution, we know that E[Y |X = F riday] =
P (The professor is away|Friday) = 0.16. It takes this value with probability 1/5. E[Y |X = Not Friday] =
P (The professor is away|Not Friday) = 0.12. It takes this value with probability 4/5.
Note that E[Y |X] is a random variable, as X is a random variable.

Data Question 1 [20 points]


Download data from the 2010 Census at a geographic level of the state or lower [the TA will tell you how
to access the dataset]. Choose data to generate two variables that will make up a SLRM but one variable
CANNOT be median household income. Your data set must have AT LEAST 30 observations.
1. [5 points] Describe your data; include the period of analysis, the number of observations, the location,
and the geographic level of the data.
Solution :
library(foreign)
library(ggplot2)
library(stargazer)

##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
mydata <- read.dta('morg10.dta')
newdata = mydata[which(mydata$stfips=='AL'),]
cols_keep <- c('grade92', 'ownchild')
subset_data = newdata[cols_keep]
newvals_grade92<-c('31'=0,'32'=3,'33'=6,'34'=8,'35'=9,
'36'=10,'37'=11,'38'=12,'39'=12,'40'=14,'41'=14,
'42'=14,'43'=16,'44'=17,'45'=20,'46'=22)
subset_data['yrs_ed']=newvals_grade92[as.character(subset_data$grade92)]
subset_data$grade92 <- NULL
final_data <- na.omit(subset_data)
final_data <- final_data[!is.infinite(rowSums(final_data)),]

My data cover monthly census interview dates (‘intmonth’) from January-December 2010. There are 1,720
observations in my final data set (including observations only from Alabama, excluding NA’s). The geographic
level of my final data set is the state (Alabama).
2. [5 points] Describe your variables. This will include the definitions of these variables and the summary
statistics (mean and standard deviation). DO NOT include the R output at part of your homework;
rather write a sentence that indicates the value of these statistics.
Solution :
As mentioned in recitation, you can find the description and value ranges of the relevant variables in this
document: https://data.nber.org/morg/docs/cpsx.pdf:
(1)yrs_ed : This variable takes the value of the number of years of education of the survey respondent. It
ranges from 0-22, with 0 representing the respondent completed less than 1st grade and 22 representing the
respondent completed a doctorate degree. The mean of this variable is 13.6 and the standard deviation is 2.5.

7
(2)ownchild : This variable takes the value of the number of own children less than 18 in the respondent’s
primary family. The mean of this variable is 0.5 and the standard deviation is 0.9.
3. [5 points] Write down the population LRM that is based on these two variables. Explain why this
is an economically interesting relationship (i.e. what economic theory/reasoning indicates that the
independent variable, X, causes the dependent variable, Y ?). What is the predicted sign of the slope
coefficient in your regression?
Solution :
The general form of a population LRM is: Yi = α + βXi + εi
Using my choice of variables, the population LRM is: ownchild = α + β × yrs_ed + εi
I am choosing to write the population LRM in terms of the number of years of education and number of
own children. You could have chosen any two other variables besides the log of weekly earnings and years of
education. This is an economically interesting relationship, as one might imagine there is a trade off between
obtaining more education and having and raising children. I predict that there will be a negative slope
coefficient in my regression, meaning that the more education a respondent has, the fewer children under 18
the respondent’s primary family has.
4. [5 points] Run a regression using these two variables and interpret the slope parameter estimate from
this regression. Include the regression table [hint: you can use stargazer or summary] as part of your
homework.
Solution :
# Run after following the data import instructions from "Census2010_Commands.R" file uploaded
# to NYU Classes (which we also went through in recitation).
# We use the "lm" command in R to fit our population LRM. The dependent, or response, variable
# is listed first, followed by "~" and then one or more independent variables:
model_with_intercept <- lm(ownchild ~ yrs_ed, data=final_data)
summary(model_with_intercept)

##
## Call:
## lm(formula = ownchild ~ yrs_ed, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7295 -0.4640 -0.3976 -0.1984 9.6024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0007563 0.0717663 -0.011 0.992
## yrs_ed 0.0331946 0.0054303 6.113 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8886 on 3567 degrees of freedom
## Multiple R-squared: 0.01037, Adjusted R-squared: 0.01009
## F-statistic: 37.37 on 1 and 3567 DF, p-value: 1.084e-09
In this model, I obtained a slope coefficient of ≈ 0.033. I can interpret this as follows: For every additional
year of education the Alabama respondent obtains, he or she has 0.033 additional children under the age of
18. From the summary table, I can see this result is statistically significant at the 0.001 level, but the sign
of the slope coefficient is opposite of what I had predicted. Of course, this is a simplistic model, and it is
likely that years of education only partially explain the number of children respondents in Alabama have
(i.e. income, marital status, etc.). Additionally, since “ownchild” only captures the number of children under

8
18, it is possible there are older children respondents had but which do not show up in our regression as a
result.

Data Question 2 [15 points]


Download data from 2016 CPS (the TA will help you find the data during recitation – look at the March
CPS) which contains observations on weekly earnings, sex, race, age and education for respondents aged
25-64.
a. [5 points] Plot the weekly earnings of individuals against the number of years of education.
Solution :
library(foreign)
mydata <- read.dta('morg16.dta')
newdata <- mydata[which(mydata$intmonth=='March'),]
newdata2 <- newdata[which(newdata$age>=25 & newdata$age<=64),]
cols_keep <- c('earnwke', 'sex','age', 'race', 'grade92')
subset_data = newdata2[cols_keep]
newvals_grade92 <- c('31'=0,'32'=3,'33'=6,'34'=8,'35'=9,
'36'=10,'37'=11,'38'=12,'39'=12,'40'=14,'41'=14,
'42'=14,'43'=16,'44'=17,'45'=20,'46'=22)
subset_data['yrs_ed']=newvals_grade92[as.character(subset_data$grade92)]
subset_data$grade92 <- NULL
subset_data['log_earnings']=log(subset_data$earnwke)
final_data <- na.omit(subset_data)
final_data <- final_data[!is.infinite(rowSums(final_data)),]
ggplot(final_data, aes(yrs_ed, earnwke)) +
geom_point(na.rm = TRUE) + labs(title = "Years of Educ. & Weekly Earnings",
x = "Years of Education", y = "Weekly Earnings")

Years of Educ. & Weekly Earnings


3000

2000
Weekly Earnings

1000

0 5 10 15 20
Years of Education
b. [5 points] Take the logarithm of the weekly earnings and plot the new variable against the number of

9
years of education. Do you see a difference in the relationship between the two variables? Does it make
sense to take the logarithm if we are interested in a linear model?
Solution :
final_data['log_earnings']=log(final_data$earnwke)

ggplot(final_data, aes(yrs_ed, log_earnings)) +


geom_point(na.rm = TRUE) + labs(title = "Years of Educ. & Log Weekly Earnings",
x = "Years of Education", y = "Log of Weekly Earnings")

Years of Educ. & Log Weekly Earnings

5
Log of Weekly Earnings

−5
0 5 10 15 20
Years of Education
The first reason we may want to use the log of weekly earnings is to improve model fit. In this case, our
residuals aren’t normally distributed, as earnings data are truncated at zero and often exhibit positive skew.
One way to account for this is to modify the initial model to reflect possible non-linearity in the dependent
variable (weekly earnings) and skewness in the distribution of disturbances for each education level. Taking
the logarithm of a skewed variable, such as weekly earnings, can improve the fit by making the variable more
“normally” distributed.
In our case, the plot of weekly earnings against years of education in part (a) showed that within each year of
education, earnings are not symmetrically distributed. In fact, they were positively skewed such that for
the same education level, a few people have very high earnings but most are lower than the average. Log
transforming weekly earnings results in a plot in part (b) where earnings are generally more symmetrically
distributed within each education level.
The second reason we may want to use the log of weekly earnings is theoretical. There is theoretical rationale
that education has a multiplicative effect on earnings, in which case the model as is would be nonlinear. By
taking the logarithm of weekly earnings, we transform the multiplicative model into a linear one, as follows:
Recall that eA+B = eA eB :
Y = eβ0 +β1 X = eβ0 eβ1 X
Take the log of both sides and recall that log(AB) = log(A) + log(B):

log(Y ) = log(eβ0 eβ1 X ) = log(eβ0 ) + log(eβ1 X )

10
Finally, recall that log(eA ) = A:
log(Y ) = β0 + β1 X

The final reason we may want to use the log of weekly earnings is for interpretation convenience. Here, since
we have just taken the log of the dependent variable Y and not X, a one unit increase in X leads to a β ∗ 100%
increase/decrease in Y. This is compared to the case where we do not take a log transformation of either
variable, and a one unit increase in X leads to a β increase/decrease in Y.
In sum, yes it is useful to use the log of weekly earnings since we are interested in a linear model.
c. [5 points] Get the summary statistics of the your data.
Solution :
# Run after following the data import instructions from "CPSMar2016_Commands.R" file uploaded
# to NYU Classes (which we also went through in recitation).
stargazer(final_data, type = 'text')

##
## =====================================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------------
## earnwke 11,025 992.340 679.023 0.010 2,884.610
## sex 11,025 1.494 0.500 1 2
## age 11,025 43.488 11.107 25 64
## race 11,025 1.407 1.234 1 21
## yrs_ed 11,025 14.251 2.684 0 22
## log_earnings 11,025 6.654 0.771 -4.605 7.967
## -----------------------------------------------------

11

You might also like