Assignment2Sol (1)
Assignment2Sol (1)
Assignment2Sol (1)
Sahar Parsa
Fall 2024
The second assignment is due on Friday, September 20th, 2024. It covers the material related to the population
regression model and the OLS estimator. For the Data questions and any other questions that relies on using
R, report the output of your analysis in a “report style” pleasing to read and add the codes you used to
generate your results. Do not hand in the raw data, the raw output from R or any intermediary output
unless stated otherwise. [1] You are encouraged to discuss the problems with others, but [2] you must write
up your own results. Do not copy someone else’s answer.
IMPORTANT DISCLAIMER: The homework is not graded. The points are only to give you information
about the weight assigned to each questions.
Question 1 [5 points]
¯
Show that Ŷ = Ȳ .
Solution :
From class, we know we can write the predicted/fitted value as:
Ŷ = αOLS + βOLS X
Where we understand αOLS and βOLS to have been obtained from the FOC we derived in class, and where
Ŷ is the predicted value taken as a random variable. We can write each Yi as its fitted value, plus its residual
as another way to interpret an OLS regression. For each i, write:
Yi = Ŷi + eˆi
¯
To show that Ŷ = Ȳ , we sum each side of the equation and divide by N :
N N
1 X 1 X
Yi = (Ŷi + ei )
N i=1 N i=1
N N N
1 X 1 X 1 X
Yi = Ŷi + ei
N i=1 N i=1 N i=1
¯
Ȳ = Ŷ + ē
where we know that ē = 0 from the algebraic properties of the OLS estimators. Recall one of the properties
is that the sum, and therefore the sample average, of the OLS residuals is zero. In other words:
With this property in mind, the second term on the right side of the above equation becomes zero and we
simplify the expression as:
¯
Ȳ = Ŷ
1
Question 2 [15 points]
x+y
Let X and Y have joint pdf: PX,Y (x, y) = 15 , where x = 1, 2, 3 and y = 0, 1
a. [5 points] Find the Covariance and correlation of X and Y (write the formula and then find the
covariance and coefficient of correlation)
Solution :
Let’s first derive the joint pdf:
1+0 1
PX,Y (X = 1, Y = 0) = =
15 15
2+0 2
PX,Y (X = 2, Y = 0) = =
15 15
3+0 3
PX,Y (X = 3, Y = 0) = =
15 15
1+1 2
PX,Y (X = 1, Y = 1) = =
15 15
2+1 3
PX,Y (X = 2, Y = 1) = =
15 15
3+1 4
PX,Y (X = 3, Y = 1) = =
15 15
Additionally, let’s derive the marginal pdf:
3 5 7
PX (X = 1) = and PX (X = 2) = and PX (X = 3) =
15 15 15
6 9
PY (Y = 0) = and PY (Y = 1) =
15 15
6 2 3 4 2 6 12 20
E[XY ] = (0) + (1) + (2) + (3) = + + =
15 15 15 15 15 15 15 15
3 5 7 3 10 21 34
E[X] = (1) + (2) + (3) = + + =
15 15 15 15 15 15 15
6 9 9 9
E[Y ] = (0) + (1) =0+ =
15 15 15 15
20 34 9 300 306 6
Cov(X, Y ) = E[XY ] − E[X]E[Y ] = ( ) − ( )( ) = − =− ≈ −0.027
15 15 15 225 225 225
Now solving for Corr(X, Y ):
Cov(X, Y ) Cov(X, Y )
Corr(X, Y ) = ρX,Y = =
sd(X)sd(Y ) σX σY
p p
σX = E[X 2 ] − E[X]2 and σY = E[Y 2 ] − E[Y ]2
2
r r r
3 5 7 34 2 3 20 63 1, 156 1, 290 1, 156
σX = ((12 )( 2 2
) + (2 )( ) + (3 )( )) − ( ) = ( + + )− = − ≈ 0.772
15 15 15 15 15 15 15 225 225 225
r r r
6 9 9 9 81 135 81
σY = ((02 )( ) + (12 )( )) − ( )2 = (0 + ) − = − ≈ 0.490
15 15 15 15 225 225 225
Cov(X, Y ) −0.027
Corr(X, Y ) = ρX,Y = ≈ ≈ −0.071
σX σY (0.772)(0.490)
b. [5 points] Find E[Y |X] (again write the formula first)
Solution :
P discrete random variables, the conditional expectation is written generally as: E[Y |X = x] =
For
t∈T tP (Y = t|X = x) where T is the support of Y (i.e. all the possible values that Y can take). Additionally,
remember that P (Y = t|X = x) = P (X=x,Y =t)
P (X=x) .
1 2 2
E[Y |X = 1] = (0)( ) + (1)( ) =
3 3 3
2 3 3
E[Y |X = 2] = (0)( ) + (1)( ) =
5 5 5
3 4 4
E[Y |X = 3] = (0)( ) + (1)( ) =
7 7 7
Note that E[Y |X] is a random variable, as X is a random variable.
2/3
if X = 1 (X=1 with probability 3/15)
E[Y |X] = 3/5 if X = 2 (X=2 with probability 5/15) (1)
4/7 if X = 3 (X=3 with probability 7/15)
c. [5 points] Calculate directly E[E[Y |X]] and hence show that it is equal to E[Y ]. This is known as the
law of iterated expectation.
Solution :
2 3 3 5 4 7 6 15 28
E[E[Y |X]] = ( )( ) + ( )( ) + ( )( ) = + + = 0.6 = E[Y ]
3 15 5 15 7 15 45 75 105
Thus the law of iterated expectation holds.
3
β = Cov(X, Y )/V ar(X) = (E[Y |X = 1] − E[Y |X = 0])
Solution :
First, consider the formula for the slope:
Remember that
Cov(X, Y ) = E[XY ] − E[X]E[Y ]
where E[X] = P r(X = 1) = p. Applying the law of iterated expectation, we can rewrite E[XY ] =
E[E[Y |X]X] = E[Y |X = 1]P r(X = 1) × 1 + E[Y |X = 0]P r(X = 0) × 0 = E[Y |X = 1]P r(X = 1) × 1 =
E[Y |X = 1]p and we can rewrite E[Y ] = E[E[Y |X]] = E[Y |X = 1]P r(X = 1) + E[Y |X = 0]P r(X = 0) =
E[Y |X = 1]p + E[Y |X = 0](1 − p).
Hence, we can rewrite
On the other hand, the denominator is the variance of a Bernoulli given by:
(1 − p)p
It follows that:
β = E[Y |X = 1] − E[Y |X = 0]
The slope is the difference in the conditional expectation Y .
For the intercept:
α = E[Y ] − βE[X]
= E[Y |X = 1]p + E[Y |X = 0](1 − p) − (E[Y |X = 1] − E[Y |X = 0])p =
E[Y |X = 0]
cons
d = α̂ + β̂inc
where the (estimated) marginal propensity to consume (MPC) out of income is simply the slope, while the
average propensity to consume (APC) is cons/inc
d = α̂/inc + β̂.
Using observations for 100 families on annual income and consumption (both measured in dollars), the
following equation is obtained:
cons
d = 124.84 + 0.853inc
a. [5 points] Interpret the intercept in this equation, and comment on its sign and magnitude.
Solution :
The positive intercept indicates that if a given family had zero annual income, its predicted consumption
would be $124.84, which of course cannot literally be true. You can see that for low levels of income, this
linear function would not describe the relationship between income and consumption very well, which is why
we will eventually have to use other types of functions to describe such relationships.
4
b. [5 points] What is the predicted consumption when family income is $30, 000?
Solution :
cons
d = 124.84 + 0.853($30, 000) = $25, 714.84
c. [5 points] With inc on the x-axis, draw a graph of the estimated MPC and APC
Solution :
MPC = β̂ = 0.853
APC = cons/inc
d = $124.84/inc + 0.853
Note that the APC is not constant, it is always larger than the MPC, and it gets closer to the MPC as
income increases.
# If needed, run install.packages("ggplot2") first
library(ggplot2)
# Selecting a range of (arbitrary) income levels to plot
inc <- 25:1000
# Generating and naming the APC curve
apc <- 124.84/inc+0.853
# Generating and naming the MPC line
mpc <- 0.853
# Creating a data frame for the three objects of interest: inc, apc and mpc
inc_data <- data.frame(inc,apc,mpc)
# Plotting the two series (apc and mpc with income on x-axis and consumption on y-axis)
ggplot(data = inc_data, aes(x=inc)) +
geom_line(aes(y = apc, colour = "APC")) + geom_line(aes(y = mpc, colour = "MPC")) +
labs(title ="Estimating MPC & APC", x ="Income", y ="Consumption") +
scale_colour_manual("", values = c("APC" = "red", "MPC" = "blue"))
APC
MPC
5
Question 5 [15 points]
A college bookseller makes calls at the offices of professors and forms the impression that professors are more
likely to be away from their offices on Friday than any other working day. A review of the records of calls,
one-fifth of which are on Fridays, indicates that for 16% of Friday calls, the professor is away from the office,
while this occurs for only 12% of calls on every other working day. Define the random variables as follows: X
is equal to one if the call is made on Friday and zero if the call is made on Monday to Thursday and Y is
equal to one if the professor is away from the office and zero if the professor is in the office.
a. [5 points] Find the joint probability function for X and Y .
Solution :
Let’s first establish the marginal pdf of X, P r(X = x). Note that the probabilities are derived simply from
the fact that there are 5 working days in a week. X takes a value of 1 when the call is made on a Friday (in
other words, 1 possible working day of the week). On the other hand, X takes a value of 0 when the call is
made on Monday-Thursday (in other words, the other 4 possible working days of the week).
1
P r(A call is made on a Friday) = P r(X = 1) = = 0.2
5
4
P r(A call is made on a day that isn’t a Friday) = P r(X = 0) = = 0.8
5
We can use the following formula to calculate the joint probability distribution:
P (X = x, Y = y) = P (Y = y|X = x)P (X = x)
where
P (The professor is away from the office and Friday) = 0.16 ∗ 0.2 = 0.032
P (The professor is in the office and Friday) = 0.84 ∗ 0.2 = 0.168
P (The professor is away from the office and Not Friday) = 0.12 ∗ 0.8 = 0.096
P (The professor is in the office and Not Friday) = 0.88 ∗ 0.8 = 0.704
Note that the joint probabilities sum to one, which is an easy way to check that our calculations are correct.
b. [5 points] Find the conditional probability function for Y given X = 1 and X = 0.
Solution :
These are given in the question:
and
P (The professor is absent|F riday) = 0.16
6
Solution :
P
For E[Y |X = x] = t∈T tP (Y = t|X = x) where T is the support of Y (i.e. all the possible val-
ues that Y can take). Because we have a Bernoulli distribution, we know that E[Y |X = F riday] =
P (The professor is away|Friday) = 0.16. It takes this value with probability 1/5. E[Y |X = Not Friday] =
P (The professor is away|Not Friday) = 0.12. It takes this value with probability 4/5.
Note that E[Y |X] is a random variable, as X is a random variable.
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
mydata <- read.dta('morg10.dta')
newdata = mydata[which(mydata$stfips=='AL'),]
cols_keep <- c('grade92', 'ownchild')
subset_data = newdata[cols_keep]
newvals_grade92<-c('31'=0,'32'=3,'33'=6,'34'=8,'35'=9,
'36'=10,'37'=11,'38'=12,'39'=12,'40'=14,'41'=14,
'42'=14,'43'=16,'44'=17,'45'=20,'46'=22)
subset_data['yrs_ed']=newvals_grade92[as.character(subset_data$grade92)]
subset_data$grade92 <- NULL
final_data <- na.omit(subset_data)
final_data <- final_data[!is.infinite(rowSums(final_data)),]
My data cover monthly census interview dates (‘intmonth’) from January-December 2010. There are 1,720
observations in my final data set (including observations only from Alabama, excluding NA’s). The geographic
level of my final data set is the state (Alabama).
2. [5 points] Describe your variables. This will include the definitions of these variables and the summary
statistics (mean and standard deviation). DO NOT include the R output at part of your homework;
rather write a sentence that indicates the value of these statistics.
Solution :
As mentioned in recitation, you can find the description and value ranges of the relevant variables in this
document: https://data.nber.org/morg/docs/cpsx.pdf:
(1)yrs_ed : This variable takes the value of the number of years of education of the survey respondent. It
ranges from 0-22, with 0 representing the respondent completed less than 1st grade and 22 representing the
respondent completed a doctorate degree. The mean of this variable is 13.6 and the standard deviation is 2.5.
7
(2)ownchild : This variable takes the value of the number of own children less than 18 in the respondent’s
primary family. The mean of this variable is 0.5 and the standard deviation is 0.9.
3. [5 points] Write down the population LRM that is based on these two variables. Explain why this
is an economically interesting relationship (i.e. what economic theory/reasoning indicates that the
independent variable, X, causes the dependent variable, Y ?). What is the predicted sign of the slope
coefficient in your regression?
Solution :
The general form of a population LRM is: Yi = α + βXi + εi
Using my choice of variables, the population LRM is: ownchild = α + β × yrs_ed + εi
I am choosing to write the population LRM in terms of the number of years of education and number of
own children. You could have chosen any two other variables besides the log of weekly earnings and years of
education. This is an economically interesting relationship, as one might imagine there is a trade off between
obtaining more education and having and raising children. I predict that there will be a negative slope
coefficient in my regression, meaning that the more education a respondent has, the fewer children under 18
the respondent’s primary family has.
4. [5 points] Run a regression using these two variables and interpret the slope parameter estimate from
this regression. Include the regression table [hint: you can use stargazer or summary] as part of your
homework.
Solution :
# Run after following the data import instructions from "Census2010_Commands.R" file uploaded
# to NYU Classes (which we also went through in recitation).
# We use the "lm" command in R to fit our population LRM. The dependent, or response, variable
# is listed first, followed by "~" and then one or more independent variables:
model_with_intercept <- lm(ownchild ~ yrs_ed, data=final_data)
summary(model_with_intercept)
##
## Call:
## lm(formula = ownchild ~ yrs_ed, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7295 -0.4640 -0.3976 -0.1984 9.6024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0007563 0.0717663 -0.011 0.992
## yrs_ed 0.0331946 0.0054303 6.113 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8886 on 3567 degrees of freedom
## Multiple R-squared: 0.01037, Adjusted R-squared: 0.01009
## F-statistic: 37.37 on 1 and 3567 DF, p-value: 1.084e-09
In this model, I obtained a slope coefficient of ≈ 0.033. I can interpret this as follows: For every additional
year of education the Alabama respondent obtains, he or she has 0.033 additional children under the age of
18. From the summary table, I can see this result is statistically significant at the 0.001 level, but the sign
of the slope coefficient is opposite of what I had predicted. Of course, this is a simplistic model, and it is
likely that years of education only partially explain the number of children respondents in Alabama have
(i.e. income, marital status, etc.). Additionally, since “ownchild” only captures the number of children under
8
18, it is possible there are older children respondents had but which do not show up in our regression as a
result.
2000
Weekly Earnings
1000
0 5 10 15 20
Years of Education
b. [5 points] Take the logarithm of the weekly earnings and plot the new variable against the number of
9
years of education. Do you see a difference in the relationship between the two variables? Does it make
sense to take the logarithm if we are interested in a linear model?
Solution :
final_data['log_earnings']=log(final_data$earnwke)
5
Log of Weekly Earnings
−5
0 5 10 15 20
Years of Education
The first reason we may want to use the log of weekly earnings is to improve model fit. In this case, our
residuals aren’t normally distributed, as earnings data are truncated at zero and often exhibit positive skew.
One way to account for this is to modify the initial model to reflect possible non-linearity in the dependent
variable (weekly earnings) and skewness in the distribution of disturbances for each education level. Taking
the logarithm of a skewed variable, such as weekly earnings, can improve the fit by making the variable more
“normally” distributed.
In our case, the plot of weekly earnings against years of education in part (a) showed that within each year of
education, earnings are not symmetrically distributed. In fact, they were positively skewed such that for
the same education level, a few people have very high earnings but most are lower than the average. Log
transforming weekly earnings results in a plot in part (b) where earnings are generally more symmetrically
distributed within each education level.
The second reason we may want to use the log of weekly earnings is theoretical. There is theoretical rationale
that education has a multiplicative effect on earnings, in which case the model as is would be nonlinear. By
taking the logarithm of weekly earnings, we transform the multiplicative model into a linear one, as follows:
Recall that eA+B = eA eB :
Y = eβ0 +β1 X = eβ0 eβ1 X
Take the log of both sides and recall that log(AB) = log(A) + log(B):
10
Finally, recall that log(eA ) = A:
log(Y ) = β0 + β1 X
The final reason we may want to use the log of weekly earnings is for interpretation convenience. Here, since
we have just taken the log of the dependent variable Y and not X, a one unit increase in X leads to a β ∗ 100%
increase/decrease in Y. This is compared to the case where we do not take a log transformation of either
variable, and a one unit increase in X leads to a β increase/decrease in Y.
In sum, yes it is useful to use the log of weekly earnings since we are interested in a linear model.
c. [5 points] Get the summary statistics of the your data.
Solution :
# Run after following the data import instructions from "CPSMar2016_Commands.R" file uploaded
# to NYU Classes (which we also went through in recitation).
stargazer(final_data, type = 'text')
##
## =====================================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------------
## earnwke 11,025 992.340 679.023 0.010 2,884.610
## sex 11,025 1.494 0.500 1 2
## age 11,025 43.488 11.107 25 64
## race 11,025 1.407 1.234 1 21
## yrs_ed 11,025 14.251 2.684 0 22
## log_earnings 11,025 6.654 0.771 -4.605 7.967
## -----------------------------------------------------
11