Lecture 4
Lecture 4
Lecture 4
Amine Hadji
Leiden University
March 1, 2022
Outline
• Correlation
• R2
Scatter plot
Regression line in sample
Relationship between variables
ŷ = b0 + b1 x,
• ŷ : predicted (estimated) y
• b0 : intercept of the straight line in the sample (i.e. the value of ŷ for x = 0)
• b1 : slope of the straight line in the sample (i.e. how much ŷ changes for one unit
increase of x).
Its sign determines if the line is increasing or decreasing
Regression line in sample
Residual error
Usually, the predicted variable ŷ 6= y the observed value:
Residual / Prediction error: y − ŷ .
Least squares estimation
The residuals for the handspan data:
The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals
n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1
The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals
n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1
The intercept b0 and slope b1 are chosen to minimize the sum of the squared residuals
n
X
SSE = e12 + e22 + ... + en2 = (yi − b0 − b1 ∗ xi )2 .
i=1
Formula:
n
1 X xi − x̄ yi − ȳ
r=
n−1 sx sy
i=1
Interpretation:
• r is close to −1 or 1 implies
• r 2:
Squared correlation
Interpretation:
• r is close to −1 or 1 implies ⇒ r 2 close to 1
Interpretation:
• r is close to −1 or 1 implies ⇒ r 2 close to 1
If we did not know anything about the xi , the standard deviation would be:
sP
n 2
i=1 (yi − ȳ )
s= .
n−1
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?
Hypothesis testing:
H0 : b 1 = 0 H1 : b1 6= 0.
Test statistics:
sample statistics − null value b1 − 0
t= = .
standard error se(b1 )
Statistical Significance - Linear Relationship
Regression line of the population is: y = b0 + b1 x.
Question: Is the slope zero, i.e. is there any relationship between the variables?
Hypothesis testing:
H0 : b 1 = 0 H1 : b1 6= 0.
Test statistics:
sample statistics − null value b1 − 0
t= = .
standard error se(b1 )
• b0 : intercept of the straight line in the sample (i.e. the value of ŷ for all xj = 0)
Problem:
the effect of omitting relevant variables can be picked up by another explanatory
variables.
Omitted variables
Problem:
the effect of omitting relevant variables can be picked up by another explanatory
variables.
Examples:
• work experience & education ⇒ wage
• No outliers
Multivariate regression - Assumptions
• No outliers
• No outliers
• No outliers
The coefficients b0 , ..., bp−1 are chosen to minimize the sum of the squared residuals
The coefficients b0 , ..., bp−1 are chosen to minimize the sum of the squared residuals
SSE
R2 = 1 − ,
SSTO
2 n−1 p
Radj = 1 − (1 − R 2 ) = R 2 − (1 − R 2 ) .
n−p−1 n−p−1
• difficult to interpret