Notes2
Notes2
Notes2
~ 16 ~
dE y | x
2 , which means that the expected value of y increases by 2
dx
units when x increases by one unit
o Does it matter which variable is on the left-hand side?
At one level, no:
1
x y 1 e , so
2
1 1 1
x 1 2 y v, where 1 , 2 , v e.
2 2 2
For purposes of most estimators, yes:
We shall see that a critically important assumption is that the
error term is independent of the “regressors” or exogenous
variables.
Are the errors shocks to y for given x or shocks to x for given y?
o It might not seem like there is much difference, but the
assumption is crucial to valid estimation.
Exogeneity: x is exogenous with respect to y if shocks to y do not affect x,
i.e., y does not cause x.
Where do the data come from? Sample and “population”
o We observe a sample of observations on y and x.
o Depending on context these samples may be
Drawn from a larger population, such as census data or surveys
Generated by a specific “data-generating process” (DGP) as in time-
series observations
o We usually would like to assume that the observations in our sample are
statistically independent, or at least uncorrelated: cov yi , y j 0, i j .
o We will assume initially (for a few weeks) that the values of x are chosen as in an
experiment: they are not random.
We will add random regressors soon and discover that they don’t change
things much as long as x is independent of e.
Goals of regression
o True regression line: actual relationship in population or DGP
True and f (e|x)
Sample of observations comes from drawing random realizations of e
from f (e|x) and plotting points appropriately above and below the true
regression line.
o We want to find an estimated regression line that comes as close to the true
regression line as possible, based on the observed sample of y and x pairs:
Estimate values of parameters 1 and 2
~ 17 ~
Estimate properties of probability distribution of error term e
Make inferences about the above estimates
Use the estimates to make conditional forecasts of y
Determine the statistical reliability of these forecasts
~ 18 ~
o A rule (formula) for calculating an estimate of a parameter (1, 2, or 2) based on
the sample values y, x
o Estimators are often denoted by ^ over the variable being estimated: An
estimator of 2 might be denoted ̂2
How might we estimate the coefficients of the simple regression model?
o Three strategies:
Method of least-squares
Method of moments
Method of maximum likelihood
o All three strategies with the SR assumptions lead to the same estimator rule: the
ordinary least-squares regression estimator: (b1, b2, s2)
Method of least squares
o Estimation strategy: Make sum of squared y-deviations (“residuals”) of observed
values from the estimated regression line as small as possible.
o Given coefficient estimates b1 , b2 , residuals are defined as eˆi yi b1 b2 xi
Or eˆi yi yˆ i , with yˆ i b1 b2 x i
o Why not minimize the sum of the residuals?
We don’t want sum of residuals to be large negative number: Minimize
sum of residuals by having all residuals infinitely negative.
Many alternative lines that make sum of residuals zero (which is
desirable) because positives and negatives cancel out.
o Why use square rather than absolute value to deal with cancellation of positives
and negatives?
Square function is continuously differentiable; absolute value function is
not.
Least-squares estimation is much easier than least-absolute-
deviation estimation.
Prominence of Gaussian (normal) distribution in nature and statistical
theory focuses us on variance, which is expectation of square.
Least-absolute-deviation estimation is occasionally done (special case of
quantile regression), but not common.
Least-absolute-deviation regression gives less importance to large outliers
than least-squares because squaring gives large emphasis to residuals with
large absolute value. Tends to draw the regression line toward these
points to eliminate large squared residuals.
N N
Least-squares criterion function: S eˆi2 yi b1 b2 x i
2
o
i 1 i 1
S N
2 yi b1 b2 x i 0
b1 i 1
N N
y
i 1
i Nb1 b2 x i 0
i 1
y b1 b2 x 0
b1 y b2 x .
Note that the b1 condition assures that the regression line passes through
the point x , y .
Substituting the second condition into the first divided by N:
yi x i y b2 x Nx b2 x i2 0
yi x i Nyx b2 x i2 Nx 2 0
b2
y x Nyx y y x x ˆ
i i i i XY
.
x Nx2
i x x
2
ˆ i
2 2
X
~ 20 ~
Assume E ei 0, i. This means that the population/DGP mean of the
error term is zero.
Corresponding to this assumption about the population mean of e
1
is the sample mean condition
N
eˆi 0 . Thus we set the
sample mean to the value we have assumed for the population
mean.
Assume cov x , e 0 , which is equivalent to E xi E ( x ) ei 0.
Corresponding to this assumption about the population
covariance between the regressor and the error term is the sample
1
x i x eˆi 0. Again, we set the
covariance condition:
N
sample moment to the zero value that we have assumed for the
population moment.
o Plugging the expression for the residual into the sample moment expressions
above:
1
N
yi b1 b2 x i 0,
b1 y b2 x .
This is the same as the intercept estimate equation for the least-squares
estimator above.
1
N
xi x yi b1 b2 xi 0,
xi x yi y b2 x b2 xi 0,
x x y y b x
i i 2 i x x i x 0,
b
x x y y .
i i
x x
2 2
i
~ 21 ~
This function measures the probability density of any particular
combination of y and x values, which can be loosely thought of as how
probable that outcome is, given the parameter values.
For a given set of parameters, some observations of y and x are less likely
than others. For example, if 1 = 0 and 2 < 0, then it is less likely that we
would see observations where y > 0 when x > 0, than observations with
y < 0.
o The idea of maximum-likelihood estimation is to choose a set of parameters that
makes the likelihood of observing the sample that we actually have as high as
possible.
o The likelihood function is just the joint density function turned on its head:
Li 1 , 2 | x i , yi f i xi , yi |1 , 2 .
o If the observations are independent random draws from identical probability
distributions (they are IID), then the overall sample density (likelihood) function
is the product of the density (likelihood) function of the individual observations:
N
f x1 , y1 , x 2 , y2 ,, x n , yn |1 , 2 f i x i , yi |1 , 2
i 1
n
L 1 , 2 | x1 , y1 , x 2 , y2 ,, x n , yn Li 1 , 2 | x i , yi .
i 1
2 2
Because of the exponential function, Gaussian likelihood functions are
usually manipulated in logs.
Note that because the log function is monotonic, maximizing the
log-likelihood function is equivalent to maximizing the likelihood
function itself.
1 1
For an individual observation: ln Li ln 2 2 yi 1 2 x i
2
2 2
Aggregating over the sample:
N N
ln Li 1 , 2 | x i , yi ln Li 1 , 2 | x i , yi
i 1 i 1
N
1 1 2
ln 2 2 yi 1 2 x i
i 1 2 2
N 1 N
ln 2 2 yi 1 2 x i .
2
2 2 i 1
~ 22 ~
o The only part of this expression that depends on or on the sample is the final
summation. Because of the negative sign, maximizing the likelihood function
(with respect to ) is equivalent to minimizing the summation.
But this summation is just the sum of squared residuals that we
minimized in OLS.
o Thus, OLS is MLE if the distribution of e conditional on x is Gaussian with
mean zero and constant variance 2, and if the observations are IID.
Evaluating alternative estimators (not important for comparison here since all three are
same, but are they any good?)
o Desirable criteria
Unbiasedness: estimator is on average equal to the true value
E ˆ
var ˆ E ˆ E ˆ
2
Small RMSE can balance variance with bias:
RMSE MSE
MSE E ˆ
2
o We will talk about BLUE estimators as minimum variance within the class of
unbiased estimators.
~ 23 ~
We can write the OLS slope estimator as
1 N
N
y i y x i x
b2 i 1
1
xi x
2
N
1 N
N
1 2 x i ei y x i x
i 1
1
xi x
2
N
1 N
N
1 2 x i ei 1 2 x x i x
i 1
1
xi x
2
N
1 N
N
x 2 i x ei x i x
i 1
1
xi x
2
N
1 N
N
e x i i x
2 i 1
1
xi x
2
N
The third step uses the property y 1 2 x , since the expected value of e is zero.
For now, we are assuming that x is non-random, as in a controlled experiment.
o If x is fixed, then the only part of the formula above that is random is e.
o The formula shows that the slope estimate is linear in e.
o This means that if e is Gaussian, then the slope estimate will also be Gaussian.
Even if e is not Gaussian, the slope estimate will converge to a Gaussian
distribution as long as some modest assumptions about its distribution
are satisfied.
o Because all the x variables are non-random, they can come outside when we take
expectations, so
1 N 1 N
N i x x e i xi x E ei
N i 1
E b2 2 E i 1
2 2 .
1 1 N
N
2
N i i
2
x x x x
i 1 N i 1
o What about the variance of b2?
We will do the details of the analytical work in matrix form because it’s
easier
~ 24 ~
var b2 E b2 2
2
2
1 N
N x i x E ei
E i 1 N
1 x i x
N
2
i 1
2
N
.
x x
2
i
i 1
HGL equations 2.14 and 2.16 provide formulas for variance of b1 and the
covariance between the coefficients:
N
x 2
i
var b1 2 N
i 1
N xi x
2
i 1
x
cov b1, b2 2 N
0
x x
2
i
i 1
~ 27 ~
Summing squares of the elements of a column vector in matrix notation is just the inner
N
product: eˆ
i 1
i
2
eˆ eˆ , where prime denotes matrix transpose. Thus we want to minimize
eˆ eˆ y Xb y Xb
o y bX y Xb
y y 2 bXy bXXb.
o Differentiating with respect to the coefficient vector and setting to zero yields
2Xy 2XXb 0, or XXb Xy.
o Pre-multiplying by the inverse of XX yields the OLS coefficient formula:
b XX Xy. (This is one of the few formulas that you need to memorize.)
1
Note symmetry between matrix formula and scalar formula. Xy is the sum of the cross
product of the two variables and XX is the sum of squares of the regressor. The former is
in the numerator (and not inverted) and the latter is in the denominator (and inverted).
In matrix notation, we can express our estimator in terms of e as
b XX Xy
1
XX X X e
1
XX Xe.
1
cov(z) E z Ez z Ez
E z1 Ez E z1 Ez z 2 Ez E z1 Ez z M Ez
2
E z1 Ez z 2 Ez E z 2 Ez
2
E z 2 Ez z M Ez
.
E z1 Ez z M Ez E z 2 Ez z M Ez E z M Ez
2
In our regression model, if e is IID with mean zero and variance 2, then
Ee = 0 and cov e E ee 2 I N , with IN being the order-N identity
matrix.
~ 28 ~
We can then compute the covariance matrix of the (unbiased) estimator
as
cov b E b b
E XX Xe XX Xe
1 1
E XX XeeX XX
1 1
XX XE ee X XX
1 1
1 1 eˆ i
2
s2 i 1
.
N
N 2 N
x x x x
2 2
i i
i 1 i 1
b ~ N , 2 XX
1
.
Asymptotic properties of OLS bivariate regression estimator
(Based on S&W, Chapter 17.)
~ 29 ~
o S N
p
if and only if lim Pr S N 0 for any > 0. Thus, for any
N
small value of , we can make the probability that SN is further from than
arbitrarily small by choosing N large enough.
o If S N
p
, then we can write plim SN = .
o This means that the entire probability distribution of SN converges on the value
as N gets large.
o Estimators that converge in probability to the true parameter value are called
consistent estimators.
Convergence in distribution
o If the sequence of random variables {SN} has cumulative probability distributions
F1, F2, …, FN, …, then S N
d
S if and only if lim FN t F t , for all t at
N
which F is continuous.
o If a sequence of random variables converges in distribution to the normal
distribution, it is called asymptotically normal.
Properties of probability limits and convergence in distribution
o Probability limits are very forgiving: Slutsky’s Theorem states that
plim (SN + RN) = plim SN + plim RN
plim (SNRN) = plim SN · plim RN
plim (SN / RN) = plim SN / plim RN
o The continuous-mapping theorem gives us
For continuous functions g, plim g(SN) = g(plim SN)
S , then g S N g S .
d
And if S N
d
~ 31 ~