Notes2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Section 2 Simple Regression

What regression does


 Relationship between variables
o Often in economics we believe that there is a (perhaps causal) relationship
between two variables.
o Usually more than two, but that’s deferred to another day.
o We call this the economic model.
 Functional form
o Is the relationship linear?
 y  1  2 x
 This is natural first assumption, unless theory rejects it.
 2 is slope, which determines whether relationship between x and y is
positive or negative.
 1 is intercept or constant term, which determines where the linear
relationship intersects the y axis.
o Is it plausible that this is an exact, “deterministic” relationship?
 No. Data (almost) never fit exactly along line.
 Why?
 Measurement error (incorrect definition or mismeasurement)
 Other variables that affect y
 Relationship is not purely linear
 Relationship may be different for different observations
o So the economic model must be modeled as determining the expected value of y
 E  y | x   1  2 x : The conditional mean of y given x is 1  2 x
 Adding an error term for a “stochastic” relationship gives us the actual
value of y: y  1  2 x  e
 Error term e captures all of the above problems.
 Error term is considered to be a random variable and is not
observed directly.
 Variance of e is 2, which is the conditional variance of y given x,
the variance of the conditional distribution of y given x.
 The simplest, but not usually valid, assumption is that the
conditional variance is the same for all observations in our
sample (homoskedasticity)

~ 16 ~
dE  y | x 
 2  , which means that the expected value of y increases by 2
dx
units when x increases by one unit
o Does it matter which variable is on the left-hand side?
 At one level, no:
1
 x   y  1  e  , so
2
1 1 1
 x  1   2 y  v, where 1   ,  2  , v   e.
2 2 2
 For purposes of most estimators, yes:
 We shall see that a critically important assumption is that the
error term is independent of the “regressors” or exogenous
variables.
 Are the errors shocks to y for given x or shocks to x for given y?
o It might not seem like there is much difference, but the
assumption is crucial to valid estimation.
 Exogeneity: x is exogenous with respect to y if shocks to y do not affect x,
i.e., y does not cause x.
 Where do the data come from? Sample and “population”
o We observe a sample of observations on y and x.
o Depending on context these samples may be
 Drawn from a larger population, such as census data or surveys
 Generated by a specific “data-generating process” (DGP) as in time-
series observations
o We usually would like to assume that the observations in our sample are
 
statistically independent, or at least uncorrelated: cov yi , y j  0, i  j .
o We will assume initially (for a few weeks) that the values of x are chosen as in an
experiment: they are not random.
 We will add random regressors soon and discover that they don’t change
things much as long as x is independent of e.
 Goals of regression
o True regression line: actual relationship in population or DGP
 True  and f (e|x)
 Sample of observations comes from drawing random realizations of e
from f (e|x) and plotting points appropriately above and below the true
regression line.
o We want to find an estimated regression line that comes as close to the true
regression line as possible, based on the observed sample of y and x pairs:
 Estimate values of parameters 1 and 2

~ 17 ~
 Estimate properties of probability distribution of error term e
 Make inferences about the above estimates
 Use the estimates to make conditional forecasts of y
 Determine the statistical reliability of these forecasts

Summarizing assumptions of simple regression model


 Assumption #0: (Implicit and unstated) The model as specified applies to all units in the
population and therefore all units in the sample.
o All units in the population under consideration have the same form of the
relationship, the same coefficients, and error terms with the same properties.
o If the United States and Mali are in the population, do they really have the same
parameters?
o This assumption underlies everything we do in econometrics, and thus it must
always be considered very carefully in choosing a specification and a sample, and
in deciding for what population the results carry implications.
 SR1: y  1  2 x  e
 SR2: E  e   0 , so E  y   1  2 x
o Note that if x is random, we make these conditional expectations
E  e |x   0
o
E  y | x   1  2 x
 SR3: var  e   2  var  y 
o If x is random, this becomes var  e | x   2  var  y | x 
o We should (and will) consider the more general case in which variance varies
across observations: heteroskedasticity
    
SR4: cov ei , e j  cov yi , y j  0
o This, too, can be relaxed: autocorrelation
 SR5: x is non-random and takes on at least two values
o We will allow random x later and see that E  e | x   0 implies that e must be
uncorrelated with x.
 SR6: (optional) e ~ N  0, 2 
o This is convenient, but not critical since the law of large numbers assures that for
a wide variety of distributions of e, our estimators converge to normal as the
sample gets large

Strategies for obtaining regression estimators


 What is an estimator?

~ 18 ~
o A rule (formula) for calculating an estimate of a parameter (1, 2, or 2) based on
the sample values y, x
o Estimators are often denoted by ^ over the variable being estimated: An
estimator of 2 might be denoted ̂2
 How might we estimate the  coefficients of the simple regression model?
o Three strategies:
 Method of least-squares
 Method of moments
 Method of maximum likelihood
o All three strategies with the SR assumptions lead to the same estimator rule: the
ordinary least-squares regression estimator: (b1, b2, s2)
 Method of least squares
o Estimation strategy: Make sum of squared y-deviations (“residuals”) of observed
values from the estimated regression line as small as possible.
o Given coefficient estimates b1 , b2 , residuals are defined as eˆi  yi  b1  b2 xi
 Or eˆi  yi  yˆ i , with yˆ i  b1  b2 x i
o Why not minimize the sum of the residuals?
 We don’t want sum of residuals to be large negative number: Minimize
sum of residuals by having all residuals infinitely negative.
 Many alternative lines that make sum of residuals zero (which is
desirable) because positives and negatives cancel out.
o Why use square rather than absolute value to deal with cancellation of positives
and negatives?
 Square function is continuously differentiable; absolute value function is
not.
 Least-squares estimation is much easier than least-absolute-
deviation estimation.
 Prominence of Gaussian (normal) distribution in nature and statistical
theory focuses us on variance, which is expectation of square.
 Least-absolute-deviation estimation is occasionally done (special case of
quantile regression), but not common.
 Least-absolute-deviation regression gives less importance to large outliers
than least-squares because squaring gives large emphasis to residuals with
large absolute value. Tends to draw the regression line toward these
points to eliminate large squared residuals.
N N
Least-squares criterion function: S   eˆi2    yi  b1  b2 x i 
2
o
i 1 i 1

 Least-squares estimators is the solution to min S . Since S is a


b1 ,b2

continuously differentiable function of the estimated parameters, we can


~ 19 ~
differentiate and set the partial derivatives equal to zero to get the least-
squares normal equations:
S N
  2  yi  b1  b2 x i   x i   0,
b2 i 1
 N N N
 yi x i  b1  x i  b2  x i2  0.
i 1 i 1 i 1

S N
  2  yi  b1  b2 x i   0
b1 i 1
N N
 y
i 1
i  Nb1  b2  x i  0
i 1

y  b1  b2 x  0
b1  y  b2 x .
 Note that the b1 condition assures that the regression line passes through
the point  x , y  .
 Substituting the second condition into the first divided by N:
 yi x i   y  b2 x  Nx  b2  x i2  0
   yi x i  Nyx   b2   x i2  Nx 2   0

b2 
 y x  Nyx    y  y  x  x   ˆ
i i i i XY
.
 x  Nx2
i  x  x 
2
ˆ i
2 2
X

 The b2 estimator is the sample covariance of x and y divided by the


sample variance of x.
 What happens if x is constant across all observations in our sample?
 Denominator is zero and we can’t calculate b2.
 This is our first encounter with the problem of collinearity: if x is
a constant then x is a linear combination of the “other
regressor”—the constant one that is multiplied by b1.
 Collinearity (or multicollinearity) will be more of a problem in
multiple regression. If it is extreme (or perfect), it means that we
can’t calculate the slope estimates.
o The above equations are the “ordinary least-squares” (OLS) coefficient
estimators.
 Method of moments
o Another general strategy for obtaining estimators is to set estimates of selected
population moments equal to their sample counterparts. This is called the
method of moments.
o In order to employ the method of moments, we have to make some specific
assumptions about the population/DGP moments.

~ 20 ~
 Assume E  ei   0, i. This means that the population/DGP mean of the
error term is zero.
 Corresponding to this assumption about the population mean of e
1
is the sample mean condition
N
 eˆi  0 . Thus we set the
sample mean to the value we have assumed for the population
mean.
 Assume cov  x , e   0 , which is equivalent to E  xi  E ( x )  ei   0.
 Corresponding to this assumption about the population
covariance between the regressor and the error term is the sample
1
  x i  x  eˆi  0. Again, we set the
covariance condition:
N
sample moment to the zero value that we have assumed for the
population moment.
o Plugging the expression for the residual into the sample moment expressions
above:
1
 N
  yi  b1  b2 x i   0,
b1  y  b2 x .
 This is the same as the intercept estimate equation for the least-squares
estimator above.
1
N
  xi  x  yi  b1  b2 xi   0,
  xi  x  yi  y  b2 x  b2 xi   0,

  x  x  y  y    b  x
i i 2 i  x  x i  x   0,

b 
  x  x  y  y  .
i i

 x  x 
2 2
i

 This is exactly the same equation as for the OLS estimator.


o Thus, if we assume that E  ei   0, i and cov  x , e   0 in the population, then
the OLS estimator can be derived by the method of moments as well.
o (Note that both of these moment conditions follow from the extended
assumption SR2 that E(e|x) = 0.)
 Method of maximum likelihood
o Consider the joint probability density function of yi and xi, fi (yi, xi |1, 2). The
function is written is conditional on the coefficients  to make explicit that the
joint distribution of y and x are affected by the parameters.

~ 21 ~
 This function measures the probability density of any particular
combination of y and x values, which can be loosely thought of as how
probable that outcome is, given the parameter values.
 For a given set of parameters, some observations of y and x are less likely
than others. For example, if 1 = 0 and 2 < 0, then it is less likely that we
would see observations where y > 0 when x > 0, than observations with
y < 0.
o The idea of maximum-likelihood estimation is to choose a set of parameters that
makes the likelihood of observing the sample that we actually have as high as
possible.
o The likelihood function is just the joint density function turned on its head:
Li  1 , 2 | x i , yi   f i  xi , yi |1 , 2 .
o If the observations are independent random draws from identical probability
distributions (they are IID), then the overall sample density (likelihood) function
is the product of the density (likelihood) function of the individual observations:
N
f  x1 , y1 , x 2 , y2 ,, x n , yn |1 , 2    f i  x i , yi |1 , 2 
i 1
 n
L  1 , 2 | x1 , y1 , x 2 , y2 ,, x n , yn    Li  1 , 2 | x i , yi .
i 1

o If the conditional probability distribution of e conditional on x is Gaussian


(normal) with mean zero and variance 2:
  1  y   x 2 
 2 i 1 2 i 
1  2 
 f i  x i , yi |1 , 2   Li  1 , 2 | x i , yi  
e  

2 2
 Because of the exponential function, Gaussian likelihood functions are
usually manipulated in logs.
 Note that because the log function is monotonic, maximizing the
log-likelihood function is equivalent to maximizing the likelihood
function itself.
1 1
For an individual observation: ln Li   ln  2   2  yi  1  2 x i 
2

2 2
 Aggregating over the sample:
N N
ln  Li  1 , 2 | x i , yi    ln Li  1 , 2 | x i , yi 
i 1 i 1
N
 1 1 2
    ln  2   2  yi  1  2 x i  
i 1  2 2  
N 1 N
  ln  2   2   yi  1  2 x i  .
2

2 2 i 1

~ 22 ~
o The only part of this expression that depends on  or on the sample is the final
summation. Because of the negative sign, maximizing the likelihood function
(with respect to ) is equivalent to minimizing the summation.
 But this summation is just the sum of squared residuals that we
minimized in OLS.
o Thus, OLS is MLE if the distribution of e conditional on x is Gaussian with
mean zero and constant variance 2, and if the observations are IID.
 Evaluating alternative estimators (not important for comparison here since all three are
same, but are they any good?)
o Desirable criteria
 Unbiasedness: estimator is on average equal to the true value
 
E ˆ  

 Small variance: estimator is usually close to its expected value

 
var ˆ  E  ˆ  E ˆ 
2

 
 Small RMSE can balance variance with bias:
RMSE  MSE


MSE  E  ˆ    
2

 
o We will talk about BLUE estimators as minimum variance within the class of
unbiased estimators.

Sampling distribution of OLS estimators


 b1 and b2 are random variables: they are functions of the random variables y and e.
o We can think of the probability distribution of b as occurring over repeated
random samples from the underlying population or DGP.
 In many (most) cases, we cannot derive the distribution of an estimator theoretically, but
must rely on Monte Carlo simulation to estimate it. (See below)
o Because OLS estimator (under our assumptions) is linear, we can derive its
distribution.

~ 23 ~
 We can write the OLS slope estimator as
1 N

N
 y i  y  x i  x 
b2  i 1
1
 xi  x 
2

N
1 N

N
  1  2 x i  ei  y  x i  x 
 i 1
1
 xi  x 
2

N
1 N

N
  1  2 x i  ei   1  2 x    x i  x 
 i 1
1
 xi  x 
2

N
1 N

N
   x 2 i  x   ei   x i  x 
 i 1
1
 xi  x 
2

N
1 N

N
e x i i x
 2  i 1
1
 xi  x 
2

N
The third step uses the property y  1  2 x , since the expected value of e is zero.
 For now, we are assuming that x is non-random, as in a controlled experiment.
o If x is fixed, then the only part of the formula above that is random is e.
o The formula shows that the slope estimate is linear in e.
o This means that if e is Gaussian, then the slope estimate will also be Gaussian.
 Even if e is not Gaussian, the slope estimate will converge to a Gaussian
distribution as long as some modest assumptions about its distribution
are satisfied.
o Because all the x variables are non-random, they can come outside when we take
expectations, so
1 N  1 N
N  i  x  x e i    xi  x E  ei 
N i 1
E  b2   2  E  i 1
  2   2 .
 1 1 N
N
2 
 N   i    i 
2
x  x x  x
i 1  N i 1
o What about the variance of b2?
 We will do the details of the analytical work in matrix form because it’s
easier

~ 24 ~
var  b2   E  b2  2 
2

2
1 N 
 N   x i  x E  ei  
 E  i 1 N 
 1  x i  x  
 N 
2
 i 1 

2
 N
.
x x
2
i
i 1

 HGL equations 2.14 and 2.16 provide formulas for variance of b1 and the
covariance between the coefficients:
N

x 2
i
 var  b1   2 N
i 1

N   xi  x 
2

i 1

x
 cov  b1, b2   2 N
0
x x
2
i
i 1

 Note that the covariance between the slope and intercept


estimators is negative: overestimating one will tend to cause us to
underestimate the other
 What determines the variance of b?
 Smaller variance of error  more precise estimators
 Larger number of observations  more precise estimators
 More dispersion of observations around mean  more precise
estimators
 What do we know about the overall probability distribution of b?
 If assumption SR6 is satisfied and e is normal, then b is also
normal because it is a linear function of the e variables and linear
functions of normally distributed variables are also normally
distributed.
 If assumption SR6 is not satisfied, then b converges to a normal
distribution as N →∞ provided some weak conditions on the
distribution of e are satisfied.
 These expressions are the true variance/covariance of the estimated coefficient
vector. However, because we do not know 2, it is not of practical use to us. We
need an estimator for 2 in order to calculate a standard error of the coefficients:
an estimate of their standard deviation.
~ 25 ~
1 N 2
 The required estimate in the classical case is s 2   eˆi .
N  2 i 1
 We divide by N – 2 because this is the number of “degrees of
freedom” in our regression.
 Degrees of freedom are a very important issue in econometrics. It
refers to how many data points are available in excess of the minimum
number required to estimate the model.
 In this case, it takes minimally two points to define a line, so the
smallest possible number of observations for which we can fit a
bivariate regression is 2. Any observations beyond 2 make it
(generally) impossible to fit a line perfectly through all observations.
Thus, N – 2 is the number of degrees of freedom in the sample.
 We always divide sums of squared residuals by the number of degrees
of freedom in order to get unbiased variance estimates.
o For example, in calculating the sample variance, we use
1 N
s2    zi  z  because there are N – 1 degrees of
N  1 i 1
freedom left after using one to calculate the mean.
o Here, we have two coefficients to estimate, not just one, so
we divide by N – 2.
 The standard error of each coefficient is the square root of the
corresponding diagonal element of that estimated covariance matrix.
 Note that the HGL text uses an alternative formula based on
1 N
ˆ 2 
N
 eˆ
i 1
i
2
.

o This estimator for 2 is biased because there are only N – 2


degrees of freedom in the N residuals—2 are used up in
estimating the 2  parameters.
o In large samples they are equivalent.

How good is the OLS estimator?


 Is OLS the best estimator? Under what conditions?
 Under “classical” regression assumptions SR1–SR5 (but not necessarily SR6) the Gauss-
Markov Theorem shows that the OLS estimator is BLUE.
o Any other estimator that is unbiased and linear in e has higher variance than b.
o Note that (5, 0) is an estimator with zero variance, but it is biased in the general
case.
 Violation of any of the SR1–SR5 assumptions usually means that there is a better
estimator.
~ 26 ~
Least-squares regression model in matrix notation
(From Griffiths, Hill, and Judge, Section 5.4)
 We can write the ith observation of the bivariate linear regression model as
yi  1  2 xi  ei .
 Arranging the n observations vertically gives us n such equations:
y1  1  2 x1  e1 ,
y2  1  2 x 2  e 2 ,

y N  1  2 x N  e N .
 This is a system of linear equations that can be conveniently rewritten in matrix form.
There is no real need for the matrix representation with only one regressor because the
equations are simple, but when we add regressors the matrix notation is more useful.
o Let y be an N  1 column vector:
 y1 
 
y
y  2 .
  
 
 yN 
o Let X be an N  2 matrix:
 1 x1 
 
1 x2
X .
  
 
1 x N 
o  is a 2  1 column vector of coefficients:
 
   1 .
 2 
o And e is an n  1 vector of the error terms:
 e1 
 
e
e   2 .
  
 
 eN 
o Then y  X  e expresses the system of N equations very compactly.
o (Write out matrices and show how multiplication works for single observation.)
 In matrix notation, eˆ  y  Xb is the vector of residuals.

~ 27 ~
 Summing squares of the elements of a column vector in matrix notation is just the inner
N
product:  eˆ
i 1
i
2
 eˆ eˆ , where prime denotes matrix transpose. Thus we want to minimize

this expression for least squares.

eˆ eˆ   y  Xb   y  Xb 
o   y   bX  y  Xb 
 y y  2 bXy  bXXb.
o Differentiating with respect to the coefficient vector and setting to zero yields
2Xy  2XXb  0, or XXb  Xy.
o Pre-multiplying by the inverse of XX yields the OLS coefficient formula:
b   XX Xy. (This is one of the few formulas that you need to memorize.)
1

 Note symmetry between matrix formula and scalar formula. Xy is the sum of the cross
product of the two variables and XX is the sum of squares of the regressor. The former is
in the numerator (and not inverted) and the latter is in the denominator (and inverted).
 In matrix notation, we can express our estimator in terms of e as
b   XX  Xy
1

  XX  X  X  e 
1

  XX  XX   XX  Xe


1 1

    XX  Xe.
1

o When x is non-stochastic, the covariance matrix of the coefficient estimator is


also easy to compute under the OLS assumptions.
 Covariance matrices: The covariance of a vector random variable is a
matrix with variances on the diagonal and covariances on the off-
diagonals. For an M  1 vector random variable z, the covariance matrix
is to the following outer product:


cov(z)  E  z  Ez  z  Ez  
 E  z1  Ez  E  z1  Ez  z 2  Ez   E  z1  Ez  z M  Ez  
2

 
 E  z1  Ez  z 2  Ez  E  z 2  Ez 
2
 E  z 2  Ez  z M  Ez  
 .
     
 
 E  z1  Ez  z M  Ez  E  z 2  Ez  z M  Ez  E  z M  Ez 
2
 

 In our regression model, if e is IID with mean zero and variance 2, then
Ee = 0 and cov  e  E  ee   2 I N , with IN being the order-N identity
matrix.

~ 28 ~
 We can then compute the covariance matrix of the (unbiased) estimator
as

cov  b   E  b    b    
 


 E   XX  Xe  XX  Xe 

1 1 

 
 E  XX  XeeX  XX  
1 1
 
  XX  XE  ee  X  XX 
1 1

 2  XX  XX  XX   2  XX  .


1 1 1

 What happens to var  bi  as N gets large? Summations in XX


have additional terms, so they get larger. This means that inverse
matrix gets “smaller” and variance decreases: more observations
implies more accurate estimators.

Note that variance also increases as the variance of the error term
goes up. More imprecise fit implies less precise coefficient
estimates.
 Our estimated covariance matrix of the coefficients is then
s 2  XX  .
1

 The (2, 2) element of this matrix is


N

1 1  eˆ i
2

s2  i 1
.
N
N 2 N

 x x  x x
2 2
i i
i 1 i 1

 This is the formula we calculated in class for the scalar system.


 Thus, to summarize, when the classical assumptions hold and e is normally distributed,


b ~ N , 2  XX 
1
.
Asymptotic properties of OLS bivariate regression estimator
(Based on S&W, Chapter 17.)

 Convergence in probability (probability limits)


o Assume that S1, S2, …, SN, … is a sequence of random variables.
 In practice, they are going to be estimators based on 1, 2, …, N
observations.

~ 29 ~
o S N 
p
  if and only if lim Pr  S N       0 for any  > 0. Thus, for any
N 

small value of , we can make the probability that SN is further from  than 
arbitrarily small by choosing N large enough.
o If S N 
p
  , then we can write plim SN = .
o This means that the entire probability distribution of SN converges on the value 
as N gets large.
o Estimators that converge in probability to the true parameter value are called
consistent estimators.
 Convergence in distribution
o If the sequence of random variables {SN} has cumulative probability distributions
F1, F2, …, FN, …, then S N 
d
 S if and only if lim FN  t   F  t  , for all t at
N 

which F is continuous.
o If a sequence of random variables converges in distribution to the normal
distribution, it is called asymptotically normal.
 Properties of probability limits and convergence in distribution
o Probability limits are very forgiving: Slutsky’s Theorem states that
 plim (SN + RN) = plim SN + plim RN
 plim (SNRN) = plim SN · plim RN
 plim (SN / RN) = plim SN / plim RN
o The continuous-mapping theorem gives us
 For continuous functions g, plim g(SN) = g(plim SN)
 S , then g  S N    g S  .
d
 And if S N 
d

o Further, we can combine probability limits and convergence in distribution to get


 If plim aN = a and S N 
d
 S , then
 a N S N 
d
 aS
 a N  S N 
d
a  S
 S N / a N 
d
S / a
 These are very useful since it means that asymptotically we can treat any
consistent estimator as a constant equal to the true value.
 Central limit theorems
o There is a variety with slightly different conditions.
o Basic result: If {SN} is a sequence of estimators of , then for a wide variety of
underlying distributions, N  S N    
d
 N  0, 2  , where 2 is the variance of
the underlying statistic.
 Applying asymptotic theory to the OLS model
o Under the more general conditions than the ones that we have typically assumed
(including, specifically, the finite kurtosis assumption, but not the
~ 30 ~
homoskedasticity assumption or the assumption of fixed regressors), the OLS
estimator satisfies the conditions for consistency and asymptotic normality.
 var  x  E ( x )  e  
o N  b2  2  
d
 N  0,  i i
 . This is general case with
 var  x i  
 2

 
heteroskedasticity.
 With homoskedasticity, the variable reduces to the usual formula:
 2 
N  b2  2  
d
 N  0, .
  var  x i  
 2


o plim ˆ b22  b22 , as proven in Section 17.3.
b2  2 d
o t  N  0, 1 .

s.e.  b2 
 Choice for t statistic:
o If homoskedastic, normal error term, then exact distribution is tN–2.
o If heteroskedastic or non-normal error (with finite 4th moment), then exact
distribution is unknown, but asymptotic distribution is normal
o Which is more reasonable for any given application?

Linearity and nonlinearity


 The OLS estimator is a linear estimator because b is linear in e (which is because y is
linear in ), not because y is linear in x.
 OLS can easily handle nonlinear relationships between y and x.
o lny = 1 + 2x
o y = 1 + 2x2
o etc.
 Dummy (indicator) variables take the value zero or one.
o Example: MALE = 1 if male and 0 if female.
o yi  1  2 MALEi  ei
 For females, E  y | MALE   1
 For males, E  y | MALE   1  2
 Thus, 2 is the difference between the expected value of males and
females.

~ 31 ~

You might also like