Regression Model Assumptions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

Regression Model Assumptions

Asad Dossani

Fin 625: Quantitative Methods in Finance

1 / 68
Regression Model Assumptions

E(ut ) = 0
Var(ut ) = σ 2 < ∞
Cov(ut , uj ) = 0
Cov(ut , xt ) = 0
ut ∼ N(0, σ 2 )

2 / 68
Statistical Distributions of Diagnostic Tests

Lagrange Multiplier (LM) tests follow a χ2 distribution with de-


grees of freedom equal to the number of restrictions placed on the
model, denoted by m. The Wald test follows an F distribution with
(m, T − k) degrees of freedom. Asymptotically, the two are equiv-
alent, though their results may differ in small samples.

χ2 (m)
F (m, T − k) → as T → ∞
m

3 / 68
Assumption 1: E(ut ) = 0

If a constant term is included in a regression, this term is never


violated. If a constant is not included, this could bias the estimates
of the slope coefficient. R 2 is not a meaningful statistic, since ȳ is
not necessarily equal to the mean of the fitted values ŷ .

4 / 68
Inclusion of a Constant Term

5 / 68
Assumption 2: Var(ut ) = σ 2 < ∞

Homoskedasticity is the assumption that the variance of the error


term is constant. If the errors do not have a constant variance, they
are said to be heteroskedastic.

6 / 68
Heteroskedasticity

7 / 68
Heteroskedasticity

The OLS estimators in the presence of heteroskedasticity are still


unbiased and consistent, but are no longer BLUE (best linear unbi-
ased estimators). The standard errors of the OLS estimator are no
longer correct. If the variance of errors is positively related to the
square of an explanatory variable, the OLS standard errors for the
slope coefficient will be too low, and vice versa.

8 / 68
White’s (1980) Test for Heteroskedasticity

We first estimate the regression model using OLS. Next, we collect


the residuals ût , and regress the squared residuals on a constant, the
original explanatory variables, the squares of the explanatory vari-
ables, and their cross products. This is called an auxiliary regression.

yt = β1 + β2 x2t + β3 x3t + ut
ût2 = α1 + α2 x2t + α3 x3t + α4 x2t
2 2
+ α5 x3t + α6 x2t x3t + vt

9 / 68
White’s (1980) Test for Heteroskedasticity

If one or more of the coefficients in the model is statistically sig-


nificant, the R 2 will be relatively high, and vice versa. Under the
null hypothesis of homoskedasticity TR 2 ∼ χ2 (m), where T is the
sample size and m is the number of regressors in the auxiliary re-
gression, excluding the constant. If the test statistic is greater than
the critical value, we reject the null hypothesis of homoskedasticity,
and vice versa.

10 / 68
White’s (1980) Test for Heteroskedasticity

Suppose the auxiliary regression R 2 = 0.05, and T = 120. Perform


the test at the 5% level of significance.

test statistic = TR 2
test statistic = (120)(0.05)
test statistic = 6.0
χ20.05 (5) = 11.07

Since the test statistic is less than the critical value, we do not reject
the null hypothesis.

11 / 68
Solutions to Heteroskedasticity

One solution is to estimate the regression using OLS, and correct


the standard errors. White’s (1980) heteroskedasticity consistent
(robust) standard errors are given by:

Var(β̂) = (X 0 X )−1 (X ΣX )(X 0 X )−1


Σ = diag(û12 , û22 , . . . , û1T )

12 / 68
Generalized Least Squares

Another solution is to transform the data so that the errors are no


longer heteroskedastic. This is known as generalized least squares
(GLS). Suppose the variance is related to a known variable zt :

yt = β1 + β2 x2t + β3 x3t + ut
Var(ut ) = σ 2 zt2

13 / 68
Generalized Least Squares

We can divide the regression equation by zt , and estimate the trans-


formed regression by OLS. The errors of the transformed data are
homoskedastic. GLS is essentially OLS applied to transformed data.

yt 1 x2t x3t ut
= β1 + β2 + β3 +
zt zt zt zt zt
2 2
 
ut σ z
Var = 2t = σ2
zt zt

In practice, we may not know the functional form of heteroskedas-


ticity. We can then use a feasible version of GLS, where we estimate
the functional form of heteroskedasticity.

14 / 68
Assumption 3: Cov(ui , uj ) = 0 for i 6= j

If errors are not uncorrelated with each other, they are autocorre-
lated, or serially correlated. Since population disturbances cannot
be observed, tests for autocorrelation are conduction on residuals
ût .

15 / 68
Lagged Values and First Differences

The lagged value of a variable is the value that the variable took
during the previous period. The value of yt lagged one period is
yt−1 . The value of yt lagged p periods is yt−p .

The first difference of yt is given by ∆yt :

∆yt = yt − yt−1

16 / 68
Graphical Tests for Autocorrelation

To test for autocorrelation, we investigate whether there is any rela-


tionship between ût and its previous values ût−1 , ût−2 , . . . . To start,
we can consider possible relationships between ût and ût−1 . We can
plot ût against ût−1 , and plot ût over time.

17 / 68
Positive Autocorrelation

18 / 68
Positive Autocorrelation

19 / 68
Negative Autocorrelation

20 / 68
Negative Autocorrelation

21 / 68
No Autocorrelation

22 / 68
No Autocorrelation

23 / 68
Durbin-Watson (1951) Test

The Durbin-Watson (DW ) test is a test for first order autocorre-


lation, i.e. the relationship between an error term and its previous
value. We can motivate the test using the following regression:

ut = ρut−1+vt
vt ∼ N(0, σv2 )
H0 : ρ = 0
H1 : ρ 6= 0

24 / 68
Durbin-Watson (1951) Test

In practice, we don’t need to run the regression since the test statistic
can be calculated using quantities available after the regression has
been run.

PT 2
t=2 (ût − ût−1 )
DW = PT 2
t=2 ût

≈ 2(1 − ρ̂)

25 / 68
Durbin-Watson (1951) Test

PT 2
PT 2
PT
t=2 ût + t=2 ût−1 − 2 t=2 ût ût−1
DW = PT 2
t=2 ût
PT 2
PT
2 t=2 ût −2 t=2 ût ût−1
≈ PT as T → ∞
2
t=2 ût
 PT 
t=2 ût ût−1
≈2 1− P T 2
t=2 ût

≈ 2(1 − ρ̂)

26 / 68
Durbin-Watson (1951) Test

−1 ≤ ρ̂ ≤ 1 → 0 ≤ DW ≤ 4
ρ̂ = 0 → DW = 2 (no autocorrelation)
ρ̂ = 1 → DW = 0 (perfect positive autocorrelation)
ρ̂ = −1 → DW = 4 (perfect negative autocorrelation)

For the DW test to be valid, the regression must have a constant,


no lags of the dependent variable, and the regressors must be non-
stochastic.

27 / 68
Durbin-Watson (1951) Test

The DW test does not follow a standard distribution. DW has two


critical values dU (upper) and dL (lower). There is an intermediate
region whether the null can neither be rejected nor not rejected.

28 / 68
Breusch-Godfrey Test for Autocorrelation

The Breusch-Godfrey test is a joint test for autocorrelation that


examines the relationship between ût and several of its lagged values
at the same time.

ut = ρ1 ut−1 + ρ2 ut−2 + · · · + ρr ut−r + vt


vt ∼ N(0, σv2 )
H0 : ρ 1 = 0 & ρ 2 = 0 & . . . & ρ r = 0
H1 : ρ1 6= 0 or ρ2 6= 0 or . . . or ρr 6= 0

29 / 68
Breusch-Godfrey Test for Autocorrelation

First, we estimate a linear regression using OLS, and obtain the


residuals ût . Second, regress we ût on all regressors and r lagged
values of the residuals. This is the auxiliary regression. Suppose the
regressors are a constant, x2t , x3t , and x4t .

ût = γ1 + γ2 x2t + γ3 x3t + γ4 x4t + ρ1 ût−1 + ρ2 ût−2 + · · · + ρr ût−r + vt

30 / 68
Breusch-Godfrey Test for Autocorrelation

We obtain the R 2 from the auxiliary regression. Let T denote the


number of observations. The statistic is given by:

(T − r )R 2 ∼ χ2r

If the test statistic exceeds the critical value, we reject the null
hypothesis of no autocorrelation, and vice versa.

31 / 68
Breusch-Godfrey Test for Autocorrelation

Suppose the auxiliary regression R 2 = 0.10, T = 120, and the


number of lags is r = 3. Perform the test at the 5% level of
significance.

test statistic = (T − r )R 2
test statistic = (120 − 3)0.10
test statistic = 11.7
χ2 (3) = 7.81

Since the test statistic is greater than the critical value, we reject
the null hypothesis.

32 / 68
Cochrane-Orcutt Procedure

If the form of autocorrelation is known, one approach is to use a


GLS procedure, known as the Cochrane-Orcutt procedure. Suppose
we regress yt on a constant, x2t and x3t . We assume a particular
functional form for the structure of the autocorrelation:

yt = β1 + β2 x2t + β3 x3t + ut
ut = ρut−1 + vt

33 / 68
Cochrane-Orcutt Procedure

Suppose we lag the regression equation, multiply by it ρ, and sub-


tract it from the original regresion equation.

yt = β1 + β2 x2t + β3 x3t + ut
ρyt−1 = ρβ1 + ρβ2 x2t−1 + ρβ3 x3t−1 + ρut−1
yt − ρyt−1 = β1 (1 − ρ) + β2 (x2t − ρx2t−1 ) + β2 (x3t − ρx3t−1 )
+ (ut − ρut−1 )

34 / 68
Cochrane-Orcutt Procedure

yt − ρyt−1 = β1 (1 − ρ) + β2 (x2t − ρx2t−1 ) + β3 (x3t − ρx3t−1 )


+ (ut − ρut−1 )
yt∗ = yt − ρyt−1
β1∗ = β1 (1 − ρ)

x2t = x2t − ρx2t−1

x3t = x3t − ρx3t−1
vt = ut − ρut−1
yt∗ = β1∗ + β2 x2t
∗ ∗
+ β3 x3t + vt

We can now estimate the transformed regression using OLS because


vt is not autocorrelated.
35 / 68
Cochrane-Orcutt Procedure

In practice, we do not know ρ, so we have to estimate it. This is


called feasible GLS. We first estimate the original regression using
OLS, and obtain the fitted residuals ût . We then run the following
regression:

ût = ρût−1 + vt

Using the estimated ρ̂, we run the GLS regression. We can addition-
ally obtain better estimates by going through the process multiple
times. After running the first GLS regression, we again correct it
for autocorrelation to obtain a new estimate for ρ̂. This procedure
is repeated until the change in ρ̂ from one iteration to the next is
sufficiently small.

36 / 68
Cochrane-Orcutt Procedure
The Cochrane-Orcutt procedure requires a specific assumption re-
garding the form of autocorrelation. Suppose we move ρyt−1 to the
right hand side of the regression equation.

yt = β1 (1 − ρ) + β2 (x2t − ρx2t−1 ) + β2 (x3t − ρx3t−1 )


+ ρyt−1 + vt
yt = β1 (1 − ρ) + β2 x2t − ρβ2 x2t−1 + β3 x3t − ρβ3 x3t−1
+ ρyt−1 + vt
yt = γ1 + γ2 x2t + γ3 x2t−1 + γ4 x3t + γ5 x3t−1 + γ6 yt−1 + vt

We could estimate an equation containing the same variables using


OLS. The Cochrane-Orcutt procedure is a restricted version of the
OLS regression with γ2 γ6 = −γ3 and γ4 γ6 = −γ5 .
37 / 68
Cochrane-Orcutt Procedure

These are known as common factor restrictions and should be tested


before Cochrane-Orcutt or similar procedure is applied. In practice,
these restrictions may be invalid, and a dynamic model should be
used instead. A dynamic model means we simply estimate the un-
restricted model using OLS.

38 / 68
Newey-West Variance Covariance Estimator

An alternative approach to dealing with autocorrelation is to esti-


mate the model using OLS and correct the standard errors. The
Newey-West variance covariance estimator is consistent in the pres-
ence of both autocorrelation and heteroskedasticity. It requires spec-
ification of a trunction lag length L to determine the number of
lagged residuals used to evaluate the autocorrelation.

T L T
ut xt xt0 + wl ut ut−l (xt xt−l + xt−l xt0 )
1 X 2 1 X X 0
Var(β̂) =
T T
t=1 l=1 t=l+1

l
wl = 1 −
L+1

39 / 68
Dynamic Models

Suppose the current value of yt depends on the current and previous


values of xt . This is known as a distributed lag model (first equa-
tion). If yt also depends on the previous values of yt , it is known as
an autoregressive distributed lag model (ARDL) (second equation).

yt = α + β0 xt + β1 xt−1 + ut
yt = α + β0 xt + β1 xt−1 + γ1 yt−1 + ut

40 / 68
Dynamic Models

Including lags in a regression can often eliminate autocorrelation.


Lags can capture the dynamic structure of the dependent variable.
For example, a change in the explanatory variable may not affect the
dependent variable immediately, but instead with a lag over several
time periods. A general form of an ARDL(p,q) model with p lags
of xt and q lags of yt is as follows:

yt = α + β0 xt + β1 xt−1 + · · · + βp xt−p + γ1 yt−1 + · · · + γq yt−q + ut

41 / 68
Dynamic Models

How do we interpret the coefficients from a dynamic model? Con-


sider an ARDL(1,1):

yt = α + β0 xt + β1 xt−1 + γ1 yt−1 + ut

In this model, β0 captures the immediate, or short run effect of xt


on yt .

42 / 68
Dynamic Models

We can also consider a long run equilibrium relationship between


x and y . This is equivalent to asking the effect of a permanent,
or long run change in x on y . Suppose we set yt = yt−1 = y ,
xt = xt−1 = x, and ut = E(ut ) = 0.

yt = α + β0 xt + β1 xt−1 + γ1 yt−1 + ut
y = α + β0 x + β1 x + γ1 y
(1 − γ1 )y = α + (β0 + β1 )x
α β0 + β1
y= + x
1 − γ1 1 − γ1

43 / 68
Lagged Dependent Variables

Adding lagged dependent variables violates the assumption that the


explanatory variables are non-stochastic. The OLS estimate is no
longer unbiased, but it is consistent, meaning that the bias disap-
pears as the sample size gets large.

If autocorrelation remains in a model with lagged dependent vari-


ables, OLS is no longer consistent. This can occur if not enough
lags are included in a model. If can also occur if relevant variables
are omitted from the model, and these variables are themselves au-
tocorrelated.

44 / 68
Lagged Dependent Variables

yt = α + β0 xt + γ1 yt−1 + ut
ut = ρut−1 + vt
yt−1 = α + β0 xt−1 + γ1 yt−2 + ut−1
yt = α + β0 xt + γ1 yt−1 + ρut−1 + vt

If ut is autocorrelated, and yt depends on yt−1 , then because yt−1


is correlated with ut−1 , yt−1 is also correlated with ut .

Then E(X 0 u ) = 0 is not satisfied, and OLS is not consistent.

45 / 68
Assumption 4: The xt are Non-Stochastic
The OLS estimator is consistent and unbiased in the presence of
stochastic regressors, provided the regressors are not correlated with
the error term.

y = Xβ + u
β̂ = (X 0 X )−1 X 0 y
β̂ = (X 0 X )−1 X 0 (X β + u )
β̂ = (X 0 X )−1 X 0 X β + (X 0 X )−1 X 0 u
β̂ = β + (X 0 X )−1 X 0 u
E(β̂) = E(β) + E[(X 0 X )−1 X 0 u ]
E(β̂) = β + E[(X 0 X )−1 X 0 ]E(u )
E(β̂) = β

46 / 68
Endogeneity

If one or more of the explanatory variables is contemporaneously


correlated with the error term, OLS is not consistent. This is known
as endogeneity.

Suppose xt and ut are positively correlated. When ut is high, yt


is also high. If xt is positively correlated with ut , xt is also high.
OLS will incorrectly attribute the high value of yt to a high value of
xt , when in reality yt is high because ut is high. This will lead to
inconsistent and biased parameter estimates and a fitted line that
appears to capture the data better than it does in reality.

47 / 68
Assumption 5: ut ∼ N(0, σ 2 )

The disturbances are assumed to be normally distributed. This is


required to conduct single or join hypothesis tests about the param-
eters. We can test the normality of the residuals using the Bera-
Jarque test.

A normally distributed random variable is characterized by its first


two moments, the mean and the variance. The third and fourth
standardized moments are skewness and kurtosis, respectively. A
normally distributed random variable has a skewness of zero and a
kurtosis of three, or an excess kurtosis of zero.

48 / 68
Assumption 5: ut ∼ N(0, σ 2 )
Let u denote the residuals and σ 2 denote their variance. Let b1 and
b2 denote skewness and kurtosis, respectively. T is the sample size.
The Bera-Jarque test W statistic is given by:

b 2 (b2 − 3)2
 
W =T 1 +
6 24
E(u 3 )
b1 =
σ3
E(u 4 )
b2 =
σ4

We can compute the test static using OLS residuals û. Under the
null hypothesis of normality, the statistic W ∼ χ2 (2). If the test
statistic exceeds the critical value, we reject the null hypothesis of
normally distributed errors.
49 / 68
Dealing with Non-Normality and Outliers

For a sufficiently large sample, non-normality is inconsequential be-


cause of the central limit theorem. Sometimes, a log transform of
the data can help to make the distribution of the residuals closer to
a normal.

Occasionally, one or two extreme observations can cause a rejection


of the normality assumption. These are known as outliers, e.g. stock
market crashes or financial crises. Outliers can have a serious effect
on OLS coefficient estimates, particularly in small samples. We can
remove outliers or include dummy variables for those observations,
if doing so is theoretically justified.

50 / 68
Outliers

51 / 68
Multicollinearity

If there is no correlation between explanatory variables, they are


called orthogonal to one another. In this case, adding or removing a
variable from the regression equation has no effect on the coefficient
estimates of the other explanatory variables.

In practice, correlation between explanatory variables is nonzero.


When explanatory variables are highly correlated with each other,
this is known as multicollinearity.

52 / 68
Multicollinearity

Perfect multicollinearity occurs when there is an exact linear rela-


tionship between two or more explanatory variables. In this case,
the regression cannot be estimated until we remove the perfectly
collinear variables.

Near multicollinearity occurs when two or more variables are highly,


but not perfectly correlated with each other. A high correlation
means that the correlation coefficient is close to one or close to
minus one.

53 / 68
Multicollinearity

The variance inflation factor (VIF ) estimates the extent to which the
variance of a parameter estimate increases because the explanatory
variables are correlated. Suppose Ri2 is the R 2 from a regression of
explanatory variable i on a constant plus all the other explanatory
variables in the regression. The VIFi is given by:
1
VIFi =
1 − Ri2

54 / 68
Multicollinearity

In the presence of multicollinearity, R 2 will be high but individual


coefficients will have high standard errors and may be statistically
insignificant.

Intuitively, a regression coefficient is the impact of an explanatory


variable on the dependent variable, holding the other explanatory
variables constant. If two explanatory variables are highly correlated
with each other, it is difficult to precisely estimate the impact of one
while holding the other constant.

55 / 68
Multicollinearity

OLS estimates are still BLUE in the presence of multicollinearity,


so one option is to ignore it. Other methods of dealing with mul-
ticollinearity include principal component analysis and ridge regres-
sion. We can also drop one of the collinear variables if doing so can
be theoretically justified.

56 / 68
Adopting the Wrong Functional Form

If the relationship between xt and yt is not linear, one possibility is


to use a non-linear model. These often require complex estimation
techniques. An alternative is to write the model so that it is linear
in the parameters, and then estimate the model by OLS. Consider a
quadratic regression:

yt = β1 + β2 xt + β3 xt2 + ut

57 / 68
Quadratic Regression: yt = β1 + β2 xt + β3 xt2 + ut

58 / 68
Logarithmic Transformation

Another approach is to transform the data into logarithms. Consider


the exponential growth model and its log transformation:

yt = β1 xtβ2 ut
ln yt = ln β1 + β2 ln xt + ln ut

59 / 68
Logarithmic Transformation
Transforming the variables into logarithms changes the interpreta-
tion of the coefficients. Each of the four possibilities has the follow-
ing interpretation:

yt = β1 + β2 xt + ut
ln yt = β1 + β2 xt + ut
yt = β1 + β2 ln xt + ut
ln yt = β1 + β2 ln xt + ut

1. 1 unit increase in x causes a β2 unit increase in y .


2. 1 unit increase in x causes a β2 ∗ 100% increase in y .
3. 1% increase in x causes a β2 /100 unit increase in y .
4. 1% increase in x causes a β2 % increase in y .

60 / 68
Omission of an Important Variable

Suppose the true data generating process is given by the first equa-
tion, but we estimate the second equation, so that x3t is omitted.

yt = β1 + β2 x2t + β3 x3t + ut
yt = β1 + β2 x2t + ut

The estimated coefficient on x2t is biased and inconsistent, unless


x2t and x3t are uncorrelated.

61 / 68
Inclusion of an Irrelevant Variable

Suppose the true data generating process is given by the first equa-
tion, but we estimate the second equation, so that x3t is included
but irrelevant.

yt = β1 + β2 x2t + ut
yt = β1 + β2 x2t + β3 x3t + ut

If x3t is irrelevant, the estimated coefficient on x3t will be close to


zero. The estimated coefficient on x2t is unbiased and consistent,
but the estimated coefficient has a higher standard error. Thus,
it is less efficient. The loss of efficiency depends positively on the
absolute value of the correlation between x2t and x3t .

62 / 68
Parameter Stability

Suppose we estimate the following regression:

yt = β1 + β2 xt + ut

The regression implicitly assumes that the coefficients β1 and β2 are


constant over the sample period. We can test this assumption using
a parameter stability test called the Chow test.

63 / 68
Chow Test

Suppose we split the sample into two sub-periods. We first estimate


the regression over the whole sample, and then separately for each
subsample. Our objective is to test whether the coefficients are
stable across the subsamples, using an F-test.

The restricted regression is the one over the whole sample, while the
unrestricted regressions are the subsample regressions. Intuitively,
the restricted regression imposes the restriction that the coefficients
are constant over the full sample, while the two unrestricted regres-
sions allow the coefficients to vary between the subsamples.

64 / 68
Chow Test

Let RSS denote the residual sum of squares for the whole sample
regression, and RSS1 and RSS2 the residual sum of squares of each
subsample, respectively. Let k denote the number of regressors in
each regression (including the constant), and T denote the sample
size of the whole sample regression. Under the null hypothesis of
parameter stability, the test statistic has an F (k, T − 2k) distribu-
tion.

RSS − (RSS1 + RSS2 ) T − 2k


test statistic = ∼ F (k, T − 2k)
RSS1 + RSS2 k

If the test statistic is greater than the critical value, we reject the
null hypothesis.

65 / 68
Chow Test
Suppose we want to test for parameter stability with a Chow test.
The regression is: yt = β1 + β2 x2t + ut . The RSS for the full sample
regression is 120, and the RSS for the two subsamples are 45 and
55, respectively. T = 120. Perform the test at the 5% level of
significance.

RSS − (RSS1 + RSS2 ) T − 2k


test statistic =
RSS1 + RSS2 k
120 − (45 + 55) 120 − 2(2)
test statistic =
45 + 55 2
test statistic = 11.6
F0.05 (2, 116) = 3.07

Since the test statistic is greater than the critical value, we reject
the null hypothesis.

66 / 68
Measurement Error in Explanatory Variables
Suppose we don’t observe the true value of xt , and instead it is
measured with error. Let x̃t denote the observed noisy value of xt
and vt denote an iid measurement error.

yt = β1 + β2 xt + ut
x̃t = xt + vt
vt ∼ N(0, σv2 )
yt = β1 + β2 (x̃t − vt ) + ut
yt = β1 + β2 x̃t + (ut − β2 vt )

This leads to a correlation between the regressor and the composite


error term. The OLS estimate is biased and inconsistent.

67 / 68
Measurement Error in Explanatory Variables

Measurement error biases the coefficient β2 towards zero. The bias


worsens as σv2 increases.

Cov(yt , x̃t )
β̂2 =
Var(x̃t )
Cov(yt , xt + vt )
β̂2 =
Var(xt + vt )
Cov(yt , xt )
β̂2 =
Var(xt ) + σv2

68 / 68

You might also like