Reading 11: Correlation and Simple Regression: Calculate and Interpret The Following
Reading 11: Correlation and Simple Regression: Calculate and Interpret The Following
Reading 11: Correlation and Simple Regression: Calculate and Interpret The Following
• Calculator notes: Sx is sample standard deviation for X var inputs, r is correlation coefficient
• Spurious Correlation: variables that are highly correlated (or exhibit causal linear relationship) by chance,
lacks any economic (or other rational) explanation
• Non-linear relationships: correlation doesn’t capture this, gets worse at doing so the more distant from linear
the relationship is
• If two populations are normally distributed (cfavow), can use 2-tailed t-test (t-stat follows, n − 2 df, r is
sample correlation) √
r n−2
t= √
1 − r2
• For x% (i.e. 0.05) significance level, find matching critical value in table for the x% sl and n-2 df
• Independent variable: used to explain variation in dependent variable (AKA explanatory, exogenous, pre-
dicting)
• KEY ASSUMPTIONS:
Linear relationship exists between response and explanatory variable
Explanatory variable uncorrelated with residuals
Residuals normally distributed with zero mean E[εi ] = 0 and constant variance E[ε2i ] = σε2
Residuals are independent ⇒ E[εi εj ] = 0, j 6= i
Yi = b0 + b1 Xi + εi
• Note: fitted or ”estimated” model (least squares or OLS estimates), and predicted values, denoted with hats.
These minimize the SSE
Ŷi = bˆ0 + bˆ1 Xi
1
• Slope coefficient bˆ1 : change in Y for 1 unit change in X
Covxy
bˆ1
σx2
• According to CFA, you must use hypothesis tests to determine statistical significance of slope coefficient (i.e.
importance of explanatory variable in predicting the response)
(cfavow) this is not appropriate for fits with multiple explanatory variates.
• Confidence interval for slope estimates:
bˆ1 ±(tc Sbˆ1 ) where tc is the t-distribuiton critical value for the x% (maybe x/2?) confidence level (2-tailed)
Prediction
• Predicted value, based on forecast of explanatory variable Xp
Ŷ = bˆ0 + bˆ1 Xp
Standard error or forecast Sf : where SEE is standard error of residuals, X = Xp , Sx2 is variance of
explanatory variable s
1 (X − X̄)2
Sf = SEE 1 + +
n (n − 1)Sx2
2
Analysis of Variance (ANOVA)
• NOTE: As before, bar indicates a mean, hat indicates an estimated/predicted value
• Regression sum of squares (RSS): measures variation in response explained by explanatory variable
n
X
RSS = (Ŷi − Ȳ )2
i=1
• Sum of squared errors (SSE): measures unexplained variation in response (AKA sum of squared residuals)
n
X
SSE = (Yi − Ŷ )2
i=1
ANOVA Table
Source of Variation Degrees of Freedom (df) Sum of Squares Mean Sum of Squares
Regression (explained) k (# of slope parameters) RSS MSR = RSS / k
Error (unexplained) n - k - 1 (n = # of observations) SSE MSE = SSE / (n - k - 1)
Total n-1 SST
F-Statistic
• F-test: assesses how well a set of explanatory variables explains the variation in the response
• Note for simple regression models (only one explanatory variable), F = t2b1
3
Limitations of Regression Analysis
• Parameter Instability: linear relationships can change over time, and model estimated from one period may
not be relevant to another
• Regression model assumptions may not hold, potentially invalidating the results. Data may exhibit:
Heteroskedasticity: non-constant variance of error terms
Autocorrelation: error terms not independent
• Other market participants may already be acting on the model results, even if they are valid
4
Reading 12: Multiple Regression
Multiple Linear Regression Basics
• ”Simple” one-variable regression model i = 1, .., n
Yi = b0 + b1 X1i + ... + bk Xki + εi where εi are i.i.d. N (0, σ)
• Note: fitted or ”estimated” model (least squares or OLS estimates), and predicted values, denoted with hats.
Ŷi = bˆ0 + bˆ1 X1i + ... + bˆk Xki
Pn 2
These minimize the SSE: i=1 εi
tc is the t-distribuiton critical value for the x/2% significance level (2-tailed), with n − k − 1 df (as before)
• Know how to compute predicted values of response (easy)
5
Analysis of Variance (ANOVA)
• NOTE: As before, bar indicates a mean, hat indicates an estimated/predicted value
• Regression sum of squares (RSS): measures variation in response explained by explanatory variable
n
X
RSS = (Ŷi − Ȳ )2
i=1
• Sum of squared errors (SSE): measures unexplained variation in response (AKA sum of squared residuals)
n
X
SSE = (Yi − Ŷ )2
i=1
• Computing SEE
√
r
SSE
SEE = M SE =
n−k−1
F-Statistic
• F-test: assesses how well a set of explanatory variables explains the variation in the response
R2 vs Adjusted R2
• Coefficient of Determination, R-squared: proportion of variation in response collectively explained by all
explanatory variables.
Computation:
SSE SST − SSE RSS
R2 = 1 − = =
SST SST SST
• Issue with R2 , Overestimating the Regression: R2 not always reliable, since it typically increases even if
contributions of new variables not statistically significant
Can end up with massive model that only reflects large impact of explanatory variables, rather than how
well they explain the response variable
6
• Solution: use adjusted R2 , as it rewards parsimony and the use of only statistically significant variables
(computation follows)
2 n−1 2
Ra = 1 − (1 − R )
n−k−1
Ra2 may increase or decrease (depending on high or low stat sig of added variable) as R2 increases
ANOVA Table
Source of Variation Degrees of Freedom (df) Sum of Squares Mean Sum of Squares
Regression (explained) k (# of slope parameters) RSS MSR = RSS / k
Error (unexplained) n - k - 1 (n = # of observations) SSE MSE = SSE / (n - k - 1)
Total n-1 SST
• Can only use n − 1 indicator variables to distinguish n categories, else linear independence assumption of
regression is violated
• Hypothesis tests for indicators: determine if indicator variable category significantly different from reference
category
still uses the same setup: H0 : bj = 0 vs. Ha : bj 6= 0, j ≥ 1
Limitations of Regression Analysis and How to Detect and Correct for Them
Heteroskedasticity
• Heteroskedasticity: non-constant variance of error terms
Unconditional: when heteroskedasticity doesn’t systematically change with the value of the explanatory
variables.
(cfavow) Violates constant variance assumption, but usually causes no major problems with regression
Conditional: when heteroskedasticity is related to the level of the explanatory variables (does create
problems)
7
• Detecting Heteroskedasticity
Plot residuals against explanatory variables, see if patterns emerge
Breusch-Pagan (BP) Chi-square test: uses regression of squared residuals on explanatory variables (test
statistic computation follows)
Idea is that conditional heteroskedasticity is present when the explanatory variables explain a large
proportion of the variation in the squared residuals (i.e. the R2 for this second regression is high)
2
BP = n(Rresid )
2
n is # of observations, Rresid is R2 from regression of squared residuals on explanatory variables
One-tailed test, reject the null if BP-stat greater than Chi-squared distn value with k df (k is # of
explanatory variables) at relevant significance level
• Correcting heteroskedasticity
Use Robust standard errors when computing t-stats if heteroskedasticity is present in data (AKA White-
corrected or Heteroskedasticity-consistent standard errors)
Autocorrelation
• Autocorrelation (AKA Serial Correlation): residual terms are correlated (not independent)
positive (negative) autocorrelation: when positive residual in one period increases probability of observing
positive (negative) residual in next period
• Detecting Autocorrelation
Plot residuals over time: clustering and cycling can indicate positive autocorrelation, while jaggedness
over the x-axis can indicate negative autocorrelation
Durbin-Watson (DW) statistic: Test null hypothesis H0 : no serial correlation of residuals (computation
follows)
PT
(εˆt − εt−1ˆ )2
DW = t=2 PT 2
where εˆt is residual for period t
t=1 εˆt
DW ' 2(1 − r) where r is correlation between residuals of one period and previous period
DW-stat upper (du ) and lower (dl ) critical value from table: based on sample size, # df, significance
level
Decision Rules:
If DW < dl , reject null, residuals exhibit positive autocorrelation
8
If dl < DW < du , test is inconclusive
If du < DW < 4 − du , do not reject null, no autocorrelation
If 4 − du < DW < 4 − dl , test is inconclusive
If 4 − dl < DW < 4, reject null, residuals exhibit negative autocorrelation
• Correcting Autocorrelation
Adjust coefficient standard errors: using the Hansen method. Note that this corrects for autocorrelation
and conditional heteroskedasticity. Employ these adjusted standard errors (AKA serial correlation consistent
or Hansen-White standard errors) in hypothesis tests, if autocorrelation is a problem (even in conjunction
with conditional heteroskedasticity).
Improve model specification: by explicitly incorporating time-series nature of date via seasonal term.
Multicollinearity
• Multicollinearity: when multiple explanatory variables or linear combos thereof are highly correlated.
• Detecting Multicollinearity
Indicative Situation: t-test results suggests individual explanatory variates are not statistically signifi-
cant, but F-test suggests they are significant collectively and R2 is high.
Rule of thumb: sample correlation above 0.7 for any two explanatory variables signals multicollinearity
is a potential problem
• Correcting Multicollinearity
omit one or more of the correlated independent variables (can be identified using stepwise regression, or
other methods)
9
Distortion Effects: model misspecs lead to biased or inconsistent regression coefficients, resulting in unreliable
hypothesis test results and inaccurate predictions
• Probit (based on normal distn) and Logit (based on logistic distn) models: application includes estimates of
probability of an event occurring, coefficients estimated via maximum likelihood
• Discriminant models: uses linear function to generate score or ranking for an observation (i.e. use financial
statement ratios as explanatory variables to categorically rate a company bankrupt or not).
10
Reading 13: Time Series Analysis
Time Series Basics and Basic Trend Models
• Time Series: set of observations for variable over successive time periods
• Linear Trend: form follows (b1 is slope coeff and t is time period, other notation as before)
yt = b0 + b1 (t) + εt
• Linear (log-linear) trend model works best when variable increases by constant amount (rate)
• Limitations: trend models work poorly when residuals exhibit autocorrelation (detect with DW stat, as
before)
• AR(p) Model
xt = b0 + b1 xt−1 + ... + bp xt−p + εt
ˆ = bˆ0 + bˆ1 xt
xt+1
(cfavow) use ”chain rule of forecasting” to obtain two period ahead forecast from the one period ahead
forecast
ˆ = bˆ0 + bˆ1 xt+1
xt+2 ˆ = bˆ0 + bˆ1 (bˆ0 + bˆ1 xt )
11
Assessing Fit of AR Models (Autocorrelation test)
• Properly specified AR model should not have significant autocorrelation in its residuals (AR model not useful
if residuals have significant autocorrelation for particular series).
• Testing AR model specification (steps):
1. Estimate AR model using OLS (linear regression), starting with AR(1) model (1st order).
2. Compute autocorrelation function (i.e. of all lags) of the residuals
3. Test whether autocorrelations statistically significant (t-test procedure follows):
Compute the t-stat for autocorrelation of lag k (shown below). Critical value has T − 2 df (two-tailed),
where T is # of observations (NOTE: in cfavow, you CANNOT assume additional prior data available. i.e.
if using AR(1) model with n data points, you only have n − 1 observations).
ρt,t−k
t= √ where ρt,t−k is autocorrelation of lag k, and denominator is standard error
1/ T
• Parameter (regression coeff) Instability: when estimated parameters change among different periods due to
varying financial and economic conditions
Models estimated on shorter sample periods usually more stable than those estimated with longer ones
(increased chance underlying process has changed)
Trade-off between statistical reliability of longer sample periods and parameter stability of shorter periods
12
Random Walks vs. Covariance Stationary Processes
• Random walk (Simple): is defined by these properties
xt = b0 + b1 xt−1 + εt where b0 = 0, b1 = 1
• Note that random walks of either type are NOT Covariance Stationary. Having b1 = 1 implies infinite mean
b0
reverting level x̂t = 1−b 1
=∞
AKA Series has Unit Root (b1 = 1)
Use modified t-test to see whether (b1 − 1) is statistically different from zero. If no, b1 = 1 and series
has unit root.
• First Differencing: Transform time series to changes in value of response variable rather than value itself,
(cfavow) makes series covariance stationary (typically, not always true in reality). Procedure follows for
AR(1) Model:
If original time series x has unit root, then xt − xt−1 = εt . Define yt as changes
yt = xt − xt−1 = εt
yt = b0 + b1 yt−1 + εt where b0 = b1 = 0
13
Detecting and Correcting for Seasonality
• Seasonality: time series pattern that repeats from year to year.
• If present, then model is misspecified and unreliable for forecasting, unless AR model explicit includes this
effect (extra term)
• Detecting Seasonality: use autocorrelation t-test (presented earlier) to see if statistically significant at later
lags
• Correcting Seasonality: include response variable(s) of lags with statistically significant autocorrelation as
explanatory variables in the model
• ARCH(1) Model (Estimated by ”regressing” squared residuals from fitted time series model, εˆt 2 , on 1st lag
thereof)
ε̂2t = a0 + a1 ε̂2t−1 + µt where µt is ARCH residual/error term
• The time series is ARCH(1) if the parameter a1 is statistically significant (different from zero)
• Generalized least squares must be used to develop predictive model for time series model with ARCH residuals
(otherwise, standard errors of model parameters incorrect)
• Assess two series for unit roots using separate Dickey Fuller (DF) test with possible results listed below:
If both series are covariance stationary, the above regression is valid.
If either the response series or explanatory series are not covariance stationary, the regression is not
valid.
If neither series is covariance stationary, the validity of the regression depends on whether the two time
series are Cointegrated.
• Cointegration: when two time series follow same trend and relationship is not expected to change (cfavow:
economically linked to same macro variables).
if two time series are cointegrated, residuals of above regression are covariance stationary and t-test are
reliable (not true if they are not cointegrated)
14
Testing for Cointegration: Run the following regression, test residuals for unit root by applying DF test,
compare with critical t-values computed with Engle Granger method (called DF-EG test). Don’t need to
know computation for exam.
yt = b0 + b1 xt + εt
If DF-EG test rejects null of unit root, residuals/error terms generated by two time series are covariance
stationary and two series are cointegrated (thus we can regression model their relationship).
2. For analysis of individual variable, plot values over time and look for non-stationarity indicators (i.e. non-
constant mean or variance, seasonality, or structural change)
4. Run trend analysis, compute residuals, use DW test to test for autocorrelation
Must use another model if autocorrelation detected (otherwise, model can be used)
5. If data has autocorrelation, examine data for stationarity. If not stationary, use data in AR model as follows:
If data has linear trend, 1st difference the data
If data has exponential trend, 1st difference the natural log of the data
If data has structural shift, run two separate models
6. After differencing or separation, if series is stationary, run AR(1) model and test for autocorrelation and
seasonality
Add lagged variables into model (for seasonality, if present) until any significant autocorrelation has been
removed/modeled (model can be used when no autocorrelation present)
7. Test for ARCH, by regressing squared residuals on lagged squared residuals and test if coefficients statistically
significant
If parameters significantly different from zero, correct using generalized least squares (otherwise model
can be used)
8. If you have multiple statistically reliable models, compare their out-of-sample RMSE to determine which is
better at forecasting
15