Reading 11: Correlation and Simple Regression: Calculate and Interpret The Following

Reading 11: Correlation and Simple Regression
Calculate and interpret the following

• Know how to compute Sample covariance and correlation (know calculator functions, they make this jokes)
• Calculator notes: Sx is sample standard deviation for X var inputs, r is correlation coefficient
• Know how to read scattplots
Explain limitations to correlation analysis

• Outliers: extreme values that distort relationship of variables
• Spurious Correlation: variables that are highly correlated (or exhibit causal linear relationship) by chance,
lacks any economic (or other rational) explanation
• Non-linear relationships: correlation doesn’t capture this, gets worse at doing so the more distant from linear
the relationship is
Hypothesis testing for correlation

• Null and alternate hypothesis (H0 : ρ = 0 vs Ha : ρ 6= 0)
• If two populations are normally distributed (cfavow), can use 2-tailed t-test (t-stat follows, n − 2 df, r is
sample correlation) √
r n−2
t= √
1 − r2
• For x% (i.e. 0.05) significance level, find matching critical value in table for the x% sl and n-2 df
Linear Regression Basics

• Dependent variable: variation in this explained by independent variable (AKA explained, endogenous, pre-
dicted, response)
• Independent variable: used to explain variation in dependent variable (AKA explanatory, exogenous, pre-
dicting)
• KEY ASSUMPTIONS:
Linear relationship exists between response and explanatory variable
Explanatory variable uncorrelated with residuals
Residuals normally distributed with zero mean E[εi ] = 0 and constant variance E[ε2i ] = σε2
Residuals are independent ⇒ E[εi εj ] = 0, j 6= i
• ”Simple” one-variable regression model i = 1, .., n
Yi = b0 + b1 Xi + εi
• Note: fitted or ”estimated” model (least squares or OLS estimates), and predicted values, denoted with hats.
These minimize the SSE
Ŷi = bˆ0 + bˆ1 Xi
1
• Slope coefficient bˆ1 : change in Y for 1 unit change in X
Covxy
bˆ1
σx2
• Intercept bˆ0 : intersection with Y axis

bˆ0 Ȳ − bˆ1 X̄
• According to CFA, you must use hypothesis tests to determine statistical significance of slope coefficient (i.e.
importance of explanatory variable in predicting the response)
Regression Summary Stats at a Glance

• Standard Error of Estimate (SEE): degree of variability in actual Y-values vs fitted (estimated) Y-values
(Y-values = response variable values). AKA Standard Error of Residual, or Standard Error of Regression.
Gauges fit of regression line (smaller SEE, better fit)
• Coefficient of Determination (R2 ): proportion of total variation in response variable explained by explanatory
variable
R2 = r2 where r is sample correlation coefficient.
(cfavow) this is not appropriate for fits with multiple explanatory variates.
• Confidence interval for slope estimates:
bˆ1 ±(tc Sbˆ1 ) where tc is the t-distribuiton critical value for the x% (maybe x/2?) confidence level (2-tailed)
Note: Sbˆ1 is the SEE for bˆ1 , computed as follows

s
Pn 2 n
i=1 εî
X
Sbˆ1 = where εî 2 is the sum of squared residuals, and n − 2 is degrees of freedom (number of obser
n−2
i=1
Hypothesis Tests for Regression Coefficients

• t-test (2-tailed) for slope coefficient (H0 : b1 = 0 vs Ha : b1 6= 0). T-stat follows (under H0 ):
bˆ1 − b1
tb1 =
Sbˆ1
Prediction
• Predicted value, based on forecast of explanatory variable Xp
Ŷ = bˆ0 + bˆ1 Xp
• Prediction interval: (cfavow) confidence interval for predicted values

Ŷ ± tc Sf where tc is the t-distribuiton critical value for the x% (maybe x/2?) confidence level (2-tailed)
Standard error or forecast Sf : where SEE is standard error of residuals, X = Xp , Sx2 is variance of
explanatory variable s
1 (X − X̄)2
Sf = SEE 1 + +
n (n − 1)Sx2
2
Analysis of Variance (ANOVA)
• NOTE: As before, bar indicates a mean, hat indicates an estimated/predicted value
• Total sum of squares (SST): measures total variation of response variable

n
X
SST = (Yi − Ȳ )2 = RSS + SSE
i=1
• Regression sum of squares (RSS): measures variation in response explained by explanatory variable
n
X
RSS = (Ŷi − Ȳ )2
i=1
• Sum of squared errors (SSE): measures unexplained variation in response (AKA sum of squared residuals)
n
X
SSE = (Yi − Ŷ )2
i=1
• Computing R2 and SEE

SSE SST − SSE RSS
R2 = 1 − = =
SST SST SST
√
r
SSE
SEE = M SE =
n−k−1
ANOVA Table
Source of Variation Degrees of Freedom (df) Sum of Squares Mean Sum of Squares
Regression (explained) k (# of slope parameters) RSS MSR = RSS / k
Error (unexplained) n - k - 1 (n = # of observations) SSE MSE = SSE / (n - k - 1)
Total n-1 SST
F-Statistic
• F-test: assesses how well a set of explanatory variables explains the variation in the response
• F-stat computation (one-tailed): MSR and MSE as defined above

RSS
M SR k
F = = SSE
M SE n−k−1
• F-test hypothesis (H0 : b1 = 0 vs Ha : b1 6= 0)

Critical value Fc : numerator df = k, denominator df = n - k - 1 (n is # of obs, k is # of slope params)
Decision: reject H0 if F > Fc at relevant significance level.
• Note for simple regression models (only one explanatory variable), F = t2b1
3
Limitations of Regression Analysis
• Parameter Instability: linear relationships can change over time, and model estimated from one period may
not be relevant to another
• Regression model assumptions may not hold, potentially invalidating the results. Data may exhibit:
Heteroskedasticity: non-constant variance of error terms
Autocorrelation: error terms not independent
• Other market participants may already be acting on the model results, even if they are valid
4
Reading 12: Multiple Regression
Multiple Linear Regression Basics
• ”Simple” one-variable regression model i = 1, .., n
Yi = b0 + b1 X1i + ... + bk Xki + εi where εi are i.i.d. N (0, σ)
• Note: fitted or ”estimated” model (least squares or OLS estimates), and predicted values, denoted with hats.
Ŷi = bˆ0 + bˆ1 X1i + ... + bˆk Xki
Pn 2
These minimize the SSE: i=1 εi
• The residuals are difference between observed and fitted values

εî = Yi − Ŷi = Yi − (bˆ0 + bˆ1 X1i + ... + bˆk Xki )
Regression Basic Results

• Intercept: value of response variable when all explanatory variables equal zero
• Partial Slope Coefficients: change in response variable for unit change in single explanatory variable, with
other explanatory variables held constant
Hypothesis Tests and Confidence Intervals for Regression Coefficients

• t-test (2-tailed) for slope coefficient (H0 : bj = 0 vs Ha : bj 6= 0). T-stat follows (under H0 ):
bˆj − bj
tbj = where Sbˆj is coefficient standard error
Sbˆj
• t-stat has n − k − 1 df (n is # of observations, k is # of slope coefficients)

• Can also compare p-value to significance level (same as comparing t-stat to critical value)
• Be able to do one and two-tailed hypothesis tests for coefficients being various values
• Confidence interval for slope estimates:
bˆj ± (tc Sbˆj ) where
tc is the t-distribuiton critical value for the x/2% significance level (2-tailed), with n − k − 1 df (as before)
• Know how to compute predicted values of response (easy)
Multiple Regression Model Assumptions

• Linear relationship exists between response and explanatory variables
• Explanatory variables are non-random and linearly independent (”no exact linear relation”, as the CFA puts
it)
• Expected value of residuals, conditional on explanatory variables, is zero E[ε|X1 , ..., Xk ] = 0
• Residuals normally distributed with constant variance E[ε2i ] = σε2
• Residuals are independent (implies uncorrelated) ⇒ E[εi εj ] = 0, j 6= i
5
Analysis of Variance (ANOVA)
• NOTE: As before, bar indicates a mean, hat indicates an estimated/predicted value
• Total sum of squares (SST): measures total variation of response variable

n
X
SST = (Yi − Ȳ )2 = RSS + SSE
i=1
• Regression sum of squares (RSS): measures variation in response explained by explanatory variable
n
X
RSS = (Ŷi − Ȳ )2
i=1
• Sum of squared errors (SSE): measures unexplained variation in response (AKA sum of squared residuals)
n
X
SSE = (Yi − Ŷ )2
i=1
• Computing SEE
√
r
SSE
SEE = M SE =
n−k−1
F-Statistic
• F-test: assesses how well a set of explanatory variables explains the variation in the response
• F-test hypothesis (H0 : b1 = ... = bk = 0 vs Ha : atleastonebj 6= 0)
• F-stat computation (one-tailed): MSR and MSE as defined above

RSS
M SR k
F = = SSE
M SE n−k−1
Critical value Fc : numerator df = k, denominator df = n - k - 1 (n is # of obs, k is # of slope params),

one-tailed test.
Decision: reject H0 if F > Fc at relevant significance level.
R2 vs Adjusted R2
• Coefficient of Determination, R-squared: proportion of variation in response collectively explained by all
explanatory variables.
Computation:
SSE SST − SSE RSS
R2 = 1 − = =
SST SST SST
• Issue with R2 , Overestimating the Regression: R2 not always reliable, since it typically increases even if
contributions of new variables not statistically significant
Can end up with massive model that only reflects large impact of explanatory variables, rather than how
well they explain the response variable
6
• Solution: use adjusted R2 , as it rewards parsimony and the use of only statistically significant variables
(computation follows)
2 n−1 2
Ra = 1 − (1 − R )
n−k−1
Ra2 may increase or decrease (depending on high or low stat sig of added variable) as R2 increases
ANOVA Table
Source of Variation Degrees of Freedom (df) Sum of Squares Mean Sum of Squares
Regression (explained) k (# of slope parameters) RSS MSR = RSS / k
Error (unexplained) n - k - 1 (n = # of observations) SSE MSE = SSE / (n - k - 1)
Total n-1 SST
Indicator (cfavow: dummy) Variables in Regression

• Reg coefficient interpretation: difference in response between category represented by the indicator and those
not (i.e. the reference)
• Can only use n − 1 indicator variables to distinguish n categories, else linear independence assumption of
regression is violated
• Hypothesis tests for indicators: determine if indicator variable category significantly different from reference
category
still uses the same setup: H0 : bj = 0 vs. Ha : bj 6= 0, j ≥ 1
Limitations of Regression Analysis and How to Detect and Correct for Them
Heteroskedasticity
• Heteroskedasticity: non-constant variance of error terms
Unconditional: when heteroskedasticity doesn’t systematically change with the value of the explanatory
variables.
(cfavow) Violates constant variance assumption, but usually causes no major problems with regression
Conditional: when heteroskedasticity is related to the level of the explanatory variables (does create
problems)
• Effects on Regression Analysis

Coefficient estimates are unaffected
Standard errors are unreliable estimates
If standard errors too small (large), t-stats will be too large (small), and null hypothesis rejected too
often (not enough): increased amount of Type 1 error
F-test also unreliable due to standard error issue
7
• Detecting Heteroskedasticity
Plot residuals against explanatory variables, see if patterns emerge
Breusch-Pagan (BP) Chi-square test: uses regression of squared residuals on explanatory variables (test
statistic computation follows)
Idea is that conditional heteroskedasticity is present when the explanatory variables explain a large
proportion of the variation in the squared residuals (i.e. the R2 for this second regression is high)
2
BP = n(Rresid )
2
n is # of observations, Rresid is R2 from regression of squared residuals on explanatory variables
One-tailed test, reject the null if BP-stat greater than Chi-squared distn value with k df (k is # of
explanatory variables) at relevant significance level
• Correcting heteroskedasticity
Use Robust standard errors when computing t-stats if heteroskedasticity is present in data (AKA White-
corrected or Heteroskedasticity-consistent standard errors)
Autocorrelation
• Autocorrelation (AKA Serial Correlation): residual terms are correlated (not independent)
positive (negative) autocorrelation: when positive residual in one period increases probability of observing
positive (negative) residual in next period
• Effects on Regression Analysis

positive autocorrelation: data tends to cluster among observations, thus coefficient standard errors are
too small. ⇒ t-stats are too large and null is rejected more when it shouldn’t be (i.e. more Type 1 errors).
F-test also unreliable
negative autocorrelation: data tends to diverge among observations, thus coefficient standard errors are
too large. ⇒ t-stats are too small and null is not rejected more when it should be (i.e. more Type 2 errors)
• Detecting Autocorrelation
Plot residuals over time: clustering and cycling can indicate positive autocorrelation, while jaggedness
over the x-axis can indicate negative autocorrelation
Durbin-Watson (DW) statistic: Test null hypothesis H0 : no serial correlation of residuals (computation
follows)
PT
(εˆt − εt−1ˆ )2
DW = t=2 PT 2
where εˆt is residual for period t
t=1 εˆt
For large sample sizes, can use the approx
DW ' 2(1 − r) where r is correlation between residuals of one period and previous period
DW-stat upper (du ) and lower (dl ) critical value from table: based on sample size, # df, significance
level
Decision Rules:
If DW < dl , reject null, residuals exhibit positive autocorrelation
8
If dl < DW < du , test is inconclusive
If du < DW < 4 − du , do not reject null, no autocorrelation
If 4 − du < DW < 4 − dl , test is inconclusive
If 4 − dl < DW < 4, reject null, residuals exhibit negative autocorrelation
• Correcting Autocorrelation
Adjust coefficient standard errors: using the Hansen method. Note that this corrects for autocorrelation
and conditional heteroskedasticity. Employ these adjusted standard errors (AKA serial correlation consistent
or Hansen-White standard errors) in hypothesis tests, if autocorrelation is a problem (even in conjunction
with conditional heteroskedasticity).
Improve model specification: by explicitly incorporating time-series nature of date via seasonal term.
Multicollinearity
• Multicollinearity: when multiple explanatory variables or linear combos thereof are highly correlated.
• Effect on Regression Analysis

distorts SEE and coef standard errors, ⇒ issues with t-tests, greater probability of not rejecting the null
(more Type 2 error)
• Detecting Multicollinearity
Indicative Situation: t-test results suggests individual explanatory variates are not statistically signifi-
cant, but F-test suggests they are significant collectively and R2 is high.
Rule of thumb: sample correlation above 0.7 for any two explanatory variables signals multicollinearity
is a potential problem
• Correcting Multicollinearity
omit one or more of the correlated independent variables (can be identified using stepwise regression, or
other methods)
Model Misspecification (Three types)

• Functional form misspecification
Important variables omitted: Distortion arises if omitted explanatory variables correlated with other
explanatory variables (and thus the residuals as well)
Variables should be transformed
Data improperly pooled
• Explanatory variables correlated with error terms (conditional heteroskedasticity)

Lagged response variable used as explanatory variable: Distortion arises if residuals are autocorrelated
Function of response variable used as explanatory variable (Forecasting the Past): Not really forecasting
if it’s based on info you already have
Measurement error in explanatory variables: Distortion arises from necessary use of proxies for what we
want to use as explanatory variables in the model
• Other time-series misspecifications that result in non-stationarity
9
Distortion Effects: model misspecs lead to biased or inconsistent regression coefficients, resulting in unreliable
hypothesis test results and inaccurate predictions
Models with response variables being an indicator variable

• Can’t use regression models? (are they stupid? examples of others follow)
• Probit (based on normal distn) and Logit (based on logistic distn) models: application includes estimates of
probability of an event occurring, coefficients estimated via maximum likelihood
• Discriminant models: uses linear function to generate score or ranking for an observation (i.e. use financial
statement ratios as explanatory variables to categorically rate a company bankrupt or not).
Figure 1: Assessing Regression Model Flowchart
10
Reading 13: Time Series Analysis
Time Series Basics and Basic Trend Models
• Time Series: set of observations for variable over successive time periods
• Trend: consistent pattern seen on plot
• Linear Trend: form follows (b1 is slope coeff and t is time period, other notation as before)
yt = b0 + b1 (t) + εt
OLS regression estimates coeffs (fitted model follows)
yˆt = bˆ0 + bˆ1 (t)
• Log-linear Trend: when series exhibits exponential growth (form follows)
yt = eb0 +b1 (t)+εt ⇒ log yt = b0 + b1 (t) + εt
• Linear (log-linear) trend model works best when variable increases by constant amount (rate)
• Limitations: trend models work poorly when residuals exhibit autocorrelation (detect with DW stat, as
before)
Autoregressive (AR) Models and Stationarity

• AR(1) Model (notation as before). Parameters estimated with OLS must be Covariance Stationary to be
valid.
xt = b0 + b1 xt−1 + εt
• A time series is Covariance Stationary if the following 3 conditions hold:

1. Constant and finite expected value (AKA mean-reverting level) over time.
2. Constant and finite variance (volatility around the mean) over time.
3. Constant and finite covariance between time series values at any given lag.
• AR(p) Model
xt = b0 + b1 xt−1 + ... + bp xt−p + εt
• Forecasting: one and two period ahead AR(1) example
ˆ = bˆ0 + bˆ1 xt
xt+1
(cfavow) use ”chain rule of forecasting” to obtain two period ahead forecast from the one period ahead
forecast
ˆ = bˆ0 + bˆ1 xt+1
xt+2 ˆ = bˆ0 + bˆ1 (bˆ0 + bˆ1 xt )
11
Assessing Fit of AR Models (Autocorrelation test)
• Properly specified AR model should not have significant autocorrelation in its residuals (AR model not useful
if residuals have significant autocorrelation for particular series).
• Testing AR model specification (steps):
1. Estimate AR model using OLS (linear regression), starting with AR(1) model (1st order).
2. Compute autocorrelation function (i.e. of all lags) of the residuals
3. Test whether autocorrelations statistically significant (t-test procedure follows):
Compute the t-stat for autocorrelation of lag k (shown below). Critical value has T − 2 df (two-tailed),
where T is # of observations (NOTE: in cfavow, you CANNOT assume additional prior data available. i.e.
if using AR(1) model with n data points, you only have n − 1 observations).
ρt,t−k
t= √ where ρt,t−k is autocorrelation of lag k, and denominator is standard error
1/ T
Mean Reversion in Time Series

• Time series mean reverts if it has tendency to move toward its expected value / mean.
• If time series at its mean reverting level, model will predict next value of time series is equal to current value
x̂t = xt−1
• Example: solve for AR(1) mean reverting level
b0
x̂t = b0 + b1 xt−1 = b0 + b1 x̂t from above, then ⇒ x̂t =
(1 − b1 )
b0
• If x̂t > (<) (1−b 1)
, then AR(1) model predicts xt+1 will be lower (higher) than xt
Mean reversion level will be finite for AR(1) when |b1 | < 1 (Property of all covariance stationary time
series)
Forecast Types and Parameter Stability

• In Sample forecasts: made within the data range / time period used to estimate the model (AKA sample,
or test period). Typically only one employed in most published research (cfavow)
• Out of Sample forecasts: made outside sample period (important as a judge of model predictive power in
the real world)
• Root Mean Squared Error (RMSE): used to compare out of sample forecasting accuracy of AR models.
Model with lowest RMSE for in sample data does not necessarily have lowest RMSE for out of sample data
v
u
u1 X T
RM SE = t (yt − yˆt )2 where t = 1, .., T are out of sample periods
T
t=1
• Parameter (regression coeff) Instability: when estimated parameters change among different periods due to
varying financial and economic conditions
Models estimated on shorter sample periods usually more stable than those estimated with longer ones
(increased chance underlying process has changed)
Trade-off between statistical reliability of longer sample periods and parameter stability of shorter periods
12
Random Walks vs. Covariance Stationary Processes
• Random walk (Simple): is defined by these properties
xt = b0 + b1 xt−1 + εt where b0 = 0, b1 = 1
Expected values of residuals (error terms) are zero. E[εt ] = 0

Variance of residuals is constant. E[ε2t ] = σ 2
No autocorrelation of residuals. E[εi , εj ] = 0, f or i 6= j
• Random walk (with drift): properties below
xt = b0 + b1 xt−1 + εt where b0 is constant drift, b1 = 1
• Note that random walks of either type are NOT Covariance Stationary. Having b1 = 1 implies infinite mean
b0
reverting level x̂t = 1−b 1
=∞
AKA Series has Unit Root (b1 = 1)
Unit Root Testing for Non-Stationarity and Corrections

• Two methods available for unit root testing:
1. Estimate AR model and examine residual autocorrelation across lags. Process is stationary if residual
autocorrelations not statistically significant for all lags or they decay to zero as # of lags increases.
2. (Better) Dickey Fuller Test: transform AR(1) to run simple regression, subtract xt−1 from both sides
to get
xt − xt−1 = b0 + (b1 − 1)xt−1 + εt
Use modified t-test to see whether (b1 − 1) is statistically different from zero. If no, b1 = 1 and series
has unit root.
• First Differencing: Transform time series to changes in value of response variable rather than value itself,
(cfavow) makes series covariance stationary (typically, not always true in reality). Procedure follows for
AR(1) Model:
If original time series x has unit root, then xt − xt−1 = εt . Define yt as changes
yt = xt − xt−1 = εt
then fit AR(1) model to yt
yt = b0 + b1 yt−1 + εt where b0 = b1 = 0
This series has finite mean-reverting level of zero ⇒ covariance stationary
13
Detecting and Correcting for Seasonality
• Seasonality: time series pattern that repeats from year to year.
• If present, then model is misspecified and unreliable for forecasting, unless AR model explicit includes this
effect (extra term)
• Detecting Seasonality: use autocorrelation t-test (presented earlier) to see if statistically significant at later
lags
• Correcting Seasonality: include response variable(s) of lags with statistically significant autocorrelation as
explanatory variables in the model
Autoregressive Conditional Heteroskedasticity (ARCH) Models

• ARCH exists in time series model if variance of residuals in one period is dependent on variance of residuals
in a previous period. This invalidates parameter and std error estimates in AR models and thus hypothesis
test results as well.
• ARCH(1) Model (Estimated by ”regressing” squared residuals from fitted time series model, εˆt 2 , on 1st lag
thereof)
ε̂2t = a0 + a1 ε̂2t−1 + µt where µt is ARCH residual/error term
• The time series is ARCH(1) if the parameter a1 is statistically significant (different from zero)
• Generalized least squares must be used to develop predictive model for time series model with ARCH residuals
(otherwise, standard errors of model parameters incorrect)
• Predicting Variance of Time Series:

For data with ARCH(1) residual pattern, can express residual variance predict as follows
2
σ̂t+1 = aˆ0 + aˆ1 ε̂2t
Time Series Stationarity and Cointegration

• If running regression with two time series (i.e. yt = b0 + b1 xt + εt ), must assess whether each time series is
stationary.
• Assess two series for unit roots using separate Dickey Fuller (DF) test with possible results listed below:
If both series are covariance stationary, the above regression is valid.
If either the response series or explanatory series are not covariance stationary, the regression is not
valid.
If neither series is covariance stationary, the validity of the regression depends on whether the two time
series are Cointegrated.
• Cointegration: when two time series follow same trend and relationship is not expected to change (cfavow:
economically linked to same macro variables).
if two time series are cointegrated, residuals of above regression are covariance stationary and t-test are
reliable (not true if they are not cointegrated)
14
Testing for Cointegration: Run the following regression, test residuals for unit root by applying DF test,
compare with critical t-values computed with Engle Granger method (called DF-EG test). Don’t need to
know computation for exam.
yt = b0 + b1 xt + εt
If DF-EG test rejects null of unit root, residuals/error terms generated by two time series are covariance
stationary and two series are cointegrated (thus we can regression model their relationship).
Selecting Model and Justifying Choice

• Structural change: significant shift in plotted data at points in time that divide data into two or more distinct
patterns
Must run two different models to accommodate data before and after shift (single model for whole period
likely produces unreliable results)
Go through the motions:

1. Determine your goal:
Modeling relationship of response and explanatory variables (i.e. cointegrated time series, cross sectional
regression)
Modeling variable over time (trend)
2. For analysis of individual variable, plot values over time and look for non-stationarity indicators (i.e. non-
constant mean or variance, seasonality, or structural change)
3. If no seasonality or structural shift, use trend model

Use linear trend for straight line data plot, log-linear for curved data plot
4. Run trend analysis, compute residuals, use DW test to test for autocorrelation
Must use another model if autocorrelation detected (otherwise, model can be used)
5. If data has autocorrelation, examine data for stationarity. If not stationary, use data in AR model as follows:
If data has linear trend, 1st difference the data
If data has exponential trend, 1st difference the natural log of the data
If data has structural shift, run two separate models
6. After differencing or separation, if series is stationary, run AR(1) model and test for autocorrelation and
seasonality
Add lagged variables into model (for seasonality, if present) until any significant autocorrelation has been
removed/modeled (model can be used when no autocorrelation present)
7. Test for ARCH, by regressing squared residuals on lagged squared residuals and test if coefficients statistically
significant
If parameters significantly different from zero, correct using generalized least squares (otherwise model
can be used)
8. If you have multiple statistically reliable models, compare their out-of-sample RMSE to determine which is
better at forecasting
15

Reading 11: Correlation and Simple Regression: Calculate and Interpret The Following

Uploaded by

Copyright:

Available Formats

Reading 11: Correlation and Simple Regression: Calculate and Interpret The Following

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reading 11: Correlation and Simple Regression: Calculate and Interpret The Following

Uploaded by

Copyright:

Available Formats

Reading 11: Correlation and Simple Regression

Calculate and interpret the following

• Know how to read scattplots

Explain limitations to correlation analysis

Hypothesis testing for correlation

Linear Regression Basics

• ”Simple” one-variable regression model i = 1, .., n

• Intercept bˆ0 : intersection with Y axis

Regression Summary Stats at a Glance

Note: Sbˆ1 is the SEE for bˆ1 , computed as follows

Hypothesis Tests for Regression Coefficients

• Prediction interval: (cfavow) confidence interval for predicted values

• Total sum of squares (SST): measures total variation of response variable

• Computing R2 and SEE

• F-stat computation (one-tailed): MSR and MSE as defined above

• F-test hypothesis (H0 : b1 = 0 vs Ha : b1 6= 0)

• The residuals are difference between observed and fitted values

Regression Basic Results

Hypothesis Tests and Confidence Intervals for Regression Coefficients

• t-stat has n − k − 1 df (n is # of observations, k is # of slope coefficients)

Multiple Regression Model Assumptions

• Total sum of squares (SST): measures total variation of response variable

• F-test hypothesis (H0 : b1 = ... = bk = 0 vs Ha : atleastonebj 6= 0)

• F-stat computation (one-tailed): MSR and MSE as defined above

Critical value Fc : numerator df = k, denominator df = n - k - 1 (n is # of obs, k is # of slope params),

Indicator (cfavow: dummy) Variables in Regression

• Effects on Regression Analysis

• Effects on Regression Analysis

For large sample sizes, can use the approx

• Effect on Regression Analysis

Model Misspecification (Three types)

• Explanatory variables correlated with error terms (conditional heteroskedasticity)

• Other time-series misspecifications that result in non-stationarity

Models with response variables being an indicator variable

Figure 1: Assessing Regression Model Flowchart

• Trend: consistent pattern seen on plot

OLS regression estimates coeffs (fitted model follows)

yˆt = bˆ0 + bˆ1 (t)

• Log-linear Trend: when series exhibits exponential growth (form follows)

yt = eb0 +b1 (t)+εt ⇒ log yt = b0 + b1 (t) + εt

Autoregressive (AR) Models and Stationarity

• A time series is Covariance Stationary if the following 3 conditions hold:

• Forecasting: one and two period ahead AR(1) example

Mean Reversion in Time Series

Forecast Types and Parameter Stability

Expected values of residuals (error terms) are zero. E[εt ] = 0

• Random walk (with drift): properties below

xt = b0 + b1 xt−1 + εt where b0 is constant drift, b1 = 1

Unit Root Testing for Non-Stationarity and Corrections

then fit AR(1) model to yt

This series has finite mean-reverting level of zero ⇒ covariance stationary

Autoregressive Conditional Heteroskedasticity (ARCH) Models

• Predicting Variance of Time Series:

Time Series Stationarity and Cointegration

Selecting Model and Justifying Choice

Go through the motions:

3. If no seasonality or structural shift, use trend model

You might also like