MultivariableRegression Summary
MultivariableRegression Summary
MultivariableRegression Summary
Much like in the case of the univariate regression with one independent variable, the multiple
regression model has a number of required assumptions:
(MR.1): Linear Model The Data Generating Process (DGP), or in other words, the
population, is described by a linear (in terms of the coefficients) model:
Y = Xβ + ε (MR.1)
E (ε|X) = 0 (MR.2)
This assumption implies that all error pairs are uncorrelated. For cross-sectional data,
this assumption implies that there is no spatial correlation between errors.
(MR.5) There exists no exact linear relationship between the explanatory variables.
This means that:
rank (X) = k + 1
ε|X ∼ N 0, σ2 I
(MR.6)
Multiple Regression: Summary of R
functions and Theory
Examining the variables
Note: the code shown is from the example task in chapter 4.11. The code in these slides is only a summarization
and therefore not complete.
#From: https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/pairs.html
panel.hist <- function(x, ...){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE, breaks = 30)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, ...)
}
panel.abs_cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 2)
}
pairs(dt4_train,
diag.panel = panel.hist,
lower.panel = panel.smooth,
upper.panel = panel.abs_cor,
col = "dodgerblue4",
pch = 21,
bg = adjustcolor("dodgerblue3", alpha = 0.2))
Correlation matrix can be visualized with:
myPanel <- function(x, y, z, ...){
lattice::panel.levelplot(x,y,z,...)
my_text <- ifelse(!is.na(z), paste0(round(z, 4)), "")
lattice::panel.text(x, y, my_text)
}
#
mask = cor(dt4_train)
mask[upper.tri(mask, diag = TRUE)] <- NA
#
#
lattice::levelplot(mask,
panel = myPanel,
col.regions = viridisLite::viridis(100),
main = 'Correlation of numerical variables')
Model estimation 2
The model Yi = β0 + β1 X1,i + β2 X2,i + β3 X1,i + β4 (X1,i × X2,i ) + i can be estimated using lm():
my_model <- lm(y ~ 1 + x1 + x2 + I(x1^2) + x1*x2, data = dt4_train)
Note that
I y ~ 1 + x1 + x2 + I(x1ˆ2) + x1*x2 is the same as y ~ 1 + x1 + x2 + I(x1ˆ2) + x1:x2
I y ~ 1 + x1 + x2 + I(x1ˆ2) + x1*x2 is the same as y ~ 1 + x1*x2 + I(x1ˆ2)
I y ~ 1 + I(x1ˆ2) + x1:x2 is Yi = β0 + β1 X1,i
2 + β (X
2 1,i × X2,i ) + i
Furthermore:
I When specifying polynomial and interaction terms it is important to think about their
interpretation;
I In general, the coefficient signs should make economic sense, so you should have an initial
assumption of their signs. Then, from the estimated model, you should determine whether
the coefficients are significant, and if they are, whether their signs make sense (sometimes
both a positive and a negative sign can have an economic interpretation).
I As per the lecture note example, you should be able to write down the estimated regression
model.
Linear Restrictions
(
H0 : βeduc = βexper , and βeduc 2 = βexper 2
H1 : βeduc 6= βexper
car::linearHypothesis(mdl_4_fit, c("educ-exper=0"))
# Breusch–Pagan Test
print(lmtest::bptest(my_model))
# Goldfeld–Quandt Test
lmtest::gqtest(my_model, alternative = "two.sided")
# White Test
lmtest::bptest(mdl_3_fit, ~ x1*x2 + I(x1^2) + I(x2^2), data = dt4_train)
In general, if the p-value < 0.05 - we reject the null hypothesis and conclude that the residuals are heteroskedastic.
Autocorrelation Tests
n
H0 : the errors are serially uncorrelated
H1 : the errors are autocorrelated (the exact order of the autocorrelation depends on the test carried out)
Normality Tests
n
H0 : residuals follow a normal distribution
H1 : residuals do not follow a normal distribution
WLS
Alternatively, we can correct the estimates themselves for heteroskedasticity by using WLS with a
generic weight function:
log_resid_sq_ols <- lm.fit(y = log(my_model$residuals^2), x = model.matrix(my_model))
h_est = exp(log_resid_sq_ols$fitted.values)
#
my_model_wls <- lm(formula = formula(my_model), data = dt4_train, weights = 1 / h_est)
2 calculated for WLS is not comparable to the OLS R 2 .
Note that Radj adj
HAC
If the residuals are autocorrelated (and also heteroskedastic, but not necessarily) - we can correct
the standard errors via:
mtest::coeftest(my_model,
sandwich::NeweyWest(my_model, lag = 2))[, ], 4)
Model Specification test
Rainbow Test for Linearity:
lmtest::raintest(formula(my_model), order.by = ~ x1, data = dt4_train)
lmtest::raintest(formula(my_model), order.by = ~ x2, data = dt4_train)
...
The data needs to be ordered. If p-value < 0.05 - we reject the null hypothesis and conclude that the model fit is
not adequate.