Level 2 Quants Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Multiple linear regression model is used when the dependent variable (Y variable) is continuous (not

discrete) and there is more than one explanatory variable. If the dependent variable were discrete,
then the model should be estimated as a logistic regression ONT FOR QUALITATIVE Y VARIABLE
THAT TOO DISCRETE 0 or 1. (Y variable – let’s take binary 0 & 1).

Assumptions of Multiple Regression -

1. Linearity: The relationship between the dependent variable and the independent variables is
linear.
2. Homoskedasticity: The variance of the regression residuals (error term) is the constant for all
observations. If variance is not constant in the error terms it may lead to heteroscedasticity
and violate assumption of Multiple regression.
3. Independence of errors: The observations are independent of one another. This implies the
regression residuals are uncorrelated across observations (error term should be uncorrelated
with each other and other variables.) If error terms are correlated it will lead to serial
correlation.
4. Normality: The regression residuals are normally distributed.
5. Independence of independent variables:
a. independent variables are not random (some logic reasoning to be taken to consider any
independent variable in model).
b. There is no exact linear relation between two or more of the independent variables or
combinations of the independent variables.

BIC is the preferred measure for determining which model provides the best fit, while AIC is
preferred for better forecast. Lower the better.

BP test statistic is calculated as nR2, where n is the number of observations and R2 is from the
regression for the BP test. Let’s take R squared as 0.06511 and critical value as 11.07 and n as 96. So,
the BP test statistic = 96 × 0.06511 = 6.251. This is less than the critical value of 11.07, so we cannot
reject the null hypothesis of no heteroskedasticity. Thus, there is no evidence of heteroskedasticity
and we conclude the model has heteroskedasticity. (City- Bombay – Pune)

If BP test statistic is greater than predicted value then we reject null which says NO
Heteroskedasticity.

Basically, if R squared is very less and N is very large the test score will be less than the critical value
which indicates Heteroskedasticity.

Serial Correlation - The Breusch–Godfrey (BG) test is for serial correlation. If the test value is greater
than the critical value it’s a reason for serial correlation. You will be rejecting null hypothesis which
says no serial correlation.
BG test think of it this way – Error term is small; your test score is larger which leads to bigger test
value than the critical value thus which indicates presence of serial correlation and we reject null.

Multicollinearity - To understand the size of the multicollinearity problem, the analyst may compute
VIFs for each independent variable in regression models.
A VIF above 10 indicates serious multicollinearity issues requiring correction, while a VIF above 5
warrants further investigation of the given variable. Since X3 and X4 each have VIFs above 10, serious
multicollinearity exists for these two variables. VIFs for X1 and X2 are both well below 5, so
multicollinearity does not appear to be an issue with these variables.
Possible solutions for addressing the multicollinearity issues –

1. excluding one or more of the regression variables.

2. using a different proxy for one of the variables.

3. increasing the sample size.

To detect multicollinearity r squared will be high and F test will show satisfactory results but T-Test
will show insignificant results.

Leverage – X variable (Independent variables coefficient is very large)

Outlier – Y variables (Dependent variable is very large) – Influential observation.

The rule of thumb for the leverage measure is that if it exceeds 3(k+1/n) where k is the number of
independent variables, then it is a potentially influential observation. Since n = 96 and k = 2,
then 3(2+1/96) =0.09375. The observations which exceed this value are potentially influential
observations.

Cook’s D to identify influential observations is Di> 2* √ (k/n).

Interpretation of Dummy Term – use N-1 formula. Also remember if all other variables remain zero
the intercept coefficient represents the December in below example.
Interpretation – R square is very low. F statistic – explains if variable coefficients are zero.

T statistics calculated value is very less than the critical value which indicates variables are
insignificant.
Multiple Linear Regression with Qualitative Dependent Variables

Qualitative dependent variables (categorical dependent variables) are outcome variables describing
data that fit into categories. For example, to predict whether a company will go bankrupt or not, we
need a qualitative dependent variable (bankrupt or not bankrupt) and company financial
performance data (e.g., return on equity, debt-to-equity ratio, or debt rating) as independent
variables. This qualitative dependent variable is binary, but a dependent variable that falls into more
than two categories is also possible.

An ordinary regression model is not appropriate for situations that require a qualitative dependent
variable, because the forecasted values of y using the model can be less than 0 or greater than 1,
which are illogical values for probability.

Instead, we transform the probability values of the dependent variable into odds: p / (1 – p). For
example, if probability = 0.80, then odds = 0.80 / 0.20 or 4 to 1.

For example, if the probability of a company going bankrupt is 0.75, P/(1 – P) is 0.75/(1 − 0.75) = 3.
So, the odds of bankruptcy are 3 to 1, implying the probability of bankruptcy is three times as large
as the probability of the company not going bankrupt. The natural logarithm (ln) of the odds of an
event happening is the log odds, which is also known as the logit function.

Logistic regression (logit) uses the logistic transformation of the event probability (P) into the log
odds, ln [P/(1 − P)], as the dependent variable.

An independent variable’s slope coefficient in a logistic regression model is the change in the log
odds that the event happens per unit change in the independent variable, holding all other
independent variables constant.

The likelihood ratio (LR) test is a method to assess the fit of logistic regression models and is based
on the log-likelihood metric that describes the fit to the data. The LR test statistic is

LR = −2 (Log likelihood restricted model − Log likelihood unrestricted model).

Because the logistic transformation of event probability is the dependent variable in


logistic regression, the interpretation of regression coefficients is not as intuitive as in
regression with a continuous dependent variable. In a logit model, the slope coefficient is
the change in the log odds that the event happens per unit change in the independent
variable, holding all other independent variables constant.
Logistic Regression is very different for normal regression. It tells us what will be the log odds by one
unit change in any factor. Below you can see asked is what will be the probability of increase in net
assets. We take the coefficient of net assets and then use 2nd Ex function to get log odds then convert
in to probability to get the answer. Intercept and other terms to be excluded in such calculations.

You might also like