3 Linear Regression 3

9/29/2021
Some Important Questions

When we perform multiple linear regression, we usually are
interested in answering a few important questions:
1. Is at least one of the predictors X1,X2, . . . , Xp useful in

predicting the response?
2. Do all the predictors help to explain Y , or is only a subset of
the predictors useful?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should
we predict, and how accurate is our prediction?
21
1. Is There a Relationship Between the Response

and Predictors?
• H0 : β1 = β2 = · · · = βp = 0
• Ha: at least one slope βj  0
• This hypothesis test is performed by computing the F-statistic in

the ANOVA (ANalysis Of VAriance) table.
• The ANOVA table has many pieces of information. What we care about is
the F-Ratio and the corresponding p-value.
• How large does the F-statistic need to be before we can reject

H0 and conclude that there is a relationship? It turns out that
the answer depends on the values of n and p.
• For the advertising data, the p-value associated with the F-statistic is
essentially zero, so we have extremely strong evidence that at least one of
the media is associated with increased sales.
22
1
9/29/2021
Why do we need F-statistic?

• Given these individual p-values for each variable, why do we
need to look at the overall F-statistic?
• However, this logic is flawed, especially when the number of
predictors p is large.
• We expect to see approximately 5% of small p-values even in
the absence of any true association between the predictors and
the response (by chance !!!)
• However, the F-statistic does not suffer from this problem
because it adjusts for the number of predictors.
• The approach of using an F-statistic to test for any association
between the predictors and the response works when p is
relatively small, and certainly small compared to n.
23
2. Deciding on Important Variables
• It is more often the case that the response is only related to a

subset of the predictors.
• The task of determining which predictors are associated with
the response, in order to fit a single model involving only those
predictors, is referred to as variable selection.
• Best subset selection: fit a separate least squares regression

model for each possible combination of the p predictors
• Unfortunately, total of 2p models that contain subsets of p variables.
• Only feasible when p is small
• Need an automated and efficient approach to choose a smaller

set of models to consider.
24
2
9/29/2021
Variable Selection
Forward stepwise selection:
• Begin with the null model:
• model that contains an intercept but no predictors.
• Fit p simple linear regressions and add to the null model the
variable that results in the lowest RSS.
• Add to that model the variable that results in the lowest RSS for
the new two-variable model.
• Continue until some stopping rule is satisfied.
25
Variable Selection (2)

Backward stepwise selection:
• Start with all variables in the model
• Remove the variable with the largest p-value

• the least statistically significant.
• The new (p − 1)-variable model is fit, and the variable with the
largest p-value is removed.
• Procedure continues until a stopping rule is reached

• may stop when all remaining variables have a p-value below some
threshold.
26
3
9/29/2021
Variable Selection (3)

Hybrid Approaches = Mixed Selection
• Variables are added to the model sequentially, in analogy to
forward selection.
• However, after adding each new variable, the method may also
remove any variables that no longer provide an improvement in
the model fit.
• the p-values for variables can become larger as new predictors are added
to the model. (p-value rises above a certain threshold).
• Such an approach attempts to more closely mimic best subset

selection while retaining the computational advantages of
forward and backward stepwise selection.
27
3. Model Fit
• In MLR, 𝑅 2 = 𝐶𝑜𝑟(𝑌, 𝑌)2

• the square of the correlation between the response and the fitted linear
model; in fact one property of the fitted linear model is that it maximizes
this correlation among all possible linear models.
• The model that uses only TV and radio to predict sales has an R2
value of 0.89719; only TV as a predictor had an R2 of 0.61
28
4
9/29/2021
Observations
• Adding radio to the model leads to a substantial improvement in
R2 (TV + Radio better prediction; higher R2 and smaller RSE)
• There is a small increase in R2 if we include newspaper
advertising in the model that already contains TV and radio
advertising.
• Previously observed: p-value for newspaper advertising is not significant.
• Additional evidence that newspaper can be dropped from the model
• R2 will always increase when more variables are added to the

model, even if those variables are only weakly associated with
the response.
• The inclusion of variables that do not provide real improvement

in the model fit to the training samples, will likely lead to poor
results on independent test samples due to overfitting.
29
4. Predictions
• Once we have fit the multiple regression model, it is
straightforward to predict the response Y on the basis of a set
of values for the predictors.
• However, there are three sorts of uncertainty associated with

this prediction.
1. The inaccuracy in the coefficient estimates is related to the
reducible error
2. Model Bias (assumption of linearity)
3. Irreducible error: even if we knew the true value of the
coefficients, the response value cannot be predicted perfectly
because of the random error  in the model
30
5
9/29/2021
Other Issues/Considerations
• Qualitative Predictors
• Interaction Terms
• Non-linear effects
• Multicollinearity
• Model Selection
31
Qualitative Predictors
• Investigate differences in credit card balance between males and
females, ignoring the other variables for the moment.
• How do you stick “men” and “women” (qualitative predictor =

factor) into a regression equation?
• Code them as indicator variables (dummy variables)
32
6
9/29/2021
• Use this variable as a predictor in the regression equation. This

results in the model.
• Now β0 can be interpreted as the average credit card balance

among males, β0 + β1 as the average credit card balance among
females, and β1 as the average difference in credit card balance
between females and males.
33
Alternative coding schemes

• Alternatively, instead of a 0/1 coding scheme, we could create a
dummy variable:
• Now β0 can be interpreted as the overall average credit card

balance (ignoring the gender effect), and β1 is the amount that
females are above the average and males are below the average.
• β0 and β1?
34
7
9/29/2021
It is important to note that the final predictions for the

credit balances of males and females will be identical
regardless of the coding scheme used. The only difference is
in the way that the coefficients are interpreted.
35
Extensions of the Linear Model

The standard linear regression model makes several highly
restrictive assumptions that are often violated in practice. Two of
the most important assumptions state that the relationship
between the predictors and response are additive and linear.
1. The additive assumption means the effect of changes in a

predictor Xj on the response Y is independent of the values of
the other predictors
2. The linear assumption states that the change in the response

Y due to a one-unit change in Xj is constant, regardless of the
value of Xj
36
8
9/29/2021
Interaction terms in advertising

Sales = b0 + b1 ´TV + b2 ´ Radio+ b3 ´TV ´ Radio
Sales = b0 +(b1 + b3 ´ Radio)´TV + b2 ´ Radio Interaction Term
Spending $1 extra on TV increases average sales by

0.0191 + 0.0011Radio
Sales = b0 + (b2 + b3 ´TV)´ Radio + b2 ´TV
Spending $1 extra on Radio increases average sales

by 0.0289 + 0.0011TV
37
Observations
• We can interpret β3 as the increase in the effectiveness of TV
advertising for a one unit increase in radio advertising (or vice-versa).
• The p-value for the interaction term, TV×radio, is extremely low,

indicating that there is strong evidence for Ha : β3  0.
• In other words, it is clear that the true relationship is not additive.
• The R2 for the model is 96.8%, compared to only 89.7% for the model
that predicts sales using TV and radio without an interaction term.
• The hierarchical principle states that if we include an interaction in a

model, we should also include the main effects, even if the p-values
associated with their coefficients are not significant.
38
9
9/29/2021
(Multi) Collinearity
• Situation in which two or more predictor variables are closely
related to one another.
• Can pose problems in the regression context,

• since it can be difficult to separate out the individual effects of collinear
variables on the response.
• Unfortunately, not all collinearity problems can be detected by

inspection of the correlation matrix.
• possible for collinearity to exist between three or more variables even if
no pair of variables has a particularly high correlation.
• Multicollinearity
39
Potential Fit Problems

There are a number of possible problems that one may encounter
when fitting the linear regression model.
1. Non-linearity of the data
2. Dependence of the error terms
3. Non-constant variance of error terms
4. Outliers
5. High leverage points
6. Collinearity
See Section 3.3.3 for more details.
40
10

3 Linear Regression 3

Uploaded by

Copyright:

Available Formats

3 Linear Regression 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Linear Regression 3

Uploaded by

Copyright:

Available Formats

9/29/2021

Some Important Questions

1. Is at least one of the predictors X1,X2, . . . , Xp useful in

1. Is There a Relationship Between the Response

• This hypothesis test is performed by computing the F-statistic in

• How large does the F-statistic need to be before we can reject

Why do we need F-statistic?

2. Deciding on Important Variables

• It is more often the case that the response is only related to a

• Best subset selection: fit a separate least squares regression

• Need an automated and efficient approach to choose a smaller

• Continue until some stopping rule is satisfied.

Variable Selection (2)

• Remove the variable with the largest p-value

• Procedure continues until a stopping rule is reached

Variable Selection (3)

• Such an approach attempts to more closely mimic best subset

• In MLR, 𝑅 2 = 𝐶𝑜𝑟(𝑌, 𝑌)2

• R2 will always increase when more variables are added to the

• The inclusion of variables that do not provide real improvement

• However, there are three sorts of uncertainty associated with

• How do you stick “men” and “women” (qualitative predictor =

• Code them as indicator variables (dummy variables)

• Use this variable as a predictor in the regression equation. This

• Now β0 can be interpreted as the average credit card balance

Alternative coding schemes

• Now β0 can be interpreted as the overall average credit card

It is important to note that the final predictions for the

Extensions of the Linear Model

1. The additive assumption means the effect of changes in a

2. The linear assumption states that the change in the response

Interaction terms in advertising

Sales = b0 +(b1 + b3 ´ Radio)´TV + b2 ´ Radio Interaction Term

Spending $1 extra on TV increases average sales by

Spending $1 extra on Radio increases average sales

• The p-value for the interaction term, TV×radio, is extremely low,

• The hierarchical principle states that if we include an interaction in a

• Can pose problems in the regression context,

• Unfortunately, not all collinearity problems can be detected by

Potential Fit Problems

See Section 3.3.3 for more details.

You might also like