Lecture 13
Lecture 13
Lecture 13
AND STATISTICS
FOURTEENTH EDITION
Chapter 13
Multiple Regression
Analysis
INTRODUCTION
We extend the concept of simple linear regression as we
investigate a response y which is affected by several
independent variables, x1, x2, x3,…, xk.
Our objective is to use the information provided by the
xi to predict the value of y.
EXAMPLE
Let y be a student’s college achievement,
measured by his/her GPA. This might be a
function of several variables:
x1 = rank in high school class
x2 = high school’s overall rating
x3 = high school GPA
x4 = SAT scores
We want to predict y using knowledge of
x1, x2, x3 and x4.
EXAMPLE
Let y be the monthly sales revenue for a
company. This might be a function of
several variables:
x1 = advertising expenditure
x2 = time of year
x3 = state of economy
x4 = size of inventory
We want to predict y using knowledge of
x1, x2, x3 and x4.
SOME QUESTIONS
How well does the model fit?
How strong is the relationship between y and the
predictor variables?
Have any assumptions been violated?
How good are the estimates and predictions?
SSE = ( y − y )
ˆ 2
= ( y − b0 − b1 x1 − ... − bk xk ) 2
EXAMPLE
A computer database in a small community contains
the listed selling price y (in thousands of dollars),
the amount of living area x1 (in hundreds of square
feet), and the number of floors x2, bedrooms x3, and
bathrooms x4, for n = 15 randomly selected
residences currently on the market.
Property y x1 x2 x3 x4 Fit a first order
1 69.0 6 1 2 1 model to the data
2 118.5 10 1 2 2 using the method
of least squares.
3 116.5 10 1 3 2
… … … … … …
15 209.9 21 2 4 3
E XAMPLE
The first order model is
E(y) = b0+ b1x1 + b2x2 + b3x3 + b4x4
fit using Minitab with the values of y and
the four independent variables entered into
five columns of the Minitab
Regression equation
worksheet.
Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, Baths
The regression equation is
ListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths
Analysis of Variance
Source DF SS MS F P
Regression 4 15913.0 3978.3 84.80 0.000
Residual Error 10 469.1
Sequential
46.9
Sums of
Total 14 16382.2 squares: conditional
Source DF Seq SS
contribution of each
SqFeet 1 14829.3 independent variable to
NumFlrs 1 0.9
Bdrms 1 166.4
SSR given the variables
Baths 1 916.5 already entered into the
model.
TESTING THE USEFULNESS
OF THE MODEL
• The first question to ask is whether the
regression model is of any use in predicting y.
• If it is not, then the value of y does not change,
regardless of the value of the independent
variables, x1, x2 ,…, xk. This implies that the
partial regression coefficients, b1, b2,…, bk are
all zero.
H 0 : b1 = b 2 = ... = b k = 0 versus
H a : at least one b i is not zero
THE F TEST
You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
Total SS
TESTING THE PARTIAL
REGRESSION COEFFICIENTS
• Is a particular independent variable useful
in the model, in the presence of all the
other independent variables? The test
statistic is function of bi, our best estimate
of bi.
H 0 : b i = 0 versus H a : b i 0
bi − 0
Test statistic : t =
SE(bi )
which has a t distribution with error df = n – k –
1.
THE REAL ESTATE
PROBLEM
Is the overall model useful in predicting list
price? How much of the overall variation in
the response is explained by the
regression model?
S = 6.84930 R-Sq = 97.1% R-Sq(adj) = 96.0%
Analysis of Variance
Source DF SS MS F P
Regression 4 15913.0 3978.3 84.80 0.000
Residual Error 10 469.1 46.9
2
R = .971 indicates that
Total 14 16382.2
F = MSR/MSE = 84.80 with
97.1%
Source of the
DFoverall
Seq SS p-value = .000 is highly
SqFeet 1 14829.3
variation
NumFlrs
is explained
1 0.9
significant. The model is
byBdrms
the regression
1 166.4 very useful in predicting the
Baths 1 916.5
model. list price of homes.
MEASURING THE STRENGTH
OF THE RELATIONSHIP
• Since Total SS = SSR + SSE, R2 measures
✓ the proportion of the total variation in the
responses that can be explained by using the
independent variables in the model.
✓ the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.
2
SSR MSR R /k
R =
2
and F = =
Total SS MSE (1 − R 2 ) /(n − k − 1)
THE REAL ESTATE PROBLEM
In the presence of the other three independent
variables, is the number of bedrooms significant in
predicting the list price of homes? Test using = .05.
Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, Baths
The regression equation is
ListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths
95
90
80
70
Percent
60
50
40
30
20
10
1
-15 -10 -5 0 5 10 15
Residual
RESIDUALS VERSUS FITS
✓ If the equal variance assumption is valid,
the plot should appear as a random
scatter around the zero center line.
✓ If not, you will see a pattern in the
residuals. Residuals Versus the Fitted Values
(response is List Price)
10
5
Residual
-5
-10
-15
50 75 100 125 150 175 200
Fitted Value
ESTIMATION AND
PREDICTION
• Once you have
✓ determined that the regression line is
useful
✓ used the diagnostic plots to check for
violation of the regression assumptions.
• You are ready to use the regression line to
✓ Estimate the average value of y for
a given value of x
✓ Predict a particular value of y for a
given value of x.
ESTIMATION AND
PREDICTION
• Enter the appropriate values of x1, x2, …, xk
in Minitab. Minitab calculates
yˆ = b0 + b1 x1 + b2 x2 + ... + bk xk
• and both the confidence interval and the
prediction interval.
• Particular values of y are more difficult to
predict, requiring a wider range of values
in the prediction interval.
THE REAL ESTATE PROBLEM
Estimate the average list price for a home with 1000
square feet of living space, one floor, 3 bedrooms and
two baths with a 95% confidence interval.
Sales, y 2.5 2.6 2.7 5.0 5.3 9.1 14.8 17.5 23.0 28.0
Scatterplot of y vs x
30 Since there is only one
25
independent variable, you
20
could fit a linear, quadratic,
15
y
10
or cubic polynomial model.
5
Which would you pick?
0
1 2 3 4 5 6 7
x
TWO POSSIBLE CHOICES
A straight line model: y = b0 + b1x + e
A quadratic model: y = b0 + b1x + b2x2 + e
Here is the Minitab printout for the straight line:
Overall F test is highly
significant, as is the t-test of
the slope. R2 = .856 suggests
Regression Analysis: y versus x a good fit. Let’s check the
The regression equation is residual plots…
y = - 6.47 + 4.34 x
Predictor Coef SE Coef T P
Constant -6.465 2.795 -2.31 0.049
x 4.3355 0.6274 6.91 0.000
S = 3.72495 R-Sq = 85.6% R-Sq(adj) = 83.9%
Analysis of Variance
Source DF SS MS F P
Regression 1 662.46 662.46 47.74 0.000
Residual Error 8 111.00 13.88
Total 9 773.46
EXAMPLE
Residuals Versus the Fitted Values
(response is y)
5.0
There is a strong pattern of
2.5
a “curve” leftover in the
residual plot.
Residual
0.0
Analysis of Variance
Source DF SS MS F P
Regression 2 751.98 375.99 122.49 0.000
Residual Error 7 21.49 3.07
Total 9 773.47
2
The quadratic model is
1
better.
Residual
-1
There are no patterns in
the residual plot,
-2
indicating that this is the
0 5 10 15 20 25 30
Fitted Value correct model for the
data.
USING QUALITATIVE
VARIABLES
• Multiple regression requires that the
response y be a quantitative variable.
• Independent variables can be either
quantitative or qualitative.
• Qualitative variables involving k
categories are entered into the model by
using k-1 dummy variables.
• Example: To enter gender as a
variable, use
• xi = 1 if male; 0 if female
EXAMPLE
Data was collected on 6 male and 6 female
assistant professors. The researchers recorded their
salaries (y) along with years of experience (x1). The
professor’s gender enters into the model as a
dummy variable: x2 = 1 if male; 0 if not.
Professor Salary, y Experience, x1 Gender, x2 Interaction, x1x2
1 $50,710 1 1 1
2 49,510 1 0 0
… … … … …
11 55,590 5 1 5
12 53,200 5 0 0
EXAMPLE
We want to predict a professor’s salary based on
years of experience and gender. We think that
there may be a difference in salary depending on
whether you are male or female.
The model we choose includes experience (x1),
gender (x2), and an interaction term (x1x2) to
allow salary’s for males and females to behave
differently.
y = b 0 + b1 x1 + b 2 x2 + b 3 x1 x2 + e
MINITAB OUTPUT What is the regression
equation
Is the for useful
overall model males?in For
We use Minitab to fit females?
thepredicting
model. y?
The overall F test is F = 346.24
For males, x2 = 1,
Regression Analysis: y versus x1, x2, x1x2 with p-value = .000. The value of
The regression equation is R2 =y.992
= 49459.7 + 1229.13x
indicates that the
1
y = 48593 + 969 x1 + 867 x2 + 260 x1x2 model fits very well.
For females, x2 = 0,
Predictor Coef SE Coef T P
Constant 48593.0 207.9 y = 48593.0
233.68 0.000+ 969.0x1
x1 969.00 63.67 15.22 0.000
x2 866.7 305.3 Two
2.84different
0.022 straight line
x1x2 260.13 87.06 2.99 0.017
models.
S = 201.344 R-Sq = 99.2% R-Sq(adj) = 98.9%
Is there a difference in the relationship between salary and
Analysis of Variance
years of experience,
Source DF depending
SS on the gender
MS of the
F P
professor?
Regression 3 42108777 14036259 346.24 0.000
Residual Error 8 324315 40539
Yes. The individual11t-test42433092
Total for the
interaction term is t = 2.99
with p-value = .017. This indicates a significant interaction
between gender and years of experience.
EXAMPLE
Have any of the regression assumptions been
violated, or have we fit the wrong model?
Residuals Versus the Fitted Values Normal Probability Plot of the Residuals
(response is y) (response is y)
300 99
200 95
90
100 80
70
Residual
Percent
60
0 50
40
30
-100
20
10
-200
5
-300
49000 50000 51000 52000 53000 54000 55000 56000
1
-400 -300
It does not appear from
-200 -100 0 100 200 300 400 500
Fitted Value
the diagnostic plots that
Residual
SSE
MSE =
n − k −1
IV. Testing, Estimation, and Prediction
1. A test for the significance of the regression,
H0 : b1 = b2 = = bk = 0, can be implemented using
the
analysis of variance F test:
MSR
F=
MSE
KEY CONCEPTS
2. The strength of the relationship between x
and y can be measured using
SSR
R =
2
Total SS
which gets closer to 1 as the relationship
gets stronger.
3. Use residual plots to check for
nonnormality, inequality of variances, and
an incorrectly fit model.
4. Significance tests for the partial regression
coefficients can be performed using the
Student’s t test with error d f = n − k − 1: bi − b i
t=
SE (bi )
KEY CONCEPTS
5. Confidence intervals can be generated by
computer to estimate the average value of
y, E(y), for given values of x1, x2, …, xk.
Computer-generated prediction intervals
can be used to predict a particular
observation y for given value of x1, x2, …, xk.
For given x1, x2, …, xk, prediction intervals
are always wider than confidence intervals.
KEY CONCEPTS
V. Model Building
1. The number of terms in a regression model cannot
exceed the number of observations in the data set and
should be considerably less!
2. To account for a curvilinear effect in a quantitative
variable, use a second-order polynomial model. For a
cubic effect, use a third-order polynomial model.
3. To add a qualitative variable with k categories, use (k
− 1) dummy or indicator variables.
4. There may be interactions between two qualitative
variables or between a quantitative and a qualitative
variable. Interaction terms are entered as bxixj .
5. Compare models using R 2(adj).