Regression Analysis and Multiple Regression: Session 7
Regression Analysis and Multiple Regression: Session 7
Regression Analysis and Multiple Regression: Session 7
Multiple Regression
Session 7
Simple Linear Regression Model
• Using Statistics
• The Simple Linear Regression Model
• Estimation: The Method of Least Squares
• Error Variance and the Standard Errors of Regression Estimators
• Correlation
• Hypothesis Tests about the Regression Relationship
• How Good is the Regression?
• Analysis of Variance Table and an F Test of the Regression Model
• Residual Analysis and Checking for Model Inadequacies
• Use of the Regression Model for Prediction
• Using the Computer
• Summary and Review of Terms
7-1 Using Statistics
This scatterplot locates pairs of Scatterplot of Advertising Expenditures (X) and Sales (Y)
observations of advertising 140
120
expenditures on the x-axis and sales 100
on the y-axis. We notice that:
Sales
80
60
40
Larger (smaller) values of sales tend to 20
be associated with larger (smaller) 0
values of advertising. 0 10 20 30
A d ve rtising
40 50
Y
Y
Y
X 0 X X
Y
Y
X X X
Model Building
The inexact nature of the Data In ANOVA, the systematic
relationship between component is the variation
advertising and sales of means between samples
or treatments (SSTR) and
suggests that a
statistical model might
Statistical the random component is
model the unexplained variation
be useful in analyzing the (SSE).
relationship.
In regression, the
A statistical model Systematic systematic component is
separates the the overall linear
component relationship, and the
systematic component
of a relationship from the + random component is the
random component. Random variation around the line.
errors
7-2 The Simple Linear
Regression Model
The population simple linear regression model:
Y= 0 + 1 X +
Nonrandom or Random
Systematic Component
Component
E[Yi]=0 + 1 Xi
}
1
Actual observed values of Y differ
0 = Intercept from the expected value by an
unexplained or random error:
Yi = E[Yi] + i
Xi
X
= 0 + 1 Xi + i
Assumptions of the Simple
Linear Regression Model
Assumptions of the Simple
• The relationship between X and Y Y Linear Regression Model
is a straight-line relationship.
• The values of the independent
variable X are assumed fixed (not
random); the only randomness in
the values of Y comes from the E[Y]=0 + 1 X
error term i.
• The errors i are normally
distributed with mean 0 and
variance 2. The errors are
uncorrelated (not related) in Identical normal
distributions of errors,
successive observations. That is: all centered on the
~ N(0,2) regression line.
X
7-3 Estimation: The Method of
Least Squares
Estimation of a simple linear regression relationship involves finding
estimated or predicted values of the intercept and slope of the linear
regression line.
The estimated regression equation:
Y=b0 + b1X + e
Y b0 + b1 X
where Y (Y - hat) is the value of Y lying on the fitted regression line for a given
value of X.
Fitting a Regression Line
Y Y
Data
Three errors from the
least squares regression
X line X
Y e
.
Y b0 b1 X the fitted regression line
Yi
Yi
{
Error ei Yi Yi
Yi the predicted value of Y for X
i
X
Least Squares Regression
The sum of squared errors in regression is:
n n
SSE = e
i=1
2
i i i
(y
i=1
y
) 2
The least squares regression line is that which minimizes the SSE
with respect to the estimates b 0 and b 1 .
n n
y
i=1
i nb0 b1 x i
i=1
Least squares b0
n n n
x y i i b0 x i b1 x 2i
i=1 i=1 i=1 Least squares b1 b1
Sums of Squares, Cross Products,
and Least Squares Estimators
Sums of Squares and Cross Products:
x
2
SSx (x x ) x
2 2
n 2
SS y ( y y ) y
2 2 y
n
SSxy ( x x )( y y ) xy
x ( y )
n
Least squares regression estimators:
SS XY
b1
SS X
b0 y b1 x
Example 7-1
Miles Dollars Miles 2 Miles*Dollars 2
1211
1345
1802
2405
1466521
1809025
2182222
3234725 2 x
1422 2005 2022084 2851110
SS x x
1687 2511 2845969 4236057 n
2
1849 2332 3418801 4311868 79448
2026 2305 4104676 4669930 293426944 40947552
2133 3016 4549689 6433128 25
2253
2400
3385
3090
5076009
5760000
7626405
7416000
x ( y)
2468 3694 6091024 9116792 SS xy xy
2699 3371 7284601 9098329 n
2806 3998 7873636 11218388 2
3082 3555 9498724 10956510 106605
3209 4692 10297681 15056628 390185024 51402848
3466 4244 12013156 14709704 25
3643 5298 13271449 19300614 SS XY 51402848
3852 4801 14837904 18493452 b1 1.255333776 1.26
4033 5147 16265089 20757852 SS X 40947552
4267 5738 18207288 24484046
4498 6420 20232004 28877160 106605
(1.255333776)
79448
4533 6059 20548088 27465448
b0 y b1 x
4804
5090
6426
6321
23078416
25908100
30870504
32173890
25 25
5233 7026 27384288 36767056 274.85
5439 6964 29582720 37877196
79498 10605 293426944 390185024
Example 7-1: Using the
Computer
MTB > Regress 'Dollars' 1 'Miles';
Regression of Dollars Charged against Miles SUBC> Constant.
Regression Analysis
8000
7000
The regression equation is
6000 Dollars = 275 + 1.26 Miles
1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
SOURCE DF SS MS F p
Regression 1 64527736 64527736 637.47 0.000
M i le s Error 23 2328161 101224
Total 24 66855896
Example 7-1: Using Computer-
Excel
The results on the right side are the output created by selecting
REGRESSION option from the DATA ANALYSIS toolkit.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.98243393
R Square 0.965176428
Adjusted R Square 0.963662359
Standard Error 318.1578225
Observations 25
ANOVA
df SS MS F Significance F
Regression 1 64527736.8 64527736.8 637.4721586 2.85084E-18
Residual 23 2328161.201 101224.4
Total 24 66855898
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 274.8496867 170.3368437 1.61356569 0.120259309 -77.51844165 627.217815 -77.51844165 627.217815
MILES 1.255333776 0.049719712 25.248211 2.85084E-18 1.152480856 1.358186696 1.152480856 1.358186696
Example 7-1: Using Computer-
Excel
Residual Analysis. The plot shows the absence of a
relationship between the residuals and the X-values (miles).
600
400
200
Residuals
0
-200 0 1000 2000 3000 4000 5000 6000
-400
-600
-800
Miles
Total Variance and Error
Variance
Y Y
X X
b1=1.25533
b1 t s (b1 )
.3
0.025,( 25 2 )
1
e:
p
slo
on
Height = Slope
un
bo
6
1.25533 0.10287
5%
24
.15
r9
:1
pe
nd
[115246
. ,1.35820]
Up
u
bo
%
95
er
L ow
0 (not a possible value of the
Length = 1
regression slope at 95%)
7-5 Correlation
The correlation between two random variables, X and Y, is a measure of
the degree of linear association between the two variables.
The population correlation, denoted by, can take on any value from -1
to 1.
indicates a perfect negative linear relationship
-1< <0 indicates a negative linear relationship
indicates no linear relationship
0< <1 indicates a positive linear relationship
indicates a perfect positive linear relationship
X X X
X X X
Covariance and Correlation
The covariance of two random variables X and Y:
Cov ( X , Y ) E [( X )(Y )]
X Y
where and Y are the population means of X and Y respectively.
X
Y
Adjusted R Square 0.98266567
0
Standard Error 0.279761372
Observations 10 0 5
XV
ANOVA
df SS MS F Significance F
Regression 1 40.0098686 40.0098686 511.2009204 1.55085E-08
Residual 8 0.626131402 0.078266425
Total 9 40.636
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -8.762524695 0.594092798 -14.74942084 4.39075E-07 -10.13250603 -7.39254336 -10.13250603 -7.39254336
US 1.423636087 0.062965575 22.60975277 1.55085E-08 1.278437117 1.568835058 1.278437117 1.568835058
RESIDUAL OUTPUT
7
International
8 9 10 11 12
United States
Hypothesis Tests for the
Correlation Coefficient
Example 10 -1:
r
H0: =0 (No linear relationship) t( n 2 )
H1: 0 (Some linear relationship) 1 r2
n2
0.9824
r =
Test Statistic: t( n 2 ) 1 - 0.9651
1 r2
25 - 2
n2 0.9824
= 25.25
0.0389
t0. 005 2.807 25.25
H 0 rejected at 1% level
Hypothesis Tests about the
Regression Relationship
Constant Y Unsystematic Variation Nonlinear Relationship
Y Y Y
X X X
A hypothesis test for the existence of a linear relationship between X and Y:
H0: 1 0
H1: 1 0
Test statistic for the existence of a linear relationship between X and Y:
b
1
t
(n - 2) s (b )
1
where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b .
1 1 1
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.
Hypothesis Tests for the
Regression Slope
Example 10 - 1: Example 10 - 3:
H0: 1 0 H 0: 1 1
H1: 1 0 H1 : 1 1
b b 1
1 1
t t
(n - 2) s(b ) ( n - 2) s( b )
1 1
1.25533 1.24 - 1
= 25.25 = 1.14
0.04972 0.21
( y y ) ( y y) ( y y )
Y Total = Unexplained Explained
Deviation Deviation Deviation
. (Error) (Regression)
}
Y
Y
Unexplained Deviation
{ Total Deviation
2 2
( y y ) ( y y) ( y y )
2
Y
Explained Deviation
{ SST = SSE + SSR
Percentage of
2 SSR SSE
r 1 total variation
SST SST explained by the
X regression.
X
The Coefficient of
Determination
Y Y Y
X X X
SST SST SST
S
r2=0 SSE r2=0.50 SSE SSR r2=0.90 S SSR
E
7000
5000
Dollars
SSR 64527736.8 4000
r 2
0.96518 3000
SST 66855898 2000
1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Miles
7-8 Analysis of Variance and an F
Test of the Regression Model
Source of Sum of Degrees of
Variation Squares Freedom Mean Square F Ratio
Example 10-1
Source of Sum of Degrees of
Variation Squares Freedom F Ratio p Value
Mean Square
Regression 64527736.8 1 64527736.8 637.47 0.000
Error 2328161.2 23 101224.4
Total 66855898.0 24
7-9 Residual Analysis and Checking
for Model Inadequacies
Residuals Residuals
0 0
x or y x or y
Residuals Residuals
0 0
Time x or y
X X X X
X X X
3) Variation around the
regression line. Prediction Interval for E[Y|X]
Prediction Interval for a Value
of Y
A (1 - ) 100% prediction interval for Y:
1 (x x ) 2
y t 1
2
n SS X
Example 10 -1 (X = 4000):
1 (4000 3177.92 ) 2
274.85 (1.2553)(4000) 2.069 1
25 40947557.84
5296.05 676.62 [4619.43,5972.67 ]
Prediction Interval for the
Average Value of Y
A (1 - ) 100% prediction interval for the E[Y X]:
1 (x x ) 2
y t
2
n SS X
Example 10 -1 (X = 4000):
1 (4000 3177.92 ) 2
274.85 (1.2553)(4000) 2.069
25 40947557.84
5296.05 156.48 [5139.57 ,5452.53]
Using the Computer
MTB > regress 'Dollars' 1 'Miles' tres in C3 fits in C4;
SUBC> predict 4000;
SUBC> residuals in C5.
Regression Analysis
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 64527736 64527736 637.47 0.000
Error 23 2328161 101224
Total 24 66855896
MTB > PLOT 'Resids' * 'Fits' MTB > PLOT 'Resids' *'Miles'
500 500
Resids
Resids
0 0
-500 -500
2000 3000 4000 5000 6000 7000 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Fits Miles
Plotting on the Computer (2)
8 7000
7
6000
6
Frequency
5 5000
Dollars
4
4000
3
2 3000
1
2000
0
-2 -1 0 1 2 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
StRes Miles
11 Multiple Regression (1)
• Using Statistics.
• The k-Variable Multiple Regression Model.
• The F Test of a Multiple Regression Model.
• How Good is the Regression.
• Tests of the Significance of Individual Regression
Parameters.
• Testing the Validity of the Regression Model.
• Using the Multiple Regression Model for
Prediction.
11 Multiple Regression (1)
• Qualitative Independent Variables.
• Polynomial Regression.
• Nonlinear Models and Transformations.
• Multicollinearity.
• Residual Autocorrelation and the Durbin-Watson Test.
• Partial F Tests and Variable Selection Methods.
• Using the Computer.
• The Matrix Approach to Multiple Regression Analysis.
• Summary and Review of Terms.
7-11 Using Statistics
y Lines y Planes
B
B
A
Slope: 1 C
A
x1
Intercept: 0
x2
x
Any two points (A and B), or Any three points (A, B, and C), or
an intercept and slope (0 an intercept and coefficients of x1
and 1), define a line on a and x2 (0 , 1, and 2), define a
two-dimensional surface. plane in a three-dimensional
surface.
7-12 The k-Variable Multiple
Regression Model
The population regression model of a
x2
dependent variable, Y, on a set of k y
independent variables, X1, X2,. . . , Xk 2
is given by:
Y= 0 + 1X1 + 2X2 + . . . + kXk +
1
where 0 is the Y-intercept of the 0
regression surface and each i , i =
1,2,...,k is the slope of the regression x1
surface - sometimes called the y 0 1 x1 2 x 2
response surface - with respect to Xi.
Model assumptions:
1. ~N(0,2), independent of other errors.
2. The variables Xi are uncorrelated with the error term.
Simple and Multiple Least-
Squares Regression
Y y
x1
y b0 b1x
X x2 y b0 b1 x1 b2 x 2
y nb 0
b1 x 1 b2 x 2
x y b x b1 x 1 b2 x 1 x 2
2
1 0 1
x y b0 x 2 b1 x 1 x 2 b2 x 2
2
2
Example 7-3
Y X1 X2 X1 X2 X12 X2 2 X1Y X2Y Normal Equations:
72 12 5 60 144 25 864 360
76 11 8 88 121 64 836 608
78 15 6 90 225 36 1170 468 743 = 10b0+123b1+65b2
70 10 5 50 100 25 700 350
68 11 3 33 121 9 748 204 9382 = 123b0+1615b1+869b2
80 16 9 144 256 81 1280 720 5040 = 65b0+869b1+509b2
82 14 12 168 196 144 1148 984
65 8 4 32 64 16 520 260
62 8 3 24 64 9 496 186
90 18 10 180 324 100 1620 900 b0 = 47.164942
--- --- --- --- ---- --- ---- ---- b1 = 1.5990404
743 123 65 869 1615 509 9382 5040
b2 = 1.1487479
Estimated regression equation:
Y 47164942
. 15990404
. X 1 11487479
. X2
Example 7-3: Using the
Computer
SUMMARY OUTPUT
ANOVA
df SS MS F Significance F
Regression 2 630.5381466 315.2690733 86.33503537 1.16729E-05
Residual 7 25.56185335 3.651693336
Total 9 656.1
y
Y Y: Error Deviation
Total deviation: Y Y
Y Y : Regression Deviation
y
x1
x2
Total Deviation = Regression Deviation + Error Deviation
SST = SSR + SSE
7-13 The F Test of a Multiple
Regression Model
A statistical test for the existence of a linear relationship between Y and any or
all of the independent variables X1, x2, ..., Xk:
H0: 1 = 2 = ...= k=0
H1: Not all the i (i=1,2,...,k) are 0
Source of Sum of Degrees of
Variation Squares Freedom Mean Square F Ratio
SOURCE DF SS MS F
p
Regression 2 630.54 315.27 86.34
0.000
Error 7 25.56 3.65
Total 9 656.10
F Distribution with 2 and 7 Degrees of Freedom The test statistic, F = 86.34, is
f(F)
greater than the critical point of F(2, 7)
Test statistic 86.34
for any common level of
significance (p-value 0), so the null
hypothesis is rejected, and we
=0.01
might conclude that the dependent
F variable is related to one or more of
0
F0.01=9.55 the independent variables.
7-14 How Good is the Regression
y The mean square error is an unbiased
estimator of the variance of the population
2
errors, , denoted by :
SSE ( y y) 2
MSE
( n ( k 1)) ( n ( k 1))
x1
Standard error of estimate:
x2 Errors: y - y s= MSE
2
The multiple coefficient of determination, R , measures the proportion of
the variation in the dependent variable that is explained by the combination
of the independent variables in the multiple regression model:
2 SSR SSE
R = =1-
SST SST
Decomposition of the Sum of
Squares and the Adjusted
Coefficient of Determination
SST
SSR SSE
2 SSR SSE
R = = 1-
SST SST
2
The adjusted multiple coefficient of determination , R , is the coefficient of
determination with the SSE and SST divided by their respective degrees of freedom:
SSE
2 (n - (k + 1))
R = 1-
SST
(n - 1)
Example 11-1: s = 1.911 R-sq = 96.1% R-sq(adj) = 95.0%
Measures of Performance in
Multiple Regression and the
ANOVA Table
Source of Sum of Degrees of
Variation Squares Freedom Mean Square F Ratio
MSR
Regression SSR (k) SSR F
MSR MSE
k
Error SSE (n-(k+1)) SSE
MSE
=(n-k-1) ( n ( k 1))
Total SST (n-1) SST
MST
( n 1)
SSE
SSR SSE 2
2 R ( n ( k 1)) 2 (n - (k + 1)) MSE
R = = 1- F R =1- =
SST SST 2 (k ) MST
(1 R ) SST
(n - 1)
7-15 Tests of the Significance of
Individual Regression Parameters
Hypothesis tests about individual regression slope
parameters:
(1) H0: 1=0
H1: 10
(2) H0: 2=0
H1: 20
.
.
.
(k) H0: k=0
H1: k0
b 0
Test statistic for test i: t ( n ( k 1 )
i
s(b )i
Regression Results for
Individual Parameters
Coefficient Standard
Variable Estimate Error t-Statistic
Constant 53.12 5.43 9.783
*
X1 2.03 0.22 9.227
*
X2 5.60 1.30 4.308
*
X3 10.35 6.88 1.504
X4 3.45 2.70 1.259
X5 -4.25 0.38 11.184
*
n=150 t0.025=1.96
Example 7-3: Using the Computer
MTB > regress 'Y' on 2 predictors 'X1' 'X2'
Regression Analysis
Analysis of Variance
SOURCE DF SS MS F
p
Regression 2 630.54 315.27 86.34
0.000
Error 7 25.56 3.65
Total 9 656.10
SOURCE DF SEQ SS
X1 1 578.82
X2 1 51.72
Using the Computer: Example 7-4
MTB > READ ‘a:\data\c11_t6.dat’ C1-C5
MTB > NAME c1 'EXPORTS' c2 'M1' c3 'LEND' c4 'PRICE' C5 'EXCHANGE'
MTB > REGRESS 'EXPORTS' on 4 predictors 'M1' 'LEND' 'PRICE' 'EXCHANGE'
Regression Analysis
Analysis of Variance
SOURCE DF SS MS F p
Regression 4 32.9463 8.2366 73.06 0.000
Error 62 6.9898 0.1127
Total 66 39.9361
Example 7-5: Three Predictors
MTB > REGRESS 'EXPORTS' on 3 predictors 'LEND' 'PRICE' 'EXCHANGE'
Regression Analysis
Analysis of Variance
SOURCE DF SS MS F p
Regression 3 29.1919 9.7306 57.06 0.000
Error 63 10.7442 0.1705
Total 66 39.9361
Example 7-5: Two Predictors
MTB > REGRESS 'EXPORTS' on 2 predictors 'M1' 'PRICE'
Regression Analysis
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 32.940 16.470 150.67 0.000
Error 64 6.996 0.109
Total 66 39.936
7-16 Investigating the Validity
of the Regression Model:
Residual Plots
1 1
RE SID UAL
RE SID UAL
0 0
-1 -1
5 6 7 8 9 110 120 130 140 150 160
M1 P RIC E
RESID UA L
RE SIDUAL
0
0
-1
-1
3 4 5
0 10 20 30 40 50 60 70
TIME Y-HAT
Midpoint Count
Standardized Residuals:
-3.0
-2.5
1 *
1 *
i
-2.0 3 *** ~ N ( 0 ,1)
-1.5 1 *
-1.0 5 *****
-0.5 13 *************
0.0 19 *******************
0.5 12 ************
1.0 6 ******
1.5 3 ***
2.0 2 **
2.5 0
3.0 1 *
Investigating the Validity of the
Regression: Outliers and
Influential Observations
Regression line Point with a
y without outlier y large value of xi
. . *
.
.. .. Regression
Regression line
. .. ..
when all data
line with
outlier
. .. . are included
.. . .. .. .. .
. . .. .
No relationship
in this cluster
* Outlier
x x
Outliers Influential Observations
Outliers and Influential
Observations: Example 7-6
Unusual Observations
Obs. M1 EXPORTS Fit Stdev.Fit Residual
St.Resid
1 5.10 2.6000 2.6420 0.1288 -0.0420 -0.14 X
2 4.90 2.6000 2.6438 0.1234 -0.0438 -0.14 X
25 6.20 5.5000 4.5949 0.0676 0.9051 2.80R
26 6.30 3.7000 4.6311 0.0651 -0.9311 -2.87R
50 8.30 4.3000 5.1317 0.0648 -0.8317 -2.57R
67 8.20 5.6000 4.9474 0.0668 0.6526 2.02R
89.76
Advertising
18.00
63.42
8.00
Promotions 12 3
Prediction in Multiple
Regression
A (1 - a) 100% prediction interval for a value of Y given values of X :
i
y t s 2 ( y) MSE
( ,( n ( k 1)))
2
b0+b2
Line for X2=0
b0
x1
X1 x2
A regression with one A multiple regression with two
quantitative variable (X1) and quantitative variables (X1 and X2)
one qualitative variable (X2): and one qualitative variable (X3):
y b b x b x
0 1 1 2 2
y b b x b x b x
0 1 1 2 2 3 3
Picturing Qualitative Variables in
Regression: Three Categories and
Two Dummy Variables
Y
Line for X = 0 and X3 = 1 A qualitative
variable with r
levels or
Line for X2 = 1 and X3 = 0 categories is
represented with
b0+b3
Line for X2 = 0 and X3 = 0 (r-1) 0/1 (dummy)
variables.
b0+b2
b0
X1
Category X2 X3
A regression with one quantitative variable (X1) and Adventure 0 0
two qualitative variables (X2 and X2): Drama 0 1
y b b x b x b x
0 1 1 2 2 3 3
Romance 1 0
Using Qualitative Variables in
Regression: Example 7-6
b0
Slope = b1+b3
b0+b2
X1
y b b x b x b x x
0 1 1 2 2 3 1 2
7-19 Polynomial Regression
One-variable polynomial regression model:
Y=0+1 X + 2X2 + 3X3 +. . . + mXm +
where m is the degree of the polynomial - the highest power of X appearing in the
equation. The degree of the polynomial is the order of the model.
Y Y
y b b X
y b b X
0 1
0 1
y b b X b X
0 1 2
2
(b 0) y b b X b X b X
0 1 2
2
3
3
X1 X1
Polynomial Regression:
Example 7-7
25 MTB > regress sales' 2 'advert’ 'advsqr'
Regression Analysis
SALES
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 630.26 315.13 208.99 0.000
Error 18 27.14 1.51
Total 20 657.40
Polynomial Regression: Other
Variables and Cross-Product
Terms
2
2
3
3
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 4.2722 4.2722 337.56 0.000
Error 19 0.2405 0.0127
Total 20 4.5126
Transformations:
Exponential Model
The exponential model:
Y e
0
1X
Regression Analysis
The regression equation is
SALES = 3.67 + 6.78 LOGADV
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 642.62 642.62 826.24 0.000
Error 19 14.78 0.78
Total 20 657.40
Plots of Transformed Variables
Sim ple Regression of Sales on Ad vertising Regression of Sales on Log(Advertising)
30 25
20
SALES
SALES
15
0 5 10 15 0 1 2 3
ADVERT LOGADV
0.5
RESIDS
LOGSALE
2.5
Y = 1.70 0 8 2 + 0 .5 53 13 6 X -0.5
R- Sq uared = 0 .9 47
-1.5
1.5
2 12 22
0 1 2 3
LOGADV Y-HAT
Variance Stabilizing
Transformations
• Square root transformation: Y Y
Useful when the variance of the regression errors is
approximately proportional to the conditional mean of Y.
• Logarithmic transformation: Y log(Y )
Useful when the variance of regression errors is approximately
proportional to the square of the conditional mean of Y.
• Reciprocal transformation:
1
Useful when the variance of the regression errors is
Y
Y
approximately proportional to the fourth power of the conditional
mean of Y.
Regression with Dependent
Indicator Variables
The logistic function:
e ( X )
0 1
E (Y X )
1 e ( X )
0 1
y Logistic Function
1
0
x
7.21 Multicollinearity
x2
x2 x1
x1
Orthogonal X variables Perfectly collinear X
provide information from variables provide identical
independent sources. No information content. No
multicollinearity. regression.
x2
x2
x1 x1
Some degree of collinearity.
A high degree of negative
Problems with regression
collinearity also causes
depend on the degree of
problems with regression.
collinearity.
Effects of Multicollinearity
• Variances of regression coefficients are inflated.
• Magnitudes of regression coefficients may be
different from what are expected.
• Signs of regression coefficients may not be as
expected.
• Adding or removing variables produces large
changes in coefficients.
• Removing a data point may cause large changes in
coefficient estimates or signs.
• In some cases, the F ratio may be significant while
the t ratios are not.
Detecting the Existence of
Multicollinearity: Correlation Matrix of
Independent Variables and Variance
Inflation Factors
MTB > CORRELATION 'm1' 'lend’ 'price’ 'exchange'
Correlations (Pearson)
M1 LEND PRICE
LEND -0.112
PRICE 0.447 0.745
EXCHANGE -0.410 -0.279 -0.420
Regression Analysis
The regression equation is
EXPORTS = - 4.02 + 0.368 M1 + 0.0047 LEND + 0.0365 PRICE + 0.27 EXCHANGE
50
0
0.0 0.5 1.0 Rh2
Solutions to the
Multicollinearity Problem
• Drop a collinear variable from the
regression.
• Change in sampling plan to include
elements outside the multicollinearity
range.
• Transformations of variables.
• Ridge regression.
7-22 Residual Autocorrelation
and the Durbin-Watson Test
An autocorrelation is a correlation of the values of a variable
with values of the same variable lagged one or more periods
back. Consequences of autocorrelation include inaccurate
estimates of variances and inaccurate predictions.
Lagged Residuals The Durbin-Watson test (first-order
i i i-1 i-2 i-3 i-4
autocorrelation):
1 1.0 * * * * H0 : 1 = 0
2
3
0.0
-1.0
1.0
0.0
*
1.0
*
*
*
*
H1:0
4 2.0 -1.0 0.0 1.0 * The Durbin-Watson
n test 2statistic:
5 3.0 2.0 -1.0 0.0 1.0 ( ei ei 1 )
d i2 n
6 -2.0 3.0 2.0 -1.0 0.0
7 1.0 -2.0 3.0 2.0 -1.0 2
8 1.5 1.0 -2.0 3.0 2.0 ei
9 1.0 1.5 1.0 -2.0 3.0 i 1
10 -2.5 1.0 1.5 1.0 -2.0
Critical Points of the Durbin-Watson
Statistic: =0.05, n= Sample Size, k =
Number of Independent Variables
0 dL dU 4-dU 4-dL 4
Partial F test:
H0: 3 = 4 = 0
H1: 3 and 4 not both 0
(SSE SSE ) / r
R F
Partial F statistic: F
(r, (n (k 1)) MSE
F
where SSER is the sum of squared errors of the reduced model, SSEF is the sum
of squared errors of the full model; MSEF is the mean square error of the full
model [MSEF = SSEF/(n-(k+1))]; r is the number of variables dropped from the full
model.
Variable Selection Methods
• All possible regressions
Run regressions with all possible combinations of independent
variables and select best model.
• Stepwise procedures
Forward selection
Add one variable at a time to the model, on the basis of its F
statistic.
Backward elimination
Remove one variable at a time, on the basis of its F statistic.
Stepwise regression
Adds variables to the model and subtracts variables from the
model, on the basis of the F statistic.
Stepwise Regression
Compute F statistic for each variable not in the model
Remove
Is there a variable with p-value > Pout?
variable
No
Stepwise Regression: Using
the Computer
MTB > STEPWISE 'EXPORTS' PREDICTORS 'M1’ 'LEND' 'PRICE’ 'EXCHANGE'
Stepwise Regression
Step 1 2
Constant 0.9348 -3.4230
M1 0.520 0.361
T-Ratio 9.89 9.21
PRICE 0.0370
T-Ratio 9.05
S 0.495 0.331
R-Sq 60.08 82.48
Using the Computer: MINITAB
MTB > REGRESS 'EXPORTS’ 4 'M1’ 'LEND’ 'PRICE' 'EXCHANGE';
SUBC> vif;
SUBC> dw.
Regression Analysis
The regression equation is
EXPORTS = - 4.02 + 0.368 M1 + 0.0047 LEND + 0.0365 PRICE + 0.27 EXCHANGE
Analysis of Variance
SOURCE DF SS MS F p
Regression 4 32.9463 8.2366 73.06 0.000
Error 62 6.9898 0.1127
Total 66 39.9361
Model: MODEL1
Dependent Variable: EXPORTS
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Variance
Variable DF Inflation
INTERCEP 1 0.00000000
M1 1 3.20719533
LEND 1 5.35391367
PRICE 1 6.28873181
EXCHANGE 1 1.38570639
Durbin-Watson D 2.583
(For Number of Obs.) 67
1st Order Autocorrelation -0.321
The Matrix Approach to
Regression Analysis (1)
The population regression model:
y1 1 x
11
x12
x ... x
13 1k 1 1
y 1 x x x ... x
2
21 22 23 2k
2
2
y 3
1 x 31
x32
x ... x
33 3k 3
3
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
y k
1 x n1
xn2
x ... x
n3
nk
k
k
Y X
The estimated regression model:
Y = Xb + e
The Matrix Approach to
Regression Analysis (2)
The normal equations:
X Xb X Y
Estimators:
b ( X X ) X Y
1
Predicted values:
Y Xb X ( X X ) X Y HY
1
V (b) ( X X )
2 1
s (b) MSE ( X X )
2 1