Bivariate Regression: Chapter Contents
Bivariate Regression: Chapter Contents
Bivariate Regression: Chapter Contents
Text
CHAPTER
12
Bivariate Regression
Chapter Contents
12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9 Visual Displays and Correlation Analysis Bivariate Regression Regression Terminology Ordinary Least Squares Formulas Tests for Signicance Analysis of Variance: Overall Fit Condence and Prediction Intervals for Y Violations of Assumptions Unusual Observations
Text
Up to this point, our study of the discipline of statistical analysis has primarily focused on learning how to describe and make inferences about single variables. It is now time to learn how to describe and summarize relationships between variables. Businesses of all types can be quite complex. Understanding how different variables in our business processes are related to each other helps us predict and, hopefully, improve our business performance. Examples of quantitative variables that might be related to each other include: spending on advertising and sales revenue, produce delivery time and percentage of spoiled produce, diesel fuel prices and unleaded gas prices, preventive maintenance spending and manufacturing productivity rates. It may be that with some of these pairs there is one variable that we would like to be able to predict such as sales revenue, percentage of spoiled produce, and productivity rates. But rst we must learn how to visualize, describe, and quantify the relationships between variables such as these.
Visual Displays
Analysis of bivariate data (i.e., two variables) typically begins with a scatter plot that displays each observed data pair (xi, yi) as a dot on an X-Y grid. This diagram provides a visual indication of the strength of the relationship or association between the two variables. This simple display requires no assumptions or computation. A scatter plot is typically the precursor to more complex analytical techniques. Figure 12.1 shows a scatter plot comparing the price per gallon of diesel fuel to the price per gallon of regular unleaded gasoline. We look at scatter plots to get an initial idea of the relationship between two variables. Is there an evident pattern to the data? Is the pattern linear or nonlinear? Are there data points that are not part of the overall pattern? We would characterize the fuel price relationship as linear (although not perfectly linear) and positive (as diesel prices increase, so do regular unleaded prices). We see one pair of values set slightly apart from the rest, above and to the right. This happens to be the state of Hawaii.
489
Text
490
FIGURE 12.1
Fuel prices FuelPrices
Source: AAA Fuel Gauge Report, May 20, 2005, www.fuelgaugereport.com.
State Fuel Prices 2.80 Regular Unleaded Price/Gallon ($) 2.60 2.40 2.20 2.00 1.80 1.90 2.10 2.30 2.50 2.70 Diesel Price/Gallon ($) 2.90
Correlation Coefcient
A visual display is a good rst step in analysis but we would also like to quantify the strength of the association between two variables. Therefore, accompanying the scatter plot is the sample correlation coefcient. This statistic measures the degree of linearity in the relationship between X and Y and is denoted r. Its range is 1 r +1. When r is near 0 there is little or no linear relationship between X and Y. An r-value near +1 indicates a strong positive relationship, while an r-value near 1 indicates a strong negative relationship.
n
(12.1)
r=
i=1 n
(xi x) 2
( yi y ) 2
i=1
i=1
To simplify the notation here and elsewhere in this chapter, we dene three terms called sums of squares:
n n n
(12.2)
SSx x =
i=1
(xi x) 2
SSyy =
i=1
( yi y ) 2
SSx y =
i=1
(xi x)( yi y )
Using this notation, the formula for the sample correlation coefcient can be written (12.3) r= SSx y SSx x SSyy (sample correlation coefcient)
Excel Tip
To calculate a sample correlation coefcient, use Excels function =CORREL(array1,array2) where array1 is the range for X and array2 is the range for Y. Data may be in rows or columns. Arrays must be the same length.
The correlation coefcient for the variables shown in Figure 12.1 is r = 0.89, which is not surprising. We would expect to see a strong linear positive relationship between state diesel fuel prices and regular unleaded gasoline prices. Figures 12.2 through 12.7 show additional prototype scatter plots. We see that a correlation of .500 implies a great deal of random variation, and even a correlation of .900 is far from perfect linearity.
Text
491
FIGURE 12.2
r .900
FIGURE 12.3
r .500
FIGURE 12.4
r .500
FIGURE 12.5
r .900
Text
492
FIGURE 12.6
No correlation (random)
r .000
FIGURE 12.7
Nonlinear relationship
Y r .200
be taken into consideration. There are two ways to test a correlation coefcient for signicance. To test the hypothesis H0: = 0, the test statistic is (12.4) t =r n2 1 r2 (test for zero correlation)
We compare this t test statistic with a critical value t for a one-tailed or two-tailed test from Appendix D using = n 2 degrees of freedom and any desired . After calculating the t statistic, we can nd its p-value by using Excels function =TDIST(t,deg_freedom,tails). MINITAB directly calculates the p-value for a two-tailed test without displaying the t statistic. An equivalent approach is to calculate a critical value for the correlation coefcient. First, look up the critical value t from Appendix D with = n 2 degrees of freedom for either a one-tailed or two-tailed test, with whatever you wish. Then, the critical value of the correlation coefcient is (12.5)
r = t t2 + n 2
An advantage of this method is that you get a benchmark for the correlation coefcient. Its disadvantage is that there is no p-value and it is inexible if you change your mind about . MegaStat uses this method, giving two-tail critical values for = .05 and = .01.
EXAMPLE
MBA Applicants
MBA
In its admission decision process, a universitys MBA program examines an applicants cumulative undergraduate GPA, as well as the applicants GPA in the last 60 credits taken. They also examine scores on the GMAT (Graduate Management Aptitude Test), which has both verbal and quantitative components. Figure 12.8 shows two scatter plots with sample
Text
493
correlation coefcients for 30 MBA applicants randomly chosen from 1,961 MBA applicant records at a public university in the Midwest. Is the correlation (r = .8296) between cumulative and last 60 credit GPA statistically signicant? Is the correlation (r = .4356) between verbal and quantitative GMAT scores statistically signicant?
FIGURE 12.8
Scatter plots for 30 MBA applicants
30 Randomly Chosen MBA Applicants 4.00 3.50 3.00 2.50 2.00 1.50 1.50 2.00 2.50 3.00 3.50 Cumulative GPA 4.00 4.50 r .8296 Raw Quant GMAT Score 4.50 Last 60 Credit GPA 60 50 40 30 20 10 0 0 10 20 30 Raw Verbal GMAT Score 40 50 r .4356 30 Randomly Chosen MBA Applicants
MBA
Step 1: State the Hypotheses We will use a two-tailed test for signicance at = .05. The hypotheses are H0 : = 0 H1 : = 0 Step 2: Calculate the Critical Value For a two-tailed test using = n 2 = 30 2 = 28 degrees of freedom, Appendix D gives t.05 = 2.048. The critical value of r is r.05 = t.05 t.05
2
Step 3: Make the Decision Both sample correlation coefcients (r = .8296 and r = .4356) exceed the critical value, so we reject the hypothesis of zero correlation in both cases. However, in the case of verbal and quantitative GMAT scores, the rejection is not very compelling. If we were using the t statistic method, we would calculate two test statistics. For GPA, t =r n2 30 2 = .8296 2 1r 1 (.8296) 2 = 7.862 and for GMAT score, t =r n2 30 2 = .4356 1 r2 1 (.4356) 2 = 2.561 (reject = 0 since t = 2.561 > t = 2.048) (reject = 0 since t = 7.862 > t = 2.048)
This method has the advantage that a p-value can then be calculated by using Excels function =TDIST(t,deg_freedom,tails). For example, for the two-tailed p-value for GPA, =TDIST(7.862,28,2) = .0000 (reject = 0 since p < .05) and for the two-tailed p-value for GMAT score, =TDIST(2.561,28,2) = .0161 (reject = 0 since p < .05).
Text
494
TABLE 12.1
Quick 5 Percent Critical Value for Correlation Coefcients
Using Excel
A correlation matrix can be created by using Excels Tools > Data Analysis > Correlation, as illustrated in Figure 12.9. This correlation matrix is for our sample of 30 MBA students.
FIGURE 12.9
Excels correlation matrix MBA
Text
495
Tip
In large samples, small correlations may be signicant, even though the scatter plot shows little evidence of linearity. Thus, a signicant correlation may lack practical importance.
Eight cross-sectional variables were selected from the LearningStats state database (50 states): Burglary Age65% Income Unem SATQ Cancer Unmar Urban% Burglary rate per 100,000 population Percent of population aged 65 and over Personal income per capita in current dollars Unemployment rate, civilian labor force Average SAT quantitative test score Death rate per 100,000 population due to cancer Percent of total births by unmarried women Percent of population living in urban areas
EXAMPLE
LS
For n = 50 states we have = n 2 = 50 2 = 48 degrees of freedom. From Appendix D the two-tail critical values for Students t are t.05 = 2.011 and t.01 = 2.682 so critical values for r are as follows: For = .05, r.05 = and for = .01, r.01 = t.01 t.01 2 + n 2 = 2.682 (2.682) 2 + 50 2 = .361 t.05 t.05 + n 2
2
2.011 (2.011) 2 + 50 2
= .279
Figure 12.10 shows a correlation matrix for these eight cross-sectional variables. The critical values are shown and signicant correlations are highlighted. Four are signicant at = .01 and seven more at = .05. In a two-tailed test, the sign of the correlation is of no interest, but the sign does reveal the direction of the association. For example, there is a strong positive correlation between Cancer and Age65%, and between Urban% and Income. This says that states with older populations have higher cancer rates and that states with a greater degree of urbanization tend to have higher incomes. The negative correlation between Burglary and Income says that states with higher incomes tend to have fewer burglaries. Although no cause-and-effect is posited, such correlations naturally invite speculation about causation.
Burglary Burglary Age65% Income Unem SATQ Cancer Unmar Urban% 1.000 .120 .345 .340 .179 .085 .595 .210
Income
Unem
SATQ
Cancer
Unmar
Urban%
FIGURE 12.10
MegaStats correlation matrix for state data States
1.000 .099
1.000
50 sample size
.279 critical value .05 (two-tail) .361 critical value .01 (two-tail)
Text
496
EXAMPLE
Eight time-series variables were selected from the LearningStats database of annual macroeconomic data (42 years): GDP C I G U R-Prime R-10Yr DJIA Gross domestic product (billions) Personal consumption expenditures (billions) Gross private domestic investment (billions) Government expenditures and investment (billions) Unemployment rate, civilian labor force (percent) Prime rate (percent) Ten-year Treasury rate (percent) Dow-Jones Industrial Average
For n = 42 years we have = n 2 = 42 2 = 40 degrees of freedom. From Appendix D the two-tail critical values for Students t are t.05 = 2.021 and t.01 = 2.704 so critical values for r are as follows: For = .05, r.05 = and for = .01, r.01 = t.01 t.01 2 + n 2 = 2.704 (2.704) 2 + 42 2 = .393 t.05 t.05 + n 2
2
2.021 (2.021) 2 + 42 2
= .304
Figure 12.11 shows the MegaStat correlation matrix for these eight variables. There are 13 signicant correlations at = .01, some of them extremely high. In time-series data, high correlations are common due to time trends and denition (e.g., C, I, and G are components of GDP so they are highly correlated with GDP).
FIGURE 12.11
MegaStats correlation matrix for time-series data Economy GDP C I G U R-Prime R-10Yr DJIA
U R-Prime
R-10Yr
DJIA
1.000 .157
1.000
42 sample size
.304 critical value .05 (two-tail) .393 critical value .01 (two-tail)
SECTION EXERCISES
12.1 For each sample, do a test for zero correlation. (a) Use Appendix D to nd the critical value of t . (b) State the hypotheses about . (c) Perform the t test and report your decision. (d) Find the critical value of r and use it to perform the same hypothesis test. a. r = +.45, n = 20, = .05, two-tailed test b. r = .35, n = 30, = .10, two-tailed test
Text
497
c. r = +.60, n = 7, = .05, one-tailed test d. r = .30, n = 61, = .01, one-tailed test Instructions for Exercises 12.2 and 12.3: (a) Make an Excel scatter plot. What does it suggest about the population correlation between X and Y ? (b) Make an Excel worksheet to calculate SSx x , SS yy , and SSx y . Use these sums to calculate the sample correlation coefcient. Check your work by using Excels function =CORREL(array1,array2). (c) Use Appendix D to nd t.05 for a two-tailed test for zero correlation. (d) Calculate the t test statistic. Can you reject = 0? (e) Use Excels function =TDIST(t,deg_freedom,tails) to calculate the two-tail p-value.
12.2 Part-Time Weekly Earnings ($) by College Students Hours Worked (X) 10 15 20 20 35
WeekPay
12.3 Data Set Telephone Hold Time (min.) for Concert Tickets Operators (X) 4 5 6 7 8 Wait Time (Y) 385 335 383 344 288
CallWait
Instructions for Exercises 12.412.6: (a) Make a scatter plot of the data. What does it suggest about the correlation between X and Y? (b) Use Excel, MegaStat, or MINITAB to calculate the correlation coefcient. (c) Use Excel or Appendix D to nd t.05 for a two-tailed test. (d) Calculate the t test statistic. (e) Calculate the critical value of r . (f) Can you reject = 0?
Movies Spent (Y) 2.85 6.50 1.50 6.35 6.20 6.75 3.60 6.10 8.35 4.35
Text
498
12.5 Portfolio Returns on Selected Mutual Funds Last Year (X) 11.9 19.5 11.2 14.1 14.2 5.2 20.7 11.3 1.1 3.9 12.9 12.4 12.5 2.7 8.8 7.2 5.9
Portfolio This Year (Y) 15.4 26.7 18.2 16.7 13.2 16.4 21.1 12.0 12.1 7.4 11.5 23.0 12.7 15.1 18.7 9.9 18.9
12.6 Number of Orders and Shipping Cost ($) Orders (X) 1,068 1,026 767 885 1,156 1,146 892 938 769 677 1,174 1,009 12.7
ShipCost Ship Cost (Y) 4,489 5,611 3,290 4,113 4,883 5,425 4,414 5,506 3,346 3,673 6,542 5,088
(a) Use Excel, MegaStat, or MINITAB to calculate a matrix of correlation coefcients. (b) Calculate the critical value of r . (c) Highlight the correlation coefcients that lead you to reject = 0 in a two-tailed test. (d) What conclusions can you draw about rates of return? Construction 10-Year 28.9 28.6 35.8 33.8 24.9 36.0 39.7 63.9 27.9 33.3 27.8 29.9
Average Annual Returns for 12 Home Construction Companies Company Name Beazer Homes USA Centex D.R. Horton Hovnanian Ent KB Home Lennar M.D.C. Holdings NVR Pulte Homes Ryland Group Standard Pacic Toll Brothers 1-Year 50.3 23.4 41.4 13.8 46.1 19.4 48.7 65.1 36.8 30.5 33.0 72.6 3-Year 26.1 33.3 42.4 67.0 38.8 39.3 41.6 55.7 42.4 46.9 39.5 46.2 5-Year 50.1 40.8 52.9 73.1 35.3 50.9 53.2 74.4 42.1 59.0 44.2 49.1
Source: The Wall Street Journal, February 28, 2005. Note: Data are intended for educational purposes only.
Text
499
Mini Case
Alumni Giving
12.1
Private universities (and, increasingly, public ones) rely heavily on alumni donations. Do highly selective universities have more loyal alumni? Figure 12.12 shows a scatter plot of freshman acceptance rates against percent of alumni who donate at 115 nationally ranked U.S. universities (those that offer a wide range of undergraduate, masters, and doctoral degrees). The correlation coefcient, calculated in Excel by using Tools > Data Analysis > Correlation is r = .6248. This negative correlation suggests that more competitive universities (lower acceptance rate) have more loyal alumni (higher percentage contributing annually). But is the correlation statistically signicant?
FIGURE 12.12
Acceptance Rates and Alumni Giving Rates (n 115 universities) 70 % Alumni Giving 60 50 40 30 20 10 0 0 20 40 60 % Acceptance Rates 80 100 r .6248
Since we have a prior hypothesis of an inverse relationship between X and Y, we choose a left-tailed test: H0 : 0 H1 : < 0 With = n 2 = 115 2 = 113 degrees of freedom, for = .05, we use Excels two-tailed function =TINV(0.10,113) to obtain the one-tail critical value t.05 = 1.65845. Since we are doing a left-tailed test, the critical value is t.05 = 1.65845. The t test statistic is
t =r n2 115 2 = (.6248) = 8.506 1 r2 1 (.6248) 2
Since the test statistic t = 8.506 is less than the critical value t.05 = 1.65845, we conclude that the true correlation is negative. We can use Excels function =TDIST(8.506,113,1) to obtain p = .0000. Alternatively, we could calculate the critical value of the correlation coefcient:
r.05 = t.05 t.05 + n 2
2
= .1542
Since the sample correlation r = .6248 is less than the critical value r.05 = .1542, we conclude that the true correlation is negative. We can choose either the t test method or the correlation critical value method, depending on which calculation seems easier.
See U.S. News & World Report, August 30, 2004, pp. 9496.
Autocorrelation
Sunoco
Autocorrelation is a special type of correlation analysis useful in business for time series data. The autocorrelation coefcient at lag k is the simple correlation between yt and ytk where k
Text
500
is any lag. Below is an autocorrelation plot up to k = 20 for the daily closing price of common stock of Sunoco, Inc. (an oil company). Sunocos autocorrelations are signicant for short lags (up to k = 3) but diminish rapidly for longer lags. In other words, todays stock price closely resembles yesterdays, but the correlation weakens as we look farther into the past. Similar patterns are often found in other nancial data. You will hear more about autocorrelation later in this chapter.
Autocorrelation Function for Sunoco Stock Price 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6
Autocorrelation
10 12 Lag (days)
14
16
18
20
Model Form
The hypothesized bivariate relationship may be linear, quadratic, or whatever you want. The examples in Figure 12.13 illustrate situations in which it might be necessary to consider nonlinear model forms. For now we will mainly focus on the simple linear (straight-line) model. However, we will examine nonlinear relationships later in the chapter.
Text
501
FIGURE 12.13
Possible model forms
Linear Salary and Experience for 25 Grads Salary ($ thousands) Salary ($ thousands) 160 140 120 100 80 60 40 20 0 0 5 10 Years on the Job 15 160 140 120 100 80 60 40 20 0 0 5 10 Years on the Job 15 Logarithmic Salary and Experience for 25 Grads Salary ($ thousands) 160 140 120 100 80 60 40 20 0 0 5 10 Years on the Job 15 S-Curve Salary and Experience for 25 Grads
When we propose a regression model, we have a causal mechanism in mind, but cause-andeffect is not proven by a simple regression. We should not read too much into a tted equation.
Text
502
The expected cost of dinner for two couples would be $95.06, that is, Cost = 15.22 + 19.96(4) = 95.06. If 100 units per hour are produced, the expected defect rate is 7.7 defects per million, that is, Defects = 3.2 + 0.045(100) = 7.7.
SECTION EXERCISES
12.8 (a) Interpret the slope of the tted regression Sales = 842 37.5 Price. (b) If Price = 20, what is the prediction for Sales? (c) Would the intercept be meaningful if this regression represents DVD sales at Blockbuster? 12.9 (a) Interpret the slope of the tted regression HomePrice = 125,000 + 150 SquareFeet. (b) What is the prediction for HomePrice if SquareFeet = 2,000? (c) Would the intercept be meaningful if this regression applies to home sales in a certain subdivision?
Roman letters denote the tted coefcients b0 (the estimated intercept) and b1 (the estimated slope). For a given value xi the tted value (or estimated value) of the dependent variable is yi . (You can read this as y-hat.) The difference between the observed value yi and the tted value yi is the residual and is denoted ei. A residual will always be calculated as the observed value minus the estimated value. (12.9) ei = yi yi (residual) The residuals may be used to estimate , the standard deviation of the errors.
Text
503
FIGURE 12.14
Estimated Slope 80 70 60 50 Y 40 30 20 10 0 0 1 2 3 X Y/ X 50/5 10
50
5 4 5 6 7
regression line is to have Excel add the line onto a scatter plot, using the following steps: Step 1: Step 2: Step 3: Step 4: Step 5: Highlight the data columns. Click on the Chart Wizard and choose XY (Scatter) to create a graph. Click on the scatter plot points to select the data. Right-click and choose Add Trendline. Choose Options and check Display equation on chart.
The menus are shown in Figure 12.15. (The R-squared statistic is actually the correlation coefcient squared. It tells us what proportion of the variation in Y is explained by X. We will more fully dene R 2 in section 12.4.) Excel will choose the regression coefcients so as to produce a good t. In this case, Excels tted regression yi = 13 + 9.857xi is close to our eyeball regression equation.
FIGURE 12.15
Excels trendline menus
80 70 60 50 Y 40 30 20 10 0 0
13
9.8571x
3 X
TABLE 12.2
Piper Cheyenne Fuel Usage
Source: Flying 130, no. 4 (April 2003), p. 99.
Text
504
FIGURE 12.16
Fitted regression
350 Fuel Usage (pounds) 300 250 200 150 100 50 0 0 1 2 3 4 Flight Time (hours) 5 6 y 54.039x 23.285 Piper Cheyenne Fuel Usage
Slope Interpretation The tted regression is y = 23.295 + 54.039x. The slope (b1 = 54.039) says that for each additional hour of ight, the Piper Cheyenne consumed about 54 pounds of fuel (1 gallon 6 pounds). This estimated slope is a statistic, since a different sample might yield a different estimate of the slope. Bear in mind also that the sample size is very small. Intercept Interpretation The intercept (b0 = 23.295) suggests that even if the plane is not ying (X = 0) some fuel would be consumed. However, the intercept has little meaning in this case, not only because zero ight hour makes no logical sense, but also because extrapolating to X = 0 is beyond the range of the observed data.
Regression Caveats
The t of the regression does not depend on the sign of its slope. The sign of the tted slope
merely tells whether X has a positive or negative association with Y.
View the intercept with skepticism unless X = 0 is logically possible and was actually observed
in the data set.
Regression does not demonstrate cause-and-effect between X and Y. A good t only shows that
X and Y vary together. Both could be affected by another variable or by the way the data are dened.
SECTION EXERCISES
12.10 The regression equation NetIncome = 2,277 + .0307 Revenue was tted from a sample of 100 leading world companies (variables are in millions of dollars). (a) Interpret the slope. (b) Is the intercept meaningful? Explain. (c) Make a prediction of NetIncome when Revenue = 1,000. (Data are from www.forbes.com and Forbes 172, no. 2 [July 21, 2003], pp. 108110.) Global100 12.11 The regression equation HomePrice = 51.3 + 2.61 Income was tted from a sample of 34 cities in the eastern United States. Both variables are in thousands of dollars. HomePrice is the median selling price of homes in the city, and Income is median family income for the city. (a) Interpret the slope. (b) Is the intercept meaningful? Explain. (c) Make a prediction of HomePrice when Income = 50 and also when Income = 100. (Data are from Money Magazine 32, no. 1 [January 2004], pp. 102103.) HomePrice 12.12 The regression equation Credits = 15.4 .07 Work was tted from a sample of 21 statistics students. Credits is the number of college credits taken and Work is the number of hours worked per week at an outside job. (a) Interpret the slope. (b) Is the intercept meaningful? Explain. (c) Make a prediction of Credits when Work = 0 and when Work = 40. What do these predictions tell you? Credits 12.13 Below are tted regressions for Y = asking price of a used vehicle and X = the age of the vehicle. The observed range of X was 1 to 8 years. The sample consisted of all vehicles listed for sale in a
Text
505
particular week in 2005. (a) Interpret the slope of each tted regression. (b) Interpret the intercept of each tted regression. Does the intercept have meaning? (c) Predict the price of a 5-year-old Chevy Blazer. (d) Predict the price of a 5-year-old Chevy Silverado. (Data are from AutoFocus 4, Issue 38 (Sept. 1723, 2004) and are for educational purposes only.) CarPrices Chevy Blazer: Price = 16,189 1,050 Age (n = 21 vehicles, observed X range was 1 to 8 years). Chevy Silverado: Price = 22,951 1,339 Age (n = 24 vehicles, observed X range was 1 to 10 years). 12.14 These data are for a sample of 10 college students who work at weekend jobs in restaurants. (a) Fit an eyeball regression equation to this scatter plot of Y = tips earned last weekend and X = hours worked. (b) Interpret the slope. (c) Interpret the intercept. Would the intercept have meaning in this example?
160 140 120 100 80 60 40 20 0 0 5 10 Hours Worked 15
12.15 These data are for a sample of 10 different vendors in a large airport. (a) Fit an eyeball regression equation to this scatter plot of Y = bottles of Evian water sold and X = price of the water. (b) Interpret the slope. (c) Interpret the intercept. Would the intercept have meaning in this example?
300 250 200 150 100 50 0 0.00 0.50 1.00 Price ($) 1.50 2.00
( yi yi ) = 0
i=1
(12.10)
Therefore to work with an equation that has a nonzero sum we square the residuals, just as we squared the deviations from the mean when we developed the equation for variance back in
Text
506
chapter 4. The tted coefcients b0 and b1 are chosen so that the tted linear model yi = b0 + b1 xi has the smallest possible sum of squared residuals (SSE):
n n
(12.11)
SSE =
i=1
( yi yi ) 2 =
i=1
( yi b0 b1 xi ) 2
(sum to be minimized)
This is an optimization problem that can be solved for b0 and b1 by using Excels Solver Add-In. However, we can also use calculus (see derivation in LearningStats Unit 12) to solve for b0 and b1.
n
(xi x)( yi y )
n i=1
(12.12)
b1 =
i=1
(12.13)
b0 = y b1 x
If we use the notation for sums of squares (see formula 12.2), then the OLS formula for the slope can be written (12.14) b1 = SSxy SSxx (OLS estimator for slope)
These formulas require only a few spreadsheet operations to nd the means, deviations around the means, and their products and sums. They are built into Excel and many calculators. The OLS formulas give unbiased and consistent estimates* of 0 and 1 . The OLS re gression line always passes through the point ( x, y ).
TABLE 12.3
Study Time and Exam Scores ExamScores
Student Tom Mary Sarah Oscar Cullyn Jaime Theresa Knut Jin-Mae Courtney Sum Mean
*Recall from Chapter 9 that an unbiased estimators expected value is the true parameter and that a consistent estimator approaches ever closer to the true parameter as the sample size increases.
Text
507
Student Tom Mary Sarah Oscar Cullyn Jaime Theresa Knut Jin-Mae Courtney Sum Mean
xi 1 5 7 8 10 11 14 15 15 19 105 x = 10.5
yi 53 74 59 43 56 84 96 69 84 83 701 y = 70.1
xi x 9.5 5.5 3.5 2.5 0.5 0.5 3.5 4.5 4.5 8.5 0
yi y 17.1 3.9 11.1 27.1 14.1 13.9 25.9 1.1 13.9 12.9 0
(xi x)(yi y) 162.45 21.45 38.85 67.75 7.05 6.95 90.65 4.95 62.55 109.65 SS x y = 519.50
(xi x)2 90.25 30.25 12.25 6.25 0.25 0.25 12.25 20.25 20.25 72.25 = 264.50
TABLE 12.4
Worksheet for Slope and Intercept Calculations ExamScores
SS x x
FIGURE 12.17
100 80 Exam Score 60 40 20 0 0 5 10 Hours of Study 15 20 y 49.477 1.9641x
Scatter plot with tted line and residuals shown as vertical line segments
Interpretation The tted regression Score = 49.477 + 1.9641 Study says that, on average, each additional hour of study yields a little less than 2 additional exam points (the slope). A student who did not study (Study = 0) would expect a score of about 49 (the intercept). In this example, the intercept is meaningful because zero study time not only is possible (though hopefully uncommon) but also was almost within the range of observed data. Excels R2 is fairly low, indicating that only about 39 percent of the variation in exam scores from the mean is explained by study time. The remaining 61 percent of unexplained variation in exam scores reects other factors (e.g., previous nights sleep, class attendance, test anxiety). We can use the tted regression equation yi = 1.9641xi + 49.477 to nd each students expected exam score. Each prediction is a conditional mean, given the students study hours. For example: Student and Study Time Oscar, 8 hours Theresa, 14 hours Courtney, 19 hours Expected Exam Score yi = 49.48 + 1.964 (8) = 65.19 (65 to nearest integer) yi = 49.48 + 1.964 (14) = 76.98 (77 to nearest integer) yi = 49.48 + 1.964 (19) = 86.79 (87 to nearest integer)
Oscars actual exam score was only 43, so he did worse than his predicted score of 65. Theresa scored 96, far above her predicted score of 77. Courtney, who studied the longest (19 hours), scored 83, fairly close to her predicted score of 87. These examples show that study time is not a perfect predictor of exam scores.
Assessing Fit
The total variation in Y around its mean (denoted SST) is what we seek to explain:
n
SST =
i=1
( yi yi ) 2
(12.15)
How much of the total variation in our dependent variable Y can be explained by our regression? The explained variation in Y (denoted SSR) is the sum of the squared differences
Text
508
between the conditional mean yi (conditioned on a given value xi ) and the unconditional mean y (same for all xi ):
n
(12.16)
SSR =
i=1
( yi y ) 2
The unexplained variation in Y (denoted SSE) is the sum of squared residuals, sometimes referred to as the error sum of squares.*
n
(12.17)
SSE =
i=1
( yi yi ) 2
If the t is good, SSE will be relatively small compared to SST. If each observed data value yi is exactly the same as its estimate yi (i.e., a perfect t), then SSE will be zero. There is no upper limit on SSE. Table 12.5 shows the calculation of SSE for the exam scores.
TABLE 12.5
Student Tom Mary Sarah Oscar Cullyn Jaime Theresa Knut Jin-Mae Courtney
Calculations of Sums of Squares Score yi 53 74 59 43 56 84 96 69 84 83 Estimated Score yi = 1.9641xi + 49.477 51.441 59.298 63.226 65.190 69.118 71.082 76.974 78.939 78.939 86.795
ExamScores Residual yi yi 1.559 14.702 4.226 22.190 13.118 12.918 19.026 9.939 5.061 3.795 (yi yi )2 2.43 216.15 17.86 492.40 172.08 166.87 361.99 98.78 25.61 14.40 SSE = 1,568.57 (yi y 2 ) 348.15 116.68 47.25 24.11 0.96 0.96 47.25 78.13 78.13 278.72 SSR = 1,020.34 (yi y 2 ) 292.41 15.21 123.21 734.41 198.81 193.21 670.81 1.21 193.21 166.41 SST = 2,588.90
Hours xi 1 5 7 8 10 11 14 15 15 19
Coefcient of Determination
Since the magnitude of SSE is dependent on sample size and on the units of measurement (e.g., dollars, kilograms, ounces) we need a unit-free benchmark. The coefcient of determination or R2 is a measure of relative t based on a comparison of SSR and SST. Excel calculates this statistic automatically. It may be calculated in either of two ways: (12.18) R2 = 1 SSE SST or R2 = SSR SST
The range of the coefcient of determination is 0 R 2 1. The highest possible R 2 is 1 because, if the regression gives a perfect t, then SSE = 0: R2 = 1 SSE 0 =1 =10=1 SST SST if SSE = 0 (perfect t)
The lowest possible R 2 is 0 because, if knowing the value of X does not help predict the value of Y, then SSE = SST: R2 = 1 SSE SST =1 =11=0 SST SST if SSE = SST (worst t)
*But bear in mind that the residual ei (observable) is not the same as the true error i (unobservable).
Text
509
For the exam scores, the coefcient of determination is R2 = 1 1,568.57 SSE =1 = 1 0.6059 = .3941 SST 2,588.90
Because a coefcient of determination always lies in the range 0 R 2 1, it is often expressed as a percent of variation explained. Since the exam score regression yields R2 = .3941, we could say that X (hours of study) explains 39.41 percent of the variation in Y (exam scores). On the other hand, 60.59 percent of the variation in exam scores is not explained by study time. The unexplained variation reects factors not included in our model (e.g., reading skills, hours of sleep, hours of work at a job, physical health, etc.) or just plain random variation. Although the word explained does not necessarily imply causation, in this case we have a priori reason to believe that causation exists, that is, that increased study time improves exam scores.
Tip
In a bivariate regression, R2 is the square of the correlation coefcient r. Thus, if r = .50 then R2 = .25. For this reason, MegaStat (and some textbooks) denotes the coefcient of determination as r 2 instead of R2. In this textbook, the uppercase notation R2 is used to indicate the difference in their denitions. It is tempting to think that a low R2 indicates that the model is not useful. Yet in some applications (e.g., predicting crude oil future prices) even a slight improvement in predictive power can translate into millions of dollars.
SECTION EXERCISES
Instructions for Exercises 12.16 and 12.17: (a) Make an Excel worksheet to calculate SS x x , SS yy , and SS x y (the same worksheet you used in Exercises 12.2 and 12.3). (b) Use the formulas to calculate the slope and intercept. (c) Use your estimated slope and intercept to make a worksheet to calculate SSE, SSR, and SST. (d) Use these sums to calculate the R2. (e) To check your answers, make an Excel scatter plot of X and Y, select the data points, right-click, select Add Trendline, select the Options tab, and choose
Display equation on chart and Display R-squared value on chart.
WeekPay
12.17 Seconds of Telephone Hold Time for Concert Tickets Operators On Duty (X) 4 5 6 7 8
CallWait
Text
510
Instructions for Exercises 12.1812.20: (a) Use Excel to make a scatter plot of the data. (b) Select the data points, right-click, select Add Trendline, select the Options tab, and choose Display equation on chart and Display R-squared value on chart. (c) Interpret the tted slope. (d) Is the intercept meaningful? Explain. (e) Interpret the R2.
12.18 Portfolio Returns (%) on Selected Mutual Funds Last Year (X) 11.9 19.5 11.2 14.1 14.2 5.2 20.7 11.3 1.1 3.9 12.9 12.4 12.5 2.7 8.8 7.2 5.9
Portfolio
This Year (Y) 15.4 26.7 18.2 16.7 13.2 16.4 21.1 12.0 12.1 7.4 11.5 23.0 12.7 15.1 18.7 9.9 18.9
12.19 Number of Orders and Shipping Cost Orders (X) 1,068 1,026 767 885 1,156 1,146 892 938 769 677 1,174 1,009
ShipCost ($) Ship Cost (Y) 4,489 5,611 3,290 4,113 4,883 5,425 4,414 5,506 3,346 3,673 6,542 5,088
Movies ($) Spent (Y) 2.85 6.50 1.50 6.35 6.20 6.75 3.60 6.10 8.35 4.35
Text
511
If the tted models predictions are perfect (SSE = 0), the standard error s yx will be zero. In general, a smaller value of s yx indicates a better t. For the exam scores, we can use SSE from Table 12.5 to nd s yx : s yx = SSE = n2 1,568.57 = 10 2 1,568.57 = 14.002 8
The standard error s yx is an estimate of (the standard deviation of the unobservable errors). Because it measures overall t, the standard error s yx serves somewhat the same function as the coefcient of determination. However, unlike R2, the magnitude of s yx depends on the units of measurement of the dependent variable (e.g., dollars, kilograms, ounces) and on the data magnitude. For this reason, R2 is often the preferred measure of overall t because its scale is always 0 to 1. The main use of the standard error s yx is to construct condence intervals.
(xi x) 2 1 + n
s yx or sb1 = SS x x x2
(12.20)
sb0 = s yx
n i=1
or
(xi x) 2
sb0 = s yx
1 x2 + n SS x x
(12.21)
For the exam score data, plugging in the sums from Table 12.4, we get sb1 = s yx
n i=1
(xi x) 2
sb0 = s yx
1 + n
x2
n i=1
(xi x) 2
= 14.002
These standard errors are used to construct condence intervals for the true slope and intercept, using Students t with = n 2 degrees of freedom and any desired condence level. Some software packages (e.g., Excel and MegaStat) provide condence intervals automatically, while others do not (e.g., MINITAB). b1 tn2 sb1 1 b1 + tn2 sb1 b0 tn2 sb0 0 b0 + tn2 sb0 (CI for true slope) (CI for true intercept) (12.22) (12.23)
For the exam scores, degrees of freedom are n 2 = 10 2 = 8, so from Appendix D we get tn2 = 2.306 for 95 percent condence. The 95 percent condence intervals for the coefcients are
Text
512
Slope b1 tn2 sb1 1 b1 + tn2 sb1 1.9641 (2.306)(0.86101) 1 1.9641 + (2.306)(0.86101) 0.0213 1 3.9495 Intercept b0 tn2 sb0 0 b0 + tn2 sb0 49.477 (2.306)(10.066) 0 49.477 + (2.306)(10.066) 26.26 0 72.69 These condence intervals are fairly wide. The width of any condence interval can be reduced by obtaining a larger sample, partly because the t-value would shrink (toward the normal z-value) but mainly because the standard errors shrink as n increases. For the exam scores, the slope includes zero, suggesting that the true slope could be zero.
Hypothesis Tests
Is the true slope different from zero? This is an important question because if 1 = 0, then X cannot inuence Y and the regression model collapses to a constant 0 plus a random error term: Initial Model yi = 0 + 1 xi + i If 1 = 0 yi = 0 + (0)xi + i Then yi = 0 + i
We could also test for a zero intercept. The hypotheses to be tested are Test for Zero Slope H0 : 1 = 0 H1 : 1 = 0 Test for Zero Intercept H0 : 0 = 0 H1 : 0 = 0 b1 0 sb1 b0 0 sb0
For either coefcient, we use a t test with = n 2 degrees of freedom. The test statistics are (12.24) (12.25) t= t= (slope) (intercept)
Usually we are interested in testing whether the parameter is equal to zero as shown here, but you may substitute another value in place of 0 if you wish. The critical value of tn2 is obtained from Appendix D or from Excels function =TDIST(t,deg_freedom, tails) where tails is 1 (one-tailed test) or 2 (two-tailed test). Often, the researcher uses a two-tailed test as the starting point, because rejection in a two-tailed test always implies rejection in a one-tailed test (but not vice versa).
ExamScores
For the exam scores, we would anticipate a positive slope (i.e., more study hours should improve exam scores) so we will use a right-tailed test:
Hypotheses H 0 : 1 0 H 1 : 1 > 0 t= Test Statistic b1 0 1.9641 0 = = 2.281 sb1 0.86095 Critical Value t.05 = 1.860 Decision Reject H0 (i.e., slope is positive)
We can reject the hypothesis of a zero slope in a right-tailed test. (We would be unable to do so in a two-tailed test because the critical value of our t statistic would be 2.306.) Once we
Text
513
have the test statistic for the slope or intercept, we can nd the p-value by using Excels function =TDIST(t, deg_freedom, tails). The p-value method is preferred by researchers, because it obviates the need for prior specication of .
Parameter Slope Excel Function =TDIST(2.281,8,1) p-Value .025995 (right-tailed test)
ExamScores
These calculations are normally done by computer (we have demonstrated the calculations only to illustrate the formulas). The Excel menu to accomplish these tasks is shown in Figure 12.18. The resulting output, shown in Figure 12.19, can be used to verify our calculations. Excel always does two-tailed tests, so you must halve the p-value if you need a one-tailed test. You may specify the condence level, but Excels default is 95 percent condence.
FIGURE 12.18
Excels regression menu
SUMMARY OUTPUT
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.627790986 0.394121523 0.318386713 14.00249438 10
FIGURE 12.19
Excels regression results for exam scores
Tip
Avoid checking the Constant is Zero box in Excels menu. This would force the intercept through the origin, changing the model drastically. Leave this option to the experts.
ExamScores
Figure 12.20 shows MegaStats menu, and Figure 12.21 shows MegaStats regression output for this data. The output format is similar to Excels, except that MegaStat highlights coefcients that differ signicantly from zero at = .05 in a two-tailed test.
Text
514
FIGURE 12.20
MegaStats regression menu
FIGURE 12.21
MegaStats regression results for exam scores
Regression Analysis r2 r Std. Error Regression output variables Intercept Study Hours coefcients 49.4771 1.9641 std. error 10.0665 0.8610 t (df = 8) 4.915 2.281 p-value .0012 .0520 0.394 0.628 14.002 n k Dep. Var. 10 1 Exam Score condence interval 95% lower 26.2638 0.0213 95% upper 72.6904 3.9495
ExamScores
Figure 12.22 shows MINITABs regression menus, and Figure 12.23 shows MINITABs regression output for this data. MINITAB gives you the same general output as Excel, but with strongly rounded results.*
FIGURE 12.22
MINITABs regression menus
*You may have noticed that both Excel and MINITAB calculated something called adjusted R-Square. For a bivariate regression, this statistic is of little interest, but in the next chapter it becomes important.
Text
515
The regression equation is Score = 49.5 + 1.96 Hours Predictor Constant Hours S = 14.00 Coef 49.48 1.9641 R-Sq = 39.4% SE Coef 10.07 0.8610 T 4.92 2.28 R-Sq(adj) = 31.8% P 0.001 0.052
FIGURE 12.23
MINITABs regression results for exam scores
Time-series data generally yield better t than cross-sectional data, as we can illustrate by using a sample of the same size as the exam scores. In the United States, taxes are collected at a variety of levels: local, state, and federal. During the prosperous 1990s, personal income rose dramatically, but so did taxes, as indicated in Table 12.6.
EXAMPLE
TABLE 12.6
Year 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
U.S. Income and Taxes, 19912000 Personal Income ($ billions) 5,085.4 5,390.4 5,610.0 5,888.0 6,200.9 6,547.4 6,937.0 7,426.0 7,777.3 8,319.2 Personal Taxes ($ billions) 610.5 635.8 674.6 722.6 778.3 869.7 968.8 1,070.4 1,159.2 1,288.2
We will assume a linear relationship: Taxes = 0 + 1 Income + i Since taxes do not depend solely on income, the random error term will reect all other factors that inuence taxes as well as possible measurement error.
FIGURE 12.24
Aggregate U.S. Tax Function, 19912000 Personal Taxes (billions $) 1,400 1,300 y .2172x 538.21 1,200 R2 .9922 1,100 1,000 900 800 700 600 500 5,000 5,500 6,000 6,500 7,000 7,500 8,000 8,500 Personal Income (billions $)
Based on the scatter plot and Excels tted linear regression, displayed in Figure 12.24, the linear model seems justied. The very high R2 says that Income explains over 99 percent of the variation in Taxes. Such a good t is not surprising, since the federal government and most states (and some cities) rely on income taxes. However, many aggregate nancial variables are correlated due to ination and general economic growth. Although causation can be assumed between Income and Taxes in our model, some of the excellent t is due to time trends (a common problem in time-series data).
Text
516
Taxes
For a more detailed look, we examine MegaStats regression output for this data, shown in Figure 12.25. On average, each extra $100 of income yielded an extra $21.72 in taxes (b1 = .2172). Both coefcients are nonzero in MegaStats two-tailed test, as indicated by the tiny p-values (highlighting indicates that signicance at = .01). For all practical purposes, the p-values are zero, which indicates that this sample result did not arise by chance (rarely would you see such small p-values in cross-sectional data, but they are not unusual in timeseries data).
FIGURE 12.25
MegaStats regression results for tax data
Regression output Variables Intercept Income Coefcients 538.207 0.2172 Std. Error 45.033 0.00683 t (df = 8) 11.951 31.830 p-value 2.21E-06 1.03E-09
Condence Interval 95% lower 642.0530 0.2015 95% upper 434.3620 0.2330
Taxes
Because the 95 percent condence interval for the slope does not include zero, we should reject the hypothesis that the slope is zero in a two-tailed test at = .05. A condence interval thus provides an easy-to-explain two-tailed test of signicance. However, we customarily rely on the computed t statistics for a formal test of signicance, as illustrated below. In this case, we are doing a right-tailed test. We do not bother to test the intercept since it has no meaning in this problem.
Hypotheses H 0 : 1 0 H 1 : 1 > 0 t=
Text
517
Tip
The test for zero slope always yields a t statistic that is identical to the test for zero correlation coefcient. Therefore, it is not necessary to do both tests. Since regression output always includes a t-test for the slope, that is the test we usually use.
SECTION EXERCISES
12.21 A regression was performed using data on 32 NFL teams in 2003. The variables were Y = current value of team (millions of dollars) and X = total debt held by the team owners (millions of dollars). (a) Write the tted regression equation. (b) Construct a 95 percent condence interval for the slope. (c) Perform a right-tailed t test for zero slope at = .05. State the hypotheses clearly. (d) Use Excel to nd the p-value for the t statistic for the slope. (Data are from Forbes 172, no. 5, pp. 8283.) NFL
12.22 A regression was performed using data on 16 randomly selected charities in 2003. The variables were Y = expenses (millions of dollars) and X = revenue (millions of dollars). (a) Write the tted regression equation. (b) Construct a 95 percent condence interval for the slope. (c) Perform a right-tailed t test for zero slope at = .05. State the hypotheses clearly. (d) Use Excel to nd the p-value for the t statistic for the slope. (Data are from Forbes 172, no. 12, p. 248, and www.forbes.com.) Charities
Decomposition of Variance
A regression seeks to explain variation in the dependent variable around its mean. A simple way to see this is to express the deviation of yi from its mean y as the sum of the deviation of yi from the regression estimate yi plus the deviation of the regression estimate yi from the mean y : yi y = ( yi yi ) + ( yi y )
n n n
(12.26)
It can be shown that this same decomposition also holds for the sums of squares: ( yi y ) 2 =
i=1 i=1
( yi yi ) 2 +
i=1
( yi y ) 2
(sums of squares)
(12.27)
This decomposition of variance may be written as SST (total variation around the mean) = SSE (unexplained or error variation) + SSR (variation explained by the regression)
Text
518
The F statistic reects both the sample size and the ratio of SSR to SSE. For a given sample size, a larger F statistic indicates a better t (larger SSR relative to SSE), while F close to zero indicates a poor t (small SSR relative to SSE). The F statistic must be compared with a critical value F1,n2 from Appendix F for whatever level of signicance is desired, and we can nd the p-value by using Excels function =FDIST(F,1,n-2). Software packages provide the p-value automatically.
EXAMPLE
Figure 12.26 shows MegaStats ANOVA table for the exam scores. The F statistic is F= MSR 1020.3412 = = 5.20 MSE 196.0698
From Appendix F the critical value of F1,8 at the 5 percent level of signicance would be 5.32, so the exam score regression is not quite signicant at = .05. The p-value of .052 says a sample such as ours would be expected about 52 times in 1,000 samples if X and Y were unrelated. In other words, if we reject the hypothesis of no relationship between X and Y, we face a Type I error risk of 5.2 percent. This p-value might be called marginally signicant.
FIGURE 12.26
MegaStats ANOVA table for exam data
ANOVA table Source Regression Residual Total SS 1,020.3412 1,568.5588 2,588.9000 df 1 8 9 MS 1,020.3412 196.0698 F 5.20 p-value .0520
From the ANOVA table, we can calculate the standard error from the mean square for the residuals: s yx = MSE = 196.0698 = 14.002 (standard error for exam scores)
Tip
In a bivariate regression, the F test always yields the same p-value as a two-tailed t test for zero slope, which in turn always gives the same p-value as a two-tailed test for zero correlation. The relationship between the test statistics is F = t 2 .
SECTION EXERCISES
12.23 Below is a regression using X = home price (000), Y = annual taxes (000), n = 20 homes. (a) Write the tted regression equation. (b) Write the formula for each t statistic and verify the t statistics shown below. (c) State the degrees of freedom for the t tests and nd the two-tail critical value for t by using Appendix D. (d) Use Excels function =TDIST(t, deg_freedom, tails) to verify the
Text
519
p-value shown for each t statistic (slope, intercept). (e) Verify that F = t 2 for the slope. (f ) In your own words, describe the t of this regression.
0.452 0.454 12
df 1 10 11
MS 1.6941 0.2058
F 8.23
p-value .0167
Regression output variables Intercept Slope coefcients 1.8064 0.0039 std. error 0.6116 0.0014 t (df = 10) 2.954 2.869 p-value .0144 .0167
condence interval 95% lower 0.4438 0.0009 95% upper 3.1691 0.0070
12.24 Below is a regression using X average price, Y = units sold, n = 20 stores. (a) Write the tted regression equation. (b) Write the formula for each t statistic and verify the t statistics shown below. (c) State the degrees of freedom for the t tests and nd the two-tail critical value for t by using Appendix D. (d) Use Excels function =TDIST(t, deg_freedom, tails) to verify the p-value shown for each t statistic (slope, intercept). (e) Verify that F = t 2 for the slope. (f) In your own words, describe the t of this regression.
0.200 26.128 20
df 1 18 19
MS 3,080.89 682.68
F 4.51
p-value .0478
Regression output variables Intercept Slope coefcients 614.9300 109.1120 std. error 51.2343 51.3623 t (df = 18) 12.002 2.124 p-value .0000 .0478
condence interval 95% lower 507.2908 217.0202 95% upper 722.5692 1.2038
Text
520
Instructions for Exercises 12.2512.27: (a) Use Excels Tools > Data Analysis > Regression (or MegaStat or MINITAB) to obtain regression estimates. (b) Interpret the 95 percent condence interval for the slope. Does it contain zero? (c) Interpret the t test for the slope and its p-value. (d) Interpret the F statistic. (e) Verify that the p-value for F is the same as for the slopes t statistic, and show that t 2 = F. (f) Describe the t of the regression. 12.25 Portfolio Returns (%) on Selected Mutual Funds (n = 17 funds) Last Year (X) 11.9 19.5 11.2 14.1 14.2 5.2 20.7 11.3 1.1 3.9 12.9 12.4 12.5 2.7 8.8 7.2 5.9 This Year (Y) 15.4 26.7 18.2 16.7 13.2 16.4 21.1 12.0 12.1 7.4 11.5 23.0 12.7 15.1 18.7 9.9 18.9 Portfolio
12.26 Number of Orders and Shipping Cost (n = 12 orders) Orders (X) 1,068 1,026 767 885 1,156 1,146 892 938 769 677 1,174 1,009 ($) Ship Cost (Y) 4,489 5,611 3,290 4,113 4,883 5,425 4,414 5,506 3,346 3,673 6,542 5,088
ShipCost
12.27 Moviegoer Spending on Snacks (n = 10 purchases) Age (X) 30 50 34 12 37 33 36 26 18 46 $ Spent (Y) 2.85 6.50 1.50 6.35 6.20 6.75 3.60 6.10 8.35 4.35
Movies
Text
521
Mini Case
Airplane Cockpit Noise
Cockpit
12.2
Career airline pilots face the risk of progressive hearing loss, due to the noisy cockpits of most jet aircraft. Much of the noise comes not from engines but from air roar, which increases at high speeds. To assess this workplace hazard, a pilot measured cockpit noise at randomly selected points during the ight by using a handheld meter. Noise level (in decibels) was measured in seven different aircraft at the rst ofcers left ear position using a handheld meter. For reference, 60 dB is a normal conversation, 75 is a typical vacuum cleaner, 85 is city trafc, 90 is a typical hair dryer, and 110 is a chain saw. Table 12.7 shows 61 observations on cockpit noise (decibels) and airspeed (knots indicated air speed, KIAS) for a Boeing 727, an older type of aircraft lacking design improvements in newer planes.
TABLE 12.7
Speed 250 340 320 330 346 260 280 395 380 400 335 Noise 83 89 88 89 92 85 84 92 92 93 91
Cockpit Noise Level and Airspeed for B-727 (n = 61) Speed 380 380 390 400 400 405 320 310 250 280 320 Noise 93 91 94 95 96 97 89 88.5 82 87 89 Speed 340 340 380 385 420 230 340 250 320 340 320 Noise 90 91 96 96 97 82 91 86 89 90 90 Speed 330 360 370 380 395 365 320 250 250 320 305 Noise 91 94 94.5 95 96 91 88 85 82 88 88
Cockpit Speed 350 380 310 295 280 320 330 320 340 350 270 Noise 90 92 88 87 86 88 90 88 89 90 84 Speed 272 310 350 370 405 250 Noise 84.5 88 90 91 93 82
The scatter plot in Figure 12.27 suggests that a linear model provides a reasonable description of the data. The tted regression shows that each additional knot of airspeed increases the noise level by 0.0765 dB. Thus, a 100-knot increase in airspeed would add about 7.65 dB of noise. The intercept of 64.229 suggests that if the plane were not ying (KIAS = 0) the noise level would be only slightly greater than a normal conversation.
FIGURE 12.27
Cockpit Noise in B-727 (n 98 96 94 92 90 88 86 84 82 80 200 Noise Level (decibels) 61)
Scatter plot of cockpit noise Data courtesy of Capt. R. E. Hartl (ret) of Delta Airlines.
250
400
450
The regression results in Figure 12.28 show that the t is very good (R2 = .895) and that the regression is highly signicant (F = 501.16, p < .001). Both the slope and intercept have p-values below .001, indicating that the true parameters are nonzero. Thus, the regression is signicant, as well as having practical value.
Text
522
FIGURE 12.28
Regression results of cockpit noise
Regression Analysis r2 r Std. Error ANOVA table Source Regression Residual Total SS 836.9817 98.5347 935.5164 df 1 59 60 condence interval std. error 1.1489 0.0034 t (df = 59) 55.907 22.387 p-value 8.29E-53 1.60E-30 95% lower 61.9306 0.0697 95% upper 66.5283 0.0834 MS 836.9817 1.6701 F 501.16 p-value 1.60E-30 0.895 0.946 1.292 n k Dep. Var. 61 1 Noise
(12.29)
(12.30)
yi tn2 s yx
1+
1 (xi x) 2 + n n (xi x) 2
i=1
Interval width varies with the value of xi, being narrowest when xi is near its mean (note that when xi = x the last term under the square root disappears completely). For some data sets, the degree of narrowing near x is almost indiscernible, while for other data sets it is quite pronounced. These calculations are usually done by computer (see Figure 12.29). Both MegaStat and MINITAB, for example, will let you type in the xi values and will give both condence and prediction intervals only for that xi value, but you must make your own graphs.
Text
523
FIGURE 12.29
MegaStats condence and prediction intervals
FIGURE 12.30
Confidence and Prediction Intervals 140 120 Exam Score 100 80 60 40 20 0 0 5 10 Study Hours 95% CI 15 20
Est Y
95% PI
FIGURE 12.31
Confidence and Prediction Intervals 1,400 1,300 1,200 1,100 1,000 900 800 700 600 500 400 5,000 5,500 6,000 6,500 7,000 7,500 8,000 8,500 Income ($ billions) Est Y 95% CI 95% PI
Taxes ($ billions)
Text
524
Condence and prediction intervals for exam scores are wide and clearly curved, while for taxes they are narrow and almost straight. We would expect this from the scatter plots (R2 = .3941 for exams, R2 = .9922 for taxes). The prediction bands for exam scores even extend above 100 points (presumably the upper limit for an exam score). While the prediction bands for taxes appear narrow, they represent billions of dollars (the narrowest tax prediction interval has a range of about $107 billion). This shows that a very high R2 does not guarantee precise predictions.
These quick rules lead to constant width intervals and are not conservative (i.e., the resulting intervals will be somewhat too narrow). They work best for large samples and when X is near its mean. They are questionable when X is near either extreme of its range. Yet they often are close enough to convey a general idea of the accuracy of your predictions. Their purpose is just to give a quick answer without getting lost in unwieldy formulas.
Non-Normal Errors
Non-normality of errors is usually considered a mild violation, since the regression parameter estimates b0 and b1 and their variances remain unbiased and consistent. The main ill consequence is that condence intervals for the parameters may be untrustworthy, because the normality assumption is used to justify using Students t to construct condence intervals. However, if the sample size is large (say, n > 30), the condence intervals should be OK. An exception would be if outliers exist, posing a serious problem that cannot be cured by large sample size.
Histogram of Residuals
Cockpit
A simple way to check for non-normality is to make a histogram of the residuals. You can use either plain residuals or standardized residuals. A standardized residual is obtained by dividing each residual by its standard error. Histogram shapes will be the same, but standardized
Text
525
residuals offer the advantage of a predictable scale (between 3 and +3 unless there are outliers). A simple eyeball test can usually reveal outliers or serious asymmetry. Figure 12.32 shows a standardized residual histogram for Mini Case 12.2. There are no outliers and the histogram is roughly symmetric, albeit possibly platykurtic (i.e., atter than normal).
FIGURE 12.32
Histogram of the Residuals
(response is noise)
10 Frequency
0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 Standardized Residual 1.5 2.0
FIGURE 12.33
Normal Probability Plot of the Residuals
(response is noise)
Text
526
Tip
Non-normality is not considered a major violation, so dont worry too much about it unless you have major outliers.
No Pattern
Residual
Notice that the residuals always have a mean of zero. Although many patterns of nonconstant variance might exist, the fan-out pattern (increasing residual variance) is most common:
Fan-Out Pattern
Funnel-In Pattern
Residual
Residual
Text
527
Residual plots provide a fairly sensitive eyeball test for heteroscedasticity. The residual plot is therefore considered an important tool in the statisticians diagnostic kit. The hypotheses are H0: Errors have constant variance (homoscedastic) H1: Errors have nonconstant variance (heteroscedastic) Figure 12.34 shows a residual plot for Mini Case 12.2 (cockpit noise). In the residual plot, we see residuals of the same magnitude as we look from left to right. A random pattern like this is consistent with the hypothesis of homoscedasticity (constant variance), although some observers might see a hint of a fan-out pattern.
FIGURE 12.34
Residuals Versus Air Speed
(response is noise)
Standardized Residual
Tip
Although it can widen the condence intervals for the coefcients, heteroscedasticity does not bias the estimates. At this stage of your training, it is sufcient just to recognize its existence.
Autocorrelated Errors
Autocorrelation is a pattern of nonindependent errors, mainly found in time-series data.* In a time-series regression, each residual et should be independent of its predecessors et1 , et2 , . . . , etn . Violations of this assumption can show up in different ways. In the simple model of rst-order autocorrelation we would nd that et is correlated with et1 . The OLS estimators b0 and b1 are still unbiased and consistent, but their estimated variances are biased in a way that typically leads to condence intervals that are too narrow and t statistics that are too large. Thus, the models t may be overstated.
*Cross-sectional data may exhibit autocorrelation, but typically it is an artifact of the order of data entry.
Text
528
Residual
Residual
Durbin-Watson Test
The most widely used test for autocorrelation is the Durbin-Watson test. The hypotheses are H0: Errors are nonautocorrelated H1: Errors are autocorrelated The Durbin-Watson test statistic for autocorrelation is
n
(et et1 ) 2
n t=1
(12.33)
DW =
t=2
When there is no autocorrelation, the DW statistic will be near 2, though its range is from 0 to 4. For a formal hypothesis test, a special table is required. For now, we simply note that
Text
529
in general DW < 2 suggests positive autocorrelation (common). DW 2 suggests no autocorrelation (ideal). DW > 2 suggests negative autocorrelation (rare).
Tip
Although it can widen the condence intervals for the coefcients, autocorrelation does not bias the estimates. At this stage of your training, it is sufcient just to recognize when you have autocorrelation.
Mini Case
Money and Ination
Money
12.3
Does ination mainly reect changes in the money supply? Table 12.8 shows data for a timeseries regression of the U.S. ination rate (as measured by the change in the Consumer Price Index or CPI) against the growth of the monetary base (as measured by the change in M1 one year earlier). The regression covers the period 19602000, or 41 years.
TABLE 12.8
Year 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 M1 0.5 3.2 1.8 3.7 4.6 4.7 2.5 6.6 7.7 3.3 5.1 6.5 9.2 5.5
Percent Change in CPI and Percent Change in M1 in Prior Year CPI 0.7 1.3 1.6 1.0 1.9 3.5 3.0 4.7 6.2 5.6 3.3 3.4 8.7 12.3 Year 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 M1 4.3 4.7 6.7 8.0 8.0 6.9 7.0 6.9 8.7 9.8 5.8 12.3 16.9 3.5 CPI 6.9 4.9 6.7 9.0 13.3 12.5 8.9 3.8 3.8 3.9 3.8 1.1 4.4 4.4 Year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 M1 4.9 0.8 4.0 8.7 14.3 10.3 1.8 2.1 4.1 0.7 2.2 2.5 3.3 CPI 4.6 6.1 3.1 2.9 2.7 2.7 2.5 3.3 1.7 1.6 2.7 3.4 1.6
*Chapter 14 discusses the use of time-series data for forecasting. Autocorrelation will be revisited and you will learn that some forecasting models actually try to take advantage of the dependency among error terms.
Text
530
The tted model is CPIt = 3.4368 + 0.1993 M1t1 ( R 2 = .075, F = 3.17) . The overall t is not very strong. The probability plot in Figure 12.35 suggests possible non-normality. The plot of residuals in Figure 12.36 shows no strong evidence of heteroscedasticity. In the residual plotted against time (Figure 12.37) we would expect 20 or 21 centerline crossings, but we only see 8, indicating positive autocorrelation (i.e., there are runs of the same sign). The DW statistic is DW = .58, also indicating strong positive autocorrelation.
FIGURE 12.35
Residual test for non-normality
Residual Normal Probability Plot of Residuals 10 8 6 4 2 0 2 4 6 3 2 1 0 1 Normal Score 2 3
FIGURE 12.36
Residual test for heteroscedasticity
8 Residual 4 0 4 8 10 5 0 5 10 Change in M1 15 20 Residuals by Change in M1
FIGURE 12.37
Residual test for autocorrelation
std. error) Residuals 10 8 6 4 2 0 2 4 6 8 0 10 20 30 Observation 40 50
Residual (gridlines
Text
531
In a regression, we look for observations that are unusual. An observation could be unusual because its Y-value is poorly predicted by the regression model (unusual residual) or because its unusual X-value greatly affects the regression line (high leverage). Tests for unusual residuals and high leverage are important diagnostic tools in evaluating the tted regression.
FIGURE 12.38
Excels exam score regression with residuals ExamScores
Text
532
FIGURE 12.39
MINITABs regression with residuals ExamScores
FIGURE 12.40
MegaStats regression with residuals ExamScores
Text
533
worked between 12 and 42 hours. This individual will have a big effect on the slope estimate, because he is so far above the mean of X. Yet this highly leveraged data point is not an outlier (i.e., the tted regression line comes very close to the data point, so its residual will be small).
FIGURE 12.41
600 500 Weeks Pay ($) 400 300 200 100 0 0 10 20 30 40 Hours Worked 50 60 70 High leverage data point
(12.34)
As a rule of thumb, a leverage statistic that exceeds 3/n is unusual (note that if xi = x the leverage statistic h i is 1/n so the rule of thumb is just three times this value).
We see from Figure 12.42 that two data points (Tom and Courtney) are likely to have high leverage because Tom studied for only 1 hour (far below the mean) while Courtney studied for 19 hours (far above the mean). Using the information in Table 12.4 (p. 507) we can calculate their leverages: h Tom = h Courtney = (1 10.5) 2 1 + = .441 10 264.50 (19 10.5) 2 1 + = .373 10 264.50 (Toms leverage) (Courtneys leverage)
EXAMPLE
FIGURE 12.42
100 Tom 80 Exam Score 60 40 20 0 0 5 10 Hours of Study 15 20 Courtney
By the quick rule, both exceed 3/n = 3/10 = .300, so these two observations are inuential. Yet despite their high leverages, the regression ts Toms and Courtneys actual exam scores well, so their residuals are not unusual. This illustrates that high leverage and unusual residuals are two different concepts.
Text
534
FIGURE 12.43
MegaStats residual table Taxes Year 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Actual Y 610.50 635.80 674.60 722.60 778.30 869.70 968.80 1,070.40 1,159.20 1,288.20 Predicted Y 566.55 632.81 680.52 740.91 808.89 884.16 968.80 1,075.03 1,151.35 1,269.07 Residual 43.95 2.99 5.92 18.31 30.59 14.46 0.00 4.63 7.85 19.13 Leverage 0.296 0.221 0.179 0.138 0.110 0.100 0.117 0.179 0.251 0.410 Studentized Residual 2.371 0.153 0.296 0.893 1.467 0.690 0.000 0.231 0.411 1.127
Studentized Deleted Residual 4.066 0.144 0.278 0.880 1.606 0.666 0.000 0.217 0.388 1.149
Mini Case
Body Fat
BodyFat
12.4
Is waistline a good predictor of body fat? Table 12.9 shows a random sample of 50 mens weights (pounds) and girths (centimeters). Figure 12.44 suggests that a linear regression is appropriate, and the MegaStat output in Figure 12.45 shows that the regression is highly signicant (F = 97.68, t = 9.883, p = .0000).
TABLE 12.9
Girth 99.1 76.0 83.1 88.5 118.0 104.3 79.5 108.8 81.9 76.6 88.7 90.9 89.0 Fat% 19.0 8.4 9.2 21.8 33.6 31.7 6.4 24.6 4.1 12.8 12.3 8.5 26.0
Abdomen Measurement and Body Fat (n = 50 men) Girth 78.0 83.2 85.6 90.3 104.5 95.6 103.1 89.9 104.0 95.3 105.0 83.5 86.7 Fat% 7.3 13.4 22.3 20.2 16.8 18.4 27.7 17.4 26.4 11.3 27.1 17.2 10.7 Girth 93.0 76.0 106.1 109.3 104.3 100.5 77.9 101.6 99.7 96.7 95.8 104.8 92.4 Fat% 18.1 13.7 28.1 23.0 30.8 16.5 7.4 18.2 25.1 16.1 30.2 25.4 25.9 Girth 95.0 86.0 90.6 105.5 79.4 126.2 98.0 95.5 73.7 86.4 122.1 Fat% 21.6 8.8 19.5 31.0 10.4 33.1 20.2 21.9 11.2 10.9 45.1
Data are from a larger sample of 252 men in Roger W. Johnson, Journal of Statistics Education 4, No. 1 (1996).
*MegaStat uses a more sophisticated test for leverage than the 3/n quick rule, but the conclusions generally agree.
Text
535
FIGURE 12.44
Girth and Body Fat (n 50 45 40 35 30 25 20 15 10 5 0 60 Body Fat (percent) 50 men)
70
130
Regression Analysis r2 r Std. Error ANOVA table Source Regression Residual Total SS 2,527.1190 1,241.8162 3,768.9352 df 1 48 49 condence interval std. error 5.6690 0.0597 t (df = 48) 6.393 9.883 p-value 95% lower 6.28E-08 3.71E-13 47.6379 0.4704 95% upper 24.8415 0.7107 MS 2,527.1190 25.8712 F 97.68 p-value 3.71E-13 0.671 0.819 5.086 n k Dep. Var. 50 1 Fat%1
FIGURE 12.45
Body fat scatter plot
MegaStats table of residuals, shown in Figure 12.46, highlights four unusual observations. Observations 5, 45, and 50 have high leverage values (exceeding 3/n = 3/50 = .06) because their abdomen measurements (italicized and boldfaced in Table 12.9) are far from the mean. Observation 37 has a large studentized deleted residual (actual body fat of 30.20 percent is much greater than the predicted 20.33 percent). Well-behaved observations are omitted because they are not unusual according to any of the diagnostic criteria (leverage, studentized residual, or studentized deleted residual).
Observation 5 37 45 50
FIGURE 12.46
Unusual body fat residuals
Text
536
Outliers
We have mentioned outliers under the discussion of non-normal residuals. However, outliers are the source of many other woes, including loss of t. What causes outliers? An outlier may be an error in recording the data. If so, the observation should be deleted. But how can you tell? Impossible or bizarre data values are prima facie reasons to discard a data value. For example, in a sample of body fat data, one adult mans weight was reported as 205 pounds and his height as 29.5 inches (probably a typographical error that should have been 69.5 inches). It is reasonable to discard the observation on grounds that it represents a population different from the other men. An outlier may be an observation that has been inuenced by an unspecied lurking variable that should have been controlled but wasnt. If so, we should try to identify the lurking variable and formulate a multiple regression model that includes the lurking variable(s) as predictors. For example, if you regress Y = tuition paid against X = credits taken, you should at least include a binary variable for university type Z = 0, 1 (public, private).
Model Misspecication
If a relevant predictor has been omitted, then the model is misspecied. Instead of bivariate regression, you should use multiple regression. Such a situation is so common that it is almost a warning against relying on bivariate regression, since we usually can think of more than one explanatory variable. Even our tax function, which gave an excellent t, might be improved if we added more predictors. As you will see in the next chapter, multiple regression is computationally easy because the computer does all the work. In fact, most computer packages just call it regression regardless of the number of predictors.
Ill-Conditioned Data
Variables in the regression should be of the same general order of magnitude, and most people take steps intuitively to make sure this is the case (well-conditioned data). Unusually large or small data (called ill-conditioned) can cause loss of regression accuracy or can create awkward estimates with exponential notation. Consider the data in Table 12.10 for 30 randomly selected large companies (only a few of the 30 selected are shown in this table). The table shows two ways of displaying the same data, but with the decimal point changed. Figures 12.47 and 12.48
TABLE 12.10
Net Income and Revenue for Selected Global 100 Companies Global30
Source: www.forbes.com and Forbes 172, no. 2 (July 21, 2003), pp. 108110.
Company Allstate American Intl Group Barclays . . . Volkswagen Group Wachovia Walt Disney
FIGURE 12.47
Thousands of Dollars
Ill-conditioned data
3.4441x 4E 07 R 2 .0524
Thousands of Dollars
Text
537
FIGURE 12.48
160,000 Millions of Dollars 140,000 120,000 100,000 80,000 60,000 40,000 20,000 0 0 2,000 4,000 6,000 8,000 10,000 12,000 Millions of Dollars y 3.4441x 36.578 R 2 .0524
Well-conditioned data
show Excel scatter plots with regression lines. Their appearance is the same, but the rst graph has disastrously crowded axis labels. The graphs have the same slope and R2, but the rst regression has an unintelligible intercept (4E+07). Awkwardly small numbers may also require adjustment. For example, the number of automobile thefts per capita in the United States in 1990 was 0.004207. However, this statistic is easier to work with if it is reported per 100,000 population as 420.7. Worst of all would be to mix very large data with very small data. For example, in 1999 the per capita income in New York was $27,546 and the number of active physicians per capita was 0.00395. To avoid mixing magnitudes, we can redene the variables as per capita income in thousands of dollars (27.546) and the number of active physicians per 10,000 population (39.5).
Tip
Adjust the magnitude of your data before running the regression.
Spurious Correlation
Prisoners
In a spurious correlation, two variables appear related because of the way they are dened. For example, consider the hypothesis that a states spending on education is a linear function of its prison population. Such a hypothesis seems absurd, and we would expect the regression to be insignicant. But if the variables are dened as totals without adjusting for population, we will observe signicant correlation. This phenomenon is called the size effect or the problem of totals. Table 12.11 shows selected data, rst with the variables as totals and then as adjusted for population.
Using Totals Total Population (millions) 4.447 0.627 . . . 5.364 0.494 K12 Spending ($ billions) 4.52 1.33 . . . 8.48 0.76 No. of Prisoners (thousands) 24.66 3.95 . . . 20.42 1.71
Using Per Capita Data K12 Spending per Capita ($) 1,016 2,129 . . . 1,580 1,543 Prisoners per 1,000 Pop. 5.54 6.30 . . . 3.81 3.47
TABLE 12.11
State Spending on Education and State and Federal Prisoners Prisoners
Source: Statistical Abstract of the United States, 2001.
Text
538
Figure 12.49 shows that, contrary to expectation, the regression on totals gives a very strong t to the data. Yet Figure 12.50 shows that if we divide by population and adjust the decimals, the t is nonexistent and the slope is indistinguishable from zero. The spurious correlation arose merely because both variables reect the size of a states population. For example, New York and California lie far to the upper right on the rst scatter plot because they are populous states, while smaller states like South Dakota and Delaware are near the origin.
FIGURE 12.49
Spurious model using totals
Number of Prisoners (thousands) Education Spending and Prison Population 180 160 140 120 100 80 60 40 20 0 0
50
FIGURE 12.50
Better model: Per capita data
10 9 8 7 6 5 4 3 2 1 0 0 Prisoners Per 1,000 y Education Spending and Prison Population .00005x 4.0313 R 2 .0000
2,500
MPG
Sometimes a relationship cannot be modeled using a linear regression. For example, Figure 12.51 shows fuel efciency (city MPG) and engine size (horsepower) for a sample of 93 vehicles with a nonlinear model form tted by Excel. This is one of several nonlinear forms offered by Excel (there are also logarithmic and exponential functions). Figure 12.52 shows an alternative, which is a linear regression after taking logarithms of each variable. These logarithms are in base 10, but any base will do (scientists prefer base e). This is an example of a variable transform to improve t. An advantage of the log transformation is that it reduces heteroscedasticity and improves the normality of the residuals, especially when dealing with totals (the size problem mentioned earlier). But log transforms will not work if any data values are zero or negative.
Text
539
FIGURE 12.51
Fuel Economy and Engine Size (n 50 45 40 35 30 25 20 15 10 5 0 0 50 100 City Miles Per Gallon y 93)
Nonlinear regression
Source: Robin H. Lock, Journal of Statistics Education 1, no. 1 (1993).
250
300
350
FIGURE 12.52
Fuel Economy and Engine Size (n 1.7 1.6 Log mpg 1.5 1.4 1.3 1.2 1.1 1.0 1.5 1.7 1.9 2.1 2.3 Log Horsepower 2.5 2.7 y .4923x 2.3868 R 2 .6181 93)
For the tax data examined earlier, Figures 12.53 and 12.54 show that a quadratic model would give a slightly better t than a linear model, as measured by R2. But would a government scal analyst get better predictions for aggregate taxes for the next year by using a quadratic model? Excel makes it easy to t all sorts of regression models. But t is only one criterion for evaluating a regression model. There really is no logical basis for imagining that income and taxes are related by a polynomial model (a better case might be made for the exponential model). Since nonlinear models might be hard to justify or explain to others, the principle of Occams Razor (choosing the simplest explanation that ts the facts) favors linear regression, unless there are other compelling factors.
Linear Aggregate U.S. Tax Function, 19912000 Personal Taxes ($ billions) 1,400 1,300 y .2172x 538.21 1,200 R 2 .9922 1,100 1,000 900 800 700 600 500 5,000 5,500 6,000 6,500 7,000 7,500 8,000 8,500 Personal Income ($ billions)
FIGURE 12.53
Linear model
Text
540
FIGURE 12.54
Quadratic model
Quadratic Aggregate U.S. Tax Function, 19912000 Personal Taxes ($ billions) 1,400 1,300 y .00002x2 .01913x 231.51571 1,200 R 2 .99748 1,100 1,000 900 800 700 600 500 5,000 5,500 6,000 6,500 7,000 7,500 8,000 8,500 Personal Income ($ billions)
Regression by Splines
Savings
One simple experiment you can try with time-series data is to t a regression by subperiods. A good example is the interesting problem of the declining rate of personal saving in the United States. In your economics class, you heard references to the marginal propensity to save, presumably referring to 1 in an equation something like Saving = 0 + 1 Income. Yet during the 1990s, increasing personal income was associated with decreased personal saving. If you look at Figure 12.55 you will see that there is no way to t a straight line to the entire data set. However, you might divide the data into three or four subperiods that could be modeled by linear functions. This is called regression by splines. By comparing regression slopes for each subperiod, you will obtain clues about what has happened to the marginal propensity to save. Of course, any economist will tell you it is not that simple and that a much more complex model is needed to explain saving.
FIGURE 12.55
Aggregate U.S. saving and income Saving
Personal Saving ($ billions) Aggregate U.S. Saving Function, 19592000 450 400 350 300 250 200 150 100 50 0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 Personal Income ($ billions)
Mini Case
CEO Compensation
CEO
12.5
Do highly compensated executives lead their corporations to outperform other rms? Statistics student Greg Burks examined 1-year total shareholder returns in 2001 less the S&P 500 average (i.e., percentage points above or below the S&P average) as a function of total CEO compensation in 2001 for the top 200 U.S. corporations. A scatter plot is shown in
Text
541
Figure 12.56. There appears to be little relationship, but a dozen or so hugely compensated CEOs (e.g., those earning over 50 million) stretch the X-axis scale while many others are clustered near the origin. A log transformation of X (using base 10) is shown in Figure 12.57. Neither tted regression is signicant. That is, there is little relationship between CEO compensation and stockholder returns (if anything, the slope is negative). However, the transformed data give a clearer picture. An advantage of the log transformation is that it improves the scatter of the residuals, producing a more homoscedastic distribution. In short, a log transformation with skewed data is an excellent option to consider.
FIGURE 12.56
Return Over S&P 500 (percent) 2001 CEO Compensation (n 150 100 50 0 50 100 0 40,000 80,000 120,000 160,000 Total Compensation ($ thousands) y .0003x 19.393 R 2 .0419 200)
FIGURE 12.57
Return Over S&P 500 (percent) 2001 CEO Compensation (n 150 100 50 0 50 100 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Logarithm of Total Compensation 5.5 y 17.027x 82.737 R 2 .0595 200)
The sample correlation coefcient r measures linear association between X and Y, with values near 0 indicating a lack of linearity while values near 1 (negative correlation) or +1 (positive correlation) suggest linearity. The t test is used to test hypotheses about the population correlation . In bivariate regression there is an assumed linear relationship between the independent variable X (the predictor) and the dependent variable Y (the response). The slope (1 ) and intercept (0 ) are unknown parameters that are estimated from a sample. Residuals are the differences between observed and tted Yvalues. The ordinary least squares (OLS) method yields regression coefcients for the slope (b1) and intercept (b0) that minimize the sum of squared residuals. The coefcient of determination (R2) measures the overall t of the regression, with R2 near 1 signifying a good t and R2 near 0 indicating a poor t. The F statistic in the ANOVA table is used to test for signicant overall regression, while the t statistics (and their p-values) are used to test hypotheses about the slope and intercept. The standard error of the regression is used to create condence intervals or prediction intervals for Y. Regression assumes that the errors are normally distributed, independent random variables with constant variance 2 . Residual tests identify possible violations of assumptions (non-normality, autocorrelation,
Chapter Summary
Text
542
heteroscedasticity). Data values with high leverage (unusual X-values) have strong inuence on the regression. Unusual standardized residuals indicate cases where the regression gives a poor t. Illconditioned data may lead to spurious correlation or other problems. Data transforms may help, but they also change the model specication.
Key Terms
autocorrelation, 500 autocorrelation coefcient, 500 bivariate data, 489 bivariate regression, 500 coefcient of determination, R2, 508 condence interval, 522 Durbin-Watson test, 528 error sum of squares, 508 tted model, 502 tted regression, 501 heteroscedastic, 526 homoscedastic, 526
ill-conditioned data, 536 intercept, 502 leverage, 532 log transformation, 538 non-normality, 524 ordinary least squares (OLS), 505 population correlation coefcient, , 490 prediction interval, 522 regression by splines, 540 residual, 502
sample correlation coefcient, r, 490 scatter plot, 489 slope, 502 spurious correlation, 537 standard error, 511 standardized residuals, 524 studentized residuals, 531 sums of squares, 490 t statistic, 492 variable transform, 538 well-conditioned data, 536
(xi x)( yi y )
n
i=1 n
(xi x) 2 n2 1 r2
( yi y ) 2
i=1
i=1
Test statistic for zero correlation: t = r True regression line: yi = 0 + 1 xi + i Fitted regression line: yi = b0 + b1 xi
n
(xi x)( yi y )
n
i=1
(xi x) 2
i=1
( yi yi ) 2 =
i=1 n
( yi b0 b1 xi ) 2
( yi yi ) 2 =1 ( yi y ) 2
Coefcient of determination: R = 1
2
i=1 n
SSE SST
i=1 n
( yi yi ) 2 n2 =
i=1
SSE n2
Text
543
s yx
n
(xi x) 2
i=1
b1 0 sb1
Condence interval for true slope: b1 tn2 sb1 1 b1 + tn2 sb1 (xi x) 2 1 + n n (xi x) 2
i=1
1+
1 (xi x) 2 + n n (xi x) 2
i=1
Note: Exercises marked * are based on optional material. 1. (a) How does correlation analysis differ from regression analysis? (b) What does a correlation coefcient reveal? (c) State the quick rule for a signicant correlation and explain its limitations. (d) What sums are needed to calculate a correlation coefcient? (e) What are the two ways of testing a correlation coefcient for signicance? 2. (a) What is a bivariate regression model? Must it be linear? (b) State three caveats about regression. (c) What does the random error component in a regression model represent? (d) What is the difference between a regression residual and the true random error? 3. (a) Explain how you t a regression to an Excel scatter plot. (b) What are the limitations of Excels scatter plot tted regression? 4. (a) Explain the logic of the ordinary least squares (OLS ) method. (b) How are the least squares formulas for the slope and intercept derived? (c) What sums are needed to calculate the least squares estimates? 5. (a) Why cant we use the sum of the residuals to assess t? (b) What sums are needed to calculate R2? (c) Name an advantage of using the R2 statistic instead of the standard error s yx to measure t. (d) Why do we need the standard error s yx ? 6. (a) Explain why a condence interval for the slope or intercept would be equivalent to a two-tailed hypothesis test. (b) Why is it especially important to test for a zero slope? 7. (a) What does the F statistic show? (b) What is its range? (c) What is the relationship between the F test and the t tests for the slope and correlation coefcient? 8. (a) For a given X, explain the distinction between a condence interval for the conditional mean of Y and a prediction interval for an individual Y-value. (b) Why is the individual prediction interval wider? (c) Why are these intervals narrowest when X is near its mean? (d) When can quick rules for these intervals give acceptable results, and when not? 9. (a) What is a residual? (b) What is a standardized residual and why it is useful? (c) Name two alternative ways to identify unusual residuals. 10. (a) When does a data point have high leverage (refer to the scatter plot)? (b) Name one test for unusual leverage. 11. (a) Name three assumptions about the random error term in the regression model. (b) Why are the residuals important in testing these assumptions? 12. (a) What are the consequences of non-normal errors? (b) Explain two tests for non-normality. (c) What can we do about non-normal residuals?
Chapter Review
Text
544
13. (a) What is heteroscedasticity? Identify its two common forms. (b) What are its consequences? (c) How do we test for it? (d) What can we do about it? 14. (a) What is autocorrelation? Identify two main forms of it. (b) What are its consequences? (c) Name two ways to test for it. (d) What can we do about it? *15. (a) Why might there be outliers in the residuals? (b) What actions could be taken? *16. (a) What is ill-conditioned data? How can it be avoided? (b) What is spurious correlation? How can it be avoided? *17. (a) What is a log transform? (b) What are its advantages and disadvantages? *18. (a) What is regression by splines? (b) Why might we do it?
CHAPTER EXERCISES
Instructions: Choose one or more of the data sets AI below, or as assigned by your instructor. Choose the dependent variable (the response variable to be explained) and the independent variable (the predictor or explanatory variable) as you judge appropriate. Use a spreadsheet or a statistical package (e.g., MegaStat or MINITAB) to obtain the bivariate regression and required graphs. Write your answers to exercises 12.28 through 12.43 (or those assigned by your instructor) in a concise report, labeling your answers to each question. Insert tables and graphs in your report as appropriate. You may work with a partner if your instructor allows it. 12.28 Are the variables cross-sectional data or time-series data? 12.29 How do you imagine the data were collected? 12.30 Is the sample size sufcient to yield a good estimate? If not, do you think more data could easily be obtained, given the nature of the problem? 12.31 State your a priori hypothesis about the sign of the slope. Is it reasonable to suppose a cause and effect relationship? 12.32 Make a scatter plot of Y against X. Discuss what it tells you. 12.33 Use Excels Add Trendline feature to t a linear regression to the scatter plot. Is a linear model credible? 12.34 Interpret the slope. Does the intercept have meaning, given the range of the data? 12.35 Use Excel, MegaStat, or MINITAB to t the regression model, including residuals and standardized residuals. 12.36 (a) Does the 95 percent condence interval for the slope include zero? If so, what does it mean? If not, what does it mean? (b) Do a two-tailed t test for zero slope at = .05. State the hypotheses, degrees of freedom, and critical value for your test. (c) Interpret the p-value for the slope. (d) Which approach do you prefer, the t test or the p-value? Why? (e) Did the sample support your hypothesis about the sign of the slope? 12.37 (a) Based on the R2 and ANOVA table for your model, how would you assess the t? (b) Interpret the p-value for the F statistic. (c) Would you say that your models t is good enough to be of practical value? 12.38 Study the table of residuals. Identify as outliers any standardized residuals that exceed 3 and as unusual any that exceed 2. Can you suggest any reasons for these unusual residuals? *12.39 (a) Make a histogram (or normal probability plot) of the residuals and discuss its appearance. (b) Do you see evidence that your regression may violate the assumption of normal errors? * 12.40 Inspect the residual plot to check for heteroscedasticity and report your conclusions. *12.41 Is an autocorrelation test appropriate for your data? If so, perform one or more tests of the residuals (eyeball inspection of residual plot against observation order, runs test, and/or DurbinWatson test). * 12.42 Use MegaStat or MINITAB to generate 95 percent condence and prediction intervals for various X-values. *12.43 Use MegaStat or MINITAB to identify observations with high leverage.
Text
545
DATA SET A
City Alexandria, VA Bernards Twp., NJ Brentwood, TN Bridgewater, NJ Cary, NC Centreville, VA Chantilly, VA Chesapeake, VA Collierville, TN Columbia, MD Coral Springs, FL Dranesville, VA Dunwoody, GA Ellicott City, MD Franconia, VA Gaithersburg, MD Hoover, AL
HomePrice1
Home 290.000 205.000 410.000 379.975 135.000 358.500 342.500 341.000 287.450 214.500 330.875 444.500 240.000 226.450 278.250 290.000 230.000
Source: Money 32, no. 1 (January 2004), pp. 102103. Note: Data are for educational purposes only.
DATA SET B
Company BMW DaimlerChrysler Dana Denso Fiat Ford Motor Fuji Heavy Industries General Motors Honda Motor Isuzu Motors Johnson Controls Lear
CarFirms
Revenue 13.8 16.1 27.5 51.5 37.5 41.4 28.6 11.4 99.7 11.9 76.3 26.8
Employees 64.1 31.9 26.7 131.3 156.5 138.3 189.5 13.9 183.9 78.0 297.9 70.3
Source: Project by statistics students Paul Ruskin, Kristy Bielewski, and Linda Stengel.
DATA SET C
Patient 1 2 3 4 5 6 7 8
Source: Records of a hospital outpatient cognitive retraining clinic. Notes: ELOS was estimated using a 42-item assessment instrument combined with expert judgment by teams. Patients had suffered head trauma, stroke, or other medical conditions affecting cognitive function.
Text
546
DATA SET D
Manufacturer/Model
Airplanes
Cruise Speed 140 235 191 132 115 170 175 156 188 128 107 148 129 191 147 213 186 148 180 186 100 176 151 98 163 143 TotalHP 125 350 310 125 180 210 244 200 280 160 125 300 180 500 235 350 300 300 440 440 150 300 260 81 250 180
Manufacturer/Model Diamond C1 Eclipse Extra Extra 400 Lancair Columbia 300 Liberty XL-2 Maule Comet Mooney 231 Mooney Eagle M205 Mooney M20C Mooney Ovation 2 M20R OMF Aircraft Symphony Piper 125 Tri Pacer Piper 6X Piper Archer III Piper Aztec F Piper Dakota Piper Malibu Mirage Piper Saratoga II TC Piper Satatoga SP Piper Seneca III Piper Seneca V Piper Super Cab Piper Turbo Lance Rockwell Commander 114 Sky Arrow 650 TC Socata TB20 Trinidad Tiger AG-5B
AMD CH 2000 Beech Baron 58 Beech Baron 58P Beech Baron D55 Beech Bonanza B36 TC Beech Duchess Beech Sierra Bellanca Super Viking Cessna 152 Cessna 170B Cessna 172 R Skyhawk Cessna 172 RG Cutlass Cessna 182Q Skylane Cessna 310 R Cessna 337G Skymotor II Cessna 414A Cessna 421B Cessna Cardinal Cessna P210 Cessna T210K Cessna T303 Crusader Cessna Turbo Skylane RG Cessna Turbo Skylane T182T Cessna Turbo Stationair TU206 Cessna U206H Cirrus SR20
Source: New and used airplane reports in Flying, various issues. Note: Data are intended for educational purposes only. Cruise speed is in knots (nautical miles per hour).
DATA SET E
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Age (yr) 2 15 22 27 38 2 0 28 12 17 13 1 1 4 1 9
Source: Randomly chosen circulated nickels were weighed by statistics student Dorothy Duffy as an independent project. Nickels were weighed on a Mettler PE 360 Delta Range scale, accurate to 0.001 gram. The coins age is the difference between the current year and the mint year.
Text
547
DATA SET F
Year 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
%ChgCPI 1.3 1.6 1.0 1.9 3.5 3.0 4.7 6.2 5.6 3.3 3.4 8.7 12.3 6.9 4.9 6.7 9.0 13.3 12.5 8.9
Source: Economic Report of the President, 2002. Note: %ChgCPI is the percent change in the Consumer Price Index and %ChgM3 is the percent change in the M3 component of the money supply lagged 1 year.
DATA SET G
Vehicle
MPG
City MPG 17 18 13 18 19 17 20 20 20 15 16 25 28 24 21 17 23 26 19 34 20 Weight 3,640 3,390 5,000 3,925 3,355 4,195 3,340 3,285 3,345 4,270 4,315 3,095 2,805 2,855 3,575 3,590 2,570 2,985 4,120 3,045 3,690
Acura CL Acura TSX BMW 3-Series Buick Century Buick Rendezvous Cadillac Seville Chevrolet Corvette Chevrolet Silverado 1500 Chevrolet TrailBlazer Chrysler Pacica Dodge Caravan Dodge Ram 1500 Ford Expedition Ford Focus GMC Envoy Honda Accord Honda Odyssey Hyundai Elantra Inniti FX Isuzu Ascender Jaguar XJ8 Kia Rio
Land Rover Freelander Lexus IS300 Lincoln Aviator Mazda MPV Mazda6 Mercedes-Benz S-Class Mercury Sable Mitsubishi Galant Nissan 350Z Nissan Pathnder Nissan Xterra Pontiac Grand Am Pontiac Vibe Saturn Ion Subaru Baja Suzuki Vitara/XL-7 Toyota Celica Toyota Matrix Toyota Sienna Volkswagen Jetta Volvo C70
Source: 2003 by Consumers Union of U.S., Inc., Yonkers, NY 107031057 from Consumer Reports New Car Buying Guide 20032004. Used with permission. Notes: Sampling methodology was to select the vehicle on every fth page starting at page 40. Data are intended for purposes of statistical education and should not be viewed as a guide to vehicle performance. Vehicle weights are in pounds.
Text
548
DATA SET H
Product
Pasta
Fat Calories Per Gram 0.20 0.12 0.08 0.04 0.12 0.00 0.08 0.08 0.12 0.12 0.38 0.33 0.29 0.33 0.25 0.25 0.20 0.19 0.20 0.16
Barilla Roasted Garlic & Onion Barilla Tomato & Basil Classico Tomato & Basil Del Monte Mushroom Five Bros. Tomato & Basil Healthy Choice Traditional Master Choice Chunky Garden Veg. Meijer All Natural Meatless Newmans Own Traditional Paul Newman Venetian Prego Fresh Mushrooms Prego Hearty MeatPepperoni Prego Hearty MeatHamburger Prego Traditional Prego Roasted Red Pepper & Garlic Ragu Old World Style w/meat Ragu Roasted Red Pepper & Onion Ragu Roasted Garlic Ragu Traditional Sutter Home Tomato & Garlic
Source: This data set was created by statistics students Donna Bennett, Nicole Cook, Latrice Haywood, and Robert Malcolm using nutrition information on the labels of sauces chosen from supermarket shelves. Note: Data are intended for educational purposes only and should not be viewed as a nutrition guide.
DATA SET I
Month 1 2 3 4 5 6 7 8 9 10 11 12
12.44 Researchers found a correlation coefcient of r = .50 on personality measures for identical twins. A reporter interpreted this to mean that the environment orchestrated one-half of their personality differences. Do you agree with this interpretation? Discuss. (See Science News 140 [December 7, 1991], p. 377.) 12.45 A study of the role of spreadsheets in planning in 55 small rms dened Y = satisfaction with sales growth and X = executive commitment to planning. Analysis yielded an overall correlation of r = .3043. Do a two-tailed test for zero correlation at = .025. 12.46 In a study of stock prices from 1970 to 1994, the correlation between Nasdaq closing prices on successive days (i.e., with a 1-day lag) was r = .13 with a t statistic of 5.47. Interpret this result. (See David Nawrocki, The Problems with Monte Carlo Simulation, Journal of Financial Planning 14, no. 11 [November 2001], p. 96.)
Text
549
12.47 Regression analysis of free throws by 29 NBA teams during the 20022003 season revealed the tted regression Y = 55.2 + .73X (R2 = .874, s yx = 53.2) where Y = total free throws made and X = total free throws attempted. The observed range of X was from 1,620 (New York Knicks) to 2,382 (Golden State Warriors). (a) Find the expected number of free throws made for a team that shoots 2,000 free throws. (b) Do you think that the intercept is meaningful? Hint: Make a scatter plot and let Excel t the line. (c) Use the quick rule to make a 95 percent prediction interval for Y when X = 2,000. FreeThrows 12.48 In the following regression, X = weekly pay, Y = income tax withheld, and n = 35 McDonalds employees. (a) Write the tted regression equation. (b) State the degrees of freedom for a twotailed test for zero slope, and use Appendix D to nd the critical value at = .05. (c) What is your conclusion about the slope? (d) Interpret the 95 percent condence limits for the slope. (e) Verify that F = t2 for the slope. (f) In your own words, describe the t of this regression.
0.202 6.816 35
df 1 33 34
MS 387.6959 46.4564
F 8.35
p-value .0068
Regression output variables Intercept Slope coefcients 30.7963 0.0343 std. error 6.4078 0.0119 t (df = 33) 4.806 2.889 p-value .0000 .0068
condence interval 95% lower 17.7595 0.0101 95% upper 43.8331 0.0584
12.49 In the following regression, X = monthly maintenance spending (dollars), Y = monthly machine downtime (hours), and n = 15 copy machines. (a) Write the tted regression equation. (b) State the degrees of freedom for a two-tailed test for zero slope, and use Appendix D to nd the critical value at = .05. (c) What is your conclusion about the slope? (d) Interpret the 95 percent condence limits for the slope. (e) Verify that F = t2 for the slope. (f) In your own words, describe the t of this regression.
0.370 286.793 15
df 1 13 14
MS 628,298.2 82,250.1
F 7.64
p-value .0161
Regression output variables Intercept Slope coefcients 1,743.57 1.2163 std. error 288.82 0.4401 t (df = 13) 6.037 2.764 p-value .0000 .0161
condence interval 95% lower 1,119.61 2.1671 95% upper 2,367.53 0.2656
Text
550
12.50 In the following regression, X = total assets ($ billions), Y = total revenue ($ billions), and n = 64 large banks. (a) Write the tted regression equation. (b) State the degrees of freedom for a twotailed test for zero slope, and use Appendix D to nd the critical value at = .05. (c) What is your conclusion about the slope? (d) Interpret the 95 percent condence limits for the slope. (e) Verify that F = t2 for the slope. (f) In your own words, describe the t of this regression.
0.519 6.977 64
df 1 62 63
MS 3,260.0981 48.6828
F 66.97
p-value 1.90E-11
Regression output variables Intercept X1 coefcients 6.5763 0.0452 std. error 1.9254 0.0055 t (df = 62) 3.416 8.183 p-value .0011 1.90E-11
condence interval 95% lower 2.7275 0.0342 95% upper 10.4252 0.0563
12.51 Do stock prices of competing companies move together? Below are daily closing prices of two computer services rms (IBM = International Business Machines Corporation, EDS = Electronic Data Systems Corporation). (a) Calculate the sample correlation coefcient (e.g., using Excel or MegaStat). (b) At = .01 can you conclude that the true correlation coefcient is greater than zero? (c) Make a scatter plot of the data. What does it say? (Data are from Center for Research and Security Prices, University of Chicago.) StockPrices
Daily Closing Price ($) of Two Stocks in October and November 2004
Date 9/1/04 9/2/04 9/3/04 9/7/04 9/8/04 9/9/04 9/10/04 9/13/04 9/14/04 9/15/04 9/16/04 9/17/04 9/20/04 9/21/04 9/22/04 9/23/04 9/24/04 9/27/04 9/28/04 9/29/04 9/30/04 IBM 84.22 84.57 84.39 84.97 85.86 86.44 86.76 86.49 86.72 86.37 86.12 85.74 85.70 85.72 84.31 83.88 84.43 84.16 84.48 84.98 85.74 EDS 19.31 19.63 19.19 19.35 19.47 19.51 20.10 19.81 19.79 19.83 20.10 19.90 19.82 20.16 19.89 19.70 19.22 19.16 19.30 19.10 19.39 Date 10/1/04 10/4/04 10/5/04 10/6/04 10/7/04 10/8/04 10/11/04 10/12/04 10/13/04 10/14/04 10/15/04 10/18/04 10/19/04 10/20/04 10/21/04 10/22/04 10/25/04 10/26/04 10/27/04 10/28/04 10/29/04 IBM 86.72 87.16 87.32 88.04 87.42 86.71 86.63 86.00 84.98 84.78 84.85 85.92 89.37 88.82 88.10 87.39 88.43 89.00 90.00 89.50 89.75 EDS 20.00 20.36 20.38 20.49 20.43 20.02 20.24 20.14 19.47 19.30 19.54 19.43 19.26 19.17 19.63 19.75 20.03 20.99 21.26 21.41 21.27
Text
551
12.52 Below are average gestation days and longevity data for 22 animals. (a) Make a scatter plot. (b) Find the correlation coefcient and interpret it. (c) Test the correlation coefcient for signicance, clearly stating the degrees of freedom. (Data are from The World Almanac and Book of Facts, 2005, p. 180. Used with permission.) Gestation
12.53 Below are fertility rates (average children born per woman) in 15 EU nations for 2 years. (a) Make a scatter plot. (b) Find the correlation coefcient and interpret it. (c) Test the correlation coefcient for signicance, clearly stating the degrees of freedom. (Data are from the World Health Organization.) Fertility
12.54 Consider the following prices and accuracy ratings for 27 stereo speakers. (a) Make a scatter plot of accuracy rating as a function of price. (b) Calculate the correlation coefcient. At = .05, does the correlation differ from zero? (c) In your own words, describe the scatter plot. (Data are from Consumer Reports 68, no. 11 [November 2003], p. 31. Data are intended for statistical education and not as a guide to speaker performance.) Speakers
Text
552
12.55 Choose one of these three data sets. (a) Make a scatter plot. (b) Let Excel estimate the regression line, with tted equation and R2. (c) Describe the t of the regression. (d) Write the tted regression equation and interpret the slope. (e) Do you think that the estimated intercept is meaningful? Explain.
Commercial Real Estate (X = assessed value, $000; Y = oor space, sq. ft.) Assessed
Assessed 1,796 1,544 2,094 1,968 1,567 1,878 949 910 1,774 1,187 1,113 671 1,678 710 678 Size 4,790 4,720 5,940 5,720 3,660 5,000 2,990 2,610 5,650 3,570 2,930 1,280 4,880 1,620 1,820
Text
553
Salaries
Poway Big Homes, Ltd. (X = home size, sq. ft.; Y = selling price, $000) HomePrice2
SqFt 3,570 3,410 2,690 3,260 3,130 3,460 3,340 3,240 2,660 3,160 3,340 2,780 2,790 3,740 3,260 3,310 2,930 3,020 2,320 3,130 Price 861 740 563 698 624 737 806 809 639 778 809 621 687 840 789 760 729 720 575 785
Text
554
12.56 Bivariate regression was employed to establish the effects of childhood exposure to lead. The effective sample size was about 122 subjects. The independent variable was the level of dentin lead (parts per million). Below are regressions using various dependent variables. (a) Calculate the t statistic for each slope. (b) From the p-values, which slopes differ from zero at = .01? (c) Do you feel that cause and effect can be assumed? Hint: Do a Web search for information about effects of childhood lead exposure. (Data are from H. L. Needleman et al., The New England Journal of Medicine 322, no. 2 [January 1990], p. 86.)
Dependent Variable Highest grade achieved Reading grade equivalent Class standing Absence from school Grammatical reasoning Vocabulary Hand-eye coordination Reaction time Minor antisocial behavior
Estimated Slope 0.027 0.070 0.006 4.8 0.159 0.124 0.041 11.8 0.639
Std Error 0.009 0.018 0.003 1.7 0.062 0.032 0.018 6.66 0.36
p-value .008 .000 .048 .006 .012 .000 .020 .080 .082
12.57 Below are recent nancial ratios for a random sample of 20 integrated health care systems. Operating Margin is total revenue minus total expenses divided by total revenue plus net operating prots. Equity Financing is fund balance divided by total assets. (a) Make a scatter plot of Y = operating margin and X = equity nancing (both variables are in percent). (b) Use Excel to t the regression, with tted equation and R2. (c) In your own words, describe the t. (Data are from Hospitals & Health Networks 71, no. 6 [March 20, 1997], pp. 4849. Copyright 1997 by Health Forum, Inc. Used with permission. Data are intended for statistical education and not as a guide to nancial performance.) HealthCare
12.58 Consider the following data on 20 chemical reactions, with Y = chromatographic retention time (seconds) and X = molecular weight (gm/mole). (a) Make a scatter plot. (b) Use Excel to t the regression, with tted equation and R2. (c) In your own words, describe the t. (Data provided by John Seeley of Oakland University.) Chemicals
Text
555
12.59 A common belief among faculty is that teaching ratings are lower in large classes. Below are MINITAB results from a regression using Y = mean student evaluation of the professor and X = class size for 364 business school classes taught during the 20022003 academic year. Ratings are on a scale of 1 (lowest) to 5 (highest). (a) What do these regression results tell you about the relationship between class size and faculty ratings? (b) Is a bivariate model adequate? If not, suggest additional predictors to be considered.
Predictor Constant Enroll S = 0.5688 Coef 4.18378 0.000578 R-Sq = 0.0% SE Coef 0.07226 0.002014 R-Sq(adj) = 0.0% T 57.90 0.29 P 0.000 0.774
12.60 Below are revenue and prot (both in $ billions) for nine large entertainment companies. (a) Make a scatter plot of prot as a function of revenue. (b) Use Excel to t the regression, with tted equation and R2. (c) In your own words, describe the t. (Data are from Fortune 149, no. 7 [April 5, 2005], p. F-50.) Entertainment
12.61 Below are tted regressions based on used vehicle ads. Observed ranges of X are shown. The assumed regression model is AskingPrice = f (VehicleAge). (a) Interpret the slopes. (b) Are the intercepts meaningful? Explain. (c) Assess the t of each model. (d) Is a bivariate model adequate to explain vehicle prices? If not, what other predictors might be considered? (Data are from Detroits AutoFocus 4, Issue 38 [September 1723, 2004]. Data are for educational purposes only and should not be viewed as a guide to vehicle prices.)
Text
556
Vehicle Ford Explorer Ford F-150 Pickup Ford Mustang Ford Taurus
n 31 43 33 32
Min Age 2 1 1 1
Max Age 6 37 10 14
12.62 Below are results of a regression of Y = average stock returns (in percent) as a function of X = average price/earnings ratios for the period 19491997 (49 years). Separate regressions were done for various holding periods (sample sizes are therefore variable). (a) Summarize what the regression results tell you. (b) Would you anticipate autocorrelation in this type of data? Explain. (Data are from Ruben Trevino and Fiona Robertson, P/E Ratios and Stock Market Returns, Journal of Financial Planning 15, no. 2 [February 2002], p. 78.)
Holding Period 1-Year 2-Year 5-Year 8-Year 10-Year Intercept 28.10 26.11 20.67 24.73 24.51 Slope 0.92 0.86 0.57 0.94 0.95 t 1.86 2.57 2.99 6.93 8.43 R2 .0688 .1252 .1720 .5459 .6516 p .0686 .0136 .0046 .0000 .0000
12.63 Adult height is somewhat predictable from average height of both parents. For females, a commonly used equation is YourHeight = ParentHeight 2.5 while for males the equation is YourHeight = ParentHeight + 2.5. (a) Test these equations on yourself (or on somebody else). (b) How well did the equations predict your height? (c) How do you suppose these equations were derived?
LS
LearningStats Unit 12 covers correlation and simple bivariate regression. It includes demonstrations of the least squares method, regression formulas, effects of model form and range of X, condence and prediction intervals, violations of assumptions, and examples of student projects. Your instructor may assign specic modules, or you may decide to check them out because the topic sounds interesting
Topic Correlation Regression Ordinary least squares estimators LearningStats Modules Overview of Correlation Correlation Analysis Overview of Simple Regression Using Excel for Regression Least Squares Method Demonstration Doing Regression Calculations Effect of Model Form Effect of X Range Condence and Prediction Intervals Calculations for Condence Intervals Superimposing Many Fitted Regressions Non-Normal Errors Heteroscedastic Errors Autocorrelated Errors Cochrane-Orcutt Transform Derivation of OLS Estimators Formulas for OLS Estimates Formulas for Signicance Tests Birth Rates Life Expectancy and Literacy Effects of Urbanization Appendix DStudents t Appendix FF Distribution
= Excel
Violations of assumptions
Formulas
Student presentations
Text
557
VS
VS VS VS VS
Visual Statistics modules 14, 15, 16, and 18 (included on your CD) are designed with the following objectives: Module 14 Become familiar with ways to display bivariate data. Understand measures of association in bivariate data. Be able to interpret bivariate regression statistics and assess their signicance. Module 15 Understand OLS terminology. Understand how sample size, standard error, and range of X affect estimation accuracy. Understand condence intervals for E(y| x) and prediction intervals for y|x. Module 16 Learn the regression assumptions required to ensure desirable properties for OLS estimators. Learn to recognize violations of the regression assumptions. Be able to identify the effects of assumption violations. Module 18 Know the common variable transformations and their purposes. Learn the effects of variable transformations on the tted regression and statistics of t. Understand polynomial models. The worktext chapter (included on the CD in .PDF format) contains a list of concepts covered, objectives of the module, overview of concepts, illustration of concepts, orientation to module features, learning exercises (basic, intermediate, advanced), learning projects (individual, team), self-evaluation quiz, glossary of terms, and solutions to self-evaluation quiz.