17.1.6 General Comments On Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

472 LEAST-SQUARES REGRESSION

Thus, the intercept, log a2, equals 20.300, and therefore, by taking the antilogarithm,
a2 5 1020.3 5 0.5. The slope is b2 5 1.75. Consequently, the power equation is
y 5 0.5x 1.75
This curve, as plotted in Fig. 17.10a, indicates a good it.

17.1.6 General Comments on Linear Regression


Before proceeding to curvilinear and multiple linear regression, we must emphasize the
introductory nature of the foregoing material on linear regression. We have focused on
the simple derivation and practical use of equations to it data. You should be cognizant
of the fact that there are theoretical aspects of regression that are of practical importance
but are beyond the scope of this book. For example, some statistical assumptions that
are inherent in the linear least-squares procedures are
1. Each x has a ixed value; it is not random and is known without error.
2. The y values are independent random variables and all have the same variance.
3. The y values for a given x must be normally distributed.
Such assumptions are relevant to the proper derivation and use of regression. For
example, the irst assumption means that (1) the x values must be error-free and (2) the
regression of y versus x is not the same as x versus y (try Prob. 17.4 at the end of the
chapter). You are urged to consult other references such as Draper and Smith (1981) to
appreciate aspects and nuances of regression that are beyond the scope of this book.

17.2 POLYNOMIAL REGRESSION


In Sec. 17.1, a procedure was developed to derive the equation of a straight line using
the least-squares criterion. Some engineering data, although exhibiting a marked pattern
such as seen in Fig. 17.8, is poorly represented by a straight line. For these cases, a curve
would be better suited to it these data. As discussed in the previous section, one method
to accomplish this objective is to use transformations. Another alternative is to it poly-
nomials to the data using polynomial regression.
The least-squares procedure can be readily extended to it the data to a higher-order
polynomial. For example, suppose that we it a second-order polynomial or quadratic:
y 5 a0 1 a1x 1 a2 x 2 1 e
For this case the sum of the squares of the residuals is [compare with Eq. (17.3)]
n
Sr 5 a ( yi 2 a0 2 a1xi 2 a2 xi2 ) 2 (17.18)
i51

Following the procedure of the previous section, we take the derivative of Eq. (17.18)
with respect to each of the unknown coeficients of the polynomial, as in
0Sr
5 22 a ( yi 2 a0 2 a1xi 2 a2 xi2 )
0a0

[92]
17.2 POLYNOMIAL REGRESSION 473

0Sr
5 22 a xi ( yi 2 a0 2 a1xi 2 a2x2i )
0a1
0Sr
5 22 a x2i ( yi 2 a0 2 a1xi 2 a2x2i )
0a2
These equations can be set equal to zero and rearranged to develop the following set of
normal equations:
( ) ( )
(n)a0 1 a xi a1 1 a x2i a2 5 a yi

( a xi ) a0 1 ( a x2i ) a1 1 ( a x3i ) a2 5 a xiyi (17.19)

( a x2i ) a0 1 ( a x3i ) a1 1 ( a x4i ) a2 5 a x2i yi


where all summations are from i 5 1 through n. Note that the above three equations are
linear and have three unknowns: a0, a1, and a2. The coeficients of the unknowns can be
calculated directly from the observed data.
For this case, we see that the problem of determining a least-squares second-order
polynomial is equivalent to solving a system of three simultaneous linear equations.
Techniques to solve such equations were discussed in Part Three.
The two-dimensional case can be easily extended to an mth-order polynomial as
y 5 a0 1 a1x 1 a2x2 1 p 1 amxm 1 e
The foregoing analysis can be easily extended to this more general case. Thus, we can
recognize that determining the coeficients of an mth-order polynomial is equivalent to
solving a system of m 1 1 simultaneous linear equations. For this case, the standard
error is formulated as

B n 2 (m 1 1)
Sr
sy/x 5 (17.20)

This quantity is divided by n 2 (m 1 1) because (m 1 1) data-derived coeficients—


a0, a1, . . . , am—were used to compute Sr ; thus, we have lost m 1 1 degrees of free-
dom. In addition to the standard error, a coeficient of determination can also be
computed for polynomial regression with Eq. (17.10).

EXAMPLE 17.5 Polynomial Regression


Problem Statement. Fit a second-order polynomial to the data in the irst two columns
of Table 17.4.
Solution. From the given data,
4
m52 a xi 5 15 a xi 5 979
n56 a yi 5 152.6 a xiyi 5 585.6
2 2
x 5 2.5 a xi 5 55 a xi yi 5 2488.8
3
y 5 25.433 a xi 5 225

[93]
474 LEAST-SQUARES REGRESSION

TABLE 17.4 Computations for an error analysis of the quadratic least-squares fit.

xi yi (yi 2 y )2 (yi 2 a0 2 a1xi 2 a2x i2)2

0 2.1 544.44 0.14332


1 7.7 314.47 1.00286
2 13.6 140.03 1.08158
3 27.2 3.12 0.80491
4 40.9 239.22 0.61951
5 61.1 1272.11 0.09439
S 152.6 2513.39 3.74657

50
Least-squares
parabola

0 5 x

FIGURE 17.11
Fit of a second-order polynomial.

Therefore, the simultaneous linear equations are

£ 15 225 § • a1 ¶ 5 • 585.6 ¶
6 15 55 a0 152.6
55
55 225 979 a2 2488.8
Solving these equations through a technique such as Gauss elimination gives a0 5 2.47857,
a1 5 2.35929, and a2 5 1.86071. Therefore, the least-squares quadratic equation for this case is
y 5 2.47857 1 2.35929x 1 1.86071x2
The standard error of the estimate based on the regression polynomial is [Eq. (17.20)]

A 623
3.74657
syyx 5 5 1.12

[94]
17.2 POLYNOMIAL REGRESSION 475

The coeficient of determination is


2513.39 2 3.74657
r2 5 5 0.99851
2513.39
and the correlation coeficient is r 5 0.99925.
These results indicate that 99.851 percent of the original uncertainty has been ex-
plained by the model. This result supports the conclusion that the quadratic equation
represents an excellent it, as is also evident from Fig. 17.11.

4th Polynomial Regression for Curve Fitting

[95]
PROBLEMS 487

PROBLEMS
That is, determine the slope that results in the least-squares it for a
straight line with a zero intercept. Fit the following data with this
model and display the result graphically:
x 2 4 6 7 10 11 14 17 20
y 1 2 5 2 8 7 6 9 12

17.6 Use least-squares regression to it a straight line to


x 1 2 3 4 5 6 7 8 9
y 1 1.5 2 3 4 5 8 10 13

(a) Along with the slope and intercept, compute the standard error
of the estimate and the correlation coeficient. Plot the data and
the straight line. Assess the it.
(b) Recompute (a), but use polynomial regression to it a parabola
to the data. Compare the results with those of (a).
17.7 Fit the following data with (a) a saturation-growth-rate model,
Determine (a) the mean, (b) the standard deviation, (c) the vari- (b) a power equation, and (c) a parabola. In each case, plot the data
ance, (d) the coeficient of variation, and (e) the 90% conidence and the equation.
interval for the mean. (f) Construct a histogram. Use a range from
x 0.75 2 3 4 6 8 8.5
28 to 34 with increments of 0.4. (g) Assuming that the distribution
is normal and that your estimate of the standard deviation is valid, y 1.2 1.95 2 2.4 2.4 2.7 2.6
compute the range (that is, the lower and the upper values) that
17.8 Fit the following data with the power model (y 5 axb). Use
encompasses 68% of the readings. Determine whether this is a
the resulting power equation to predict y at x 5 9:
valid estimate for the data in this problem.
17.3 Use least-squares regression to it a straight line to x 2.5 3.5 5 6 7.5 10 12.5 15 17.5 20

x 0 2 4 6 9 11 12 15 17 19 y 13 11 8.5 8.2 7 6.2 5.2 4.8 4.6 4.3

y 5 6 7 6 9 8 7 10 12 12 17.9 Fit an exponential model to


Along with the slope and intercept, compute the standard error of x 0.4 0.8 1.2 1.6 2 2.3
the estimate and the correlation coeficient. Plot the data and the
y 800 975 1500 1950 2900 3600
regression line. Then repeat the problem, but regress x versus y—
that is, switch the variables. Interpret your results. Plot the data and the equation on both standard and semi-logarithmic
17.4 Use least-squares regression to it a straight line to graph paper.
17.10 Rather than using the base-e exponential model (Eq. 17.22),
x 6 7 11 15 17 21 23 29 29 37 39
a common alternative is to use a base-10 model,
y 29 21 29 14 21 15 7 7 13 0 3
y 5 a510b5x
Along with the slope and the intercept, compute the standard error of
the estimate and the correlation coeficient. Plot the data and the re- When used for curve itting, this equation yields identical results
gression line. If someone made an additional measurement of x 5 10, to the base-e version, but the value of the exponent parameter (b5)
y 5 10, would you suspect, based on a visual assessment and the will differ from that estimated with Eq. 17.22 (b1). Use the base-10
standard error, that the measurement was valid or faulty? Justify your version to solve Prob. 17.9. In addition, develop a formulation to
conclusion. relate b1 to b5.
17.5 Using the same approach as was employed to derive Eqs. (17.15) 17.11 Beyond the examples in Fig. 17.10, there are other models
and (17.16), derive the least-squares it of the following model: that can be linearized using transformations. For example,

y 5 a1x 1 e y 5 a4xeb4x

[96]
488 LEAST-SQUARES REGRESSION

Linearize this model and use it to estimate a4 and b4 based on the Determine the coeficients by setting up and solving Eq. (17.25).
following data. Develop a plot of your it along with the data. 17.16 Given these data
x 0.1 0.2 0.4 0.6 0.9 1.3 1.5 1.7 1.8 x 5 10 15 20 25 30 35 40 45 50
y 0.75 1.25 1.45 1.25 0.85 0.55 0.35 0.28 0.18 y 17 24 31 33 37 37 40 40 42 41

17.12 An investigator has reported the data tabulated below for an use least-squares regression to it (a) a straight line, (b) a power
experiment to determine the growth rate of bacteria k (per d), as a equation, (c) a saturation-growth-rate equation, and (d) a parabola.
function of oxygen concentration c (mg/L). It is known that such Plot the data along with all the curves. Is any one of the curves
data can be modeled by the following equation: superior? If so, justify.
17.17 Fit a cubic equation to the following data:
kmaxc2
k5
cs 1 c2 x 3 4 5 7 8 9 11 12

where cs and kmax are parameters. Use a transformation to linearize y 1.6 3.6 4.4 3.4 2.2 2.8 3.8 4.6
this equation. Then use linear regression to estimate cs and kmax and Along with the coeficients, determine r2 and syyx.
predict the growth rate at c 5 2 mg/L. 17.18 Use multiple linear regression to it
c 0.5 0.8 1.5 2.5 4
x1 0 1 1 2 2 3 3 4 4
k 1.1 2.4 5.3 7.6 8.9
x2 0 1 2 1 2 1 2 1 2
17.13 An investigator has reported the data tabulated below. It is y 15.1 17.9 12.7 25.6 20.5 35.1 29.7 45.4 40.2
known that such data can be modeled by the following equation
Compute the coeficients, the standard error of the estimate, and the
x 5 e(y2b)ya correlation coeficient.
where a and b are parameters. Use a transformation to linearize this 17.19 Use multiple linear regression to it
equation and then employ linear regression to determine a and b.
x1 0 0 1 2 0 1 2 2 1
Based on your analysis predict y at x 5 2.6.
x2 0 2 2 4 4 6 6 2 1
x 1 2 3 4 5
y 14 21 11 12 23 23 14 6 11
y 0.5 2 2.9 3.5 4
Compute the coeficients, the standard error of the estimate, and the
17.14 It is known that the data tabulated below can be modeled by correlation coeficient.
the following equation
a 1 1x 2
17.20 Use nonlinear regression to it a parabola to the following

y5a b
data:
b 1x
 

x 0.2 0.5 0.8 1.2 1.7 2 2.3


Use a transformation to linearize this equation and then employ y 500 700 1000 1200 2200 2650 3750
linear regression to determine the parameters a and b. Based on
17.21 Use nonlinear regression to it a saturation-growth-rate
your analysis predict y at x 5 1.6.
equation to the data in Prob. 17.16.
x 0.5 1 2 3 4 17.22 Recompute the regression its from Probs. (a) 17.3 and (b)
y 10.4 5.8 3.3 2.4 2 17.17, using the matrix approach. Estimate the standard errors and
develop 90% conidence intervals for the coeficients.
17.15 The following data are provided 17.23 Develop, debug, and test a program in either a high-level
language or macro language of your choice to implement linear
x 1 2 3 4 5
regression. Among other things: (a) include statements to docu-
y 2.2 2.8 3.6 4.5 5.5 ment the code, and (b) determine the standard error and the coefi-
You want to use least-squares regression to it these data with the cient of determination.
following model, 17.24 A material is tested for cyclic fatigue failure whereby a
stress, in MPa, is applied to the material and the number of cycles
c needed to cause failure is measured. The results are in the table
y 5 a 1 bx 1
x below. When a log-log plot of stress versus cycles is generated, the

[97]
PROBLEMS 489

data trend shows a linear relationship. Use least-squares regression at which the concentration will reach 200 CFUy100 mL. Note that
to determine a best-it equation for these data. your choice of model should be consistent with the fact that nega-
tive concentrations are impossible and that the bacteria concentra-
N, cycles 1 10 100 1000 10,000 100,000 1,000,000 tion always decreases with time.
Stress, MPa 1100 1000 925 800 625 550 420 17.28 An object is suspended in a wind tunnel and the force mea-
sured for various levels of wind velocity. The results are tabulated
17.25 The following data show the relationship between the vis- below.
cosity of SAE 70 oil and temperature. After taking the log of the
data, use linear regression to ind the equation of the line that best v, m/s 10 20 30 40 50 60 70 80
its the data and the r2 value. F, N 25 70 380 550 610 1220 830 1450
Temperature, 8C 26.67 93.33 148.89 315.56 Use least-squares regression to it these data with (a) a straight line,
Viscosity, m, N ? s/m2 1.35 0.085 0.012 0.00075 (b) a power equation based on log transformations, and (c) a power
model based on nonlinear regression. Display the results graphically.
17.26 The data below represents the bacterial growth in a liquid 17.29 Fit a power model to the data from Prob. 17.28, but use
culture over a number of days. natural logarithms to perform the transformations.
Day 0 4 8 12 16 20 17.30 Derive the least-squares it of the following model:
6
Amount 3 10 67 84 98 125 149 185 y 5 a 1x 1 a2x2 1 e
Find a best-it equation to the data trend. Try several possibilities— That is, determine the coeficients that results in the least-squares it
linear, parabolic, and exponential. Use the software package of for a second-order polynomial with a zero intercept. Test the ap-
your choice to ind the best equation to predict the amount of bac- proach by using it to it the data from Prob. 17.28.
teria after 40 days. 17.31 In Prob. 17.11 we used transformations to linearize and it
17.27 The concentration of E. coli bacteria in a swimming area is the following model:
monitored after a storm:
y 5 a4xeb4x
t (hr) 4 8 12 16 20 24
Use nonlinear regression to estimate a4 and b4 based on the follow-
c (CFUy100 mL) 1600 1320 1000 890 650 560
ing data. Develop a plot of your it along with the data.
The time is measured in hours following the end of the storm and x 0.1 0.2 0.4 0.6 0.9 1.3 1.5 1.7 1.8
the unit CFU is a “colony forming unit.” Use these data to estimate
y 0.75 1.25 1.45 1.25 0.85 0.55 0.35 0.28 0.18
(a) the concentration at the end of the storm (t 5 0) and (b) the time

[98]

You might also like