Nlreg10e PDF
Nlreg10e PDF
Nlreg10e PDF
Andreas Ruckstuhl
IDP Institut fr Datenanalyse und Prozessdesign
ZHAW Zrcher Hochschule fr Angewandte Wissenschaften
October 2010
Contents
6. Parameter Transformations 17
8. Closing Comments 27
A. Gauss-Newton Method 28
The author thanks Werner Stahel for his valuable comments.
E-Mail Address: [email protected]; Internet: http://www.idp.zhaw.ch
I
1. The Nonlinear Regression Model 1
Goals
The nonlinear regression model block in the Weiterbildungslehrgang (WBL) in ange-
wandter Statistik at the ETH Zurich should
1. introduce problems that are relevant to the fitting of nonlinear regression func-
tions,
2. present graphical representations for assessing the quality of approximate confi-
dence intervals, and
3. introduce some parts of the statistics software R that can help with solving
concrete problems.
b The Linear Regression Model. In (multiple) linear regression, functions h are con-
sidered that are linear in the parameters j ,
(1) (2) (m) (1) (2) (p)
hhxi , xi , . . . , xi ; 1 , 2 , . . . , p i = 1 xei + 2 xei + . . . + p xei ,
where the xe(j) can be arbitrary functions of the original explanatory variables x(j) .
(Here the parameters are usually denoted as j instead of j .)
Example d Puromycin. The speed with which an enzymatic reaction occurs depends on the
concentration of a substrate. According to the information from Bates and Watts
(1988), it was examined how a treatment of the enzyme with an additional substance
called Puromycin influences this reaction speed. The initial speed of the reaction is
chosen as the variable of interest, which is measured via radioactivity. (The unit of the
variable of interest is count/min 2 ; the number of registrations on a Geiger counter per
time period measures the quantity of the substance present, and the reaction speed is
proportional to the change per time unit.)
A. Ruckstuhl, ZHAW
2 1. The Nonlinear Regression Model
200
150
Velocity
Velocity
100
50
The relationship of the variable of interest with the substrate concentration x (in ppm)
is described via the Michaelis-Menten function
1 x
hhx; i = .
2 + x
Example e Oxygen Consumption. To determine the biochemical oxygen consumption, river water
samples were enriched with dissolved organic nutrients, with inorganic materials, and
with dissolved oxygen, and were bottled in different bottles. (Marske, 1967, see Bates
and Watts (1988)). Each bottle was then inoculated with a mixed culture of microor-
ganisms and then sealed in a climate chamber with constant temperature. The bottles
were periodically opened and their dissolved oxygen content was analyzed. From this
the biochemical oxygen consumption [mg/l] was calculated. The model used to con-
nect the cumulative biochemical oxygen consumption Y with the incubation timex,
is based on exponential growth decay, which leads to
h hx, i = 1 1 e2 x
. Figure 1.e shows the data and the regression function to be applied.
20
18
Oxygen Demand
Oxygen Demand
16
14
12
10
8
1 2 3 4 5 6 7 Days
Days
Figure 1.e: Oxygen consumption example. (a) Data and (b) typical shape of the regression
function.
163
y (= chem. shift)
162
y
161
160
(a) (b)
2 4 6 8 10 12 x
x (=pH)
Figure 1.f: Membrane Separation Technology.(a) Data and (b) a typical shape of the regression
function.
where the unknown parameters are 1 , 2 and 3 > 0 and 4 < 0. Solving for y leads
to the model
1 + 2 103 +4 xi
Yi = hhxi ; i + Ei = + Ei .
1 + 103 +4 xi
The regression funtion hhxi , i for a reasonably chosen is shown in Figure 1.f next
to the data.
A. Ruckstuhl, ZHAW
4 1. The Nonlinear Regression Model
Since useful regression functions are often derived from the theory of the application
area in question, a general overview of nonlinear regression functions is of limited
benefit. A compilation of functions from publications can be found in Appendix 7 of
Bates and Watts (1988).
Here are some more linearizable functions (also see Daniel and Wood, 1980):
h hx, i = 1/(1 + 2 exphxi) 1/h hx, i = 1 + 2 exphxi
h hx, i = 1 x/(2 + x) 1/h hx, i = 1/1 + 2 /1 x1
h hx, i = 1 x2 lnhh hx, ii = lnh1 i + 2 lnhxi
h hx, i = 1 exph2 ghxii lnhh hx, ii = lnh1 i + 2 ghxi
h hx, i = exph1 x(1) exph2 /x(2) ii lnhlnhh hx, iii = lnh1 i + lnhx(1) i 2 /x(2)
(1) 2
(2) 3
h hx, i = 1 x x lnhh hx, ii = lnh1 i + 2 lnhx(1) i + 3 lnhx(2) i .
i The Statistically Complete Model. A linear regression with the linearized regression
function in the referred-to example is based on the model
lnhYi i = 0 + 1 xei + Ei ,
where the random errors Ei all have the same normal distribution. We back transform
this model and thus get
Yi = 1 x2 Ee i
with Ee i = exphEi i. The errors Ee i , i = 1, . . . , n now contribute multiplicatively and
are lognormal distributed! The assumptions about the random deviations are thus
now drastically different than for a model that is based directily on h,
Yi = 1 x2 + Ei
with random deviations Ei that, as usual, contribute additively and have a specific
normal distribution.
j * Note: In linear regression ithas been shown that the variance can be stabilized with certain
transformations (e.g. loghi, ). If this is not possible, in certain circumstances one can also
perform a weighted linear regression . The process is analogous in nonlinear regression.
A. Ruckstuhl, ZHAW
6 2. Methodology for Parameter Estimation
c Solution Approach for the Minimization Problem. The main idea of the usual al-
gorithm for minimizing the sum of squared deviations (see 2.a) goes as follows: If a
preliminary best value () exists, we approximate
D E D the model
E surface with the plane
that touches the surface at the point () = h x; () . Now we seek the point in
this plane that lies closest to Y . This amounts to the estimation in a linear regression
problem. This new point lies on the plane, but not on the surface, that corresponds
to the nonlinear problem. However, it determines a parameter vector (+1) and with
this we go into the next round of iteration.
(1) (p)
i hi i h i + Ai h i (1 1 ) + ... + Ai h i (p p )
Ye = A h i + E
* Algorithms comfortably determine the derivative matrix A numerically. In more complex prob-
lems the numerical approximation can be insufficient and cause convergence problems. It is then
advantageous if expressions for the partial derivatives can be arrived at analytically. With these
the derivative matrix can be reliably numerically determined and the procedure is more likely to
converge (see also Chapter 6).
g Initial Value from Prior Knowledge. As already noted in the introduction, nonlin-
ear models are often based on theoretical considerations from the application area in
question. Already existing prior knowledge from similar experiments can be used
to get an initial value. To be sure that the chosen start value fits, it is advisable to
graphically represent the regression function hhx; i for various possible starting values
= 0 together with the data (e.g., as in Figure 2.h, right).
h Start Values via Linearizable Regression Functions. Often, because of the distri-
bution of the error, one is forced to remain with the nonlinear form in models with
linearizable regression functions. However, the linearized model can deliver starting
values.
In the Puromycin Example the regression function is linearizable: The reciprocal
values of the two variables fulfill
1 1 1 2 1
ye = = + = 0 + 1 xe .
y hhx; i 1 1 x
The least squares solution for this modified problem is b = [b0 , b1 ]T = (0.00511, 0.000247)T
(Figure 2.h (a)). This gives the initial value
(0) (0)
1 = 1/b0 = 196 , 2 = b1 /b0 = 0.048 .
A. Ruckstuhl, ZHAW
8 3. Approximate Tests and Confidence Intervals
0.020 200
150
1/velocity
0.015
Velocity
0.010 100
0.005 50
i Initial Values via Geometric Meaning of the Parameter. It is often helpful to consider
the geometrical features of the regression function.
In the Puromycin Example we can thus arrive at an initial value in another, in-
structive way: 1 is the y value for x = . Since the regression function is monotone
increasing, we can use the maximal yi -value or a visually determined asymptotic
value 10 = 207 as initial value for 1 . The parameter 2 is the x-value, at which y
reaches half of the asymptotic value 1 . This gives 20 = 0.06.
The initial values thus result from the geometrical meaning of the parameters and a
coarse determination of the corresponding aspects of a curve fitted by eye.
2
(a)
163
y (=Verschiebung)
1
162
y
0
161
1
160
(b)
2
2 4 6 8 10 12 2 4 6 8 10 12
x (=pH) x (=pH)
Figure 2.j: Membrane Separation Technology Example. (a) Regression line, which is used for
determining the initial values for 3 and 4 . (b) Regression function hhx; i for the initial
value = (0) ( ) and for the least squares estimation = b ().
the set of all these values. For an individual parameter j the confidence region is the
confidence interval.
The results that now follow are based on the fact that the estimator b is asymptotically
multivariate normally distributed. For an individual parameter that leads to a z -Test
and the corresponding confidence interval; for several parameters the corresponding
Chi-Square test works and gives elliptical confidence regions.
b The asymptotic properties of the estimator can be derived from the linear approx-
imation. The problem of nonlinear regression is indeed approximately equal to the
linear regression problem mentioned in 2.d
Ye = A h i + E ,
if the parameter vector , which is used for the linearization lies near to the solution. If
the estimation procedure has converged (i.e. = b), then = 0 otherwise this would
not be the solution. The standard error of the coefficients and more generally the
covariance matrix of b then correspond approximately to the corresponding values
for b.
* A bit more precisely: The standard errors characterize the uncertainties that are generated by the
random fluctuations in the data. The available data have led to the estimation value b. If the
data were somewhat different, then b would still be approximately correct, thus we accept that it
is good enough for the linearization. The estimation of for the new data set would thus lie as
far from the estimated value for the available data, as this corresponds to the distribution of the
parameter in the linearized problem.
A. Ruckstuhl, ZHAW
10 3. Approximate Tests and Confidence Intervals
Example d Membrane Separation Technology. A computer output for the Membrane Separa-
tion example shows Table 3.d. The estimations of the parameters are in the column
Value, followed by the estimated approximate standard error and the test statistics
(t value), that are approximately tnp distributed. In the last row the estimated
standard deviation b of the random error Ei is given.
From this output, in linear regression the confidence intervals for the parameters can
be determined: The approximate 95% confidence interval for the parameter 1 is
t35
163.706 q0.975 0.1262 = 163.706 0.256 .
Example e Puromycin. For checking the influence of treating an enzyme with Puromycin of the
postulated form (1.d) a general model for the data with and without the treatment
can be formulated as follows:
(1 + 3 zi )xi
Yi = + Ei .
2 + 4 zi + xi
Where z is the indicator variable for the treatment (zi = 1, if treated, otherwise =0).
Table 3.e shows that the parameter 4 at the 5% level is not significantly different from
0, since the P value of 0.167 is larger then the level (5%). However, the treatment has
a clear influence, which is expressed through 3 ; the 95% confidence interval covers
the region 52.398 9.5513 2.09 = [32.4, 72.4] (the value 2.09 corresponds to the 0.975
quantile of the t19 distribution).
f Confidence Intervals for Function Values. Besides the parameters, the function D value
E
h hx0 , i for a given x0 is of interest. In linear regression the function value h x0 , =
xT0 =: 0 is estimated by b0 = xT0 b and the estimated (1 ) confidence interval for
it is q
tnp
0 q1/2 se h 0 i with se h 0 i = xTo (X T X )1 xo .
b b b b
b o hi + aT (b ) mit ao = hhxo , i
o hi o .
(If x0 is equal to an observed xi , then a0 equals the corresponding row of the matrix
A from 2.d.) The confidence interval for the function value 0 hi := h hx0 , i is then
approximately
D E D D EE D D EE r D E D E 1
t T
0 b q np b
1/2 se 0 mit se 0 b = b aboT A b A b abo .
In this formula, again the unknown values are replaced by their estimations.
A. Ruckstuhl, ZHAW
12 3. Approximate Tests and Confidence Intervals
30
3
log(PCB Concentration)
25
Oxygen Demand
2 20
15
1
10
0 5
h Prediction Interval. The considered confidence band indicates where the ideal func-
tion values hhxi, and thus the expected values of Y for givenx, lie. The question, in
which region future observations Y0 for given x0 will lie, is not answered by this.
However, this is often more interesting than the question of the ideal function value;
for example, we would like to know in which region the measured value of oxygen
consumption would lie for an incubation time of 6 days.
Such a statement is a prediction about a random variable and is different in principle
from a confidence interval, which says something about a parameter, which is a fixed
but unknown number. Corresponding to the question posed, we call the region we
are now seeking a prediction interval or prognosis interval. More about this in
Chapter 7.
ters in the nonlinear regression model can appropriately describe the data (see Beispiel
Puromycin).
b F Test for Model Comparison. To test a null hypothesis = for the whole
parameter vector or also j = j for an individual component, we can use an F-Test
for model comparison like in linear regression. Here, we compare the sum of squares
b (For n
Sh i that arises under the null hypothesis with the sum of squares Shi.
the F test is the same as the so-called Likelihood Quotient test, and the sum of squares
is, up to a constant, equal to the log likelihood.)
Now we consider the null hypothesis = for the whole parameter vector. The test
statistic is
b a
n p Sh i Shi
T = Fp,np .
p b
Shi
From this we get a confidence region
n o
b 1+ p
Shi Shi np q
F
p,np
where q = q1 is the (1) quantile of the F distribution with p and np degrees
of freedom.
In linear regression we get the same exact confidence region if we use the (multivariate)
normal distribution of the estimator b . In the nonlinear case the results are different.
The region that is based on the F tests is not based on the linear approximation in 2.d
and is thus (much) more exact.
c Exact Confidence Regions for p=2. If p = 2, we can find the exact confidence region
by calculating Shi on a grid of values and determine the borders of the region
through interpolation, as is familiar for contour plots. In Figure 4.c are given the
contours together with the elliptical regions that result from linear approximation for
the Puromycin example (left) and the oxygen consumption example (right).
For p > 2 contour plots do not exist. In the next chapter we will be introduced to
graphical tools that also work for higher dimensions. They depend on the following
concepts.
A. Ruckstuhl, ZHAW
14 4. More Precise Tests and Confidence Intervals
0.10 10
0.09 8
0.08
6
2 0.07 2
4
0.06
0.05 2
0.04 0
e t Test via F Test. In linear regression and in the previous chapter we have calculated
tests and confidence intervals from a test value that follows a t-distribution (t-test for
the coefficients). Is this another test?
It turns out that the test statistic of the t-test in linear regression turns into the test
statistic of the F-test if we square it, and both tests are equivalent. In nonlinear
regression, the F-test is not equivalent with the t-test discussed in the last chapter
(3.d). However, we can transform the F-test into a t-test that is more precise than
that of the last chapter:
From the test statistics of the F-tests, we drop the root and provide then with the
signs of bk k ,
r D E
D E Sek k S b
Tk hk i := sign bk k .
b
D E
(sign hai denotes the sign of a, and is b2 = S b /(n p).) This test statistic is
(approximately) tnp distributed.
In the linear regression model, Tk , is, as mentioned, equal to the test statistic of the
usual t-test,
bk
Tk hk i = D Ek .
se bk
f Confidence Intervals for Function Values via F-Test. With this technique we can also
determine confidence intervals for a function value at a point xo . For this we repa-
rameterize the original problem so that a parameter, say 1 , represents the function
value hhxo i and proceed as in 4.d.
k bk
k hk i := D E
se bk
on the horizontal axis because of the linear approximation. The comparison line is
then the diagonal, so the line with slope 1 and intercept 0.
The more strongly the profile t-function is curved, the stronger is the nonlinearity in a
neighborhood of k . Therefore, this representation shows how good the linear approx-
imation is in a neighborhood of bk . (The neighborhood that is statistically important
is approximately determined by |k hk i| 2.5.) In Figure 5.a it is apparent that in
the Puromycin example the nonlinearity is minimal, but in the oxygen consumption
example it is large.
From the illustration we can read off the confidence intervals according to 4.e. For
convenience, on the right vertical axis are marked the probabilites P hTk ti according
to the t-distribution. In the oxygen consumption example, this gives a confidence
interval without an upper bound!
Example b from Membrane Separation Technology. As 5.a shows, from the profile t-plot we can
graphically read out corresponding confidence intervals that are based on the profile t-
function. The R function confint(...) numerically calculates the desired confidence
interval on the basis of the profile t-function.In Table 5.b is shown the corresponding
R output from the membrane separation example. In this case, no large differences
from the classical calculation method are apparent.
A. Ruckstuhl, ZHAW
16 5. Profile t-Plot and Profile Traces
Table 5.b: Membrane separation technology example: R output for the confidence intervals that
are based on the profile t-function.
c Likelihood Profile Traces. The likelihood profile traces are another useful tool.
Here the estimated parameters ej , j 6= k at fixed k (see 4.d) are considered as
(k)
functions ej hk i of these values.
The graphical representation of these functions would fill a whole matrix of diagrams,
but without diagonals. It is worthwhile to combine the opposite diagrams of this
(k) (j)
matrix: Over the representation of ej hk i we superimpose ek hj i in mirrored
form, so that the axes have the same meaning for both functions.
In Figure 5.c ist shown each of these diagrams for our two examples. Additionally, are
shown contours of confidence regions for [1 , 2 ]. We see that the profile traces cross
the contours at points of contact of the horizontal and vertical tangents.
The representation shows not only the nonlinearities, but also holds useful clues for
how the parameters influence each other. To understand this, we now consider
the case of a linear regression function. The profile traces in the individual diagrams
then consist of two lines, that cross at the point [b1 , b2 ]. We standardize the parameter
(k)
by using k hk i from 5.a, so we can show that the slope of the trace ej hk i is equal
to the correlation coefficient ckj of the estimated coefficients bj and bk . The reverse
1 1
190 200 210 220 230 240 20 40 60 80 100
0.99
0.99 4
3
2 2
0.80 0.80
1
T1(1)
T1(1)
Level
Level
0 0.0
0 0.0
0.80
1 2
0.80
2
4
0.99
3 0.99
2 0 2 4 0 10 20 30
(1) (1)
Figure 5.a: Profile t-plot for the first parameter is each of the Puromycin and oxygen consump-
tion examples. The dashed lines show the applied linear approximation and the dotted line
the construction of the 99% confidence interval with the help of T1 h1 i.
2.0
0.10
0.09
1.5
0.08
2
2
0.07 1.0
0.06
0.5
0.05
0.04
Example d Membrane Separation Technology. All profile t-plots and profile traces can be as-
sembled into a triangle matrix of diagrams, as Figure 5.d shows for the membrane
separation technology example.
Most profile traces are strongly curved, i.e. the regression function tends to a strong
nonlinearity around the estimated parameter values. Even though the profile traces for
3 and 4 are lines, a further problem is apparent: The profile traces lie on top of each
other! This means that the parameters 3 and 4 are extremely strongly collinear.
Parameter 2 is also collinear with 3 and 4 , although more weakly.
e * Good Approximation of Two Dimensional Likelihood Contours. The profile traces can be used
to construct very accurate approximations for two dimensional projections of the likelihood con-
tours (see Bates and Watts, 1988). Their calculation is computationally less difficult than for the
corresponding exact likelihood contours.
6. Parameter Transformations
a Parameter transformations are primarily used to improve the linear approximation
and therefore improve the convergence behavior and the quality of the confidence
interval.
It is here expressly noted that parameter transformations, unlike transformations of
the variable of interest (see 1.h) does not change the statistical part of the model. So,
they are not helpful if the assumptions about the distribution of the random deviations
A. Ruckstuhl, ZHAW
18 6. Parameter Transformations
1
0
3
163.4 163.8
160.2 3
160.0 2
1
159.8
0
2 159.6
1
159.4
2
159.2
3
163.4 163.8 159.2 159.6 160.0
3
4.0 4.0
2
3.5 3.5
1
3.0 3.0 0
3
2.5 2.5 1
2.0 2.0 2
3
163.4 163.8 159.2 159.6 160.0 2.0 3.0 4.0
3
1
0.5 0.5 0.5
0
4 0.6 0.6 0.6
1
0.7 0.7 0.7 2
163.4 163.8 159.2 159.6 160.0 2.0 3.0 4.0 0.8 0.6 0.4
1 2 3 4
Figure 5.d: Profile t-plots and Profile Traces for the Example from Membrane Separation Tech-
nology. The + in the profile t-plot denotes the least squares solution.
are violated. It is the quality of the linear approximation and the statistical statements
based on it that are changed.
Often, the transformed parameters are very difficult to interpret for the application.
The important questions often concern individual parameters the original parameters.
Despite this, we can work with transformations: We derive more accurate confidence
regions for the transformed parameters and back transform these to get results for the
original variables.
The parameter should lie in the interval (a, b). With the log transformation
= a + (b a)/(1 + exphi), can, for arbitrary R, only take values in
(a, b).
In the model
hhx, i = 1 exph2 xi + 3 exph4 xi
with 2 , 4 > 0 the parameter pairs (1 , 2 ) and (3 , 4 ) are interchangeable, i.e.
hhx, i does not change. This can create uncomfortable optimization problems,
because, for one thing, the solution is not unique. The constraint 0 < 2 < 4
that ensures the uniqueness is achieved via the transformation 2 = exph2 i und
4 = exph2 i(1 + exph4 i). The function is now
h h
= exph2 xi , = 1 x exph2 xi .
1 2
If all x values are positive, both vectors
The derivative vectors become approximately orthogonal if for x0 the mean value of
the xi is chosen.
A. Ruckstuhl, ZHAW
20 6. Parameter Transformations
1 2 3 1 2 e3
2 -0.256 2 -0.256
3 -0.434 0.771 e3 0.323 0.679
4 0.515 -0.708 -0.989 4 0.515 -0.708 -0.312
Table 6.d: Correlation matrices for the Membrane Separation Technology for the original pa-
rameters (left) and the transformed parameters e3 (right).
Example d Membrane Separation Technology. In this example it is apparent from the approx-
imate correlation matrix (Table 6.d, left half) that the parameters 3 and 4 are
strongly correlated. (We have already found this in 5.d from the profile traces).
If the model is reparameterized to
Example f from Membrane Separation Technology. The parameter transformation given in 6.d
leads to a satisfactory result, as far as the correlation is concerned. We look at the
likelihood contours or the profile t-plot and the profile traces, and the parametrization
is still not satisfactory.
An intensive search for further improvements leads to the following transformations,
for which the profile traces turn out satisfactorily (Figure 6.f):
!
1 + 2 10e3 1 2 e3
e1 := , e2 := log10 10 ,
10e3 + 1 10e3 + 1
e3 :=3 + 4 medhxj i e4 :=104 .
The model is then
(xi medhxj )i
e 1 e4
Yi = e1 + 10 2 + Ei .
(xi medhxj i)
1 + 10e3 e4
and we get the result shown in Table 6.f .
1.5
1.0
0.5
0.0
1
0.5
1.0
0.32 0.0
2 0.5
0.30
1.0
0.28
0.15 0.15
1.0
0.0
3 0.05 0.05
0.5
1.5
161.50 161.60 161.70 0.28 0.32 0.36 0.00 0.05 0.10 0.15
1.0
0.35 0.35 0.35
0.5
0.0
4 0.30 0.30 0.30
0.5
1.0
0.25 0.25 0.25
1.5
161.50 161.60 161.70 0.28 0.32 0.36 0.00 0.05 0.10 0.15 0.25 0.30 0.35
1 2 3 4
Figure 6.f: Profile t-plot and profile traces for the membrane separation technology example
according to the given transformations.
A. Ruckstuhl, ZHAW
22 6. Parameter Transformations
Table 6.f: Membrane separation technology: R summary of the fit after the parameter
transformation.
the help of the Gaussian error propagation law (see Stahel, 2000, Abschnitt 6.10):
!2
exp hi
2
b2b s2b = exp b b2b
=b
or
bb exp b bb .
The Gaussian error propagation law is based on the linearization of the transformation
function g hi; concretely on exp hi. We have carried out the parameter transformation
because the quality of the confidence intervals left a lot to be desired, then unfortu-
nately this linearization negates what has been achieved and we are back where we
started before the transformation.
The correct approach in such cases is to determine the confidence interval as presented
in Chapter 4. If this is impossible for whatever reason, we can fall back on the following
approximation.
This procedure also ensures that, unlike the Gaussian error propagation law, the con-
fidence interval is entirely in the region that is predetermined for the parameter. It is
thus impossible that, for example, the confidence interval in the example = exp hi,
unlike the interval from 6.h, can contain negative values.
As already noted, the approach just discussed should only be used if the way via the
F-test from Chapter 4 is not possible. The Gaussian error propagation law should not
be used in this context.
Forecasts
a Besides the question of which parameters are plausible with respect to the given data,
the question of in what range future observations will lie is often of central interest.
The difference in these two questions was already discussed in 3.h. In this chapter we
want to go into, among other things, the answer to the second question. We assume
that the parameter is estimated from the data with the least squares method. (If
you want, you can call this dataset the training dataset.) Was can we now, with the
help of our model, say about a future observation Yo at a given point xo ?
(The data and the function h hi are graphically represented in Figure 7.b) We can now
ask ourselves which weight values are expected for a concentration of, e.g. x0 = 3?
A. Ruckstuhl, ZHAW
24 7. Forecasts and Calibration
1000
Weight 800
Weight
600
400
200
0 1 2 3 4 Concentration
Concentration
Figure 7.b: Cress Example. Left: Representation of the data. Right: A typical shape of the
applied regression function.
d Forecast Versus Confidence lntervals. If the sample size n of the training dataset
is very large, the estimated variance b2v is dominated by the error variance b2 . This
means that the uncertainty in the forecast is then determined primarily by the obser-
vation error. The second summand in the expression for b2v reflects the uncertainty
that is caused by the estimation of .
It is therefore clear that the forecast interval is wider than the confidence interval for
the expected value, since the random scatter of the observation must be taken into
account. The endpoint of such an interval for xo values in a chosen region are shown
in Figure 7.i left, bound together as a band.
e * Quality of the Approximation. The determination of the forecast interval in 7.c is based on the
same approximation as is used in Chapterber 3. The quality of the approximation can again be
checked graphically as in Chapter 5.
where V0 hxo i is the lower and V1 hxo i the upper boundaries of the prediction interval
for hhxo i. However, if we want to make a prediction about more than one future
observation, then the number of the observations in the forecast interval is not binomial
distributed with = 0.95. The events, that the individual future observations fall in
the bad, are, specifically, not independent; they depend on each other over the random
borders V0 and V1 . If, for example, the estimation of b randomly turns out too small,
for all future observations the band remains too narrow, and too many observations
would lie outside of the band.
Calibration
g The actual goal of the experiment in the cress example is to estimate the concen-
tration of the agrochemical materials from the weight of the cress. Consequently, we
would like to use the regression relationship in the wrong direction. Problems of
inference arise from this. Despite that, such a procedure is often desired to calibrate
a measurement method or to estimate the result of a more expensive measurement
method from a cheaper one. The regression curve in this relationship is often called a
calibration curve. Another key word for finding this topic is inverse regression.
We would like to present here a simple method that gives a useable result if simplifying
assumptions apply.
Here, h1 denotes the inverse function of h. This procedure is, however, only correct
if h hi is monotone increasing or decreasing. However, this condition is usually fulfilled
in calibration problems.
A. Ruckstuhl, ZHAW
26 7. Forecasts and Calibration
1000 1000
800 800
600 600
Weight
Weight
400 400
200 200
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Concentration Concentration
Figure 7.i: Cress example. Left: Confidence band for the estimated regression curve (dashed)
and forecast band (solid). Right: Schematic representation of how a calibration interval is
determined, at the points y0 = 650 and y0 = 350 . The resulting intervals are [0.4, 1.22] and
[1.73, 4.34] respectively.
i Accuracy of the Obtained Values. At the end of this calculation, the question arises
of how accurate this value really is. The problem appears similar to the prediction
problem at the start of this chapter. Unlike the prediction problem, yo is observed
and the corresponding value xo must be estimated.
The answer can be formulated as follows: We see xo as a parameter for which we
want a confidence interval. Such an interval arises (as always) from a test. We take as
null hypothesis xH = xo ! As we have seen in 7.c, Y lies with probability 0.95 in the
forecast interval q
tnp 2
bo q1/2 b2 + se hbo i .
Where bo was a compact notation for hhxo , i. b This interval therefore forms an
acceptance interval for the value Yo (which here plays the role of a test statistic)
under the null hypothesis xo = xH . In Figure 7.i are graphically represented the
prediction intervals for all possible xo in a set interval for the cress example.
j Illustration. Figure 7.i right illustrates now the thought process for the Cress ex-
ample: Measured values yo are compatible with parameter values xo in the sense
of the test, if the point [xo , yo ]lies between the curves shown. In the figure, we can
thus determine without difficulty the set of xo values that are compatible with the
observation yo . They form the shown interval, which can also be described as the set
( r )
2
tnp
x : |y0 h x, b | q1/2 b2 + se h x, b .
This interval is now the desired confidence interval (or calibration interval) for xo .
Pm
If we have m values to determine yo , we apply the above method to yo = j=0 yoj /m
an: ( r )
2 t
x : |yo h x, b | b2 + se h x, b np
q1/2 .
k In this chapter, one of many possibilities for determining the calibration interval was
presented. The diversity of methods is because their determination is based on some
strongly simplified assumptions.
8. Closing Comments
a Reason for the Difficulty in the Oxygen Consumption Example. Why do we have
so much difficulty with the oxygen consumption example? We consider Figure 1.e and
remind ourselves that the parameter 1 represents the expected oxygen consumption
for infinite incubation time, so it is clear that 1 is difficult to estimate, because the
horizontal asymptote through the data is badly determined. If we had more observa-
tions with longer incubation times, we could avoid the difficulties with the quality of
the confidence intervals of .
b Bootstrap. For some time the bootstrap has also been used for determining confi-
dence, prediction, and calibration intervals. Regarding this, see, e.g., Huet, Bouvier,
Gruet and Jolivet (1996). In this book the case of inconstant variance (heteroskedastic
models) is also discussed. It is also worth taking a look in the book from Carroll and
Ruppert (1988). ccc
c Correlated Error. In this document is is always assumed that the errors Ei are inde-
pendent. Like linear regression analysis, nonlinear regression can also be extended by
bringing in correlated errors and random effects (Find software hints in 8.d).
d Statistics Programs. Today most statistics packages contain a procedure that can
calculate asymptotic confidence intervals for the parameters. In principle it is then
possible to calculate t-profiles and profile traces, because they are also based on the
fitting of nonlinear models, although with one fewer parameter.
In both implementations of the statistics language S, S-Plus and R, the function nls
is available, that is based on the work of Bates and Watts (1988). The library nlme
contains S functions that can fit nonlinear regression models with correlated errors
(gnls) and random effects (nlme). These implementations are based on the book
Mixed Effects Models in S and S-Plus from Pinheiro and Bates (2000).
e Literature Notes. This document is based mainly on the book from Bates and Watts
(1988). A mathematical discussion about the statistical and numerical methods in
nonlinear regression can be found in Seber and Wild (1989). The book from Ratkowsky
(1989) enumerates many possibly nonlinear functions hhi that primarily occur in
biological applications.
A short introduction to this theme in relationship with the statistics program R can
be found in Venables and Ripley (2002) and in Ritz and Streibig (2008).
A. Ruckstuhl, ZHAW
28 A. Gauss-Newton Method
A. Gauss-Newton Method
a The Optimization Problem. To determine the least squares estimator we must mini-
mize the squared sum
n
X
S() := (yi i hi)2
i=1
. Unfortunately this can not be carried out explicitly like in linear regression. With
local linear approximations of the regression function, however, the difficulties can be
overcome.
b Solution Procedure. The procedure is set out in four steps. We consider the (j +1)-th
iteration in the procedure.To simplify, we assume that the regression function h hi has
only one unknown parameter. The solution from the previous iteration is denoted by
(j) . (0) is then the notation for the initial value.
1. Approximation: The regression function hhxi , i = hii with the one dimensional
parameter at the point (j) is approximated by a line:
i hi
i hi i h (j)i + ( (j) ) = i h (j) i + ai h (j) i( (j)) .
(j)
where
hhxi , i
(k)
ai hi := , k = 1, . . . , p.
k
With vectors and matrices the above equation can be written as
Here the (n p) derivative matrix A (j) consists of the j -th iteration of the elements
(k)
{aik = ai hi} at the point = (j) .
2. Local Linear Model: We now assume that the approximation in the 1st step holds
exactly for the true model. With this we get for the residuals
(j+1) (j+1)
ri = yi {h (j) ii + ai h (j) i( (j) )} = yei ai h (j)i (j)
(j+1)
with yei := yi h (j) i and (j+1) := (j) .
point
h(j+1) i with (j+1) = (j) + b(j)
lies nearer to y than h(j) i .
4. Iteration: Now with (j+1) = (j) + b(j) we return to step 1 and repeat steps 1,
2, and 3 until this procedure converges. The converged solution minimizes Shi
and thus corresponds to the desired estimation value b.
References
Bates, D. M. and Watts, D. G. (1988). Nonlinear Regression Analysis & Its Applica-
tions, John Wiley & Sons.
Carroll, R. and Ruppert, D. (1988). Transformation and Weighting in Regression,
Wiley, New York.
Daniel, C. and Wood, F. S. (1980). Fitting Equations to Data, John Wiley & Sons,
New York.
Huet, S., Bouvier, A., Gruet, M.-A. and Jolivet, E. (1996). Statistical Tools for Non-
linear Regression: A Practical Guide with S-Plus Examples, Springer-Verlag, New
York.
Pinheiro, J. C. and Bates, D. M. (2000). Mixed-Effects Models in S and S-PLUS,
Statistics and Computing, Springer.
Rapold-Nydegger, I. (1994). Untersuchungen zum Diffusionsverhalten von Anionen in
carboxylierten Cellulosemembranen, PhD thesis, ETH Zurich.
Ratkowsky, D. A. (1985). A statistically suitable general formulation for modelling
catalytic chemical reactions, Chemical Engineering Science 40(9): 16231628.
Ratkowsky, D. A. (1989). Handbook of Nonlinear Regression Models, Marcel Dekker,
New York.
Ritz, C. and Streibig, J. C. (2008). Nonlinear Regression with R, Use R!, Springer
Verlag.
Seber, G. and Wild, C. (1989). Nonlinear regression, Wiley, New York.
Stahel, W. (2000). Statistische Datenanalyse: Eine Einfuhrung fur Naturwis-
senschaftler, 3. Auflage edn, Vieweg, Braunschweig/Wiesbaden.
Venables, W. N. and Ripley, B. (2002). Modern Applied Statistics with S, Statistics
and Computing, fourth edn, Springer-Verlag, New York.
A. Ruckstuhl, ZHAW