Sio223a Chap7 PDF
Sio223a Chap7 PDF
Sio223a Chap7 PDF
Chapter 7
Least squares is a time-honored estimation procedure, that was developed independently by Gauss
(1795), Legendre (1805) and Adrain (1808) and published in the first decade of the nineteenth
century. It is perhaps the most widely used technique in geophysical data analysis. Unlike
maximum likelihood, which can be applied to any problem for which we know the general form
of the joint pdf, in least squares the parameters to be estimated must arise in expressions for the
means of the observations. When the parameters appear linearly in these expressions then the
least squares estimation problem can be solved in closed form, and it is relatively straightforward
to derive the statistical properties for the resulting parameter estimates.
One very simple example which we will treat in some detail in order to illustrate the more general
problem is that of fitting a straight line to a collection of pairs of observations (xi , yi ) where
i = 1, 2, . . . , n. We suppose that a reasonable model is of the form
y = β0 + β1 x, (1)
and we need a mechanism for determining β0 and β1 . This is of course just a special case of many
more general problems including fitting a polynomial of order p, for which one would need to find
p + 1 coefficients. The most commonly used method for finding a model is that of least squares
estimation. It is supposed that x is an independent (or predictor) variable which is known exactly,
while y is a dependent (or response) variable. The least squares (LS) estimates for β0 and β1 are
those for which the predicted values of the curve minimize the sum of the squared deviations from
the observations. That is the problem is to find the values of β0 , β1 that minimize the residual sum
of squares
n
X
S(β0 , β1 ) = (yi − β0 − β1 xi )2 (2)
i=1
Note that this involves the minimization of vertical deviations from the line (not the perpendicular
distance) and is thus not symmetric in y and x. In other words if x is treated as the dependent
variable instead of y one might well expect a different result.
To find the minimizing values of βi in (2) we just solve the equations resulting from setting
∂S ∂S
= 0, = 0, (3)
∂β0 ∂β1
namely
X X
yi = nβ̂0 + β̂1 xi
i i
X X X (4)
xi yi = β̂0 xi + β̂1 x2i
i i i
2008
c D.C. Agnew/ C. Constable
7-2 Least Squares Estimation Version 1.3
Solving for the β̂i yields the least squares parameter estimates:
P 2P P P
xi i yi − xi xi yi
β̂0 =
n x2i − ( xi )2
P P
P P P (5)
n xi yi − xi yi
β̂1 =
n x2i − ( xi )2
P P
P
where the ’s are implicitly taken to be from i = 1 to n in each case. Having generated these
estimates, it is natural to wonder how much faith we should have in β̂0 and β̂1 , and whether the fit
to the data is reasonable. Perhaps a different functional form would provide a more appropriate fit
to the observations, for example, involving a series of independent variables, so that
y ≈ β0 + β1 x1 + β2 x2 + β3 x3 (6)
or decay curves
f (t) = Ae−αt + Be−βt , (7)
or periodic functions
f (t) = Acosω1 t + Bsinω1 t + Ccosω2 t + Dsinω2 t. (8)
In equations (7) and (8) the functions f (t) are linear in A, B, C and D, but nonlinear in the other
parameters α, β, ω1 , and ω2 . When the function to be fit is linear in the parameters, then the partial
derivatives of S with respect to them yield equations that can be solved in closed form. Typically
non-linear least squares problems do not provide a solution in closed form and one must resort to
an iterative procedure. However, it is sometimes possible to transform the nonlinear function to
be fitted into a linear form. For example, the Arrhenius equation models the rate of a chemical
reaction as a function of temperature via a 2-parameter model with an unknown constant frequency
factor C and activation energy EA , so that
EA
log α(T ) = log C − (10)
kT
We return to the simplest of LS fitting problems, namely fitting a straight line to paired observations
(xi , yi ), so that we can consider the statistical properties of LS estimates, assess the goodness of
fit in the resulting model, and understand how regression is related to correlation.
To make progress on these fronts we need to adopt some kind of statistical model for the noise
associated with the measurements. In the standard statistical model (SSM) we suppose that y is
a linear function of x plus some random noise,
yi = β0 + β1 xi + ei i = 1, . . . , n. (11)
2008
c D.C. Agnew/ C. Constable
7-3 Least Squares Estimation Version 1.3
In (11) the values of xi are taken to be fixed, while the ei are independent random variables with
E(ei ) = 0 and V ar(ei ) = σ 2 , but for the time being we make no further assumption about the exact
distribution underlying the ei .
Under the SSM it is straightforward to show that the LS estimate for a straight line is unbiased:
that is E[βj ] = βj . To do this for β0 we make use of the fact that E[yi ] = β0 + β1 xi , and take the
expected value of β0 in equation (5). This yields:
P 2P P P
xi i E[yi ] − xi xi E[yi ]
E[β̂0 ] =
n x2i − ( xi )2
P P
P 2
xi (nβ0 + β1 i xi ) − xi (β0 xi + β1 i x2i )
P P P P
(12)
= P 2 P 2
n xi − ( xi )
= β0
A similar proof establishes that E[β̂1 ] = β1 . Note that this proof only uses E[ei ] = 0 and the fact
that the errors are additive: we did not need them to be iid.
Under the SSM, V ar[yi ] = σ 2 and Cov[yi , yj ] = 0 for i 6= j. Making use of this it is possible (see
Rice p. 513) to calculate the variances for β̂i as
σ 2 i x2i
P
V ar[β̂0 ] = P 2 P
n xi − ( xi )2
nσ 2
V ar[β̂1 ] = (13)
n x2i − ( xi )2
P P
−σ 2 i xi
P
Cov[β̂0 , β̂1 ] = P 2 P
n xi − ( xi )2
To show this we make use of the fact that equation (5) can be rewritten in the form:
P P
(x i − x̄)(y i − ȳ) i (xi − x̄)(yi )
β̂1 = i P 2
= P 2
i (xi − x̄) i (xi − x̄)
Then
σ2
V ar[β̂1 ] = P 2
i (xi − x̄)
Similarly for the other expressions in (13).
We see from (13) that the variances of the slope and intercept depend on xi and σ 2 . The xi are
known, so we just need a means of finding σ 2 . In the SSM, σ 2 = E[yi − β0 − β1 xi ]. So we can
estimate σ 2 from the average squared deviations of data about the fitted line:
X
RSS = (yi − β̂0 − β̂1 xi )2 (14)
i
When the ei are independent normally distributed random variables then β̂0 , β̂1 will be too, since
they are just linear combinations of independent normal RV’s. More generally if the ei are
independent and satisfy some not too demanding assumptions, then a version of the Central Limit
Theorem will apply, and for large n, β̂0 and β̂1 are approximately normal RV’s.
An immediate and important consequence of this is that we can invoke either exact or approximate
confidence intervals and hypothesis tests based on the β̂i being normally distributed. It can be
shown that
β̂i − βi
∼ tn−2 (16)
sβ̂i
and we can use the t-distribution to establish confidence intervals and for hypothesis testing.
Perhaps the commonest application of hypothesis testing is in determining whether the βi are
significantly different from zero. If not there may be a case for excluding them from the model.
The most basic thing to do in assessing the fit is to use the residuals from the model, in this case:
They should be plotted as a function of x, which allows one to see systematic misfit or departures
from the SSM. These may indicate the need for a more complex model or transformation of
variables.
When the variance of errors is a constant independent of x then the errors are said to be ho-
moscedastic, when the opposite is true they are heteroscedastic. Rice provides some good
examples for this in Chapter 14 - see Figs 14.11 and 14.12. When the variance varies with x it
is sometimes possible to find a transformation to correct the problem. For example, instead of
√ √
y = βx one could try y = γ x. Then β̂ = γ̂ 2 , . . .
A common scenario one might wish to test is whether the intercept is zero. This can be done
by calculating both slope and intercept, and finding sγ̂0 . Then one could use the t−test on the
hypothesis H0 : γ0 = 0 with
γ̂0
t=
sγ̂0
Another strategy in assessing fit is to look at the sample distribution of residuals, compared to a
normal probability plot. Q-Q plots of the residuals can provide a visual means of assessing things
like gross departures from normality or identifying outliers. LS estimates are not robust against
outliers, which can have a large effect on the estimated coefficients, their standard errors and s.
This is especially true if outliers correspond to extreme values of x.
2008
c D.C. Agnew/ C. Constable
7-5 Least Squares Estimation Version 1.3
As must by now be obvious there is a close relationship between correlation and fitting straight
lines. We define
1X
Sxx = (xi − x̄)2
n i
1X
Syy = (yi − ȳ)2 (18)
n i
1X
Sxy = (xi − x̄)(yi − ȳ)
n i
The correlation coefficient r expressing the degree of linear correlation between x and y is
Sxy
r=p , (19)
Sxx Syy
Sxy
β̂1 = , (20)
Sxx
implying that s
Sxx
r = β̂1
Syy
Clearly zero correlation occurs only when the slope is zero. We can also see that if we calculate
standardized residuals
xi − x̄ yi − ȳ
ui = √ vi = p , (21)
Sxx Syy
then Suu = Svv = 1 and Suv = r. In this standardized system β0 = v̄ − rū = 0, since u and v are
centered on the mean values of the original variables, and the predicted values are just
(22) clearly describes the concept of regression toward mediocrity, originally espoused by Galton.
When offsprings heights are paired with those of their parents, there is a tendency for larger
(or smaller) than average parents to have offspring whose sizes are closer to the mean of the
distribution.
In more complex least squares problems there are substantial advantages to adopting an approach
that exploits linear algebra. The notation is more compact and can provide theoretical insights as
well as computational advantages of a purely practical nature. Programming packages like Matlab
are an excellent example of this.
2008
c D.C. Agnew/ C. Constable
7-6 Least Squares Estimation Version 1.3
In least squares estimation the parameters to be estimated must arise in expressions for the means
of the observations. That is for an observation y we can write:
E(y) = θ1 x1 + θ2 x2 + . . . + θp xp (23)
Now suppose that we measure y a total of n times yielding yi , i = 1, . . . , n each time using a
different set of values for x1 , . . . , xp , denoted by xi1 , xi2 , . . . , xi,p for the ith experiment; we
assume all these values for x are known. Then our expression becomes
How do we estimate the unknown parameters θ1 , θ2 , . . . , θp using the observed values {yi } and the
known values {xi1 }, . . . , {xip }, i = 1, . . . , n? The principle of least squares chooses as estimates
of θ1 , θ2 , . . . , θp those values θ̂1 , θ̂2 , . . . , θ̂p which minimize
n
X 2
S(θ1 , θ2 , . . . , θp ) = {yi − θ1 xi1 − θ2 xi2 − . . . − θp xip } (26)
i=1
We can find the solution to this problem as we did before for the straight line fitting problem, by
∂S
simply setting ∂θ i
= 0 yielding the following:
n
X n
X n
X n
X
θ̂1 xi1 xik + θ̂2 xi2 xik + . . . θ̂p xip xik = yi xik k = 1, . . . , p (27)
i=1 i=1 i=1 i=1
But the index notation is a bit messy and we now want to push forward with developing the matrix
approach, rewriting (27) in terms of vectors and matrices. We start again from (25),
~ = X~θ +~
Y (28)
~ ∈ IRn , ~θ ∈ IRp . X is an n × p matrix (not a random variable) and is known as the design
Y
matrix; its rows are the x1 , x2 , . . . , xp ’s for each measurement of y:
2008
c D.C. Agnew/ C. Constable
7-7 Least Squares Estimation Version 1.3
We see that S = ~r.~r or the Euclidean length of ~r. In other words the least-squares solution is the
one with the smallest misfit to the measurements, as given by
n
X
~rT ~r = ri ri = ri2 = ~r.~r = k~rk22 (31)
i=1
Equivalently,
~ − X~θ)T (Y ~ − X~θ) = ~0
∇~θ (Y
(33)
~ Y ~ T X~θ − ~θT X T Y
~ −Y ~ + ~θT X T X~θ = ~0
T
∇~θ Y
~ T X~θ = ~θT X T Y
Since Y ~ )T ~θ this becomes
~ = (X T Y
whence
~ + 2X T X~θ = 0
−2X T Y (35)
which is to say
~
θ̂ = (X T X)−1 X T Y (36)
provided (X T X)−1 exists. These are the normal equations, the formal solution to the least-
squares problem. As we will see later, constructing (X T X) can sometimes introduce considerable
roundoff error so in practice it may not be desirable to solve the equations in this particular form.
We discuss some alternative formulations later. Here we note that in order for a unique solution to
exist for the normal equations, we require (X T X) to be non-singular: this will be the case if and
only if the rank of X equals p.
The matrix notation is readily understood if we use as an example the straight line fitting from an
earlier section. In this context (29) produces
y1 1 x
1
y2 β
1 x 2
Y~ = . ~θ = 0
X = ..
. (37)
.. β1 . .
.
yn 1 xn
We can form the normal equations as in (36) by
1 x1
1 1 ... 1 1
x2
XT X = (38)
x1 x2 . . . xn ... ..
.
1 xn
2008
c D.C. Agnew/ C. Constable
7-8 Least Squares Estimation Version 1.3
yielding Pn
T
X X= Pnn
i=1
Pn 2 xi
(39)
i=1 xi i=1 xi
1 P 2 P
i xi − xi
T −1 i
(X X) = P 2 P (40)
n i xi − ( i xi ) − i xi n
P 2
along with P
y
T
X Y = P i i (41)
i xi yi
β0
h i
β̂ = ~
= (X T X)−1 X T Y
β1
1 h P 2 P ih P
x − i xi P i yi
i
= P 2 i i
P
n i xi − ( i xi )2 − i xi n (42)
i xi yi
P
P 2
1
P P P
( y i )( x ) −( x i )( P i xi yi )
h i
= P 2 i P i i Pi
n( i xi yi ) −( i xi )( i yi )
P
n i xi − ( i xi )2
If the errors in the original measurements are uncorrelated, i.e., Cov(i , j ) = 0 ∀ i 6= j and they
all have the same variance, σ 2 , then we write the data covariance matrix as C = σ 2 I. I is an n
by n identity matrix. When this property holds for the data errors, each θ̂k is an unbiased estimate
of θk
E(θ̂k ) = θk for all k = 1, . . . , p (43)
Also the variance-covariance matrix for θ̂ is Cθ̂θ̂ = σ 2 (X T X)−1 , so that
Cθ̂θ̂ is a p by p matrix, and (44) is just a generalization of (13). Observe that even though the
uncertainties in the original measurements are uncorrelated the parameter estimates derived from
them are in general correlated. Under this model an unbiased estimate for σ 2 is
~ − Ŷ ||2 RSS
||Y
σ̂ 2 = s2 = = (45)
n−p n−p
Also note that the normal equations provide estimators θ̂ that are linear combinations of the
observations. The Gauss-Markov theorem states that the least squares estimators are the best
2008
c D.C. Agnew/ C. Constable
7-9 Least Squares Estimation Version 1.3
linear unbiased estimates (BLUE) for the parameters θi . “Best” in this context means most
efficient or minimum variance. Suppose the LS estimate for θi is given by
θ̂i = a1 y1 + a2 y2 + . . . + an yn (46)
a linear combination of the observations. Then any other linear unbiased estimate for θi , for
example,
θi∗ = b1 y1 + b2 y2 + . . . + bn yn (47)
will always have
V ar(θi∗ ) ≥ V ar(θ̂i ) (48)
with equality if ai = bi , i = 1, . . . , n
Unbiased estimators do not always have the smallest possible variance. It can be shown that
minimum mean square error (MSE) estimators look like
~
θ̂M M SE = (αI + X T X)−1 X T Y (49)
Suppose that we now further restrict the ei to be independent and normally distributed. Since
the θ̂k are linear combinations of individual normally distributed random variables they are also
normal with mean θk and variance σ 2 Σkk where we now have Σ = (X T X)−1 . The standard error
in θ̂k can be estimated as p
sθ̂k = s Σkk (50)
Using this result we can once again construct confidence intervals and hypothesis tests. These
will be exact under the case of normal data errors and approximate otherwise (courtesy a version
of the Central Limit Theorem) since the θ̂k are a linear combination of the RV’s i .
2008
c D.C. Agnew/ C. Constable
7-10 Least Squares Estimation Version 1.3
One criterion often used as a rule of thumb in evaluating models is the squared multiple corre-
lation coefficient or coefficient of determination
Sy2 − Sˆ2
R2 = (54)
Sy2
This provides a measure of the variance reduction achieved by the model: in other words it
indicates how much of the variance in the original data is explained by the model.
Suppose that we now assume that the i are drawn from a Gaussian population. Then recalling the
method of maximum likelihood described in Chapter 5 we can write the log-likelihood function
for θ1 , . . . , θp as
n p
1 n/2 1 X X
θj Xij )2
l(θ1 , . . . , θp ) = ln 2
− 2
(y i −
2πσ 2σ i=1 j=1
n 1 ~ (55)
= − ln(2πσ 2 ) − 2 kY − X~θk2
2 2σ
n 1
= − ln(2πσ 2 ) − 2 k~rk2
2 2σ
Thus maximizing the likelihood function in this case will lead to identical results to minimizing
the sum of the squares of the residuals.
Note that this is a special property of Gaussian errors, and recall the results of Section 5.5.5. If we
had instead
P exponentially
P distributed uncertainties, i.e., j ∼ exp{−|xj |/x0 }, then l is maximized
when i |yi − j θj Xij | is minimum. This is the one-norm of the vector of residuals, and θ is
no longer a fixed linear combination of the yi at its minimum.
What do we do when the i are covariant, with general covariance matrix C 6= σ 2 I? Then the
Gauss-Markov Theorem is no longer valid, and equivalence between LS and ML will not hold even
under the assumption of normality. However, going back to our earlier description of covariance
matrices (section 4.6), we remember that we can transform to a system where the observations are
uncorrelated. We form new pseudo-data
~ 0 = LY
Y ~ (56)
~ 0 ] = σ 2 I. We proceed as before with the new variables.
so that V ar[Y
Note that if the covariance matrix for the i is C, the likelihood function may be written as
1 1
~ − X~θ)T C −1 (Y
~ − X~θ)
l(θ1 , . . . , θp ) = − log (2π)n det(C) − (Y
(57)
2 2
2008
c D.C. Agnew/ C. Constable
7-11 Least Squares Estimation Version 1.3
Numerical solution of the normal equations needs to be performed with some care as the matrix
(XC −1 X T )−1 may be very poorly conditioned. In fact it is sometimes not a good idea even to form
the normal equations since the additions made in forming (XC −1 X T ) can produce undesirable
round-off error. X and X T X have the same null space and therefore the same rank. Thus the
normal equations have a unique solution if and only if the columns of X are linearly independent.
The QR algorithm is a very numerically stable method for solution of a least squares problem * If
X is an n × p matrix and is of rank p then we can write
Xn×p = Qn×p Rp×p
where QT Q = I, that is the columns of Q are orthogonal, and R is upper triangular, that is
rij = 0 for all i > j. If we substitute X = QR in the normal equations it is straightforward to
show that the least squares estimate can be expressed as β̂ = R−1 QT Y or equivalently
Rβ̂ = QT Y
X T X = RT R,
with R upper triangular again. Then RT v = X T Y , again solved by back substitution for v and β̂
is recovered from Rβ = v.
Earlier in this chapter we discussed how in some circumstances non-linear models can be exactly
transformed into linear ones which can be solved directly. One example of this is given below:
* See C. L. Lawson and R. J. Hanson (1974) Solving Least Squares Problems (Prentice-Hall).
2008
c D.C. Agnew/ C. Constable
7-12 Least Squares Estimation Version 1.3
The Arrhenius relationship for thermally activated semiconduction in minerals is σ(t) = σ0 e−A/kt
where σ(t) is the electrical conductivity at temperature t, k is Boltzmann’s constant and A is the
activation energy. This has been used to model the electrical conductivity data for the mineral
olivine as a function of temperature. Olivine is a major constituent of Earth’s mantle, and it
is of some interest to understand the relationship between conductivity and the temperature and
other physical properties of Earth’s interior. For the conductivity example we can solve for the
activation energy A and σ0 simply by working in the log domain, and the transformed model is
shown in Figure 7-1 for conductivity data derived from a sample of Jackson County dunite.
Figure 7 -1: Temperature - conductivity data for Jackson County dunite. The blue symbols appear
to follow the Arrhenius relationship reasonably well, but the red parts will require additional non-
linear terms.
However, in some circumstances such a transformation may be either not possible or undesirable
- see the example in Figure 7-1. If there is no closed form solution then one must resort to
solution by non-linear least squares, with a model that is non-linear in p unknown parameters with
n observations. The basic strategy is to make some initial guess for a solution, approximate the
model by a linear one and refine the parameter estimates by successive iterations.
The linear form of equation (28) is replaced by a nonlinear model function, so that
~ = f (~θ) + ~r
Y (61)
with ~θ again a vector of parameters, and the nonlinear prediction function f replacing the n × p
matrix X. Again we want to miminize
S = ~r.~r
which occurs when the gradient of S with respect to ~θ = 0, but in the nonlinear system the
derivatives ∇θ~r will be functions of both the independent variables and the parameters. We
choose an initial value for the parameters θ̂0 and successively update the approximate estimates:
θ̂k+1 = θ̂k + ∆θ̂, k = 0, 1, 2, . . . (62)
2008
c D.C. Agnew/ C. Constable
7-13 Least Squares Estimation Version 1.3
Here the index k represents the iteration number, and at each iteration the vector of increments to
the parameters is ∆θ̂. At each iteration the model f is linearized using a 1st order Taylor expansion
the matrix J is known as the Jacobian and is a function of constants, the independent variables,
and the parameters ~θ and varies from one iteration to the next. In terms of this linearized model,
we can write
∆Y~ =Y ~ − f k (θ̂) = ~rk (64)
Substituting the iterative update into the residual sum of squares S we recover the normal equations
with the Jacobian in place of X:
~,
(J T J)∆θ = J T ∆Y (65)
forming the basis for the Gauss-Newton algorithm for solution of NLLS. As in the linear case these
can be solved using QR, Cholesky or singular value decomposition for stability, but it’s important to
note that the algorithm will only converge when the objective function is approximately quadratic
in the parameters. Various techniques have been devised for dealing with divergence during
the iterative process. See Bob Parker’s notes on Optimization from the SIO239 math class (or
Numerical Recipes, or Practical Optimization, by Gill, Murray and Wright) for details of shift
cutting, steepest descent methods, and the Levenberg-Marquadt algorithm.
2008
c D.C. Agnew/ C. Constable