Kelley - Iterative Methods For Optimization-SIAM (1999) PDF
Kelley - Iterative Methods For Optimization-SIAM (1999) PDF
Kelley - Iterative Methods For Optimization-SIAM (1999) PDF
This electronic version is for personal use and may not be duplicated or distributed.
Contents
Preface xiii
ix
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
x CONTENTS
3 Global Convergence 39
3.1 The Method of Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Line Search Methods and the Armijo Rule . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Stepsize Control with Polynomial Models . . . . . . . . . . . . . . . . 43
3.2.2 Slow Convergence of Steepest Descent . . . . . . . . . . . . . . . . . 45
3.2.3 Damped Gauss–Newton Iteration . . . . . . . . . . . . . . . . . . . . 47
3.2.4 Nonlinear Conjugate Gradient Methods . . . . . . . . . . . . . . . . . 48
3.3 Trust Region Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Changing the Trust Region and the Step . . . . . . . . . . . . . . . . . 51
3.3.2 Global Convergence of Trust Region Algorithms . . . . . . . . . . . . 52
3.3.3 A Unidirectional Trust Region Algorithm . . . . . . . . . . . . . . . . 54
3.3.4 The Exact Solution of the Trust Region Problem . . . . . . . . . . . . 55
3.3.5 The Levenberg–Marquardt Parameter . . . . . . . . . . . . . . . . . . 56
3.3.6 Superlinear Convergence: The Dogleg . . . . . . . . . . . . . . . . . . 58
3.3.7 A Trust Region Method for Newton–CG . . . . . . . . . . . . . . . . . 63
3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.1 Parameter Identification . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.2 Discrete Control Problem . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Exercises on Global Convergence . . . . . . . . . . . . . . . . . . . . . . . . 68
CONTENTS xi
xii CONTENTS
Bibliography 161
Index 177
Preface
This book on unconstrained and bound constrained optimization can be used as a tutorial for
self-study or a reference by those who solve such problems in their work. It can also serve as a
textbook in an introductory optimization course.
As in my earlier book [154] on linear and nonlinear equations, we treat a small number of
methods in depth, giving a less detailed description of only a few (for example, the nonlinear
conjugate gradient method and the DIRECT algorithm). We aim for clarity and brevity rather
than complete generality and confine our scope to algorithms that are easy to implement (by the
reader!) and understand.
One consequence of this approach is that the algorithms in this book are often special cases
of more general ones in the literature. For example, in Chapter 3, we provide details only
for trust region globalizations of Newton’s method for unconstrained problems and line search
globalizations of the BFGS quasi-Newton method for unconstrained and bound constrained
problems. We refer the reader to the literature for more general results. Our intention is that
both our algorithms and proofs, being special cases, are more concise and simple than others in
the literature and illustrate the central issues more clearly than a fully general formulation.
Part II of this book covers some algorithms for noisy or global optimization or both. There
are many interesting algorithms in this class, and this book is limited to those deterministic
algorithms that can be implemented in a more-or-less straightforward way. We do not, for
example, cover simulated annealing, genetic algorithms, response surface methods, or random
search procedures.
The reader of this book should be familiar with the material in an elementary graduate level
course in numerical analysis, in particular direct and iterative methods for the solution of linear
equations and linear least squares problems. The material in texts such as [127] and [264] is
sufficient.
A suite of MATLAB∗ codes has been written to accompany this book. These codes were
used to generate the computational examples in the book, but the algorithms do not depend
on the MATLAB environment and the reader can easily implement the algorithms in another
language, either directly from the algorithmic descriptions or by translating the MATLAB code.
The MATLAB environment is an excellent choice for experimentation, doing the exercises, and
small-to-medium-scale production work. Large-scale work on high-performance computers is
best done in another language. The reader should also be aware that there is a large amount of
high-quality software available for optimization. The book [195], for example, provides pointers
to several useful packages.
Parts of this book are based upon work supported by the National Science Foundation over
several years, most recently under National Science Foundation grants DMS-9321938, DMS-
9700569, and DMS-9714811, and by allocations of computing resources from the North Carolina
Supercomputing Center. Any opinions, findings, and conclusions or recommendations expressed
∗ MATLAB is a registered trademark of The MathWorks, Inc., 24 Prime Park Way, Natick, MA 01760, USA, (508)
xiii
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
xiv PREFACE
in this material are those of the author and do not necessarily reflect the views of the National
Science Foundation or of the North Carolina Supercomputing Center.
The list of students and colleagues who have helped me with this project, directly, through
collaborations/discussions on issues that I treat in the manuscript, by providing pointers to the
literature, or as a source of inspiration, is long. I am particularly indebted to Tom Banks, Jim
Banoczi, John Betts, David Bortz, Steve Campbell, Tony Choi, Andy Conn, Douglas Cooper, Joe
David, John Dennis, Owen Eslinger, Jörg Gablonsky, Paul Gilmore, Matthias Heinkenschloß,
Laura Helfrich, Lea Jenkins, Vickie Kearn, Carl and Betty Kelley, Debbie Lockhart, Casey Miller,
Jorge Moré, Mary Rose Muccie, John Nelder, Chung-Wei Ng, Deborah Poulson, Ekkehard
Sachs, Dave Shanno, Joseph Skudlarek, Dan Sorensen, John Strikwerda, Mike Tocci, Jon Tolle,
Virginia Torczon, Floria Tosca, Hien Tran, Margaret Wright, Steve Wright, and Kevin Yoemans.
C. T. Kelley
Raleigh, North Carolina
All computations reported in this book were done in MATLAB (version 5.2 on various SUN
SPARCstations and on an Apple Macintosh Powerbook 2400). The suite of MATLAB codes that
we used for the examples is available by anonymous ftp from ftp.math.ncsu.edu in the directory
FTP/kelley/optimization/matlab
http://www.siam.org/books/fr18/
xv
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
Part I
Chapter 1
Basic Concepts
or to say that we seek to solve the problem min f . The understanding is that (1.1) means that we
seek a local minimizer. We will refer to f as the objective function and to f (x∗ ) as the minimum
or minimum value. If a local minimizer x∗ exists, we say a minimum is attained at x∗ .
We say that problem (1.2) is unconstrained because we impose no conditions on the inde-
pendent variables x and assume that f is defined for all x.
The local minimization problem is different from (and much easier than) the global mini-
mization problem in which a global minimizer, a point x∗ such that
is sought.
The constrained optimization problem is to minimize a function f over a set U ⊂ RN . A
local minimizer, therefore, is an x∗ ∈ U such that
or say that we seek to solve the problem minU f . A global minimizer is a point x∗ ∈ U such
that
(1.6) f (x∗ ) ≤ f (x) for all x ∈ U .
We consider only the simplest constrained problems in this book (Chapter 5 and §7.4) and refer
the reader to [104], [117], [195], and [66] for deeper discussions of constrained optimization
and pointers to software.
Having posed an optimization problem one can proceed in the classical way and use methods
that require smoothness of f . That is the approach we take in this first part of the book. These
3
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
methods can fail if the objective function has discontinuities or irregularities. Such nonsmooth
effects are common and can be caused, for example, by truncation error in internal calculations
for f , noise in internal probabilistic modeling in f , table lookup within f , or use of experimental
data in f . We address a class of methods for dealing with such problems in Part II.
1.2 Notation
In this book, following the convention in [154], vectors are to be understood as column vectors.
The vector x∗ will denote a solution, x a potential solution, and {xk }k≥0 the sequence of iterates.
We will refer to x0 as the initial iterate. x0 is sometimes timidly called the initial guess. We will
denote the ith component of a vector x by (x)i (note the parentheses) and the ith component
of xk by (xk )i . We will rarely need to refer to individual components of vectors. We will let
∂f /∂xi denote the partial derivative of f with respect to (x)i . As is standard [154], e = x − x∗
will denote the error, en = xn − x∗ the error in the nth iterate, and B(r) the ball of radius r
about x∗
B(r) = {x |
e
< r}.
For x ∈ RN we let ∇f (x) ∈ RN denote the gradient of f ,
∇f (x) = (∂f /∂x1 , . . . , ∂f /∂xN ),
when it exists.
We let ∇2 f denote the Hessian of f ,
(∇2 f )ij = ∂ 2 f /∂xi ∂xj ,
when it exists. Note that ∇2 f is the Jacobian of ∇f . However, ∇2 f has more structure than
a Jacobian for a general nonlinear function. If f is twice continuously differentiable, then the
Hessian is symmetric ((∇2 f )ij = (∇2 f )ji ) by equality of mixed partial derivatives [229].
In this book we will consistently use the Euclidean norm
N
x
= (x)2i .
i=1
When we refer to a matrix norm we will mean the matrix norm induced by the Euclidean norm
Ax
A
= max .
x=0
x
We will use two forms of the fundamental theorem of calculus, one for the function–gradient
pair and one for the gradient–Hessian.
Theorem 1.2.1. Let f be twice continuously differentiable in a neighborhood of a line
segment between points x∗ , x = x∗ + e ∈ RN ; then
1
f (x) = f (x∗ ) + ∇f (x∗ + te)T e dt
0
BASIC CONCEPTS 5
and 1
∇f (x) = ∇f (x∗ ) + ∇2 f (x∗ + te)e dt.
0
A direct consequence (see Exercise 1.7.1) of Theorem 1.2.1 is the following form of Taylor’s
theorem we will use throughout this book.
Theorem 1.2.2. Let f be twice continuously differentiable in a neighborhood of a point
x∗ ∈ RN . Then for e ∈ RN and
e
sufficiently small
t2 T 2
f (x∗ + tu) = f (x∗ ) + t∇f (x∗ )T u + u ∇ f (x∗ )u + o(t2 ).
2
Since x∗ is a local minimizer we must have for t sufficiently small 0 ≤ f (x∗ + tu) − f (x∗ ) and
hence
t
(1.8) ∇f (x∗ )T u + uT ∇2 f (x∗ )u + o(t) ≥ 0
2
for all t sufficiently small and all u ∈ RN . So if we set t = 0 and u = −∇f (x∗ ) we obtain
∇f (x∗ )
2 = 0.
t2 T 2
f (x∗ + tu) = f (x∗ ) + t∇f (x∗ )T u + u ∇ f (x∗ )u + o(t2 )
2
t2 T 2
= f (x∗ ) + u ∇ f (x∗ )u + o(t2 ).
2
λ
f (x∗ + tu) − f (x∗ ) ≥
tu
2 + o(t2 ) > 0
2
1
(1.9) f (x) = −xT b + xT Hx.
2
Quadratic functions form the basis for most of the algorithms in Part I, which approximate an
objective function f by a quadratic model and minimize that model. In this section we discuss
some elementary issues in quadratic optimization.
Clearly,
∇2 f (x) = H
∇f (x) = −b + Hx.
BASIC CONCEPTS 7
(1.11) Hx∗ = b.
In particular, if H is spd (and hence nonsingular), the unique global minimizer is the solution of
the linear system (1.11).
If H is a dense matrix and N is not too large, it is reasonable to solve (1.11) by computing
the Cholesky factorization [249], [127] of H
H = LLT ,
where L is a nonsingular lower triangular matrix with positive diagonal, and then solving (1.11)
by two triangular solves. If H is indefinite the Cholesky factorization will not exist and the
standard implementation [127], [249], [264] will fail because the computation of the diagonal
of L will require a real square root of a negative number or a division by zero.
If N is very large, H is sparse, or a matrix representation of H is not available, a more
efficient approach is the conjugate gradient iteration [154], [141]. This iteration requires only
matrix–vector products, a feature which we will use in a direct way in §§2.5 and 3.3.7. Our
formulation of the algorithm uses x as both an input and output variable. On input x contains
x0 , the initial iterate, and on output the approximate solution. We terminate the iteration if the
relative residual is sufficiently small, i.e.,
b − Hx
≤
b
1. r = b − Hx, ρ0 =
r
2 , k = 1.
√
2. Do While ρk−1 >
b
and k < kmax
(a) if k = 1 then p = r
else
β = ρk−1 /ρk−2 and p = r + βp
(b) w = Hp
(c) α = ρk−1 /pT w
(d) x = x + αp
(e) r = r − αw
(f) ρk =
r
2
(g) k = k + 1
Note that if H is not spd, the denominator in α = ρk−1 /pT w may vanish, resulting in
breakdown of the iteration.
The conjugate gradient iteration minimizes f over an increasing sequence of nested subspaces
of RN [127], [154]. We have that
for k ≥ 1.
While in principle the iteration must converge after N iterations and conjugate gradient can
be regarded as a direct solver, N is, in practice, far too many iterations for the large problems to
which conjugate gradient is applied. As an iterative method, the performance of the conjugate
gradient algorithm depends both on b and on the spectrum of H (see [154] and the references
cited therein). A general convergence estimate [68], [60], which will suffice for the discussion
here, is
k
∗ ∗ κ(H) − 1
(1.12)
xk − x
H ≤ 2
x0 − x
H .
κ(H) + 1
In (1.12), the H-norm of a vector is defined by
u
2H = uT Hu
κ(H) =
H
H −1
.
For spd H
κ(H) = λl λ−1
s ,
where λl and λs are the largest and smallest eigenvalues of H. Geometrically, κ(H) is large if
the ellipsoidal level surfaces of f are very far from spherical.
The conjugate gradient iteration will perform well if κ(H) is near 1 and may perform very
poorly if κ(H) is large. The performance can be improved by preconditioning, which transforms
(1.11) into one with a coefficient matrix having eigenvalues near 1. Suppose that M is spd and
a sufficiently good approximation to H −1 so that
is much smaller that κ(H). In that case, (1.12) would indicate that far fewer conjugate gradient
iterations might be needed to solve
than to solve (1.11). Moreover, the solution x∗ of (1.11) could be recovered from the solution
z ∗ of (1.13) by
(1.14) x = M 1/2 z.
In practice, the square root of the preconditioning matrix M need not be computed. The algo-
rithm, using the same conventions that we used for cg, is
Algorithm 1.5.2. pcg(x, b, H, M, , kmax)
1. r = b − Hx, ρ0 =
r
2 , k = 1
√
2. Do While ρk−1 >
b
and k < kmax
(a) z = M r
(b) τk−1 = z T r
BASIC CONCEPTS 9
Note that only products of M with vectors in RN are needed and that a matrix representation
of M need not be stored. We refer the reader to [11], [15], [127], and [154] for more discussion
of preconditioners and their construction.
1.6 Examples
It will be useful to have some example problems to solve as we develop the algorithms. The
examples here are included to encourage the reader to experiment with the algorithms and play
with the MATLAB codes. The codes for the problems themselves are included with the set of
MATLAB codes. The author of this book does not encourage the reader to regard the examples
as anything more than examples. In particular, they are not real-world problems, and should not
be used as an exhaustive test suite for a code. While there are documented collections of test
problems (for example, [10] and [26]), the reader should always evaluate and compare algorithms
in the context of his/her own problems.
Some of the problems are directly related to applications. When that is the case we will cite
some of the relevant literature. Other examples are included because they are small, simple, and
illustrate important effects that can be hidden by the complexity of more serious problems.
be found in [151]. The function space setting for the particular control problems of interest in this
section can be found in [170], [158], and [159], as can a discussion of more general problems.
The infinite-dimensional problem is
(1.15) min f,
u
where T
(1.16) f (u) = L(y(t), u(t), t) dt,
0
and we seek an optimal point u ∈ L∞ [0, T ]. u is called the control variable or simply the
control. The function L is given and y, the state variable, satisfies the initial value problem
(with ẏ = dy/dt)
(1.17) ẏ(t) = φ(y(t), u(t), t), y(0) = y0 .
One could view the problem (1.15)–(1.17) as a constrained optimization problem or, as we
do here, think of the evaluation of f as requiring the solution of (1.17) before the integral on the
right side of (1.16) can be evaluated. This means that evaluation of f requires the solution of
(1.17), which is called the state equation.
∇f (u), the gradient of f at u with respect to the L2 inner product, is uniquely determined,
if it exists, by
T
(1.18) f (u + w) − f (u) − (∇f (u))(t)w(t) dt = o(
w
)
0
as
w
→ 0, uniformly in w. If ∇f (u) exists then
T
df (u + ξw)
(∇f (u))(t)w(t) dt = .
0 dξ ξ=0
So computing the gradient requires u and y, hence a solution of the state equation, and p, which
requires a solution of (1.20), a final-value problem for the adjoint equation. In the general case,
(1.17) is nonlinear, but (1.20) is a linear problem for p, which should be expected to be easier
to solve. This is the motivation for our claim that a gradient evaluation costs little more than a
function evaluation.
The discrete problems of interest here are constructed by solving (1.17) by numerical in-
tegration. After doing that, one can derive an adjoint variable and compute gradients using a
discrete form of (1.19). However, in [139] the equation for the adjoint variable of the discrete
problem is usually not a discretization of (1.20). For the forward Euler method, however, the
discretization of the adjoint equation is the adjoint equation for the discrete problem and we use
that discretization here for that reason.
The fully discrete problem is minu f , where u ∈ RN and
N
f (u) = L((y)j , (u)j , j),
j=1
BASIC CONCEPTS 11
on the interval [0, T ]. We let x = (c, k)T be the vector of unknown parameters and, when the
dependence on the parameters needs to be explicit, we will write u(t : x) instead of u(t) for the
solution of (1.21). If the displacement is sampled at {tj }M j=1 , where tj = (j − 1)T /(M − 1),
and the observations for u are {uj }Mj=1 , then the objective function is
M
1
(1.22) f (x) = |u(tj : x) − uj |2 .
2 j=1
We can compute the derivatives of u(t : x) with respect to the parameters by solving the
sensitivity equations. Differentiating (1.21) with respect to c and k and setting w1 = ∂u/∂c and
w2 = ∂u/∂k we obtain
f (x) = xT Hx
1.7.2. Consider the parameter identification problem for x = (c, k, ω, φ)T ∈ R4 associated with
the initial value problem
For what values of x is u differentiable? Derive the sensitivity equations for those values
of x for which u is differentiable.
1.7.3. Solve the system of sensitivity equations from exercise 1.7.2 numerically for c = 10,
k = 1, ω = π, and φ = 0 using the integrator of your choice. What happens if you use a
nonstiff integrator?
1.7.4. Let N = 2, d = (1, 1)T , and let f (x) = xT d + xT x. Compute, by hand, the minimizer
using conjugate gradient iteration.
1.7.5. For the same f as in exercise 1.7.4 solve the constrained optimization problem
min f (x),
x∈U
where U is the circle centered at (0, 0)T of radius 1/3. You can solve this by inspection;
no computer and very little mathematics is needed.
Chapter 2
By a local convergence method we mean one that requires that the initial iterate x0 is close to a
local minimizer x∗ at which the sufficient conditions hold.
xn+1 − x∗
≤ K
xn − x∗
2 .
xn+1 − x∗
≤ K
xn − x∗
α .
• xn → x∗ q-superlinearly if
xn+1 − x∗
lim = 0.
n→∞
xn − x∗
xn+1 − x∗
≤ σ
xn − x∗
We remind the reader that a q-superlinearly convergent sequence is also q-linearly conver-
gent with q-factor σ for any σ > 0. A q-quadratically convergent sequence is q-superlinearly
convergent with q-order of 2.
13
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
In some cases the accuracy of the iteration can be improved by means that are external
to the algorithm, say, by evaluation of the objective function and its gradient with increasing
accuracy as the iteration progresses. In such cases, one has no guarantee that the accuracy of
the iteration is monotonically increasing but only that the accuracy of the results is improving at
a rate determined by the improving accuracy in the function–gradient evaluations. The concept
of r-type convergence captures this effect.
Definition 2.1.3. Let {xn } ⊂ RN and x∗ ∈ RN . Then {xn } converges to x∗ r-( quadrat-
ically, superlinearly, linearly) if there is a sequence {ξn } ⊂ R converging q-(quadratically,
superlinearly, linearly) to 0 such that
xn − x∗
≤ ξn .
We say that {xn } converges r-superlinearly with r-order α > 1 if ξn → 0 q-superlinearly with
q-order α.
(2.1)
∇2 f (x) − ∇2 f (y)
≤ γ
x − y
.
2. ∇f (x∗ ) = 0.
We sometimes say that f is twice Lipschitz continuously differentiable with Lipschitz constant
γ to mean that part 1 of the standard assumptions holds.
If the standard assumptions hold then Theorem 1.4.1 implies that x∗ is a local minimizer
of f . Moreover, the standard assumptions for nonlinear equations [154] hold for the system
∇f (x) = 0. This means that all of the local convergence results for nonlinear equations can be
applied to unconstrained optimization problems. In this chapter we will quote those results from
nonlinear equations as they apply to unconstrained optimization. However, these statements
must be understood in the context of optimization. We will use, for example, the fact that the
Hessian (the Jacobian of ∇f ) is positive definite at x∗ in our solution of the linear equation for
the Newton step. We will also use this in our interpretation of the Newton iteration.
LOCAL CONVERGENCE 15
Solving Mc (x+ ) = 0 leads to the standard formula for the Newton iteration
One could say that Newton’s method for unconstrained optimization is simply the method
for nonlinear equations applied to ∇f (x) = 0. While this is technically correct if xc is near a
minimizer, it is utterly wrong if xc is near a maximum. A more precise way of expressing the
idea is to say that x+ is a minimizer of the local quadratic model of f about xc .
1
mc (x) = f (xc ) + ∇f (xc )T (x − xc ) + (x − xc )T ∇2 f (xc )(x − xc ).
2
If ∇2 f (xc ) is positive definite, then the minimizer x+ of mc is the unique solution of ∇mc (x) =
0. Hence,
0 = ∇mc (x+ ) = ∇f (xc ) + ∇2 f (xc )(x+ − xc ).
Therefore,
(2.3) x+ = xc − (∇2 f (xc ))−1 ∇f (xc ),
which is the same as (2.2) with F replaced by ∇f and F by ∇2 f . Of course, x+ is not computed
by forming an inverse matrix. Rather, given xc , ∇f (xc ) is computed and the linear equation
e
.
(2.8)
e+
≤ K
ec
2 .
Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 hold. By Theorem 1.2.1
1
e+ = ec − ∇2 f (xc )−1 ∇f (xc ) = (∇2 f (xc ))−1 (∇2 f (xc ) − ∇2 f (x∗ + tec ))ec dt.
0
e+
≤ (2
(∇2 f (x∗ ))−1
)γ
ec
2 /2.
converges q-quadratically to x∗ .
Proof. Let δ be small enough so that the conclusions of Theorem 2.3.2 hold. Reduce δ if
needed so that Kδ = η < 1. Then if n ≥ 0 and xn ∈ B(δ), Theorem 2.3.2 implies that
(2.9)
en+1
≤ K
en
2 ≤ η
en
<
en
and hence xn+1 ∈ B(ηδ) ⊂ B(δ). Therefore, if xn ∈ B(δ) we may continue the iteration. Since
x0 ∈ B(δ) by assumption, the entire sequence {xn } ⊂ B(δ). (2.9) then implies that xn → x∗
q-quadratically.
Newton’s method, from the local convergence point of view, is exactly the same as that
for nonlinear equations applied to the problem of finding a root of ∇f . We exploit the extra
structure of positive definiteness of ∇2 f with an implementation of Newton’s method based on
the Cholesky factorization [127], [249], [264]
(2.11)
∇f (xn )
≤ τr
∇f (x0 )
,
2. Do while
∇f (x)
> τr r0 + τa
(a) Compute ∇2 f (x)
(b) F actor ∇2 f (x) = LLT
(c) Solve LLT s = −∇f (x)
LOCAL CONVERGENCE 17
(d) x = x + s
(e) Compute ∇f (x).
Algorithm newton, as formulated above, is not completely satisfactory. The value of the
objective function f is never used and step 2b will fail if ∇2 f is not positive definite. This failure,
in fact, could serve as a signal that one is too far from a minimizer for Newton’s method to be
directly applicable. However, if we are near enough (see Exercise 2.7.8) to a local minimizer,
as we assume in this chapter, all will be well and we may apply all the results from nonlinear
equations.
and the accuracy in ∇2 f will be much less than that of a Jacobian in the nonlinear equations
case.
If f is significantly larger than machine roundoff, (2.13) indicates that using numerical
Hessians based on a second numerical differentiation of the objective function will not be very
accurate. Even in the best possible case, where f is the same size as machine roundoff, finite
difference Hessians will not be very accurate and will be very expensive to compute if the Hessian
is dense. If, as on most computers today, machine roundoff is (roughly) 10−16 , (2.13) indicates
that a forward difference Hessian will be accurate to roughly four decimal digits.
One can obtain better results with centered differences, but at a cost of twice the number of
function evaluations. A centered difference approximation to ∇f is
fˆ(x + h) − fˆ(x − h)
Dh f (x) =
2h
1/3
and the error is O(h2 + f /h), which is minimized if h = O(f ) leading to an error in the
2/3
gradient of g = O(f ). Therefore, a central difference Hessian will have an error of
4/9
∆ = O((g )2/3 ) = O(f ),
which is substantially better. We will find that accurate gradients are much more important than
accurate Hessians and one option is to compute gradients with central differences and Hessians
2/3
with forward differences. If one does that the centered difference gradient error is O(f ) and
therefore the forward difference Hessian error will be
√ 1/3
∆ = O g = O(f ).
More elaborate schemes [22] compute a difference gradient and then reuse the function evalua-
tions in the Hessian computation.
In many optimization problems, however, accurate gradients are available. When that is the
case, numerical differentiation to compute Hessians is, like numerical computation of Jacobians
for nonlinear equations [154], a reasonable idea for many problems and the less expensive
forward differences work well.
Clever implementations of difference computation can exploit sparsity in the Hessian [67],
[59] to evaluate a forward difference approximation with far fewer than N evaluations of ∇f .
In the sparse case it is also possible [22], [23] to reuse the points from a centered difference
approximation to the gradient to create a second-order accurate Hessian.
Unless g (xn ) → 0 as the iteration progresses, one cannot expect convergence. For this
reason estimates like (2.14) are sometimes called local improvement [88] results. Theorem 2.3.4
is a typical example.
Theorem 2.3.4. Let the standard assumptions hold. Then there are K̄ > 0, δ > 0, and
δ1 > 0 such that if xc ∈ B(δ) and
∆(xc )
< δ1 then
(2.14)
e+
≤ K̄(
ec
2 +
∆(xc )
ec
+
g (xc )
).
Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 and Theorem 2.3.2
hold. Let
xN 2
+ = xc − (∇ f (xc ))
−1
∇f (xc )
and note that
x+ = xN 2
+ + ((∇ f (xc ))
−1
− (∇2 f (xc ) + ∆(xc ))−1 )∇f (xc ) − (∇2 f (xc ) + ∆(xc ))−1 g (xc ).
e+
≤ K
ec
2 + 2
(∇2 f (xc ))−1 − (∇2 f (xc ) + ∆(xc ))−1
∇2 f (x∗ )
ec
(2.15)
+
(∇2 f (xc ) + ∆(xc ))−1
g (xc )
.
If
∆(xc )
≤
(∇2 f (x∗ ))−1
−1 /4,
then Lemma 2.3.1 implies that
∆(xc )
≤
(∇2 f (xc ))−1
−1 /2
and the Banach Lemma [12], [154] states that ∇2 f (xc ) + ∆(xc ) is nonsingular and
(∇2 f (xc ) + ∆(xc ))−1
≤ 2
(∇2 f (xc ))−1
≤ 4
(∇2 f (x∗ ))−1
.
Hence,
(∇2 f (xc ))−1 − (∇2 f (xc ) + ∆(xc ))−1
≤ 8
(∇2 f (x∗ ))−1
2
∆(xc )
.
LOCAL CONVERGENCE 19
e+
≤ K
ec
2 + 16
(∇2 f (x∗ ))−1
2
∇2 f (x∗ )
∆(xc )
ec
+ 4
(∇2 f (x∗ ))−1
g (xc )
.
Setting
K̄ = K + 16
(∇2 f (x∗ ))−1
2
∇2 f (x∗ )
+ 4
(∇2 f (x∗ ))−1
and
(2.16)
∆(xc )
≤ γ
x0 − xc
≤ γ(
e0
+
ec
).
Algorithmically the chord iteration differs from the Newton iteration only in that the computation
and factorization of the Hessian is moved outside of the main loop.
Algorithm 2.3.2. chord(x, f, τ )
1. r0 =
∇f (x)
2. Compute ∇2 f (x)
3. F actor ∇2 f (x) = LLT
4. Do while
∇f (x)
> τr r0 + τa
(a) Solve LLT s = −∇f (x)
(b) x = x + s
(c) Compute ∇f (x).
(2.17)
en+1
≤ KC
e0
en
.
Proof. Let δ be small enough so that the conclusions of Theorem 2.3.4 hold. Assume that
en , e0 ∈ B(δ). Combining (2.16) and (2.14) implies
en+1
≤ K̄(
en
(1 + γ) + γ
e0
)
en
≤ K̄(1 + 2γ)δ
en
.
then the chord iterates converge q-linearly to x∗ . Q-linear convergence implies that
en
<
e0
The Shamanskii method [233], [154], [211] is a generalization of the chord method that
updates Hessians after every m + 1 nonlinear iterations. Newton’s method corresponds to
m = 1 and the chord method to m = ∞. The convergence result is a direct consequence of
Theorems 2.3.3 and 2.3.5.
Theorem 2.3.6. Let the standard assumptions hold and let m ≥ 1 be given. Then there are
KS > 0 and δ > 0 such that if x0 ∈ B(δ) the Shamanskii iterates converge q-superlinearly to
x∗ with q-order m and
(2.18)
en+1
≤ KS
en
m+1 .
As one more application of Theorem 2.3.4, we analyze the effects of a difference approxima-
tion of the Hessian. We follow the notation of [154] where possible. For example, to construct
a Hessian matrix, whose columns are ∇2 f (x)ej , where ej is the unit vector with jth compo-
nent 1 and other components 0, we could approximate the matrix–vector products ∇2 f (x)ej by
forward differences and then symmetrize the resulting matrix. We define
(2.19) ∇2h f (x) = (Ah + ATh )/2,
where Ah is the matrix whose jth column is Dh2 f (x : ej ). Dh2 f (x : w), the difference approxi-
mation of the action of the Hessian ∇2 f (x) on a vector w, is defined to be the quotient
0, w = 0,
2
(2.20) Dh f (x : w) =
∇f (x + hw/
w
) − ∇f (x)
, w = 0.
h/
w
If we compute ∇f (x) + g (x) and the gradient errors satisfy an estimate of the form
g (x)
≤ ¯
for all x, then the computed difference Hessian is ∇h (∇f + g ) and satisfies
(2.21)
∇2 f (x) − ∇h (∇f + g )(x)
= O(h + ¯/h).
√
So, as in [154], the choice h ≈ ¯ is optimal in the sense that it minimizes the quantity in the
O-term in (2.21).
The local convergence theorem in this case is [88], [154], [278], as follows.
Theorem 2.3.7. Let the standard assumptions hold. Then there are δ, ¯, and KD > 0 such
that if xc ∈ B(δ),
g (x)
≤ ¯c for all x ∈ B(δ), and
h ≥ M
g (xc )
then
x+ = xc − (∇hc (∇f (xc ) + g (xc )))−1 (∇f (xc ) + g (xc ))
satisfies
e+
≤ KD (¯
c + (¯
c + h)
ec
).
LOCAL CONVERGENCE 21
then f (xn ) → −∞ but f (xn+1 ) − f (xn ) = −1/(n + 1) → 0. The reader has been warned.
If the standard assumptions hold, then one may terminate the iteration when the norm of ∇f
is sufficiently small relative to ∇f (x0 ) (see [154]). We will summarize the key points here and
refer the reader to [154] for the details. The idea is that if ∇2 f (x∗ ) is well conditioned, then a
small gradient norm implies a small error norm. Hence, for any gradient-based iterative method,
termination on small gradients is reasonable. In the special case of Newton’s method, the norm
of the step is a very good indicator of the error and if one is willing to incur the added cost of an
extra iteration, a very sharp bound on the error can be obtained, as we will see below.
Lemma 2.3.8. Assume that the standard assumptions hold. Let δ > 0 be small enough so
that the conclusions of Lemma 2.3.1 hold for x ∈ B(δ). Then for all x ∈ B(δ)
e
∇f (x)
4κ(∇2 f (x∗ ))
e
(2.22) ≤ ≤ .
4
e0
κ(∇2 f (x∗ ))
∇f (x0 )
e0
The meaning of (2.22) is that, up to a constant multiplier, the norm of the relative gradient
is the same as the norm of the relative error. This partially motivates the termination condition
(2.12).
In the special case of Newton’s method, one can use the steplength as an accurate estimate
of the error because Theorem 2.3.2 implies that
(2.23)
ec
=
s
+ O(
ec
2 ).
Hence, near the solution s and ec are essentially the same size. The cost of using (2.23) is that
all the information needed to compute x+ = xc + s has been computed. If we terminate the
iteration when
s
is smaller than our desired tolerance and then take x+ as the final result, we
have attained more accuracy than we asked for. One possibility is to terminate the iteration when
√ √
s
= O( τs ) for some τs > 0. This, together with (2.23), will imply that
ec
= O( τs )
and hence, using Theorem 2.3.2, that
(2.24)
e+
= O(
ec
2 ) = O(τs ).
For a superlinearly convergent method, termination on small steps is equally valid but one
cannot use (2.24). For a superlinearly convergent method we have
(2.25)
ec
=
s
+ o(
ec
) and
e+
= o(
ec
).
≤ = .
ec
ec
ec
Hence, estimation of errors by steps is worthwhile only if convergence is fast. One can go further
[156] if one has an estimate ρ of the q-factor that satisfies
e+
≤ ρ
ec
.
In that case,
(1 − ρ)
ec
≤
ec
−
e+
≤
ec − e+
=
s
.
Hence
ρ
(2.26)
e+
≤ ρ
ec
≤
s
.
1−ρ
So, if the q-factor can be estimated from above by ρ and
s
< (1 − ρ)τs /ρ,
then
e+
< τs . This approach is used in ODE and DAE codes [32], [234], [228], [213],
but requires good estimates of the q-factor and we do not advocate it for q-linearly convergent
methods for optimization. The danger is that if the convergence is slow, the approximate q-factor
can be a gross underestimate and cause premature termination of the iteration.
It is not uncommon for evaluations of f and ∇f to be very expensive and optimizations are,
therefore, usually allocated a fixed maximum number of iterations. Some algorithms, such as
the DIRECT, [150], algorithm we discuss in §8.4.2, assign a limit to the number of function
evaluations and terminate the iteration in only this way.
The vector R = (r1 , . . . , rM ) is called the residual. These problems arise in data fitting, for
example. In that case M is the number of observations and N is the number of parameters;
for these problems M > N and we say the problem is overdetermined. If M = N we have a
nonlinear equation and the theory and methods from [154] are applicable. If M < N the problem
is underdetermined. Overdetermined least squares problems arise most often in data fitting
applications like the parameter identification example in §1.6.2. Underdetermined problems are
less common, but are, for example, important in the solution of high-index differential algebraic
equations [48], [50].
The local convergence theory for underdetermined problems has the additional complexity
that the limit of the Gauss–Newton iteration is not uniquely determined by the distance of the
initial iterate to the set of points where R(x∗ ) = 0. In §2.4.3 we describe the difficulties and
state a simple convergence result.
If x∗ is a local minimizer of f and f (x∗ ) = 0, the problem min f is called a zero residual
problem (a remarkable and suspicious event in the data fitting scenario). If f (x∗ ) is small, the
expected result in data fitting if the model (i.e., R) is good, the problem is called a small residual
problem. Otherwise one has a large residual problem.
Nonlinear least squares problems are an intermediate stage between nonlinear equations and
optimization problems and the methods for their solution reflect this. We define the M × N
Jacobian R of R by
LOCAL CONVERGENCE 23
In the underdetermined case, if R (x∗ ) has full row rank, (2.30) implies that R(x∗ ) = 0; this is
not the case for overdetermined problems.
The cost of a gradient is roughly that of a Jacobian evaluation, which, as is the case with
nonlinear equations, is the most one is willing to accept. Computation of the Hessian (an N × N
matrix)
M
(2.31) ∇2 f (x) = R (x)T R (x) + ri (x)T ∇2 ri (x)
j=1
2
requires computation of the M Hessians {∇ ri } for the second-order term
M
ri (x)T ∇2 ri (x)
j=1
where the second derivative R is a tensor. The notation is to be interpreted in the following
way. For v ∈ RM , R (x)T v is the N × N matrix
M
R (x)T v = ∇2 (R(x)T v) = (v)i ∇2 ri (x).
i=1
We will use the tensor notation when expanding R about x∗ in some of the analysis to follow.
The standard assumptions for nonlinear least squares problems follow in Assumption 2.4.1.
Assumption 2.4.1. x∗ is a minimizer of
R
22 , R is Lipschitz continuously differentiable
near x∗ , and R (x∗ )T R (x∗ ) has maximal rank. The rank assumption may also be stated as
• R (x∗ ) is nonsingular (M = N ),
Theorem 2.4.1. Let M > N . Let Assumption 2.4.1 hold. Then there are K > 0 and δ > 0
such that if xc ∈ B(δ) then the error in the Gauss–Newton iteration satisfies
(2.34)
e+
≤ K(
ec
2 +
R(x∗ )
ec
).
Note that
R (xc )ec − R(xc ) = R (xc )ec − R(x∗ ) + R(x∗ ) − R(xc )
Now,
R (xc )ec + R(x∗ ) − R(xc )
≤ γ
ec
2 /2
and, since R (x∗ )T R(x∗ ) = 0,
Hence,
e+
≤
(R (xc )T R (xc ))−1
(R (xc )T R (xc ))−1
R (xc )T
γ
ec
2
+
(2.36) 2
T −1
R(x∗ )
+
R (xc )T
ec
≤
(R (xc ) R (xc ))
γ
ec
.
2
Setting
1 +
R (x)T
K = γ max (
(R (x)T R (x))−1
x∈B(δ) 2
LOCAL CONVERGENCE 25
ec
term on the right
side of (2.34) vanishes. For a problem other than a zero residual one, even q-linear convergence
is not guaranteed. In fact, if xc ∈ B(δ) then (2.35) will imply that
e+
≤ r
ec
for some
0 < r < 1 if
(2.37) K(δ +
R (x∗ )
) ≤ r
and therefore the q-factor will be K
R (x∗ )
. Hence, for small residual problems and accurate
initial data the convergence of Gauss–Newton will be fast. Gauss–Newton may not converge at
all for large residual problems.
Equation (2.36) exposes a more subtle issue when the term
γ
ec
R(x∗ )
.
Using Taylor’s theorem and the necessary conditions (R (x∗ )T R(x∗ ) = 0) we have
Recall that
R (x∗ )T R(x∗ ) = ∇2 f (x∗ ) − R (x∗ )T R (x∗ )
and hence
(R (x∗ )−R (xc ))T R(x∗ )
(2.38)
≤
∇2 f (x∗ ) − R (x∗ )T R (x∗ )
R(x∗ )
+ O(
ec
2 ).
In a sense (2.38) says that even for a large residual problem, convergence can be fast if the problem
is not very nonlinear (small R ). In the special case of a linear least squares problem (where
R = 0) Gauss–Newton becomes simply the solution of the normal equations and converges in
one iteration.
So, Gauss–Newton can be expected to work well for overdetermined small residual problems
and good initial iterates. For large residual problems and/or initial data far from the solution,
there is no reason to expect Gauss–Newton to give good results. We address these issues in
§3.2.3.
(2.39) min
Ax − b
2 .
If A is M × N with M < N there will not be a unique minimizer but there will be a unique
minimizer with minimum norm. The minimum norm solution can be expressed in terms of the
singular value decomposition [127], [249] of A,
(2.40) A = U ΣV T .
In (2.40), Σ is an N × N diagonal matrix. The diagonal entries of Σ, {σi } are called the singular
values. σi ≥ 0 and σi = 0 if i > M . The columns of the M × N matrix U and the N × N
matrix V are called the left and right singular vectors. U and V have orthonormal columns and
hence the minimum norm solution of (2.39) is
x = A† b,
where A† = V Σ† U T ,
Σ† = diag(σ1† , . . . , σN
†
),
and −1
σi , σi = 0,
σi† =
0, σi = 0.
A† is called the Moore–Penrose inverse [49], [189], [212]. If A is a square nonsingular matrix,
then A† = A−1 ; if M > N then the definition of A† using the singular value decomposition is
still valid; and, if A has full column rank, A† = (AT A)−1 AT .
Two simple properties of the Moore–Penrose inverse are that A† A is a projection onto the
range of A† and AA† is a projection onto the range of A. This means that
So the minimum norm solution of the local linear model (2.33) of an underdetermined
nonlinear least squares problem can be written as [17], [102]
The challenge in formulating a local convergence result is that there is not a unique optimal point
that attracts the iterates.
In the linear case, where R(x) = Ax − b, one gets
x+ = xc − A† (Axc − b) = (I − A† A)xc − A† b.
LOCAL CONVERGENCE 27
Theorem 2.4.2. Let M ≤ N and let Assumption 2.4.1 hold for some x∗ ∈ Z. Then there is
δ > 0 such that if
x0 − x∗
≤ δ,
then the Gauss–Newton iterates
B1 = {x |
x − x∗
≤ ρ1 }
and the singular values of R (x) are bounded away from zero in B1 . We may, reducing ρ1 if
necessary, apply the Kantorovich theorem [154], [151], [211] to show that if x ∈ B1 and w ∈ Z
is such that
x − w
= min
x − z
,
z∈Z
w − ξ(x)
= O(
x − w
2 ) ≤
x − w
/2
The method of the proof is to adjust δ so that the Gauss–Newton iterates remain in B1 and
R(xn ) → 0 sufficiently rapidly. We begin by requiring that δ < ρ1 /2.
Let xc ∈ B1 and let e = x − ξ(xc ). Taylor’s theorem, the fundamental theorem of calculus,
and (2.41) imply that
e+ = ec − R (xc )† R(xc )
= e0 − R (x∗ )† R(x) + O(
ec
2 )
(2.44) d(x+ ) ≤
x+ − ξ(xc )
≤ K1
xc − ξ(xc )
2 ≤ K1 d(xc )2 .
Now let
ρ2 = min(ρ1 , (2K1 )−1 ).
So if
xc ∈ B2 = {x |
x − x∗
≤ ρ2 }
then
(2.45) d(x+ ) ≤ d(xc )/2.
Finally, there is K2 such that
x+ − x∗
≤
xc − x∗
+
x+ − xc
=
xc − x∗
+
R (xc )† R(xc )
≤
xc − x∗
+ K2
xc − ξ(xc )
.
x+ − xc
=
R (xc )† R(xc )
≤ K3
xc − ξ(xc )
≤ K3 d(xc ).
n+m 1 − 2−m
= l=n d(xl ) = d(xn )
2
xn − z ∗
≤ 2d(xn ),
LOCAL CONVERGENCE 29
(2.48)
e+
≤ KI (
ec
+ ηc )
ec
.
Proof. Let δ be small enough so that the conclusions of Lemma 2.3.1 and Theorem 2.3.2
hold. To prove the first assertion (2.48) note that if
and
(2.49) e+ = ec + s = ec − (∇2 f (xc ))−1 ∇f (xc ) − (∇2 f (xc ))−1 r.
Now, (2.47), (2.7), and (2.6) imply that
s + (∇2 f (xc ))−1 ∇f (xc )
≤
(∇2 f (xc ))−1
ηc
∇f (xc )
e+
≤
ec − ∇2 f (xc )−1 ∇f (xc )
+ 4κ(F (x∗ ))ηc
ec
≤ K
ec
2 + 4κ(∇2 f (x∗ ))ηc
ec
,
xn+1 = xn + sn ,
where
∇2 f (xn )sn + ∇f (xn )
≤ ηn
∇f (xn )
,
converges q-linearly to x∗ . Moreover
• if ηn ≤ Kη
∇f (xn )
p for some Kη > 0 the convergence is q-superlinear with q-order
1 + p.
The similarity between Theorem 2.5.2 and Theorem 2.3.5, the convergence result for the
chord method, should be clear. Rather than require that the approximate Hessian be accurate,
we demand that the linear iteration produce a sufficiently small relative residual. Theorem 2.5.3
is the remarkable statement that any reduction in the relative linear residual will suffice for linear
convergence in a certain norm. This statement implies [154] that
∇f (xn )
will converge to
zero q-linearly, or, equivalently, that xn → x∗ q-linearly with respect to
·
∗ , which is defined
by
x
∗ =
∇2 f (x∗ )x
.
Theorem 2.5.3. Let the standard assumptions hold. Then there is δ such that if xc ∈ B(δ),
s satisfies (2.47), x+ = xc + s, and ηc ≤ η < η̄ < 1, then
(2.50)
e+
∗ ≤ η̄
ec
∗ .
Theorem 2.5.4. Let the standard assumptions hold. Then there is δ such that if x0 ∈ B(δ),
{ηn } ⊂ [0, η] with η < η̄ < 1, then the inexact Newton iteration
xn+1 = xn + sn ,
where
∇2 f (xn )sn + ∇f (xn )
≤ ηn
∇f (xn )
Forward Difference CG
Algorithm fdcg is an implementation of the solution by CG of the equation for the Newton step
(2.4). In this algorithm we take care to identify failure in CG (i.e., detection of a vector p for
which pT Hp ≤ 0). This failure either means that H is singular (pT Hp = 0; see exercise 2.7.3)
or that pT Hp < 0, i.e., p is a direction of negative curvature. The algorithms we will discuss
in §3.3.7 make good use of directions of negative curvature. The initial iterate for forward
difference CG iteration should be the zero vector. In this way the first iterate will give a steepest
descent step, a fact that is very useful. The inputs to Algorithm fdcg are the current point x,
the objective f , the forcing term η, and a limit on the number of iterations kmax. The output is
the inexact Newton direction d. Note that in step 2b Dh2 f (x : p) is used as an approximation to
∇2 f (x)p.
Algorithm 2.5.1. fdcg(x, f, η, kmax, d)
LOCAL CONVERGENCE 31
1. r = −∇f (x), ρ0 =
r
22 , k = 1, d = 0.
√
2. Do While ρk−1 > η
∇f (x)
and k < kmax
(a) if k = 1 then p = r
else
β = ρk−1 /ρk−2 and p = r + βp
(b) w = Dh2 f (x : p)
If pT w = 0 signal indefiniteness; stop.
If pT w < 0 signal negative curvature; stop.
(c) α = ρk−1 /pT w
(d) d = d + αp
(e) r = r − αw
(f) ρk =
r
2
(g) k = k + 1
1. r = −∇f (x), ρ0 =
r
22 , k = 1, d = 0.
√
2. Do While ρk−1 > η
∇f (x)
and k < kmax
(a) z = M r
(b) τk−1 = z T r
(c) if k = 1 then β = 0 and p = z
else
β = τk−1 /τk−2 , p = z + βp
(d) w = Dh2 f (x : p)
If pT w = 0 signal indefiniteness; stop.
If pT w < 0 signal negative curvature; stop.
(e) α = τk−1 /pT w
(f) d = d + αp
(g) r = r − αw
(h) ρk = rT r
(i) k = k + 1
In our formulation of Algorithms fdcg and fdpcg, indefiniteness is a signal that we are
not sufficiently near a minimum for the theory in this section to hold. In §3.3.7 we show how
negative curvature can be exploited when far from the solution.
One view of preconditioning is that it is no more than a rescaling of the independent variables.
Suppose, rather than (1.2), we seek to solve
where fˆ(y) = f (M 1/2 y) and M is spd. If y ∗ is a local minimizer of fˆ, then x∗ = M 1/2 y ∗ is
a local minimizer of f and the two problems are equivalent. Moreover, if x = M 1/2 y and ∇x
and ∇y denote gradients in the x and y coordinates, then
and
∇2y fˆ(y) = M 1/2 (∇2x f (x))M 1/2 .
Hence, the scaling matrix plays the role of the square root of the preconditioner for the precon-
ditioned conjugate gradient algorithm.
Newton–CG
The theory guarantees that if x0 is near enough to a local minimizer then ∇2 f (xn ) will be spd
for the entire iteration and xn will converge rapidly to x∗ . Hence, Algorithm newtcg will not
terminate with failure because of an increase in f or an indefinite Hessian. Note that both the
forcing term η and the preconditioner M can change as the iteration progresses.
1. rc = r0 =
∇f (x)
2. Do while
∇f (x)
> τr r0 + τa
However, since the map p → Dh2 f (x : p) is not linear in p, the quality of B as an approximation
to ∇2 f (x) may degrade as the linear iteration progresses. Usually this will not cause problems
unless many iterations are needed to satisfy the inexact Newton condition. However, if one does
not see the expected rate of convergence in a Newton–CG iteration, this could be a factor [128].
One partial remedy is to use a centered-difference Hessian–vector product [162], which reduces
the error in B. In exercise 2.7.15 we discuss a more complex and imaginative way to compute
accurate Hessians.
LOCAL CONVERGENCE 33
2.6 Examples
In this section we used the collection of MATLAB codes but disabled the features (see Chapter 3)
that assist in convergence when far from a minimizer. We took care to make certain that the
initial iterates were near enough to the minimizer so that the observations of local convergence
corresponded to the theory. In practical optimization problems, good initial data is usually not
available and the globally convergent methods discussed in Chapter 3 must be used to start the
iteration.
The plots in this section have the characteristics of local convergence in that both the gradient
norms and function values are decreasing. The reader should contrast this with the examples in
Chapter 3.
Newton Gauss–Newton
n
∇f (xn )
f (xn )
∇f (xn )
f (xn )
0 2.33e+01 7.88e-01 2.33e+01 7.88e-01
1 6.87e+00 9.90e-02 1.77e+00 6.76e-03
2 4.59e-01 6.58e-04 1.01e-02 4.57e-07
3 2.96e-03 3.06e-08 9.84e-07 2.28e-14
4 2.16e-06 4.15e-14
Figure 2.1 is a graphical representation of the convergence history from Table 2.1. We think
that the plots are a more effective way to understand iteration statistics and will present mostly
graphs for the remainder of the book. The concavity of the plots of the gradient norms is the
signature of superlinear convergence.
0
10
Function Value
Gradient Norm
−5
10
−2
10
−10
−4
10
10
−6 −15
10 10
0 2 4 0 2 4
Iterations Iterations
Gauss−Newton Method Gauss−Newton Method
2 0
10 10
0
10
Function Value
Gradient Norm
−5
−2 10
10
−4
10 −10
10
−6
10
−8 −15
10 10
0 1 2 3 0 1 2 3
Iterations Iterations
We next illustrate the difference between Gauss–Newton and Newton on a nonzero residual
problem. We use the same example as before with the observations randomly perturbed. We
used the MATLAB rand function for this, perturbing the samples of the analytic solution by
.5 × rand(M, 1). The least squares residual is about 3.6 and the plots in Figure 2.2 indicate
that Newton’s method is still converging quadratically, but the rate of Gauss–Newton is linear.
The linear convergence of Gauss–Newton can be seen clearly from the linear semilog plot of the
gradient norms. Even so, the Gauss–Newton iteration was more efficient, in terms of floating
point operation, than Newton’s method. The Gauss–Newton iteration took roughly 1 million
floating point operations while the Newton iteration took 1.4 million.
with Newton–CG and two different choices, η = .1, .0001, of the forcing term. The initial
iterate was u0 = (10, 10, . . . , 10)T and the iteration was terminated when
∇f
< 10−8 . In
LOCAL CONVERGENCE 35
0
10 3.9
Function Value
Gradient Norm
−2
10 3.8
−4
10 3.7
−6
10 3.6
0 1 2 3 0 1 2 3
Iterations Iterations
Gauss−Newton Method Gauss−Newton Method
2
10 4
0
10 3.9
Function Value
Gradient Norm
−2
10 3.8
−4
10 3.7
−6
10 3.6
0 2 4 6 0 2 4 6
Iterations Iterations
Figure 2.2: Local Optimization for the Parameter ID Problem, Nonzero Residual
Figure 2.3 one can see that the small forcing term produces an iteration history with the concavity
of superlinear convergence. The limiting q-linear behavior of an iteration with constant η is not
yet visible. The iteration with the larger value of η is in the q-linearly convergent stage, as the
linear plot of ∇f against the iteration counter shows.
The cost of the computation is not reflected by the number of nonlinear iterations. When
η = .0001, the high accuracy of the linear solve is not rewarded. The computation with η = .0001
required 8 nonlinear iterations, a total of 32 CG iterations, roughly 1.25 million floating point
operations, and 41 gradient evaluations. The optimization with η = .1 needed 10 nonlinear
iterations, a total of 13 CG iterations, roughly 820 thousand floating point operations, and 24
gradient evaluations.
eta = .1 eta = .1
5 3
10 10
Function Value
Gradient Norm
0 2
10 10
−5 1
10 10
−10 0
10 10
0 5 10 0 5 10
Iterations Iterations
0
10
Function Value
Gradient Norm
2
10
−5
10
1
10
−10
10
−15 0
10 10
0 2 4 6 8 0 2 4 6 8
Iterations Iterations
Figure 2.3: Newton–CG for the Discrete Control Problem: η = .1, .0001
3. f (x) = x4 .
Use difference steps of h = 10−1 , 10−2 , 10−4 , and 10−8 . Explain your results.
2.7.2. Repeat part (c) of exercise 2.7.1. Experiment with
2
f (x) = ex + 10−4 rand(x) and f (x) = x2 + 10−4 rand(x),
where rand denotes the random number generator in your computing environment. Ex-
plain the differences in the results.
2.7.3. Show that if A is symmetric, p = 0, and pT Ap = 0, then A is either singular or indefinite.
2.7.4. Show that if b ∈ RN and the N × N matrix A is symmetric and has a negative eigenvalue,
then the quadratic functional
m(x) = xT Ax + xT b
LOCAL CONVERGENCE 37
expect? How would you extend this method to the case N > 1? Look at [30] for some
results on this.
2.7.6. Show that if the standard assumptions hold, h is sufficiently small, and x is sufficiently
near x∗ , the difference Hessian defined by (2.19), ∇2h f (x), is spd.
2.7.7. Write a locally convergent Newton method code based on accurate function and gradient
information and forward difference Hessians using (2.19). Be sure that your code tests for
positivity of the Hessian so that you can avoid convergence to a local maximum. Is the
test for positivity expensive? Apply your code to the parameter ID problem from §1.6.2.
If you use an ODE solver that lets you control the accuracy of the integration, try values
of the accuracy from 10−8 to 10−2 and see how the iteration performs. Be sure that your
difference Hessian reflects the accuracy in the gradient.
2.7.8. Let the standard assumptions hold and let λs > 0 be the smallest eigenvalue of ∇2 f (x∗ ).
Give the best (i.e., largest) bound you can for ρ such that ∇2 f (x) is positive definite for
all x ∈ B(ρ).
2.7.9. Use the definition of A† to prove (2.41).
2.7.10. Fill in the missing details in the proof of Theorem 2.4.2 by showing how the Kantorovich
theorem can be used to prove the existence of ξ(x).
2.7.11. Let f (x) = x2 and f (x) = sin(100x)/10. Using an initial iterate of x0 = 1, try to find
a local minimum of f + f using Newton’s method with analytic gradients and Hessians.
Repeat the experiment with difference gradients and Hessians (try forward differences
with a step size of h = .2).
2.7.12. Solve the parameter ID problem from §2.6 with the observations perturbed randomly (for
example, you could use the MATLAB rand function for this). Vary the amplitude of the
perturbation and see how the performance of Newton and Gauss–Newton changes.
2.7.13. Derive sensitivity equations for the entries of the Hessian for the parameter ID objective
function. In general, if there are P parameters, how many sensitivity equations would
need to be solved for the gradient? How many for the Hessian?
2.7.14. Solve the discrete control problem from §2.6.2 using Newton–CG with forcing terms that
depend on n. Consider ηn = .5/n, ηn = min(.1,
∇f (un )
), and some of the choices
from [99]. Vary N and the termination criteria and compare the performance with the
constant η choice in §2.6.2.
2.7.15. Let F : RN → RM (where M and N need not be the same) be sufficiently smooth (how
smooth is that?) and be such that F can also be computed for complex arguments. Show
that [181], [245]
Im(F (x + ihu))/h = F (x)u + O(h2 ),
where Im denotes imaginary part. What happens if there is error in F ? How can you use
this fact to compute better difference gradients and Hessians?
Chapter 3
Global Convergence
The locally convergent algorithms discussed in Chapter 2 can and do fail when the initial iterate
is not near the root. The reasons for this failure, as we explain below, are that the Newton
direction may fail to be a direction of descent for f and that even when a search direction is a
direction of decrease of f , as −∇f is, the length of the step can be too long. Hence, taking a
Newton (or Gauss–Newton, or inexact Newton) step can lead to an increase in the function and
divergence of the iteration (see exercise 3.5.14 for two dramatic examples of this). The globally
convergent algorithms developed in this chapter partially address this problem by either finding
a local minimum or failing in one of a small number of easily detectable ways.
These are not algorithms for global optimization. When these algorithms are applied to
problems with many local minima, the results of the iteration may depend in complex ways on
the initial iterate.
If we take the simple choice λ = 1, then x+ is not guaranteed to be nearer a solution than xc ,
even if xc is very near a solution that satisfies the standard assumptions. The reason for this is
that, unlike the Newton direction, the steepest descent direction scales with f . The Newton step,
on the other hand, is the same for f as it is for cf for any c = 0 but need not be a direction of
decrease for f .
To make the method of steepest descent succeed, it is important to choose the steplength λ.
One way to do this, which we analyze in §3.2, is to let λ = β m , where β ∈ (0, 1) and m ≥ 0 is
the smallest nonnegative integer such that there is sufficient decrease in f . In the context of the
steepest descent algorithm, this means that
This strategy, introduced in [7] and called the Armijo rule, is an example of a line search in
which one searches on a ray from xc in a direction in which f is locally decreasing. In (3.2),
α ∈ (0, 1) is a parameter, which we discuss after we fully specify the algorithm. This strategy
of repeatedly testing for sufficient decrease and reducing the stepsize if the test is failed is called
backtracking for obvious reasons.
39
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
is at least as much as a fraction of the predicted reduction in the linear model. The parameter α
is typically set to 10−4 .
The reason we demand sufficient decrease instead of simple decrease (i.e., f (xc ) < f (x+ )
or α = 0) is largely theoretical; a nonzero value of α is required within the proof to insure that
the iteration does not stagnate before convergence.
Algorithm 3.1.1. steep(x, f, kmax)
1. For k = 1, . . . , kmax
(a) Compute f and ∇f ; test for termination.
(b) Find the least integer m ≥ 0 such that (3.2) holds for λ = β m .
(c) x = x + λd.
2. If k = kmax and the termination test is failed, signal failure.
Clearly the steepest descent direction d = −∇f (x) is a descent direction. A line search
algorithm searches for decrease in f in a descent direction, using the Armijo rule for stepsize
control, unless ∇f (x) = 0.
We will consider descent directions based on quadratic models of f of the form
1
m(x) = f (xc ) + ∇f (xc )T (x − xc ) + (x − xc )T Hc (x − xc ),
2
where Hc , which is sometimes called the model Hessian, is spd. We let d = x − xc be such that
m(x) is minimized. Hence,
∇m(x) = ∇f (xc ) + Hc (x − xc ) = 0
and hence
(3.3) d = −Hc−1 ∇f (xc ).
GLOBAL CONVERGENCE 41
The steepest descent direction satisfies (3.3) with Hc = I. However, the Newton direction
d = −∇2 f (x)−1 ∇f (x) may fail to be a descent direction if x is far from a minimizer because
∇2 f may not be spd. Hence, unlike the case for nonlinear equations [154], Newton’s method is
not a generally good global method, even with a line search, and must be modified (see [113],
[117], [231], and [100]) to make sure that the model Hessians are spd.
The algorithm we analyze in this section is an extension of Algorithm steep that allows for
descent directions that satisfy (3.3) for spd H. We modify (3.2) to account for H and the new
descent direction d = −H −1 ∇f (x). The general sufficient decrease condition is
where 0 < βlow ≤ βhigh < 1. The choice β = βlow = βhigh is the simple rule in Algo-
rithm steep. An exact line search, in which λ is the exact minimum of f (xc + λd), is not only
not worth the extra expense but can degrade the performance of the algorithm.
Algorithm 3.2.1. optarm(x, f, kmax)
1. For k = 1, . . . , kmax
(a) Compute f and ∇f ; test for termination.
(b) Construct an spd matrix H and solve (3.3) to obtain a descent direction d.
(c) Beginning with λ = 1, repeatedly reduce λ using any strategy that satisfies (3.5)
until (3.4) holds.
(d) x = x + λd.
2. If k = kmax and the termination test is failed, signal failure.
In the remainder of this section we prove that if the sequence of model Hessians remains
uniformly bounded and positive definite and the sequence of function values {f (xk )} is bounded
from below, then any limit point of the sequence {xk } generated by Algorithm optarm con-
verges to a point x∗ that satisfies the necessary conditions for optimality. We follow that analysis
with a local convergence theory that is much less impressive than that for Newton’s method.
We begin our analysis with a simple estimate that follows directly from the spectral theorem
for spd matrices.
Lemma 3.2.1. Let H be spd with smallest and largest eigenvalues 0 < λs < λl . Then for
all z ∈ RN ,
λ−1 2
l
z
≤ z H
T −1
z ≤ λ−1
s
z
.
2
The first step in the analysis is to use Lemma 3.2.1 to obtain a lower bound for the steplength.
Lemma 3.2.2. Assume that ∇f is Lipschitz continuous with Lipschitz constant L. Let
α ∈ (0, 1), x ∈ RN , and H be an spd matrix. Let λs > 0 be the smallest and λl ≥ λs the
largest eigenvalues of H. Let d be given by (3.3). Assume that ∇f (x) = 0. Then (3.4) holds for
any λ such that
2λs (1 − α)
(3.6) 0<λ≤ .
Lκ(H)
Hence
f (x + λd) = f (x) + λ∇f (x)T d
(3.7) 1
+λ 0
(∇f (x + tλd) − ∇f (x))T d dt.
Therefore,
λ2 L
f (x + λd) = f (x − λH −1 ∇f (x)) ≤ f (x) + λ∇f (x)T d +
d
2 .
2
Positivity of H, Lemma 3.2.1, and κ(H) = λl λ−1
s imply that
d
2 =
H −1 ∇f (x)
2 ≤ λ−2 T
s ∇f (x) ∇f (x)
≤ −λl λ−2 T −1 T
s ∇f (x) d = −κ(H)λs ∇f (x) d.
Hence
f (x + λd) ≤ f (x) + (λ − λ2 Lλ−1 T
s κ(H)/2)∇f (x) d,
(3.8) κ(Hk ) ≤ κ̄
satisfy
2βlow λs (1 − α)
(3.9) λk ≥ λ̄ =
Lκ̄
and at most
2λs (1 − α)
(3.10) m = log / log(βhigh )
Lκ̄
stepsize reductions will be required.
Proof. In the context of Algorithm optarm, Lemma 3.2.2 implies that the line search will
terminate when
2λs (1 − α)
λ≤ ,
Lκ(Hk )
GLOBAL CONVERGENCE 43
if not before. The most that one can overshoot this is by a factor of βlow , which proves (3.9).
The line search will require at most m stepsize reductions, where m is the least nonnegative
integer such that
2λs (1 − α) m
> βhigh .
Lκ(Hk )
This implies (3.10).
The convergence theorem for Algorithm optarm says that if the condition numbers of the
matrices H and the norms of the iterates remain bounded, then every limit point of the iteration
is a stationary point. Boundedness of the sequence of iterates implies that there will be limit
points, but there is no guarantee that there is a unique limit point.
Theorem 3.2.4. Let ∇f be Lipschitz continuous with Lipschitz constant L. Assume that
the matrices Hk are spd and that there are κ̄ and λl such that κ(Hk ) ≤ κ̄, and
Hk
≤ λl for
all k. Then either f (xk ) is unbounded from below or
(3.11) lim ∇f (xk ) = 0
k→∞
and hence any limit point of the sequence of iterates produced by Algorithm optarm is a
stationary point.
In particular, if f (xk ) is bounded from below and xkl → x∗ is any convergent subsequence
of {xk }, then ∇f (x∗ ) = 0.
Proof. By construction, f (xk ) is a decreasing sequence. Therefore, if f (xk ) is bounded
from below, then limk→∞ f (xk ) = f ∗ exists and
(3.12) lim f (xk+1 ) − f (xk ) = 0.
k→∞
≤ −αλ̄λ−1 2
l
∇f (xk )
≤ 0.
Hence, by (3.12)
λl (f (xk ) − f (xk+1 ))
∇f (xk )
2 ≤ →0
αλ̄
as k → ∞. This completes the proof.
The analysis of the Armijo rule is valid for other line search methods [84], [125], [272],
[273]. The key points are that the sufficient decrease condition can be satisfied in finitely many
steps and that the stepsizes are bounded away from zero.
and let λ1 be the minimum of q on the interval [βlow , βhigh ] ⊂ (0, 1). This minimum can be
computed directly since α ∈ (0, 1) and failure of (3.4) imply
−ξ (0)
λt = .
2(ξ(1) − ξ(0) − ξ (0))
So
βlow , λt ≤ βlow ,
(3.13) λ+ = λt , βlow < λt < βhigh ,
βhigh , λt ≥ βhigh .
If our first reduced value of λ does not satisfy (3.4), we base additional reductions on the
data
ξ(0) = f (xc ), ξ (0) = ∇f (xc )T d, ξ(λ− ), ξ(λc ),
where λc < λ− are the most recent values of λ to fail to satisfy (3.4). This is sufficient data to
approximate ξ with a cubic polynomial
With λt in hand, we compute λ+ using (3.13). The application of (3.13) is called safeguarding
and is important for the theory, as one can see from the proof of Theorem 3.2.4. Safeguarding
is also important in practice because, if the cubic model is poor, the unsafeguarded model can
make steplength reductions that are so small that the iteration can stagnate or so large (i.e., near
1) that too many reductions are needed before (3.4) holds.
GLOBAL CONVERGENCE 45
1 T
f (x) = x Ax − bT x + a,
2
where A is spd, b ∈ RN , and a is a scalar. We will look at a very simple example, using
the method of steepest descent with Hk = I (so λl = λs = 1) and show how the speed of
convergence depends on conditioning and scaling.
Lemma 3.2.5. Let f be a convex quadratic and let Hk = I for all k. Then the sequence
{xk } generated by Algorithm optarm converges to the unique minimizer of f .
Proof. In exercise 3.5.4 you are asked to show that f is bounded from below and that
∇f (x) = Ax − b. Hence ∇f (x∗ ) vanishes only at x∗ = A−1 b. Since ∇2 f (x) = A is spd, the
second-order sufficient conditions hold and x∗ is the unique minimizer of f .
Theorem 3.2.4 implies that
and hence xk → x∗ .
Since the steepest descent iteration converges to the unique minimizer of a convex quadratic,
we can investigate the rate of convergence without concern about the initial iterate. We do this
in terms of the A-norm. The problems can be illustrated with the simplest case a = 0 and b = 0.
Proposition 3.2.6. Let f (x) = xT Ax/2 and let {xk } be given by Algorithm optarm with
Hk = I for all k. Then the sequence {xk } satisfies
(3.16)
xk+1
A = (1 − O(κ(A)−2 ))
xk
A .
Proof. The sufficient decrease condition, (3.4), implies that for all k
2β(1 − α)
(3.18) λk ≥ λ̄ = .
λl κ(A)
xk+1
2A −
xk
2A ≤ −2αλ̄λs
xk
2A ,
Az
2 = (Az)T (Az) ≥ λs z T Az = λs
z
2A .
Hence,
xk+1
2A ≤ (1 − 2αλ̄λs )
xk
2A ≤ (1 − 4α(1 − α)βκ(A)−2 )
xk
2A .
where
(3.19) ω < 2(1 − α).
In this case x∗ = 0. We have ∇f (x) = f (x) = ωx and hence for all x ∈ R,
ωx2
f (x − ∇f (x)) − f (x) = ((1 − ω)2 − 1)
2
ω 2 x2
= (ω − 2)
2
ω 2 x2
f (x − λ∇f (x)) − f (x) = (λω − 2) < −αλω 2 x2
2
only if
2(1 − α)
λ< .
ω
So
2(1 − α) 2(1 − α)
β < βm = λ < .
ω ω
If ω is very large, many steplength reductions will be required with each iteration and the line
search will be very inefficient.
These are examples of poor scaling, where a change in f by a multiplicative factor can
dramatically improve the efficiency of the line search or the convergence speed. In fact, if
ω = 1, steepest descent and Newton’s method are the same and only one iteration is required.
The case for a general convex quadratic is similar. Let λl and λs be the largest and smallest
eigenvalues of the spd matrix A. We assume that b = 0 and a = 0 for this example. We let ul
and us be unit eigenvectors corresponding to the eigenvalues λl and λs . If
λs < 2(1 − α)
is small and the initial iterate is in the direction of us , convergence will require a very large
number of iterations (slow). If λl is large and the initial iterate is in the direction of ul , the line
search will be inefficient (many stepsize reductions at each iteration).
GLOBAL CONVERGENCE 47
Newton’s method does not suffer from poor scaling of f and converges rapidly with no need
for a line search when the initial iterate is near the solution. However, when far away from the
solution, the Newton direction may not be a descent direction at all and the line search may
fail. Making the transition from steepest descent, which is a good algorithm when far from
the solution, to Newton’s or some other superlinearly convergent method as the iteration moves
toward the solution, is the central problem in the design of line search algorithms. The scaling
problems discussed above must also be addressed, even when far from the solution.
is
−∇f (x) = −R (x)T R(x).
The steepest descent algorithm could be applied to nonlinear least squares problems with the
good global performance and poor local convergence that we expect.
The Gauss–Newton direction at x
is not defined if R fails to have full column rank. If R does have full column rank, then
(dGS )T ∇f (x) = −(R (x)T R(x))T (R (x)T R (x))−1 R (x)T R(x) < 0,
and the Gauss–Newton direction is a descent direction. The combination of the Armijo rule with
the Gauss–Newton direction is called damped Gauss–Newton iteration.
A problem with the damped Gauss–Newton algorithm is that, in order for Theorem 3.2.4 to
be applicable, the matrices {R (xk )T R (xk )} must not only have full column rank but also must
be uniformly bounded and well conditioned, which are very strong assumptions (but if they are
satisfied, damped Gauss–Newton is a very effective algorithm).
The Levenberg–Marquardt method [172], [183] addresses these issues by adding a regular-
ization parameter ν > 0 to R (xc )T R (xc ) to obtain x+ = xc + s where
where I is the N × N identity matrix. The matrix νc I + R (xc )T R (xc ) is positive definite.
The parameter ν is called the Levenberg–Marquardt parameter.
It is not necessary to compute R (xc )T R (xc ) to compute a Levenberg–Marquardt step [76].
One can also solve the full-rank (M + N ) × N linear least squares problem
2
1 R (xc ) R(xc )
(3.21) min √ s +
2 νc I 0
to compute s (see exercise 3.5.6). Compare this with computing the undamped Gauss–Newton
step by solving (2.33).
If one couples the Levenberg–Marquardt method with the Armijo rule, then Theorem 3.2.4
is applicable far from a minimizer and Theorem 2.4.1 nearby. We ask the reader to provide the
details of the proof in exercise 3.5.7.
is bounded. Then
lim R (xk )T R(xk ) = 0.
k→∞
Moreover, if x∗ is any limit point of {xk } at which R(x∗ ) = 0, Assumption 2.4.1 holds, and
νk → 0, then xk → x∗ q-superlinearly. If, moreover,
νk = O(
R(xk )
)
The update of x
xk+1 = xk + αk pk
can be done with a simple analytic minimization in the quadratic case, but a line search will be
needed in the nonlinear case. The missing pieces, therefore, are the choice of βk , the way the
line search will be done, and convergence theory. Theory is needed, for example, to decide if
pk is a descent direction for all k.
The general form of the algorithm is very simple. The inputs are an initial iterate, which
will be overwritten by the solution, the function to be minimized, and a termination vector
τ = (τr , τa ) of relative and absolute residuals.
Algorithm 3.2.2. nlcg(x, f, τ )
GLOBAL CONVERGENCE 49
1. r0 =
∇f (x)
; k = 0
2. Do while
∇f (x)
> τr r0 + τa
The two most common choices for β, both of which are equivalent to the linear CG formula
in the quadratic case, are the Fletcher–Reeves [106]
∇f (xk )
2
(3.23) βkF R =
∇f (xk−1 )
2
formulas. The Fletcher–Reeves method has been observed to take long sequences of very small
steps and virtually stagnate [112], [207], [208], [226]. The Polak–Ribière formula performs
much better and is more commonly used but has a less satisfactory convergence theory.
The line search has more stringent requirements, at least for proofs of convergence, than
are satisfied by the Armijo method that we advocate for steepest descent. We require that the
steplength parameter satisfies the Wolfe conditions [272], [273]
and
(3.26) ∇f (xk + αk pk )T pk ≥ σβ ∇f (xk )T pk ,
where 0 < σα < σβ < 1. The first of the Wolfe conditions (3.25) is the sufficient decrease
condition, (3.4), that all line search algorithms must satisfy. The second (3.26) is often, but not
always, implied by the Armijo backtracking scheme of alternating a test for sufficient decrease
and reduction of the steplength. One can design a line search method that will, under modest
assumptions, find a steplength satisfying the Wolfe conditions [104], [171], [193].
The convergence result [3] for the Fletcher–Reeves formula requires a bit more. The proof
that pk is descent direction requires the strong Wolfe conditions, which replace (3.26) by
and demand that 0 < σα < σβ < 1/2. The algorithm from [193], for example, will find a point
satisfying the strong Wolfe conditions.
Theorem 3.2.8. Assume that the set
N = {x | f (x) ≤ f (x0 )}
This result has been generalized to allow for any choice of βk such that |βk | ≤ βkF R [112].
A similar result for the Polak–Ribière method, but with more complex conditions on the line
search, has been proved in [134]. This complexity in the line search is probably necessary, as
there are examples where reasonable line searches lead to failure in the Polak–Ribière method,
[222]. One can also prove convergence if βkP R is replaced by max(βkP R , 0) [112].
There is continuing research on these methods and we point to [112], [134], [205], and [202]
as good sources.
We will refer to either the trial step st or the trial solution xt = xc + st as the solution to the
trust region problem.
Having solved the trust region problem, one must decide whether to accept the step and/or to
change the trust region radius. The trust region methods that we will discuss in detail approximate
the solution of the trust region problem with the minimizer of the quadratic model along a
piecewise linear path contained in the trust region. Before discussing these specific methods,
we present a special case of a result from [223] on global convergence.
A prototype trust region algorithm, upon which we base the specific instances that follow, is
Algorithm 3.3.1.
Algorithm 3.3.1. trbasic(x, f )
1. Initialize the trust region radius ∆.
2. Do until termination criteria are satisfied
(a) Approximately solve the trust region problem to obtain xt .
(b) Test both the trial point and the trust region radius and decide whether or not to
accept the step, the trust region radius, or both. At least one of x or ∆ will change
in this phase.
Most trust region algorithms differ only in how step 2a in Algorithm trbasic is done.
There are also different ways to implement step 2b, but these differ only in minor details and the
approach we present next in §3.3.1 is very representative.
GLOBAL CONVERGENCE 51
with the predicted reduction, i.e., the decrease in the quadratic model
pred > 0 for all the trust region algorithms we discuss in this book unless ∇f (xc ) = 0. We
will introduce three control parameters
which are used to determine if the trial step should be rejected (ared/pred < µ0 ) and/or the trust
region radius should be decreased (ared/pred < µlow ), increased (ared/pred > µhigh ), or left
unchanged. Typical values are .25 for µlow and .75 for µhigh . Both µ0 = 10−4 or µ0 = µlow
are used. One can also use the sufficient decrease condition (3.4) to determine if the trial step
should be accepted [84].
We will contract and expand the trust region radius by simply multiplying ∆ by constants
Typical values are ωdown = 1/2 and ωup = 2. There are many other ways to implement a trust
region adjustment algorithm that also give global convergence. For example, the relative error
|pred − ared|/
∇f
can be used [84] rather than the ratio ared/pred. Finally we limit the
number of times the trust region radius can be expanded by requiring
(3.29) ∆ ≤ CT
∇f (xc )
,
for some CT > 1, which may depend on xc . This only serves to eliminate the possibility of
infinite expansion and is used in the proofs. Many of the dogleg methods which we consider
later automatically impose (3.29).
The possibility of expansion is important for efficiency in the case of poor scaling of f .
The convergence theory presented here [162] also uses the expansion phase in the proof of
convergence, but that is not essential. We will present the algorithm to test the trust region in a
manner, somewhat different from much of the literature, that only returns once a new iterate has
been accepted.
Algorithm 3.3.2. trtest(xc , xt , x+ , f, ∆)
1. z = xc
2. Do while z = xc
(a) ared = f (xc ) − f (xt ), st = xt − xc , pred = −∇f (xc )T st − sTt Hc st /2
(b) If ared/pred < µ0 then set z = xc , ∆ = ωdown ∆, and solve the trust region
problem with the new radius to obtain a new trial point. If the trust region radius
was just expanded, set z = xold
t .
(c) If µ0 ≤ ared/pred < µlow , then set z = xt and ∆ = ωdown ∆.
(d) If µlow ≤ ared/pred ≤ µhigh , set z = xt .
The loop in Algorithm trtest serves the same purpose as the loop in a line search algorithm
such as Algorithm steep. One must design the solution to the trust region problem in such a
way that that loop will terminate after finitely many iterations and a general way to do that is the
subject of the next section.
We incorporate Algorithm trtest into a general trust region algorithm paradigm that we
will use for the remainder of this section.
Algorithm 3.3.3. trgen(x, f )
1. Initialize ∆
2. Do forever
(a) Let xc = x. Compute ∇f (xc ) and an approximate Hessian Hc .
(b) Solve the trust region problem to obtain a trial point xt .
(c) Call trtest(xc , xt , x, f, ∆)
The global convergence theorem based on this assumption should be compared with the
similar result on line search methods—Theorem 3.2.4.
Theorem 3.3.1. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk }
be generated by Algorithm trgen and let the solutions for the trust region problems satisfy
Assumption 3.3.1. Assume that the matrices {Hk } are bounded. Then either f is unbounded
from below, ∇f (xk ) = 0 for some finite k, or
GLOBAL CONVERGENCE 53
Proof. Assume that ∇f (xk ) = 0 for all k and that f is bounded from below. We will show
that there is MT ∈ (0, 1] such that once an iterate is taken (i.e., the step is accepted and the trust
region radius is no longer a candidate for expansion), then
(3.32)
sk
≥ MT
∇f (xk )
.
Assume (3.32) for the present. Since sk is an acceptable step, Algorithm trtest and part 1 of
Assumption 3.3.1 imply that
aredk ≥ µ0 predk ≥ µ0
∇f (xk )
σ min(
sk
,
∇f (xk )
).
Now since f (xk ) is a decreasing sequence and f is bounded from below, limk→∞ aredk = 0.
Hence (3.33) implies (3.31).
It remains to prove (3.32). To begin note that if
sk
< ∆k then by part 2 of Assumption 3.3.1
sk
≥
∇f (xk )
/M.
(3.34)
sk
= ∆k and
sk
<
∇f (xk )
,
since if (3.34) does not hold then (3.32) holds with MT = min(1, 1/M ).
We will complete the proof by showing that if (3.34) holds and sk is accepted, then
−2
2σ min(1 − µhigh , (1 − µ0 )ωup )
(3.35)
sk
= ∆k ≥
∇f (xk )
.
M +L
This will complete the proof with
−2
2σ min(1 − µhigh , (1 − µ0 )ωup )
MT = min 1, 1/M, .
M +L
Now increase the constant M > 0 in part 1 of Assumption 3.3.1 if needed so that
(3.36)
Hk
≤ M for all k.
We prove (3.35) by showing that if (3.34) holds and (3.35) does not hold for a trial step st ,
then the trust region radius will be expanded and the step corresponding to the larger radius will
be acceptable. Let st be a trial step such that
st
<
∇f (xk )
and
−2
2σ min(1 − µhigh , (1 − µ0 )ωup )
(3.37)
st
= ∆t <
∇f (xk )
.
M +L
We use the Lipschitz continuity of ∇f and (3.36) to obtain
1
aredt = −∇f (xk )T st − (∇f (xk + tst ) − ∇f (xk ))T st dt
0
1
= predt + sTt Hk st /2 − (∇f (xk + tst ) − ∇f (xk ))T st dt
0
≥ predt − (M + L)
st
2 /2.
aredt (M + L)
st
2
≥1−
predt 2predt
(3.38)
(M + L)
st
2
≥1− .
2σ
∇f (xk )
min(
∇f (xk )
,
st
)
Now since
st
<
∇f (xk )
by (3.34) we have
min(
∇f (xk )
,
st
) =
st
and hence
aredk (M + L)
st
s+
t
≤ ωup
st
< ωup
∇f (xk )
.
So
min(
∇f (xk )
,
s+ +
t
) >
st
/ωup .
Hence,
ared+
t (M + L)
s+t
2
≥1−
pred+
t 2σ
∇f (xk )
min(
∇f (xk )
,
s+
t
)
2
(M + L)ωup
s+t
(M + L)ωup
st
≥1− ≥1− ≥ µ0
2
∇f (xk )
σ 2
∇f (xk )
σ
by (3.37). Hence, the expansion will produce an acceptable step. This means that if the final
accepted step satisfies (3.34), it must also satisfy (3.35). This completes the proof.
GLOBAL CONVERGENCE 55
x(λ̂), the minimizer of the quadratic model in the steepest descent direction, subject to the trust
region bounds, is called the Cauchy point. We will denote the Cauchy point by xCP c .
1
CP
Then with x as trial point, one can use Theorem 3.3.1 to derive a global convergence
theorem for the unidirectional trust region.
Theorem 3.3.2. Let ∇f be Lipschitz continuous with Lipschitz constant L. Let {xk } be
generated by Algorithm trgen with xt = xCP and (3.40). Assume that the matrices {Hk } are
bounded. Then either f (xk ) is unbounded from below, ∇f (xk ) = 0 for some finite k, or
lim ∇f (xk ) = 0.
k→∞
∇f (xc )
2 ∇f (xc )
st = − .
∇f (xc )T Hc ∇f (xc )
Hence, if
Hc
≤ M ,
st
≥
∇f (xc )
/M
as asserted.
We leave the proof that xt satisfies part 1 for the reader (exercise 3.5.8).
The assumptions we used are stronger that those in, for example, [104] and [223], where
lim inf
∇f (xk )
= 0
m(s) = g T s + sT As/2.
A vector s is a solution to
(3.41) min m(s)
s≤∆
(A + νI)s = −g
and either ν = 0 or
s
= ∆.
Proof. If
s
< ∆ then ∇m(s) = g + As = 0, and the conclusion follows with ν = 0. To
consider the case where
s
= ∆, let λ1 ≤ λ2 ≤ · · · λN be the eigenvalues of A.
1 In some of the literature, [84], for example, H is assumed to be positive definite and the Cauchy point is taken to
c
be the global minimizer of the quadratic model.
m(s) = g T s + sT As/2
Since
lim s(ν) = 0
ν→∞
and
s(ν)
is a continuous decreasing function of ν ∈ (ν0 , ∞) we see that if
lim
s(ν)
> ∆
ν→ν0
g T s + sT (A + νI)s/2.
Hence, m(s) is minimized by setting s1 equal to the minimum norm solution of (A+ν0 )x = −g
(which exists by orthogonality of g to S0 ) and letting s2 be any element of S0 such that
s2
2 = ∆2 −
s1
2 .
GLOBAL CONVERGENCE 57
We present the algorithm from [190] to illustrate this point. For an inexact formulation, see
[276].
The Levenberg–Marquardt quadratic model of least squares objective
M
1 1
f (x) =
ri (x)
22 = R(x)T R(x)
2 i=1 2
pred = m(xc ) − m(xt ) = −sT R (xc )T R(xc ) − 21 sT (R (xc )T R (xc ) + νc I)s
The algorithm we present below follows the trust region paradigm and decides on accepting
the trial point and on adjustments in the Levenberg–Marquardt parameter by examinaing the
ratio
ared f (xc ) − f (xt )
=
pred m(xc ) − m(xt )
f (xc ) − f (xt )
= −2 .
sT ∇f (xc )
In addition to the trust region parameters 0 < ωdown < 1 < ωup and µ0 ≤ µlow < µhigh
we require a default value ν0 of the Levenberg–Marquardt parameter.
The algorithm for testing the trial point differs from Algorithm trtest in that we decrease
(increase) ν rather that increasing (decreasing) a trust region radius if ared/pred is large (small).
We also attempt to set the Levenberg–Marquardt parameter to zero when possible in order to
recover the Gauss–Newton iteration’s fast convergence for small residual problems.
Algorithm 3.3.4. trtestlm(xc , xt , x+ , f, ν)
1. z = xc
2. Do while z = xc
3. x+ = z.
Moreover, if x is a limit point of {xk } for which R(x∗ ) = 0 and R (x∗ ) has full rank, then
∗
xN −1
c = xc − Hc ∇f (xc ).
If Hc is spd, the Newton point is the global minimizer of the local quadratic model. On the other
hand, if Hc has directions of negative curvature the local quadratic model will not have a finite
minimizer, but the Newton point is still useful. Note that if H = I the Newton point and the
Cauchy point are the same if the Newton point is inside the trust region.
We will restrict our attention to a special class of algorithms that approximate the solution
of the trust region problem by minimizing mc along a piecewise linear path S ⊂ T (∆). These
paths are sometimes called doglegs because of the shapes of the early examples [84], [80], [218],
[217], [220]. In the case where ∇2 f (x) is spd, one may think of the dogleg path as a piecewise
linear approximation to the path with parametric representation
This is the path on which the exact solution of the trust region problem lies.
The next step up from the unidirectional path, the classical dogleg path [220], has as many
as three nodes, xc , xCP
c
∗
, and xN CP ∗
c . Here xc is the global minimizer of the quadratic model in
the steepest descent direction, which will exist if and only if ∇f (xc )T Hc ∇f (xc ) > 0. If xCP
c
∗
exists and
(3.44) (xNc − xc
CP ∗ T
) (xCPc
∗
− xc ) > 0,
GLOBAL CONVERGENCE 59
we will let xN
c be the terminal node. If (3.44) holds, as it always will if Hc is spd, then the path
can be parameterized by the distance from xc and, moreover, mc decreases along the path. If
(3.44) does not hold, we do not use xN c as a node and revert to the unidirectional path in the
steepest descent direction.
Note that (3.44) implies
(3.45) ∇f (xc )T (xN c − xc ) < 0.
We can express the conditions for using the three node path rather than the unidirectional path
very simply. If xCP
c is on the boundary of the trust region then we accept xCP c as the trial point.
CP CP ∗
If xc = xc is in the interior of the trust region, then we test (3.44) to decide what to do.
With this in mind our trial point for the classical dogleg algorithm will be
CP
x if
xc − xCP
c
=∆
CP ∗
or x exists and (3.44) fails,
(3.46) xD (∆) = xN if
xc − xCP N
c
<
xc − xc
≤ ∆
and (3.44) holds,
D
y (∆) otherwise.
• No two points on the path have the same distance from xc ; hence the path may be param-
eterized as x(s), where s =
x(s) − xc
.
This enables us to show that the dogleg approximate solution of the trust region problem sat-
isfies Assumption 3.3.1 and apply Theorem 3.3.1 to conclude global convergence. Superlinear
convergence will follow if Hk is a sufficiently good approximation to ∇2 f (xk ).
Lemma 3.3.5. Let xc , Hc , and ∆ be given. Let Hc be nonsingular,
∇f (xc )
2
sCP ∗ = xCP ∗ − xc = − ∇f (xc ).
∇f (xc )T Hc ∇f (xc )
for any δ ≤
sN
there is a unique point x(δ) on S such that
x(δ) − xc
= δ.
Proof. Clearly the statement of the result holds on the segment of the path from x to xCP ∗ .
To prove the result on the segment from xCP ∗ to xN we must show that
1
φ(λ) =
(1 − λ)sCP ∗ + λsN
2
2
sCP ∗
≥ (sN )T sCP ∗ >
sCP ∗
2
and therefore that
sN
>
sCP ∗
, we have
φ (λ) = (sN − sCP ∗ )T ((1 − λ)sCP ∗ + λsN )
= −(1 − λ)
sCP ∗
2 + (1 − λ)(sN )T sCP ∗ + λ
sN
2 − λ(sN )T sCP ∗
> λ(
sN
2 − (sN )T sCP ∗ ) ≥ λ(
sN
−
sCP ∗
)
sN
> 0.
Hence, φ is an increasing function and the proof is complete.
The next stage is to show that the local quadratic model decreases on the dogleg path S.
Lemma 3.3.6. Let the assumptions of Lemma 3.3.5 hold. Then the local quadratic model
1
mc (x) = f (xc ) + ∇f (xc )T (x − xc ) + (x − xc )T Hc (x − xc )
2
is strictly monotone decreasing on S.
Proof. Since xCPc
∗
is the minimum of the local quadratic model in the steepest descent
direction, we need only show that mc is strictly decreasing on the segment of the path between
xCP
c
∗
and xN . Set
ψ(λ) = mc (xc + (1 − λ)sCP ∗ + λsN )
ψ (λ) = (1 − λ)(λ̂
∇f (xc )
2 − ∇f (xc )T Hc−1 ∇f (xc ))
1−λ
= (xc − xCP
c
∗ T
) (xN
c − xc ) < 0,
λ̂
GLOBAL CONVERGENCE 61
Proof. We need to check that the solutions of the trust region problem satisfy Assump-
tion 3.3.1. Part 2 of the assumption follows from the definition, (3.46), of xD and the bounded-
ness of the approximate Hessians. Let
Hk
≤ M
for all k. If
sk
< ∆, then (3.46) implies that (3.44) must hold and so xt = xN
k is the Newton
point. Hence,
−1
sk
=
xk − xN k
=
Hk ∇f (xk )
≥
∇f (xk )
/M,
∇f (xc )T Hc ∇f (xc )
= ∆c
∇f (xc )
− ∆2c
2
∇f (xc )
2
≥ ∆c
∇f (xc )
=
s
∇f (xc )
.
Hence (3.30) holds with σ = 1.
Now assume that ∇f (xc )T Hc ∇f (xc ) > 0 and
sCP
= ∆c . In this case
∇f (xc )
2 ∆c
T
≥
∇f (xc ) Hc ∇f (xc )
∇f (xc )
and so
λ̂2
pred = λ̂
∇f (xc )
2 − T
2 ∇f (xc ) Hc ∇f (xc )
∇f (xc )T Hc ∇f (xc )
= ∆c
∇f (xc )
− ∆2c
2
∇f (xc )
2
≥ ∆c
∇f (xc )
/2,
which is (3.30) with σ = 1/2.
If (3.44) fails, ∇f (xc )T Hc ∇f (xc ) > 0, and
sCP
< ∆c , then
∇f (xc )
2
λ̂ = ,
∇f (xc )T Hc ∇f (xc )
and
λ̂2
pred = λ̂
∇f (xc )
2 − T
2 ∇f (xc ) Hc ∇f (xc )
∇f (xc )
4 λ̂
∇f (xc )
2
= =
2∇f (xc )T Hc ∇f (xc ) 2
∇f (xc )
= ,
2
which is (3.30) with σ = 1/2.
The final case is if (3.44) holds and xD = xCP . In that case the predicted reduction is more
than Cauchy decrease, i.e., the decrease obtained by taking the Cauchy point, and hence
∇f (xc )
4
pred ≥
2∇f (xc )T Hc ∇f (xc )
∇f (xc )
2
≥ ,
2M
which is (3.30) with σ = 1/(2M ). This completes the proof.
The last part of the proof of this theorem is very important, asserting that any solution of
the trust region problem for which pred is at least a fixed fraction of Cauchy decrease will give
global convergence. We refer the reader to [232] and [104] for a more general and detailed
treatment using this point of view.
Corollary 3.3.8. Any algorithm for solving the trust region problem that satisfies for some
τ >0
pred ≥ τ (mc (xc ) − mc (xCP c ))
ek
< ρ,
Hk
≤ 2
∇2 f (x∗ )
,
Hk−1
≤ 2
(∇2 f (x∗ ))−1
,
and xk is near enough for the assumptions of Theorem 2.3.2 to hold. If Hk is spd, so is Hk−1
and for such k, (3.44) holds. Hence, the dogleg path has the nodes xk , xCP N
k , and xk . Moreover,
if ρ is sufficiently small, then
Hk−1 ∇f (xk )
≤ 2
ek
≤ 2ρ.
We complete the proof by showing that if ρ is sufficiently small, the trust region radius will be
expanded if necessary until the Newton step is in the trust region. Once we do this, the proof is
complete as then the local quadratic convergence of Newton’s method will take over.
GLOBAL CONVERGENCE 63
Now
predk ≥
sk
∇f (xk )
/2
by the proof of Theorem 3.3.7. Using Hk = ∇2 f (xk ) we have
1
T
aredk = −∇f (xk ) stk − (∇f (xk + tstk ) − ∇f (xk ))T stk dt
0
1
= predk + sTtk ∇2 f (xk )stk /2 − (∇f (xk + tstk ) − ∇f (xk ))T stk dt
0
= predk + O(
sk
∇f (xk )
ρ)
and therefore ared/pred = 1−O(ρ). Hence, for ρ sufficiently small, the trust region radius will
be increased, if necessary, until the Newton point is inside the trust region and then a Newton
step will be taken. This completes the proof.
The classical dogleg algorithm is implemented in Algorithm ntrust, which uses the trust
radius adjustment scheme from Algorithm trtest. It is to be understood that trtest is
implemented so that xt is given by (3.46) and hence trtest only samples points on the
piecewise linear search path determined by the Cauchy point, the Newton point, and (3.44).
Algorithm 3.3.6. ntrust(x, f, τ )
1. Compute f (x) and ∇f (x)
2. τ = τa + τr
∇f (x)
3. Do while
∇f (x)
> τ
(a) Compute and factor ∇2 f (x)
(b) Compute the Cauchy and Newton points and test (3.44)
(c) Call trtest(x, xt , x+ , f, ∆)
(d) Compute f (x+ ) and ∇f (x+ ); x = x+
min φ(d),
dC ≤∆
This is a dogleg method in that the approximate solution of the trust region problem lies on a
piecewise linear path with the CG iterations as nodes. As long as CG is performing properly
(i.e., pT w > 0) nodes are added to the path until the path intersects the trust region boundary. If
a direction of indefiniteness is found (pT w ≤ 0), then that direction is followed to the boundary.
In this way a negative curvature direction, if found in the course of the CG iteration, can be
exploited.
The inputs to Algorithm trcg are the current point x, the objective f , the forcing term η,
and the current trust region radius ∆. The output is the approximate solution of the trust region
problem d. This algorithm is not the whole story, as once the trust region problem is solved
approximately, one must use f (xc + d) to compute ared and then make a decision on how the
trust region radius should be changed. Our formulation differs from that in [247] in that the
termination criterion measures relative residuals in the l2 -norm rather than in the C-norm. This
change in the norm has no effect on the analysis in [247], and, therefore, we can apply the results
in §2.5 directly to draw conclusions about local convergence.
Algorithm 3.3.7. trcg(d, x, f, M, η, ∆, kmax)
1. r = −∇f (x), ρ0 =
r
22 , k = 1, d = 0
√
2. Do While ρk−1 > η
∇f (x)
2 and k < kmax
(a) z = M r
(b) τk−1 = z T r
(c) if k = 1 then β = 0 and p = z
else
β = τk−1 /τk−2 , p = z + βp
(d) w = ∇2 f (x)p
If pT w ≤ 0 then
Find τ such that
d + τ p
C = ∆
d = d + τ p; return
(e) α = τk−1 /pT w
(f) r = r − αw
(g) ρk = rT r
(h) dˆ = d + αp
ˆ C > ∆ then
(i) If
d
Algorithm trcg does what we would expect a dogleg algorithm to do in that the piecewise
linear path determined by the iteration moves monotonically away from x (in the
·
C -norm!)
and the quadratic model decreases on that path [247]. Algorithm trcg will, therefore, compute
the same Newton step as Algorithm fdpcg. One might think that it may be difficult to compute
the C-norm if one has, for example, a way to compute the action of M on a vector that does
not require computation of the matrix C. However, at the cost of storing two additional vectors
we can update Cp and Cd as the iteration progresses. So, when p is updated to z + βp then
Cp = r + βCp can be updated at the same time without computing the product of C with p.
Then
p
C = pT Cp. Similarly d = d + τ p implies that Cd = Cd + τ Cp.
Algorithm cgtrust combines the solution of the trust region problem from trcg, the
trust region radius adjustment scheme from trtest, and (indirectly) the locally convergent
algorithm newtcg. The result fits nicely into our paradigm algorithm trgen.
GLOBAL CONVERGENCE 65
1. Initialize ∆, M , η, kmax.
2. Do forever
Moreover, if x∗ is a local minimizer for which the standard assumptions hold and xn → x∗ ,
then
• if ηn → 0 the convergence is q-superlinear, and
• if ηn ≤ Kη
∇f (xn )
p for some Kη > 0 the convergence is q-superlinear with q-order
1 + p.
Finally, there are δ and ∆ such that if
x0 − x∗
≤ δ and ∆0 ≤ ∆ then xn → x∗ .
One can, as we do in the MATLAB code cgtrust, replace the Hessian–vector product
with a difference Hessian. The accuracy of the difference Hessian and the loss of symmetry
present the potential problem that was mentioned in §2.5. Another, very different, approach is
to approximate the exact solution of the trust region subproblem with an iterative method [243].
3.4 Examples
The results we report here used the MATLAB implementations of steepest descent, steep.m,
damped Gauss–Newton, gaussn.m, the dogleg trust region algorithm for Newton’s method,
ntrust.m, and the PCG–dogleg algorithms, cgtrust.m, from the software collection.
Our MATLAB implementation of Algorithm steep guards against extremely poor scaling
and very long steps by setting λ to
at the beginning of the line search. We invite the reader in Exercise 3.5.3 to attempt the control
example with λ0 = 1.
We not only present plots, which are an efficient way to understand convergence rates, but we
also report counts of function, gradient, and Hessian evaluations and the results of the MATLAB
flops command.
Dogleg Dogleg
5 5
10 10
0
10
Function Value
Gradient Norm
0
10
5
10
5
10
10
10
10 15
10 10
0 10 20 30 0 10 20 30
Iterations Iterations
2
10
Function Value
Gradient Norm
0
10
0
10
5
10
2
10
4 10
10 10
0 20 40 60 0 20 40 60
Iterations Iterations
0 0
10 10
Function Value
Gradient Norm
2 5
10 10
4 10
10 10
6 15
10 10
0 2 4 6 0 2 4 6
Iterations Iterations
Levenberg–Marquardt Levenberg–Marquardt
2 5
10 10
0 0
10 10
Function Value
Gradient Norm
2 5
10 10
4 10
10 10
6 15
10 10
0 5 10 15 0 5 10 15
Iterations Iterations
GLOBAL CONVERGENCE 67
Dogleg–CG Dogleg–CG
5 7
10 10
6
10
Function Value
Gradient Norm
0
10
5
10
5
10
4
10
10 3
10 10
0 2 4 6 0 2 4 6
Iterations Iterations
6
10
Function Value
Gradient Norm
0
10
5
10
5
10
4
10
10 3
10 10
0 20 40 60 0 20 40 60
Iterations Iterations
Figure 3.3: Steepest Descent and Dogleg–CG for Discrete Control Problem
This problem can be solved very efficiently with Algorithm cgtrust. In our implementa-
tion we use the same parameters from (3.51). In Figure 3.3 we compare the dogleg–CG iteration
with steepest descent. We terminated both iterations when
∇f
< 10−8 . For the dogleg–CG
code we used η = .01 throughout the entire iteration and an initial trust region radius of
u0
.
The steepest descent computation required 48 gradient evaluations, 95 function evaluations, and
roughly 1 million floating point operations, and dogleg–CG needed 17 gradient evaluations, 21
function evaluations, and roughly 530 thousand floating point operations. Note that the steepest
descent algorithm performed very well in the terminal phase of the iteration. The reason for this
is that, in this example, the Hessian is near the identity.
f (x) =
F (x)
2 /2.
What is ∇f ? When is the Newton step for the nonlinear equation F (x) = 0,
d = −F (x)−1 F (x),
3.5.3. Implement Algorithm steep without the scaling fixup in (3.50). Apply this crippled
algorithm to the control problem example from §3.4.2. What happens and why?
3.5.6. Show that the Levenberg–Marquardt steps computed by (3.20) and (3.21) are the same.
3.5.10. Look at the trust region algorithm for nonlinear equations from [218] or [84]. What are the
costs of that algorithm that are not present in a line search? When might this trust region
approach have advantages for solving nonlinear equations? Could it be implemented
inexactly?
GLOBAL CONVERGENCE 69
3.5.11. The double dogleg method [80], [84] puts a new node on the dogleg path in the Newton
direction, thereby trying more aggressively for superlinear convergence. Implement this
method, perhaps by modifying the MATLAB code ntrust.m, and compare the results
with the examples in §3.4. Prove convergence results like Theorems 3.3.7 and 3.3.9 for
this method.
3.5.12. In [51] a trust region algorithm was proposed that permitted inaccurate gradient computa-
tions, with the relative accuracy being tightened as the iteration progresses. Look at [51]
and try to design a similar algorithm based on the line search paradigm. What problems
do you encounter? How do you solve them?
3.5.13. Suppose one modifies Algorithm trtest by not resolving the trust region problem if
the trial point is rejected, but instead performing a line search from xt , and setting ∆ =
x+ − xc
, where x+ is the accepted point from the line search. Discuss the merits of
this modification and any potential problems. See [209] for the development of this idea.
3.5.14. Write programs for optimization that take full Gauss–Newton or Newton steps (you can
cripple the MATLAB codes gaussn.m and ntrust.m for this). Apply these codes to
the parameter identification problem from §3.4.1. What happens?
3.5.15. Write a nonlinear CG code and apply it to the problems in §3.4. Try at least two ways to
manage the line search. How important are the (strong) Wolfe conditions?
3.5.16. Discuss the impact of using a difference Hessian in Algorithm trcg. How will the global
convergence of Algorithm cgtrust be affected? How about the local convergence?
Consider the accuracy in the evaluation of ∇f in your results.
3.5.17. Without looking at [247] describe in general terms how the proof of Theorem 3.3.1 should
be modified to prove Theorem 3.3.10. Then examine the proof in [247] to see if you left
anything out.
Chapter 4
71
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
4.1 Analysis
This section begins with some simple observations on nonsingularity and positivity of the update.
It is very useful for both theory and practice to express (4.4) in terms of the inverse matrices.
The formula we use in this book is Lemma 4.1.1.
−1
Lemma 4.1.1. Let Hc be spd, y T s = 0, and H+ given by (4.4). Then H+ is nonsingular
and
−1 sy T ysT ssT
(4.5) H+ = I− T Hc−1 I − T + T .
y s y s y s
(z T y)2 (z T Hc s)2
z T H+ z = T
+ z T Hc z − T .
y s s Hc s
(z T y)2
z T H+ z > ≥ 0,
yT s
as asserted.
If y T s ≤ 0 the update is considered a failure.
x0 − x∗
≤ δ and
H0 − ∇2 f (x∗ )
≤ δ,
Technical Details
The proof of Theorem 4.1.3 is technical and we subdivide it into several lemmas. Our proof is
a hybrid of ideas from [37], [135], and [154]. Similar to other treatments of this topic [45] we
begin with the observation (see §2.5.2) that one may assume ∇2 f (x∗ ) = I for the convergence
analysis.
Lemma 4.1.4. Let the standard assumptions hold and let
fˆ(y) = f (Ay),
BFGS METHOD 73
where A = (∇2 f (x∗ ))−1/2 . Let xc and Hc be given and let x̂c = A−1 xc and Ĥc = AHc A.
Then the BFGS updates (x+ , H+ ) for f and (x̂+ , Ĥ+ ) for fˆ are related by
In particular, the BFGS sequence for f exists (i.e., Hn is spd for all n) if and only if the BFGS
sequence for fˆ does and the convergence of {xn } is q-superlinear if and only if the convergence
of {x̂n } is.
Proof. The proof is a simple calculation and is left for exercise 4.5.3.
Hence we can, with no loss of generality, assume that ∇2 f (x∗ ) = I, for if this is not so, we
can replace f by fˆ and obtain an equivalent problem for which it is.
Keeping in mind our assumption that ∇2 f (x∗ ) = I, we denote errors in the inverse Hessian
by
E = H −1 − ∇2 f (x∗ )−1 = H −1 − I.
These errors satisfy a simple recursion [37].
Lemma 4.1.5. Let the standard assumptions hold. Let Hc be spd and
x+ = xc − Hc−1 ∇f (xc ).
0 <
xc − x∗
≤ δ0 and
Ec
≤ δ0 ,
where w = s/
s
and for some K∆ > 0
(4.7)
∆
≤ K∆
s
.
Proof. Let δ0 be small enough so that ∇f (xc ) = 0 if xc = x∗ . Theorem 1.2.1 implies that
1
∇f (xc ) = ∇2 f (x∗ + tec )ec dt = ec + ∆1 ec ,
0
Clearly
∆1
≤ γ
ec
/2,
and
s = −Hc−1 ∇f (xc ) = −(I + Ec )(I + ∆1 )ec .
Therefore,
ec
(1 − δ0 )(1 − γδ0 /2) ≤
s
≤
ec
(1 + δ0 )(1 + γδ0 /2)
and hence
(4.8) 0 <
ec
/2 ≤
s
≤ 2
ec
if, say,
(4.9) δ0 ≤ min(1/4, 1/(2γ)).
We will assume that (4.9) holds for the rest of this section.
The standard assumptions, our assumption that ∇2 f (x∗ ) = I, and the fundamental theorem
of calculus imply that
1
y = ∇f (x+ ) − ∇f (xc ) = ∇2 f (xc + ts)s dt
0
(4.10)
1
=s+ (∇2 f (xc + ts) − I)s dt = s + ∆2 s,
0
(4.13)
∆3
≤ C
s
.
Subtracting (∇2 f (x∗ ))−1 = I from (4.5) and using (4.12) gives us
where
∆ = ∆3 Hc−1 (I − wwT + ∆T3 ) + (I − wwT )Hc−1 ∆T3 .
Therefore, if (4.9) holds then 1 + δ0 ≤ 3/2 and
∆
≤ (1 + δ0 )
∆3
(2 +
∆3
) ≤
s
3C(2 + C
s
)/2
≤ 3C
s
(2 + 2Cδ0 )/2.
Reduce δ0 if necessary so that 2Cδ0 ≤ 1 and the proof is complete with K∆ = 9C/2.
Lemma 4.1.5 implies that the approximate Hessians do not drift too far from the exact Hessian
if the initial data are good. This property, called bounded deterioration in [37], will directly
imply local q-linear convergence.
Corollary 4.1.6. Let the assumptions of Lemma 4.1.5 hold and let δ0 be as in the statement
of Lemma 4.1.5. Then
(4.14)
E+
≤
Ec
+ K∆
s
≤
Ec
+ K∆ (
ec
+
e+
).
BFGS METHOD 75
Proof. The leftmost inequality follows from Lemma 4.1.5 and the fact that I − wwT is an
orthogonal projection. The final inequality follows from the triangle inequality.
We are now ready to prove local q-linear convergence. This is of interest in its own right and
is a critical step in the superlinear convergence proof. Note that, unlike the statement and proof
of Theorem 2.3.4, we do not express the estimates in terms of
H − ∇2 f (x∗ )
=
H − I
but
in terms of E = H −1 − I. The two approaches are equivalent, since if
E
≤ δ. < 1/2, then
H −1
< 3/2 and the Banach lemma implies that
H
≤ 2. Hence
Hn − I
/2 ≤
Hn
−1
Hn − I
≤
Hn−1 − I
=
Hn−1 (Hn − I)
≤
Hn−1
Hn − I
≤ 3
Hn − I
/2.
Theorem 4.1.7. Let the standard assumptions hold and let σ ∈ (0, 1). Then there is δ. such
that if
(4.15)
x0 − x∗
≤ δ. and
H0−1 − ∇2 f (x∗ )−1
≤ δ. ,
then the BFGS iterates are defined and converge q-linearly to x∗ with q-factor at most σ.
Proof. For δ̂ sufficiently small and
(4.16)
xc − x∗
≤ δ̂ and
Ec
=
Hc−1 − I
≤ δ̂,
the standard assumptions imply that there is K̄ such that
(4.17)
e+
≤ K̄(
Ec
ec
+
ec
2 )/2 ≤ K̄ δ̂
ec
.
≤
En
+ K∆ (1 + σ)σ n
e0
≤
En
+ K∆ (1 + σ)σ n δ.
n
≤
E0
+ δ. K∆ (1 + σ) σn
j=0
K∆ (1 + σ)
= δ. 1 + .
1−σ
We complete the induction and the proof by invoking (4.18) to conclude that
En+1
≤ δ ∗ .
(4.19) lim = 0,
n→∞
sn
where {sn } is the sequence of steps and {En } is the sequence of errors in the inverse Hessian.
We will only state and prove the special case of the necessary condition that we need and refer
the reader to [82], [81], [84], or [154] for more general proofs.
Theorem 4.1.8. Let the standard assumptions hold; let {Hn } be a sequence of nonsingular
N × N matrices satisfying
(4.20)
Hn
≤ M
for some M > 0. Let x0 ∈ RN be given and let {xn }∞
n=1 be given by
(4.21) lim = 0,
n→∞
sn
(1 − σ)
en
≤
sn
≤ (1 + σ)
en
.
(4.22) lim = 0.
n→∞
en
= Hn−1 en+1 + O(
en
2 +
sn
2 ) = Hn−1 en+1 + O(
en
2 ).
Therefore, (4.22) implies that
En yn
Hn−1 en+1
en+1
= + O(
en
) ≥ M −1 + O(
en
) → 0
en
en
en
BFGS METHOD 77
It is easy to show that (see exercise 4.5.5) for any unit vector v ∈ RN ,
(4.25)
A(I − vv T )
2F ≤
A
2F −
Av
2 and
(I − vv T )A
2F ≤
A
2F .
(4.26)
En+1
2F ≤
En
2F −
En wn
2 + O(
sn
) = (1 − θn2 )
En
2F + O(
sn
),
where wn = sn /
sn
and
En wn
if En = 0,
En
F
θn =
1 if En = 0.
=
E0
2F −
Ek+1
2F + O(1) < ∞.
Hence θn
En
F → 0.
However,
En wn
if En = 0
θn
En
F =
0 if En = 0
En sn
=
En wn
= .
sn
D = {x | f (x) ≤ f (x0 )}
4.2 Implementation
The two implementation issues that we must confront are storage of the data needed to maintain
the updates and a strategy for dealing with the possibility that y T s ≤ 0. We address the
storage question in §4.2.1. For the second issue, when y T s is not sufficiently positive, we restart
the BFGS update with the identity. We present the details of this in §4.2.2. Our globalization
approach, also given in §4.2.2, is the simplest possible, the Armijo rule as described in Chapter 3.
We choose to discuss the Armijo rule in the interest of simplicity of exposition. However,
while the Armijo rule is robust and sufficient for most problems, more complex line search
schemes have been reported to be more efficient [42], and one who seeks to write a general
purpose optimization code should give careful thought to the best way to globalize a quasi-
Newton method. In the case of BFGS, for example, one is always seeking to use a positive definite
quadratic model, even in regions of negative curvature, and in such regions the approximate
Hessian could be reinitialized to the identity more often than necessary.
4.2.1 Storage
For the present we assume that y T s > 0. We will develop a storage-efficient way to compute
the BFGS step using the history of the iteration rather than full matrix storage.
The implementation recommended here is one of many that stores the history of the iteration
and uses that information recursively to compute the action of Hk−1 on a vector. This idea was
suggested in [16], [186], [206], and other implementations may be found in [44] and [201].
All of these implementations store the iteration history in the pairs {sk , yk } and we present a
concrete example in Algorithm bfgsrec. A better, but somewhat less direct, way is based on
the ideas in [91] and [275] and requires that only a single vector be stored for each iteration. We
assume that we can compute the action of H0−1 on a vector efficiently, say, by factoring H0 at the
outset of the iteration or by setting H0 = I. We will use the BFGS formula from Lemma 4.1.1.
One way to maintain the update is to store the history of the iteration in the sequences of
vectors {yk } and {sk } where
If one has done this for k = 0, . . . , n − 1, one can compute the new search direction
dn = −Hn−1 ∇f (xn )
BFGS METHOD 79
Algorithm bfgsrec overwrites a given vector d with Hn−1 d. The storage needed is one
vector for d and 2n vectors for the sequences {sk , yk }n−1
k=0 . A method for computing the product
of H0−1 and a vector must also be provided.
Algorithm 4.2.1. bfgsrec(n, {sk }, {yk }, H0−1 , d)
1. If n = 0, d = H0−1 d; return
2. α = sTn−1 d/yn−1
T
s; d = d − αyn−1
Algorithm bfgsrec has the great advantage, at least in a language that efficiently supports
recursion, of being very simple. More complex, but nonrecursive versions, have been described
in [16], [201], and [44].
The storage cost of two vectors per iteration can be significant, and when available storage is
exhausted one can simply discard the iteration history and restart with H0 . This approach, which
we implement in the remaining algorithms in this section, takes advantage of the fact that if H0
is spd then −H0−1 ∇f (x) will be a descent direction, and hence useful for a line search. Another
approach, called the limited memory BFGS [44], [207], [176], [201], keeps all but the oldest
(s, y) pair and continues with the update. Neither of these approaches for control of storage,
while essential in practice for large problems, has the superlinear convergence properties that
the full-storage algorithm does.
At a cost of a modest amount of complexity in the formulation, we can reduce the storage
cost to one vector for each iteration. The method for doing this in [275] begins with an expansion
of (4.5) as
−1
H+ = Hc−1 + α0 sc sTc + β0 ((Hc−1 yc )sTc + sc (Hc−1 yc )T ),
where
ycT sc + ycT Hc−1 yc −1
α0 = and β0 = T .
(ycT sc )2 yc sc
Now note that
and obtain
−1
(4.27) H+ = Hc−1 + α1 sc sTc + β0 (sc (Hc−1 ∇f (x+ ))T + (Hc−1 ∇f (x+ ))sTc ),
where
α1 = α0 + 2β0 /λc .
Also
−1
d+ = −H+ ∇f (x+ )
sc y T yc sT sc sTc ∇f (x+ )
(4.28) =− I− T c Hc−1 I− Tc ∇f (x+ ) −
yc sc yc sc ycT sc
= Ac sc + Bc Hc−1 ∇f (x+ ),
where
ycT yc sTc sT ∇f (x+ )
(4.29) Ac = T
H −1
c I − T
∇f (x+ ) + c T
yc sc yc sc λc yc sc
and
sTc ∇f (x+ )
(4.30) Bc = −1 + .
ycT sc
At this point we can compute d+ , and therefore λ+ and s+ using only Hc−1 ∇f (xc ). We do not
need H+ at all. We can now form H+ with the new data for the next iterate and will show that
we do not need to store the vectors {yk }.
Since (verify this!) Bc = 0, we have
s+ Ac sc
Hc−1 ∇f (x+ ) = − + .
Bc λ+ Bc
where
β0
(4.32) αc = α1 + 2β0 Ac /Bc and βc = − .
Bc λ+
This leads to the expansion
n
−1
(4.33) Hn+1 = H0−1 + αk sk sTk + βk (sk sTk+1 + sk+1 sTk ).
k=0
Upon reflection the reader will see that this is a complete algorithm. We can use (4.28) and Hn
to compute dn+1 . Then we can compute λn+1 and sn+1 and use them and (4.32) to compute
−1
αn and βn . This new data can be used to form Hn+1 with (4.33), which we can use to compute
dn+2 and continue the iteration.
In this way only the steps {sk } and the expansion coefficients {αk } and {βk } need be stored.
Algorithm bfgsopt is an implementation of these ideas.
Algorithm 4.2.2. bfgsopt(x, f, )
1. g = −∇f (x), n = 0.
2. While
g
>
(a) If n = 0, dn = −H0−1 g
otherwise compute A, B, and dn using (4.28), (4.29), and (4.30).
(b) Compute λn , sn , and x = xn+1 with the Armijo rule.
(c) If n > 0 compute αn−1 and βn−1 using (4.32).
(d) g = −∇f (x), n = n + 1.
BFGS METHOD 81
and BF GS
H+ if y T s > 0,
(4.35) H+ =
Hc if y T s ≤ 0.
In the MBFGS1 method, (4.34), the model Hessian is reinitialized to I if y T s ≤ 0. In the
early phase of this iteration, where ∇2 f may have negative eigenvalues, y T s ≤ 0 is certainly
possible and the search direction could be the steepest descent direction for several iterations.
An MBFGS2 step (4.35) keeps the history of the iteration even if y T s ≤ 0. One view
is that this approach keeps as much information as possible. Another is that once y T s ≤ 0,
the iteration history is suspect and should be thrown away. Both forms are used in practice.
Our MATLAB code bfgswopt uses MFBGS1 and maintains an approximation to H −1 using
Algorithm bfgsopt. We also guard against poor scaling by using (3.50).
(y − Hc s)(y − Hc s)T
(4.38) H+ = Hc + .
(y − Hc s)T s
By preserving the symmetry of the approximate Hessians, but not the positive definiteness,
these updates present a problem for a line search globalization but an opportunity for a trust
region approach. The SR1 update has been reported to outperform BFGS algorithms in certain
cases [165], [41], [64], [65], [163], [258], [118], [250], [119], [268], [164], in which either the
approximate Hessians can be expected to be positive definite or a trust region framework is used
[41], [64], [65].
One may update the inverse of the SR1 approximate Hessian using the Sherman–Morrison
formula, (4.39), a simple relation between the inverse of a nonsingular matrix and that of a
rank-one update of that matrix [93], [239], [240], [14].
Proposition 4.3.1. Let H be a nonsingular N ×N matrix and let u, v ∈ RN . Then H +uv T
is invertible if and only if 1 + v T H −1 u = 0. In this case
(H −1 u)v T
(4.39) (H + uv T )−1 = I − H −1 .
1 + v T H −1 u
small. This update does not enforce or require positivity of the approximate Hessian and has
been used effectively to exploit negative curvature in a trust region context [165], [41].
For overdetermined nonlinear least squares problems one can try to approximate the second-
order term in ∇2 f while computing RT R exactly. Suppose
∇2 f (x) ≈ H = C(x) + A,
where the idea is that C, the computed part, is significantly easier to compute than A, the
approximated part. This is certainly the case for nonlinear least squares, where C = RT R . A
quasi-Newton method that intends to exploit this structure will update A only; hence
H+ = C(x+ ) + A+ .
The definition of y # given in (4.40) is called the default choice in [87]. This is not the only
choice for y # , and one can prove superlinear convergence for this and many other choices [87],
[84]. This idea, using several different updates, has been used in other contexts, such as optimal
control [159], [164].
An algorithm of this type, using SR1 to update A and a different choice for y # , was suggested
in [20] and [21]. The nonlinear least squares update from [77], [78], and [84] uses a DFP update
and yet another y # to compute A+ ,
T T
(y # − Ac s)y # + y # (y # − Ac s)T [(y # − Ac s)T y # ]y # y #
(4.41) A+ = Ac + T
− T
.
y# s (y # s)2
The application of this idea to large-residual least squares problems is not trivial, and scaling
issues must be considered in a successful implementation.
Our proof of superlinear convergence can be applied to updates like (4.41). We state a special
case of a result from [87] for the BFGS formulation
T
y# y# (Ac s)(Ac s)T
(4.42) A+ = Ac + − .
y #T s sT Ac s
Theorem 4.3.2. Let the standard assumptions hold and assume that
A∗ = ∇2 f (x∗ ) − C(x∗ )
x0 − x∗
≤ δ and
A0 − A∗
≤ δ,
then the quasi-Newton iterates defined by (4.42) exist and converge q-superlinearly to x∗ .
This result can be readily proved using the methods in this chapter (see [159]).
Quasi-Newton methods can also be designed to take into account special structure, such as
the sparsity pattern of the Hessian. One can update only those elements that are nonzero in the
initial approximation to the Hessian, requiring that the secant equation Hs = y holds. Such
updates have been proposed and analyzed in varying levels of generality in [83], [87], [185],
[238], [256], and [255].
BFGS METHOD 83
Another approach is to use the dependency of f on subsets of the variables, a structure that is
often present in discretizations of infinite-dimensional problems where coefficients of operators
can be updated rather than entire matrix representations of those operators. We refer the reader
to [133], [131], and [132] for an algebraic viewpoint based on finite-dimensional analysis and
to [159], [157], [164], [160], and [163] for an operator theoretic description of these methods.
When applied to discretizations of infinite-dimensional optimization problems, quasi-Newton
methods perform best when they also work well on the infinite-dimensional problem itself. Work
on BFGS in Hilbert space can be found, for example, in [135], [158], and [159].
Quasi-Newton methods have been designed for underdetermined problems [184], and Broy-
den’s method itself has been applied to linear least squares problems [111], [148].
4.4 Examples
The computations in this section were done with the MATLAB code bfgswopt. For the small
parameter ID problem, where evaluation of f is far more expensive than the cost of maintaining
or factoring the (very small!) approximate Hessian, one could also use a brute force approach
in which H is updated and factored anew with each iteration.
u0 (t) = 10.
BFGS also requires an initial approximation to the Hessian and we consider two such approxi-
mations:
(4.43) Hp = .25I and Hg = I.
The Hessian for the continuous problem is a compact perturbation of the identity and the
theory from [158] and [135] indicates that Hg is a much better approximate Hessian than Hp .
The results in Figure 4.2 support that idea. For the better Hessian, one can see the concavity
of superlinear convergence in the plot of the gradient norm. The computation for the better
Hessian required 12 iterations and roughly 572 thousand floating point operations, while the one
with the poor Hessian took 16 iterations and roughly 880 thousand floating point operations.
Stepsize reductions were not required for the good Hessian and were needed four times during
the iteration for the poor Hessian. However, the guard against poor scaling (3.50) was needed
in both cases.
When we used the same poor initial iterate that we used in §3.4.2
0 0
10 10
Function Value
Gradient Norm
2 5
10 10
4 10
10 10
6 15
10 10
0 2 4 6 0 2 4 6
Iterations Iterations
BFGS BFGS
5 5
10 10
0
10
Function Value
Gradient Norm
0 5
10 10
10
10
5 15
10 10
0 5 10 15 0 5 10 15
Iterations Iterations
0
10
4
10
5
10
10 3
10 10
0 5 10 15 0 5 10 15
Iterations Iterations
0
10
4
10
5
10
10 3
10 10
0 5 10 15 20 0 5 10 15 20
Iterations Iterations
BFGS METHOD 85
and allocated 50 vectors to Algorithm bfgsopt, there was no longer a benefit to using the
good Hessian. In fact, as is clear from Figure 4.3 the poor Hessian produced a more rapidly
convergent iteration.
5
10
Function Value
Gradient Norm
6
10
0
10
4
10
5
10
10 2
10 10
0 50 100 150 0 50 100 150
Iterations Iterations
5
10
Function Value
Gradient Norm
6
10
0
10
4
10
5
10
10 2
10 10
0 20 40 60 80 0 20 40 60 80
Iterations Iterations
Figure 4.3: BFGS–Armijo for Discrete Control Problem: Poor Initial Iterate
4.5.2. Prove Lemma 4.1.1. It might help to use the secant equation.
4.5.7. Show how the Sherman–Morrison formula can be used to implement the SR1 update in
such a way that only one vector need be stored for each iterate.
4.5.8. State and prove a local convergence theorem for DFP and/or PSB.
4.5.9. Implement the DFP and PSB update and compare their performance with BFGS on the
examples from §4.4.
4.5.10. Show that, for positive definite quadratic problems, the BFGS method with an exact line
search (i.e., one that finds the minimum of f in the search direction) is the same as CG
[201], [200].
4.5.11. Prove Theorem 4.3.2.
Chapter 5
(5.2) x∗ ∈ Ω = {x ∈ RN | Li ≤ (x)i ≤ Ui }.
or as minΩ f . The set Ω is called the feasible set and a point in Ω is called a feasible point.
Because the set Ω is compact there is always a solution to our minimization problem [229].
The inequalities Li ≤ (x)i ≤ Ui are called inequality constraints or simply constraints.
We will say that the ith constraint is active at x ∈ Ω if either (x)i = Li or (x)i = Ui . If the
ith constraint is not active we will say that it is inactive. The set of indices i such that the ith
constraint is active (inactive) will be called the set of active (inactive) indices at x.
We will write A(x) and I(x) for the active and inactive sets at x.
87
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
minimizer is one of the endpoints. If x∗ = a is a local minimizer, then it need not be the case
that f (a) = 0; however, because a is a local minimizer, f (x) ≥ f (a) for all a ≤ x sufficiently
near a. Hence f (a) ≥ 0. Nothing, however, can be said about f . Similarly, if x∗ = b is a
local minimizer, f (b) ≤ 0. If f is differentiable on [a, b] (i.e., on an open set containing [a, b]),
then the necessary conditions for all three possibilities, x∗ = a, x∗ = b, and a < x∗ < b can be
neatly expressed by the following theorem.
Theorem 5.2.1. Let f be a continuously differentiable function of one variable on the
interval [a, b]. Let x∗ be a local minimum of f on [a, b]. Then
As in the unconstrained case, stationary points are said to satisfy the first-order necessary
conditions.
The fact that optimality implies stationarity is proved with Taylor’s theorem just as it was in
the unconstrained case.
Theorem 5.2.2. Let f be continuously differentiable on Ω and let x∗ be a solution of problem
(5.4). Then x∗ is a stationary point for problem (5.4).
Proof. Let x∗ be a solution of problem (5.4) and let y ∈ Ω. As Ω is convex, the line segment
joining x∗ and y is entirely in Ω. Hence, the function
is defined for t ∈ [0, 1] and has a local minimum at t = 0. Therefore, by Theorem 5.2.1
0 ≤ φ (0) = ∇f (x∗ )T (y − x∗ )
as asserted.
The case of a function of a single variable is less useful in explaining the role of the second
derivative. However, we can get a complete picture by looking at functions of two variables.
To illustrate the ideas we let N = 2 and let f be a twice Lipschitz continuously differentiable
function on Ω = [0, 1] × [0, 1]. If x∗ is a solution of (5.4) and no constraints are active, then
∇2 f (x∗ ) is positive semidefinite by the same arguments used in the unconstrained case. If one
or more constraints are active, however, then, just as in the one variable case, one cannot draw
conclusions about the positivity of ∇2 f (x∗ ). Suppose the minimizer is at x∗ = (ξ, 0) with
0 < ξ < 1. While nothing can be said about ∂ 2 f (x∗ )/∂x22 , the function φ(t) = f (t, 0), defined
on [0, 1], must satisfy
φ (ξ) = ∂ 2 f (x∗ )/∂x21 ≥ 0.
Hence, second partials in directions corresponding to the inactive constraints must be nonnega-
tive, while nothing can be said about directions corresponding to active constraints.
BOUND CONSTRAINTS 89
φ(ξ) = f (ξ, ζ ∗ )
Theorem 5.2.4 states our final necessary condition; we defer the proof to §5.4.4.
Theorem 5.2.4. Let f be continuously differentiable. A point x∗ ∈ Ω is stationary for
problem (5.4) if and only if
(5.10) x∗ = P(x∗ − λ∇f (x∗ ))
for all λ ≥ 0.
Nondegeneracy is important not only in the formulation of sufficient conditions but also in
the design of termination criteria. The first step in the use of nondegeneracy is Lemma 5.3.1.
Lemma 5.3.1. Let x∗ be a nondegenerate stationary point. Assume that A = A(x∗ ) is not
empty. Then there is σ such that
∇f (x∗ )T (x − x∗ ) = ∇f (x∗ )T PA (x − x∗ ) ≥ σ
PA (x − x∗ )
for all x ∈ Ω.
Proof. If i ∈ A then nondegeneracy and stationarity imply that there is σ > 0 such that
either
(x∗i ) = Li and (∇f (x∗ ))i ≥ σ or (x∗i ) = Ui and (∇f (x∗ ))i ≤ −σ.
If x ∈ Ω then for all i ∈ A,
Therefore, since
x
1 ≥
x
2 ,
∇f (x∗ )T PA (x − x∗ ) ≥ σ
PA (x − x∗ )
,
as asserted.
For a nondegenerate stationary point the sufficiency conditions are very similar to the un-
constrained case.
Theorem 5.3.2. Let x∗ ∈ Ω be a nondegenerate stationary point for problem (5.4). Let
f be twice differentiable in a neighborhood of x∗ and assume that the reduced Hessian at x∗
is positive definite. Then x∗ is a solution of problem (5.4) (and hence a nondegenerate local
minimizer).
Proof. Let x ∈ Ω and define φ(t) = f (x∗ + t(x − x∗ )). We complete the proof by showing
that either (i) φ (0) > 0 or (ii) φ (0) = 0, φ (0) > 0. Let e = x − x∗ and note that
∇f (x∗ )T PA e > 0
proving (ii).
BOUND CONSTRAINTS 91
where λ is a steplength parameter given by the Armijo rule or some other line search scheme.
In this section we will restrict our attention to the simplest form of the Armijo rule. In order to
implement any line search scheme, we must specify what we mean by sufficient decrease. For
λ > 0 define
(5.12) x(λ) = P(x − λ∇f (x)).
For bound constrained problems we will express the sufficient decrease condition for line searches
(compare with (3.4)) as
−α
(5.13) f (x(λ)) − f (x) ≤
x − x(λ)
2 .
λ
As with (3.4), α is a parameter and is typically set to 10−4 [84].
The general algorithmic description follows in Algorithm 5.4.1.
Algorithm 5.4.1. gradproj(x, f, nmax)
1. For n = 1, . . . , nmax
Proof. Let
A∗ = A(x∗ ), I ∗ = I(x∗ ), A = A(x), and I = I(x).
Let
δ1 = min∗ {(Ui − (x∗ )i ), ((x∗ )i − Li ), (Ui − Li )/2}.
i∈I
If i ∈ I ∗ and
e
< δ1 then Li < (x)i < Ui . Hence,
I∗ ⊂ I
e
< δ1 ≤ min {(Ui − Li )/2} ,
Therefore, if
e
< δ3 < min(σ/2, δ2 ),
then i ∈ Aλ and (x(λ))i = (x∗ )i . Hence A∗ ⊂ Aλ .
It remains to prove that Aλ ⊂ A∗ . By definition of P we have
P(x) − P(y)
≤
x − y
and, therefore,
x∗ − x(λ)
=
P(x∗ − λ∇f (x∗ )) − P(x − λ∇f (x))
(5.14)
≤
e
+ λ
∇f (x∗ ) − ∇f (x)
≤ (1 + Lλ)
e
.
(5.15)
x∗ − x(λ)
≥ δ1 = min∗ {(Ui − x∗ ), (x∗ − Li )}.
i∈I
However, if
e
< δ4 = min(δ3 , δ1 /(1 + L))
then (5.14) implies that (5.15) cannot hold. This completes the proof.
Theorem 5.4.2. Let f be twice continuously differentiable on Ω and let x∗ be a nondegenerate
stationary point for problem (5.4). Assume that sufficient conditions hold at x∗ . Then there are
δ and M such that if
e
< δ and A(x) = A(x∗ ) then
(5.16)
e
/M ≤
x − x(1)
≤ M
e
.
BOUND CONSTRAINTS 93
x − x(1)
=
e − (x(1) − x∗ (1))
≤
e
+
P(x − ∇f (x)) − P(x∗ − ∇f (x∗ ))
≤ 2
e
+
∇f (x) − ∇f (x∗ )
≤ (2 + L)
e
.
Hence, the right inequality in (5.16) holds.
To verify the left inequality in (5.16) we apply Lemma 5.4.1. Let δ1 be such that
e
< δ1
implies that the conclusions of Lemma 5.4.1 hold for λ = 1. The lemma implies that
(∇f (x))i , i ∈ I ∗ ,
(5.17) (x − x(1))i =
(e)i = 0, i ∈ A∗ .
The remaining case is if i ∈ I = I ∗ . The sufficiency conditions imply that there is µ > 0
such that
uT PI ∗ ∇2 f (x∗ )PI ∗ u ≥ µ
PI ∗ u
2
for all u ∈ RN . Hence, there is δ2 so that if
e
< δ2 then
uT PI ∗ ∇2 f (x)PI ∗ u ≥ µ
PI ∗ u
2 /2
for all u ∈ RN .
Therefore, since e = PI ∗ e,
1
PI ∗ (x − x(1))
2 = eT PI ∗ ∇2 f (x∗ + te)e dt
0
1
= eT PI ∗ ∇2 f (x∗ + te)PI ∗ e dt
0
≥ µ
PI∗ e
2 /2.
Therefore,
x − x(1)
≥ min(1, µ/2)
e
and setting M = max{2 + L, 1, 2/µ} completes
the proof.
Following the unconstrained case, we formulate a termination criterion based on relative and
absolute reductions in the measure of stationarity
x − x(1)
. Given r0 =
x0 − x0 (1)
and
relative and absolute tolerances τr and τa the termination criterion for Algorithm gradproj is
(5.18)
x − x(1)
≤ τa + τr r0 .
Proof. By definition of P
x(λ) − x + λ∇f (x)
≤
y − x + λ∇f (x)
φ(t) =
(1 − t)x(λ) + ty − x + λ∇f (x)
2 /2
and, therefore,
0 ≤ φ (0) = (y − x(λ))T (x(λ) − x + λ∇f (x))
as asserted.
We will most often use the equivalent form of (5.19)
(5.21)
x − x(λ)
2 ≤ λ∇f (x)T (x − x(λ)).
An important result in any line search analysis is that the steplengths remain bounded away
from 0.
Theorem 5.4.5. Assume that ∇f is Lipschitz continuous with Lipschitz constant L. Let
x ∈ Ω. Then the sufficient decrease condition (5.13) holds for all λ such that
2(1 − α)
(5.22) 0<λ≤ .
L
Proof. We begin with the fundamental theorem of calculus. Setting y = x − x(λ) we have
1
f (x − y) − f (x) = f (x(λ)) − f (x) = − ∇f (x − ty)T y dt.
0
Hence,
f (x(λ)) = f (x) + ∇f (x)T (x(λ) − x)
(5.23) 1
− (∇f (x − ty) − ∇f (x))T y dt.
0
where 1
E= (∇f (x − ty) − ∇f (x))T y dt
0
and hence
E
≤ L
x − x(λ)
2 /2.
BOUND CONSTRAINTS 95
So
(5.25) λ(f (x) − f (x(λ))) ≥ λ∇f (x)T (x − x(λ)) − λL
x − x(λ)
2 /2.
Therefore, using Corollary 5.4.4 we obtain
Theorem 5.4.6. Assume that ∇f is Lipschitz continuous with Lipschitz constant L. Let
{xn } be the sequence generated by the gradient projection method. Then every limit point of
the sequence is a stationary point.
Proof. Since the sequence {f (xn )} is decreasing and f is bounded from below on Ω, f (xn )
has a limit f ∗ . The sufficient decrease condition, as in the proof of Theorem 3.2.4, and (5.26)
imply that
xn − xn+1
2 ≤ λ(f (xn ) − f (xn+1 ))/α ≤ (f (xn ) − f (xn+1 ))/α → 0
as n → ∞.
Now let y ∈ Ω and n ≥ 0. By (5.20) we have
≤ λ−1 T T
n (xn − xn+1 ) (xn+1 − y) + ∇f (xn ) (xn − xn+1 ).
Therefore, by (5.26),
∇f (xn )T (xn − y) ≤
xn − xn+1
(λ−1
n
xn+1 − y
+
∇f (xn )
),
(5.27)
∇f (xn )T (xn − y) ≤
xn − xn+1
(λ̄−1
xn+1 − y
+
∇f (xn )
).
Using (5.12),
F (x) = x − x(1).
We now prove Theorem 5.2.4.
Proof. Corollary 5.4.4 states that
x∗ − x∗ (λ)
2 ≤ λ∇f (x∗ )T (x∗ − x∗ (λ)).
where Hc is spd. This is not the case as the following simple example illustrates. Let N = 2,
Li = 0, and Ui = 1 for all i. Let
f (x) =
x − (−1, 1/2)T
2 /2;
BOUND CONSTRAINTS 97
then the only local minimizer for (5.3) is x∗ = (0, 1/2)T . Let xc = (0, 0) (not a local mini-
mizer!); then ∇f (xc ) = (1, −1/2)T . If
−1 2 1
Hc =
1 2
The reason that xc (λ) = xc for all λ > 0 is that the search direction for the unconstrained
problem has been rotated by Hc−1 to be orthogonal to the direction of decrease in the inactive
directions for the constrained problem. Hence, unlike the constrained case, positivity of the
model Hessian is not sufficient and we must be able to estimate the active set and model the
reduced Hessian (rather than the Hessian) if we expect to improve convergence.
The solution proposed in [19] is to underestimate the inactive set in a careful way and
therefore maintain a useful spd approximate reduced Hessian. For x ∈ Ω and
R(xc , c , Hc ).
and then use the method of proof from Theorem 5.4.5. We do this by writing
∇f (x)T (x − xH,1 (λ)) = (PA (x) ∇f (x))T (x − xH,1 (λ)) + (PI (x) ∇f (x))T (x − xH,1 (λ))
and, therefore,
(5.33) (PA (x) ∇f (x))T (x − xH,1 (λ)) = (PA (x) ∇f (x))T (x − x(λ)).
Li + ≤ (x)i ≤ Ui −
and, hence, if
(5.36) λ ≤ λ̄2 =
maxx∈Ω
R(x, , H)−1 ∇f (x)
∞
then i is in the inactive set for both xH,1 (λ) and x(λ). Therefore, if (5.36) holds then
(PI (x) ∇f (x))T (x − xH,1 (λ)) = λ(PI (x) ∇f (x))T H −1 PI (x) ∇f (x)
(5.37) ≥ λ−1
l λ
−1
PI (x) (x − x(λ))
2
= λ−1 T
l (PI (x) ∇f (x)) (x − x(λ)).
BOUND CONSTRAINTS 99
min(1,λ−1 )
≥ min(1, λ−1 T
l )∇f (x) (x − x(λ)) ≥ λ
l
x − x(λ)
2 .
(5.38)
The remainder of the proof is almost identical to that for Theorem 5.4.5. The fundamental
theorem of calculus and the Lipschitz continuity assumption imply that
(1 − α)
(5.39) λ ≤ λ̄3 = .
max(1, λl )L
1. For n = 1, . . . , nmax
(d) Find the least integer m such that (5.13) holds for λ = β m .
(e) x = x(λ).
If our model reduced Hessians remain uniformly positive definite, a global convergence
result completely analogous to Theorem 3.2.4 holds.
Theorem 5.5.2. Let ∇f be Lipschitz continuous with Lipschitz constant L. Assume that the
matrices Hn are symmetric positive definite and that there are κ̄ and λl such that κ(Hn ) ≤ κ̄,
and
Hn
≤ λl for all n. Assume that there is ¯ > 0 such that ¯ ≤ n < min(Ui − Li )/2 for
all n.
Then
(5.40) lim
xn − xn (1)
= 0,
n→∞
and hence any limit point of the sequence of iterates produced by Algorithm sgradproj is a
stationary point.
In particular, if xnl → x∗ is any convergent subsequence of {xn }, then x∗ = x∗ (1). If xn
converges to a nondegenerate local minimizer x∗ , then the active set of xn is the same as that of
x∗ after finitely many iterations.
Proof. With Lemma 5.5.1 and its proof in hand, the proof follows the outline of the proof of
Theorem 5.4.6. We invite the reader to work through it in exercise 5.8.3.
Hn = ∇2R f (xn )
in Algorithm sgradproj, then the resulting projected Newton method will take full steps (i.e.,
λ = 1) and, if n is chosen with care, converge q-quadratically to x∗ .
A specific form of the recommendation from [19], which we use here, is
(5.41) n = min(
xn − xn (1)
, min(Ui − Li )/2).
Note that while xn is far from a stationary point and the reduced Hessian is spd, then n will be
bounded away from zero and Theorem 5.5.2 will be applicable. The convergence result is like
Theorem 2.3.3 for local convergence but makes the strong assumption that Hn is spd (valid near
x∗ , of course) in order to get a global result.
Algorithm projnewt is the formal description of the projected Newton algorithm. It is a
bit more than just a specific instance of Algorithm gradproj. Keep in mind that if the initial
iterate is far from x∗ and the reduced Hessian is not spd, then the line search (and hence the
entire iteration) may fail. The algorithm tests for this. This possibility of indefiniteness is the
weakness in any line search method that uses ∇2 f when far from the minimizer. The inputs to
Algorithm projnewt are the same as those for Algorithm gradproj. The algorithm exploits
the fact that
(5.42) R(x, , ∇2R f (x)) = R(x, , ∇2 f (x))
which follows from A(x) ⊂ A1 (x).
Algorithm 5.5.2. projnewt(x, f, τ, nmax)
1. For n = 1, . . . , nmax
(c) Compute and factor R = R(x, , ∇2R f (x)). If R is not spd, terminate with a failure
message.
(d) Solve Rd = −∇f (xc ).
(e) Find the least integer m such that (5.13) holds for λ = β m .
(f) x = x(λ).
implies that
PA(xc ) ec = PA(xc ) e+ = 0.
Hence, we need only estimate PI(xc ) e+ to prove the result.
Let
δ ∗ = min∗ (|(x)i − Ui |, |(x)i − Li |) > 0.
i∈I(x )
We reduce
e
if necessary so that
e
≤ δ ∗ /M,
where M is the constant in Theorem 5.4.2. We may then apply Theorem 5.4.2 to conclude that
both c < δ ∗ and
ec
< δ ∗ . Then any index i ∈ A1c (xc ) must also be in A(xc ) = A(x∗ ).
Hence
(5.43) A1c (xc ) = A(xc ) = A(x∗ ).
From this we have
(5.44) R(xc , c , ∇2R f (xc )) = ∇2R f (xc ).
Hence, for
ec
sufficiently small the projected Newton iteration is
where 1
E1 = (∇2 f (x∗ + tec ) − ∇2 f (xc ))ec dt
0
and hence
E1
≤ K1
ec
2 for some K1 > 0.
By the necessary conditions,
PI(xc ) (∇2R f (xc ))−1 ∇f (xc ) = (∇2R f (xc ))−1 PI(xc ) ∇f (xc ) = ec + E2 ,
where
E2
≤ K2
ec
2 for some K2 > 0.
Since PI(xc ) Pw = PPI(xc ) w for all w ∈ RN ,
Therefore,
e+
≤ K2
ec
2 as asserted.
and update an approximation to the part of the model Hessian that acts on the inactive set. In
this way we can hope to maintain a positive definite model reduced Hessian with, say, a BFGS
update. So if our model reduced Hessian is
R = C(x) + A,
we can use (4.42) to update A (with A0 = PI 0 (x0 ) , for example), as long as the active set does
not change. If one begins the iteration near a nondegenerate local minimizer with an accurate
approximation to the Hessian, then one would expect, based on Theorem 5.5.3, that the active
set would remain constant and that the iteration would converge q-superlinearly.
However, if the initial data is far from a local minimizer, the active set can change with each
iteration and the update must be designed to account for this. One way to do this is to use a
projected form of the BFGS update of A from (4.42),
T
y# y# (Ac s)(Ac s)T
(5.50) A+ = PI+ Ac PI+ + − PI+ PI+ ,
y #T s sT Ac s
with
y # = PI+ (∇f (x+ ) − ∇f (xc )).
Here I+ = I 1+ (x+ ). This update carries as much information as possible from the previous
model reduced Hessian while taking care about proper approximation of the active set. As in
T
the unconstrained case, if y # s ≤ 0 we can either skip the update or reinitialize A to PI .
A is not spd if any constraints are active. However, we can demand that A be symmetric
indefinite, and a generalized inverse A† exists. We have
If A(xc ) = A(x+ ) any of the low-storage methods from Chapter 4 can be used to update A† .
In this case we have
T
† sy # † y # sT ssT
(5.52) A+ = I − T
A c I − T
+ T
.
y# s y# s y# s
PIn = PI n (xn ) .
1. d = PIn d.
2. If n = 0, d = A†0 d; return
T # T #
3. α = s# n−1 d/yn−1 s# ; d = d − αyn−1
The projected BFGS–Armijo algorithm that we used in the example problems in §5.7 is
based on Algorithm bfgsrecb. Note that we reinitialize ns to zero (i.e., reinitialize A to PI )
#T
when yns s ≤ 0. We found experimentally that this was better than skipping the update.
Algorithm 5.5.4. bfgsoptb(x, f, τ, u, l)
1. ns = n = 0; pg0 = pg = x − P(x − ∇f (x))
2. = min(min(Ui − Li )/2,
pg
); A = A1 (x); I = I 1 (x); A0 = PI
3. While
pg
≤ τa + τr
pg0
Theorem 4.1.3 can be applied directly once the active set has been identified and a good
initial approximation to the reduced Hessian is available. The reader is invited to construct the
(easy!) proof in exercise 5.8.6.
Theorem 5.5.4. Let x∗ be a nondegenerate local minimizer. Then if x0 is sufficiently near
to x∗ , A(x0 ) = A(x∗ ), and A0 sufficiently near to PI(x∗ ) ∇2 f (x∗ )PI(x∗ ) , then the projected
BFGS iteration, with n =
xn − xn (1)
, will converge q-superlinearly to x∗ .
A global convergence result for this projected BFGS algorithm can be derived by combining
Theorems 5.5.2 and 4.1.9.
Theorem 5.5.5. Let ∇f be Lipschitz continuous on Ω. Assume that the matrices Hn are
constructed with the projected BFGS method (5.50) and satisfy the assumptions of Theorem 5.5.2.
Then (5.40) and the conclusions of Theorem 5.5.2 hold.
Moreover, if x∗ is a nondegenerate local minimizer such that there is n0 such that A(xn ) =
A(x∗ ) for all n ≥ n0 , Hn0 is spd, and the set
Function Value
0
10
1 1
10 10
0 500 1000 1500 2000 0 500 1000 1500 2000
Iterations Iterations
Function Value
0
10
5
10
10 1
10 10
0 10 20 30 40 0 10 20 30 40
Iterations Iterations
1.3
1.2
1.1
0.9
0.8
0.7
0.6
0.5
0 200 400 600 800 1000 1200 1400 1600 1800 2000
5.7 Examples
The computations in this section were done with the MATLAB code bfgsbound. In this code
the storage is limited to five pairs of vectors, and β = .1 was used in the line search.
0 1.73
10
Function Value
1.72
2
10
1.71
4
10
1.7
6
10 1.69
0 10 20 30 40 0 10 20 30 40
Iterations Iterations
0 1.73
10
Function Value
1.72
2
10
1.71
4
10
1.7
6
10 1.69
0 50 100 0 50 100
Iterations Iterations
Function Value
0 6
10 10
5 4
10 10
10 2
10 10
0 2 4 6 0 2 4 6
Iterations Iterations
Function Value
0 6
10 10
5 4
10 10
10 2
10 10
0 2 4 6 8 0 2 4 6 8
Iterations Iterations
We solve the constrained problem with Algorithm gradproj and Algorithm bfgsoptb.
In Figure 5.3 we plot the function value and the norm of the projected gradient u − u(1).
The projected BFGS iteration required 71 function evaluations, 36 gradient evaluations, and
roughly 5.6 million floating point operations, while the gradient projection needed 183 function
evaluations, 92 gradient evaluations, and roughly 10.4 million floating point operations.
Our second control problem example solves the same problem as in §3.4.2 using the con-
straints
−206 ≤ u ≤ 206.
We terminate the iteration when
u − u(1)
≤ 10−6 , which is exactly the condition used in
§3.4.2 when the active set is empty. The solution to the unconstrained problem is feasible,
the active set is empty, and the initial iterate is feasible. Both the gradient projection iteration
and the projected BFGS iteration converge to the solution of the unconstrained problem. The
constraints are not active at either the initial iterate or the final solution but are active inside
the line search for the first iterate and for the second iterate. As is clear from a comparison
of Figures 5.4 and 3.3, this small change has a dramatic effect on the cost of the optimization,
eliminating the need for the scaling fixup (3.50). The gradient projection method, requiring 15
function evaluations, 8 gradient evaluations, and roughly 167 thousand floating point operations,
is far more efficient that the steepest descent iteration reported in §3.4.2. The projected BFGS
iteration was somewhat worse, needing 223 thousand operations, but only 13 function evaluations
and 7 gradient evaluations. In this example the cost of maintaining the BFGS update was not
compensated by a significantly reduced iteration count.
5.8.5. Suppose the unconstrained problem (1.2) has a solution x∗ at which the standard assump-
tions for unconstrained optimization hold. Consider the bound constrained problem (5.3)
for u and l such that x∗ ∈ Ω and A(x∗ ) is not empty. Is x∗ a nondegenerate local mini-
mizer? If not, how are the results in this chapter changed? You might try a computational
example to see what’s going on.
5.8.6. Prove Theorem 5.5.4.
5.8.10. What would happen in the examples if we increased the number of (y, s) pairs that were
stored? By how much would the BFGS cost be increased?
Part II
Chapter 6
The algorithms in Part I cannot be implemented at all if the gradient of f is not available,
either analytically or via a difference. Even if gradients are available, these algorithms are
not satisfactory if f has many local minima that are not of interest. We limit our coverage to
deterministic sampling algorithms which are generally applicable and are more or less easy to
implement. Of these algorithms, only the DIRECT algorithm [150] covered in §8.4.2 is truly
intended to be a global optimizer.
The study of optimization methods that do not require gradients is an active research area (see
[227] for a survey of some of this activity), even for smooth problems [61], [62]. Even though
some of the methods, such as the Nelder–Mead [204] and Hooke–Jeeves [145] algorithms are
classic, most of the convergence analysis in this part of the book was done after 1990.
The algorithms and theoretical results that we present in this part of the book are for objective
functions that are perturbations of simple, smooth functions. The surfaces in Figure 6.1 illustrate
this problem. The optimization landscape on the left of Figure 6.1, taken from [271], arose in
a problem in semiconductor design. The landscape on the right is a simple perturbation of a
convex quadratic.
4.5
3.5
20
3
0
2.5
-20
2
-40
1.5
-60
25
1
20
-80
0 15
5 0.5
10
10
15 5
20 0
25 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
We do not discuss algorithms that explicitly smooth the objective function or apply a filter,
such as the ones in [168] and [187]. For general problems, these must sample the variable
space in some way, for example by performing high-dimensional integration, and are too costly.
However, in some special cases these integrals can be performed analytically and impressive
results for special-purpose filtering algorithms for computational chemistry have been reported
in, for example, [196] and [277]. Nor do we discuss analog methods (see [149] for a well-known
111
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
example).
We also omit stochastic methods like the special-purpose methods discussed in [38] and [39],
or more radical general-purpose global optimization algorithms, such as simulated annealing
[166] (see [1] and [265] for surveys of recent work), interval methods [152], or genetic algorithms
[143], [144] (see [246] or [123] for a survey), which are random to some extent or random
search algorithms. These probabilistic methods, however, should be considered when the more
conservative algorithms such as the ones in this part of the book fail.
Clearly,
σ+ (S) ≤ diam(S) ≤ 2σ+ (S).
Note that the matrix of simplex directions and the vector of objective function differences
depend on which of the vertices is labeled x1 . Most of the algorithms we consider in this part
of the book use a vertex ordering or sample on a regular stencil. In this way the algorithms, in
one way or another, use a simplex gradient.
This definition of simplex gradient is motivated by the first-order estimate in Lemma 6.2.1.
Lemma 6.2.1. Let S be a simplex. Let ∇f be Lipschitz continuous in a neighborhood of S
with Lipschitz constant 2Kf . Then there is K > 0, depending only on Kf such that
(6.2)
∇f (x1 ) − D(f : S)
≤ Kκ(V )σ+ (S).
Proof. Our smoothness assumptions on f and Taylor’s theorem imply that for all 2 ≤ j ≤
N + 1,
|f (x1 ) − f (xj ) + vjT ∇f (x1 )| ≤ Kf
vj
2 ≤ Kf σ+ (S)2 .
Hence
δ(f : S) − V T ∇f (x1 )
≤ N 1/2 Kf σ+ (S)2
and hence, setting K = N 1/2 Kf ,
∇f (x1 ) − D(f : S)
≤ K
V −T
σ+ (S)2 .
A first-order estimate also holds for the simplex gradient of an objective function that satisfies
(6.1).
Lemma 6.2.2. Let S be a nonsingular simplex. Let f satisfy (6.1) and let ∇fs be Lipschitz
continuous in a neighborhood of S with Lipschitz constant 2Ks . Then there is K > 0, depending
only on Ks , such that
φ
S
(6.3)
∇fs (x1 ) − D(f : S)
≤ Kκ(V ) σ+ (S) + .
σ+ (S)
∇fs (x1 ) − D(fs : S)
≤ Ks N 1/2 κ(V )σ+ (S).
D(f : S) − D(fs : S)
≤
V −T
δ(f : S) − δ(fs : S)
=
V −T
δ(φ : S)
φ
S
≤ 2N 1/2
V −T
φ
S ≤ 2N 1/2 κ(V ) .
σ+ (S)
Proof. The compactness of Ω and our smoothness assumptions on fs imply that there is β0
such that
∇fs (x)
≥ β0
x − x∗
x1 − x∗
≤ β0−1
∇fs (x1 )
φ
S
≤ β0−1
D(f : S)
+ Kκ(V ) σ+ (S) + .
σ+ (S)
Then,
1. if
φ
S k
lim σ+ (S k ) = 0, lim = 0,
k→∞ k→∞ σ+ (S k )
lim sup
x∗ − xk1
≤ KS ;
k→∞
lim sup
φ
S k ≤ 2 , lim inf σ+ (S k ) ≥ , and lim inf
D(f : S k )
≤ ,
k→∞ k→∞ k→∞
lim sup
x∗ − xk1
≤ KS ( + lim sup σ+ (S k )).
k→∞ k→∞
rj = x1 − vj for j = 1, . . . , N.
(6.4)
∇f (x1 ) − DC (f : S)
≤ Kκ(V )σ+ (S)2 .
V T (δ(f : S) − δ(f : R)) − V T ∇f (x1 )
≤ N 1/2 KC σ+ (S)3 ,
Proof. This proof is very similar to that of Lemma 6.2.2 and is left to the reader.
The quality of the information that can be obtained from the central simplex gradient is
higher than that of the forward. The difference in practice can be dramatic, as the examples
in §7.6 illustrate. The consequences of a small central simplex gradient follow directly from
Lemma 6.2.6.
Lemma 6.2.7. Let f satisfy (6.1) and let ∇2 fs be continuously differentiable in a compact
set Ω ⊂ RN . Assume that fs has a unique critical point x∗ in Ω. Then there is KΩ > 0 such
that if a simplex S and its reflection R(S) are both contained in Ω then
φ
S
x1 − x∗
≤ KΩ
DC (f : S)
+ κ(V ) σ+ (S)2 + .
σ+ (S)
Lemma 6.2.7 is all one needs to conclude convergence from a sequence of small central
simplex gradients.
Theorem 6.2.8. Let f satisfy (6.1) and let ∇2 fs be continuously differentiable in a compact
set Ω ⊂ RN . Assume that fs has a unique critical point x∗ in Ω. Let S k be a sequence of
simplices having vertices {xkj }N +1
j=1 . Assume that there is M such that
Then,
1. if
φ
S k
lim σ+ (S k ) = 0, lim = 0,
k→∞ k→∞ σ+ (S k )
and lim supk→∞
DC (f : S k )
= , for some > 0, then there is KS > 0 such that
lim sup
x∗ − xk1
≤ KS ;
k→∞
lim sup
φ
S k ≤ 3 , lim inf σ+ (S k ) ≥ 2 , and lim inf
DC (f : S k )
≤ 2 ,
k→∞ k→∞ k→∞
lim sup
x∗ − xk1
≤ KS ( + lim sup σ+ (S k ))2 .
k→∞ k→∞
Theorem 6.2.8, like Theorem 6.2.4, motivates using a small simplex gradient as a test for
convergence. Suppose
φ
∞ ≤ and an algorithm generates sequences of simplices whose
vertices are intended to approximate a minimizer of fs . We can use the results in §2.3.1 to con-
clude that simplices with σ+ (S) << 1/2 will result in inaccurate forward difference gradients
and those with σ+ (S) << 2/3 in inaccurate central difference gradients. This indicates that
the central simplex gradient will be less sensitive to noise than the forward. While this is not
usually critical in computing a difference Hessian, where the loss of accuracy may cause slow
convergence, it can cause failure of the iteration if one is computing a difference gradient.
If one wants to terminate the algorithm when the simplex gradient is small, say, ≤ τ , a rough
estimate of the minimal possible value of τ is τ = O(1/2 ) for a forward difference simplex
gradient and τ = O(2/3 ) for a central simplex gradient.
Moreover, if one is using a centered difference, one has information on the values of f at
enough points to make an important qualitative judgment. In order to evaluate a central simplex
gradient f must be sampled at x1 and x1 ± vj for 1 ≤ j ≤ N . If f (x1 ) ≤ f (x1 ± vj ) for
all 1 ≤ j ≤ N , then one can question the validity of using the simplex gradient as a descent
direction or as a measure of stationarity. We call this stencil failure. We will use stencil failure
as a termination criterion in most of the algorithms we discuss in this part of the book. Our basis
for that is a result from [29], which only requires differentiability of fs .
Theorem 6.2.9. Let S be a nonsingular simplex such that for some µ− ∈ (0, 1) and κ+ > 0,
Let f satisfy (6.1) and let ∇fs be Lipschitz continuously differentiable in a ball B of radius
2σ+ (S) about x1 . Assume that
Proof. Let R(S), the reflected simplex, have vertices x1 and {rj }N
j=1 . (6.7) implies that
each component of δ(f : S) and δ(f : R) is positive. Now since
V = V (S) = −V (R),
we must have
0 < δ(f : S)T δ(f : R)
Since
V
≤ 2σ+ (S) we have by (6.9)
E2
.
The assumptions of the lemma give a lower estimate of the left side of (6.10),
wT V V T w ≥ µ− σ+ (S)2
w
2 .
Hence,
∇2 f (x1 )
≤ b
∇2 f (x1 )
+ c,
where, using (6.10),
φ
B
b= 8µ−1
1 K s κ+ σ+ (S) +
σ+ (S)
and 2
φ
B µ− 2
c= 4µ−1
− (Ks κ+ )
2
σ+ (S) + = B .
σ+ (S) 16
So b2 − 4c = b2 (1 − µ− /4) and the quadratic formula then implies that
√
2 b + b2 − 4c 1 + 1 − µ− /4
∇ f (x1 )
≤ =b ≤b
2 2
as asserted.
6.3 Examples
Our examples are selected to represent a variety of problems that can be attacked by the methods
in this part of the book and, at the same time, are easy for the reader to implement. Many of the
problems to which these methods have been applied have complex objective functions and have
been solved as team efforts [107], [250], [121], [70], [69]. In many such cases the objective
function is not even available as a single subroutine as the optimizer, simulator, and design tool
are one package. Hence, the examples we present in this part of the book are even more artificial
than the ones in the first part. The cost of an evaluation of f is much less in these examples than
it is in practice.
so that a global optimum exists. If i wi < 0 then inf f = −∞ and there is no global optimum.
Weber’s problem is not differentiable at x = zi because the square root function is not
differentiable at 0. A gradient-based algorithm, applied in a naive way, will have difficulty with
this problem. There are special-purpose algorithms (see [182] for a survey) for Weber’s problem,
especially if all the weights are positive. Our main interest is in the case where at least one weight
is negative. In that case there may be multiple local minima.
We will consider two examples. The first, and simplest, is from [182]. This example has
three demand centers with
2 90 43
w = (2, 4, −5)T and (z1 , z2 , z3 ) = .
42 11 88
The global minimum is at x∗ = (90, 11)T , at which the gradient is not defined. The complex
contours near the minimizer in Figure 6.2 illustrate the difficulty of the problem.
200
150
100
50
−50
−100
−150
−200
−200 −150 −100 −50 0 50 100 150 200
Our second example has two local minimizers, at (−10, −10) and (25, 30) with the global
minimizer at (25, 30). There are four demand centers with
T −10 0 5 25
w = (2, −4, 2, 1) and (z1 , z2 , z3 , z4 ) = .
−10 0 8 30
See Figure 6.3.
Our third example adds the oscillatory function
φ(x) = sin(.0035xT x) + 5 sin(.003(x − y)T (x − y))
to the second example, where y = (−20, 0)T . This complicates the optimization landscape
significantly, as the surface and contour plots in Figure 6.4 show.
60
40
20
−20
−40
−60
−60 −40 −20 0 20 40 60
60
40
20
−20
−40
−60
−60 −40 −20 0 20 40 60
where {ξj }, {aj }, {bj }, {cj } are given and rand is a random number generator. f has been
designed so that the minimum value is O(a1 +a2 +a3 ). The unperturbed case a1 = a2 = a3 = 0
is also of interest for many of the algorithms in this part of the book.
dij =
ξi − ξj
and
T T
x = (ξ1T , . . . , ξM ) ∈ RN
where N = 3M , we have that the Lennard–Jones energy function is
M i−1
(6.13) f (x) = d−12
ij − 2d−6
ij .
i=1 j=1
2
f has many local minimizers (O(eM ) is one conjecture [142]) and the values at the mini-
mizers are close. Hence, the Lennard–Jones function does not conform to the noisy perturbation
of a smooth function paradigm. The reader is asked in some of the exercises to see how the
methods perform.
6.4.3. Try to minimize the Lennard–Jones functional using some of the algorithms from the first
part of the book. Vary the initial iterate and M . Compare your best results with those in
[142], [40], and [210].
Chapter 7
Implicit Filtering
(7.2)
∇h f (x)
≤ τ h
for some τ > 0, when more than pmax iterations have been taken, after a stencil failure, or
when the line search fails by taking more than amax backtracks. Even the failures of fdsteep
123
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
can be used to advantage by triggering a reduction in h. The line search parameters α, β and
the parameter τ in the termination criterion (7.2) do not affect the convergence analysis that we
present here but can affect performance.
Algorithm 7.1.1. fdsteep(x, f, pmax, τ, h, amax)
1. For p = 1, . . . , pmax
(a) Compute f and ∇h f ; terminate if (6.7) or (7.2) hold.
(b) Find the least integer 0 ≤ m ≤ amax such that (7.1) holds for λ = β m . If no such
m exists, terminate.
(c) x = x − λ∇h f (x).
Algorithm fdsteep will terminate after finitely many iterations because of the limits on
the number of iterations and the number of backtracks. If the set {x | f (x) ≤ f (x0 )} is bounded
then the iterations will remain in that set. Implicit filtering calls fdsteep repeatedly, reducing
h after each termination of fdsteep. Aside from the data needed by fdsteep, one must
provide a sequence of difference increments, called scales in [120].
Algorithm 7.1.2. imfilter1(x, f, pmax, τ, {hk }, amax)
1. For k = 0, . . .
Call fdsteep(x, f, pmax, τ, hk , amax)
The convergence result follows from the second-order estimate, (6.5), the consequences of a
stencil failure, Theorem 6.2.9, and the equalities hk = σ+ (S k ) and κ(V k ) = 1. A similar result
for forward differences would follow from (6.3).
Theorem 7.1.1. Let f satisfy (6.1) and let ∇fs be Lipschitz continuous. Let hk → 0, {xk }
be the implicit filtering sequence, and S k = S(x, hk ). Assume that (7.1) holds (i.e., there is no
line search failure) for all but finitely many k. Then if
∇hk f (xk ) = DC (f : S k ) → 0.
∇fs (xk ) → 0,
as asserted.
In the context of implicit filtering, where N is small, the full quasi-Newton Hessian or its
inverse is maintained throughout the iteration. Our MATLAB codes store the model Hessian.
Algorithm 7.2.2. imfilter2(x, f, pmax, τ, {hk }, amax)
1. H = I.
2. For k = 0, . . .
Call fdquasi(x, f, H, pmax, τ, hk , amax).
In [250] and [120] the SR1 method was used because it performed somewhat better than the
BFGS method in the context of a particular application. The examples in §7.6 show the opposite
effect, and both methods have been successfully used in practice.
The choice of a quasi-Newton method to use with implicit filtering is an area of active research
[56], [55]. Both SR1 and BFGS have been used, with SR1 performing modestly better in some
applications with bound constraints [270], [251], [271], [250], [55]. The implementation of
implicit filtering in the collection of MATLAB codes imfil.m uses BFGS as the default but
has SR1 as an option. We found BFGS with central differences to be consistently better in the
preparation of the (unconstrained!) computational examples in this book.
where
x(λ) = P(x − λ∇h f (x)).
One could terminate the iteration at a given scale when the analogue to (7.2)
(7.7)
x − x(1)
≤ τ h
holds or when
(7.8) f (xc ) < f (x ± rj ) for all x ± rj feasible,
which is the analogue to (6.7) for bound constrained problems.
Quasi-Newton methods for bound constraints can be constructed more simply for small
problems, like the ones to which implicit filtering is applied, where it is practical to store the
model of the inverse of the reduced Hessian as a full matrix. By using full matrix storage, the
complexity of bfgsrecb is avoided. One such alternative [53], [54], [55] to the updates in
§5.5.3 is to update the complete reduced Hessian and then correct it with information from the
new active set. This results in a two-stage update in which a model for the inverse of reduced
Hessian is updated with (4.5) to obtain
−1 sy T −1 ysT ssT
(7.9) R1/2 = I − T Rc I− T + T .
y s y s y s
Then the new reduced Hessian is computed using the active set information at the new point
−1 −1
(7.10) R+ = PA+ + PI+ R1/2 PI+ .
It is easy to show that Theorems 5.5.4 and 5.5.5 hold for this form of the update.
A FORTRAN implementation [119] of implicit filtering for bound constrained problems is
in the software collection. In the original version of that implementation a projected SR1 update
was used and a Cholesky factorization of the matrix R+ was performed to verify positivity. The
model Hessian was reinitialized to the identity whenever the scale or the active set changed.
7.6 Examples
Many of these examples are from [56]. For all the examples we report results with and without a
quasi-Newton Hessian. We report results for both forward and central differences. In the figures
the solid line corresponds to the BFGS Hessian, the dashed-dotted line to the SR1 Hessian, and
the dashed line to H = I, the steepest descent form of implicit filtering.
Unlike the smooth problems considered earlier, where convergence of the gradient to zero
was supported by theory, convergence of the simplex gradient to zero is limited by the noise in
the objective. We illustrate performance by plotting both the objective function value and the
norm of the simplex gradient. From these examples it is clear that the the graphs of function
value against the count of function evaluations is a better indicator of the performance of the
optimizer.
In all cases we terminated the iteration when either fdquasi had been called for each scale
or a budget of function evaluations had been exhausted. Once the code completes an iteration
and the number of function evaluations is greater than or equal to the budget, the iteration is
terminated.
The examples include both smooth and nonsmooth problems, with and without noise. A
serious problem for some algorithms of this type is their failure on very easy problems. For
most of the algorithms covered in this part of the book, we will present examples that illustrate
performance on this collection of problems.
100
function value
150
0
10
200
250
1
10 300
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations
0 100
10 function value
150
1
10
200
2
10
250
3
10 300
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations
60
simplex gradient norm
0 50
function value
10
40
30
1
10
20
10
2
10 0
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations
60
simplex gradient norm
0 50
function value
10
40
30
1
10
20
10
2
10 0
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations
60
simplex gradient norm
0
10
function value
50
1
10 40
30
2
10
20
3
10 10
0 50 100 150 200 0 50 100 150 200
function evaluations function evaluations
60
simplex gradient norm
0
10
function value
50
2
10 40
30
4
10
20
6
10 10
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations
differencing. This, of course, is consistent with the theory, which does not claim that implicit
filtering is a global optimizer.
7.6.2 Parameter ID
We consider the parameter ID example from §1.6.2 using the data from §2.6.1. Recall that in this
example we use as data the values of the exact solution for c = k = 1 at the points ti = i/100
for 1 ≤ i ≤ 100. The initial iterate was (5, 5)T ; the sequence of scales was {2−k }12 k=1 . Implicit
filtering, like the globally convergent algorithms in the first part of the book, is fairly insensitive
to the choice of initial iterate, as we will see when we revisit this example in §8.5.2.
We report on both low (rtol = atol = 10−3 , Figure 7.4) and high (rtol = atol = 10−6 ,
Figure 7.5) accuracy computations. Note that after 200 function evaluations the function re-
duction from the central difference BFGS form of implicit filtering flattens out in both plots at
roughly the expected level of O(tol) while the other methods have not. This effect, which is
not uncommon, is one reason for our preference for the BFGS central difference form of the
algorithm.
2
10
0
function value
10
0
10
2
10
2
10
4 4
10 10
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations
0 function value 0
10 10
5 2
10 10
10 4
10 10
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations
0
10
2
function value
10
2
10
0
10
4
10
2 6
10 10
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations
0 0
function value
10 10
5 2
10 10
10 4
10 10
0 50 100 150 200 250 0 50 100 150 200 250
function evaluations function evaluations
0 0
function value
10 10
10 20
10 10
20 40
10 10
0 100 200 300 400 0 100 200 300 400
function evaluations function evaluations
0 0
function value
10 10
10 5
10 10
20 10
10 10
0 200 400 600 0 200 400 600
function evaluations function evaluations
0
10
0
function value
10
2
10
5
10
4
10
6 10
10 10
0 1000 2000 3000 4000 0 1000 2000 3000 4000
function evaluations function evaluations
0 0
function value
10 10
5 5
10 10
10 10
10 10
0 1000 2000 3000 4000 0 1000 2000 3000 4000
function evaluations function evaluations
0 0
function value
10 10
2 2
10 10
4 4
10 10
0 1000 2000 3000 4000 0 1000 2000 3000 4000
function evaluations function evaluations
0
10
function value 0
10
2
10
2
10
4
10
6 4
10 10
0 100 200 300 400 0 100 200 300 400
function evaluations function evaluations
1 2
10 10
function value
0 0
10 10
1 2
10 10
2 4
10 10
0 1000 2000 3000 4000 0 1000 2000 3000 4000
function evaluations function evaluations
2
10
1
function value
10
1
10
0
10
0
10
1 1
10 10
0 1000 2000 3000 4000 0 1000 2000 3000 4000
function evaluations function evaluations
If a3 = 0 then f may not return the same value when called with the same argument twice.
The reader is invited to explore the consequences of this in exercise 7.7.3.
The performance of the algorithms in this part of the book sometimes depends on the size of
the problem much more strongly than the Newton-based methods in Part I. In the case of implicit
filtering, that dependence is mostly a result of the cost of evaluation of the simplex gradient. To
illustrate this we consider our quadratic problems for N = 4 (Figures 7.6 and 7.8) and N = 32
(Figures 7.7 and 7.9).
For all the quadratic examples the initial iterate was
(1, 2, . . . , N )T
x0 =
10N
and the sequence of scales was {2−k }10
k=0 .
7.7.4. Try to solve the Lennard–Jones problem with implicit filtering for various values of M
and various initial iterates. Compare your best results with those in [142], [40], and [210].
Are you doing any better than you did in exercise 6.4.3?
7.7.5. Show that Theorems 5.5.4 and 5.5.5 hold if the projected BFGS update is implemented us-
ing (7.9) and (7.10). How would these formulas affect an implementation like bfgsrecb,
which is designed for problems in which full matrices cannot be stored?
Chapter 8
In this chapter we discuss the class of direct search algorithms. These methods use values of
f taken from a set of sample points and use that information to continue the sampling. Unlike
implicit filtering, these methods do not explicitly use approximate gradient information. We will
focus on three such methods: the Nelder–Mead simplex algorithm [204], the multidirectional
search method [85], [261], [262], and the Hooke–Jeeves algorithm [145]. Each of these can be
analyzed using the simplex gradient techniques from Chapter 6. We will not discuss the very
general results based on the taxonomies of direct search methods from [263], [174], and [179] or
the recent research on the application of these methods to bound [173] or linear [175] constraints.
We include at the end of this chapter a short discussion of methods based on surrogate models
and a brief account of a very different search method, the DIRECT algorithm [150]. These two
final topics do not lead to algorithms that are easy to implement, and our discussions will be
very general with pointers to the literature.
1 N
(8.3) x= xi .
N i=1
135
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
by rules that we formally describe in Algorithm nelder. Our formulation of the algorithm
allows for termination if either f (xN +1 ) − f (x1 ) is sufficiently small or a user-specified number
of function evaluations has been expended.
1. Evaluate f at the vertices of S and sort the vertices of S so that (8.1) holds.
2. Set f count = N + 1.
(a) Compute x, (8.3), x(µr ), (8.2), and fr = f (x(µr )). f count = f count + 1.
(b) Reflect: If f count = kmax then exit. If f (x1 ) ≤ fr < f (xN ), replace xN +1 with
x(µr ) and go to step 3g.
(c) Expand: If f count = kmax then exit. If fr < f (x1 ) then compute fe = f (x(µe )).
f count = f count + 1. If fe < fr , replace xN +1 with x(µe ); otherwise replace
xN +1 with x(µr ). Go to to step 3g.
(d) Outside Contraction: If f count = kmax then exit. If f (xN ) ≤ fr < f (xN +1 ),
compute fc = f (x(µoc )). f count = f count + 1. If fc ≤ fr replace xN +1 with
x(µoc ) and go to step 3g; otherwise go to step 3f.
(e) Inside Contraction: If f count = kmax then exit. If fr ≥ f (xN +1 ) compute
fc = f (x(µic )). f count = f count + 1. If fc < f (xN +1 ), replace xN +1 with
x(µic ) and go to step 3g; otherwise go to step 3f.
(f) Shrink: If f count ≥ kmax − N , exit. For 2 ≤ i ≤ N + 1: set xi = x1 − (xi −
x1 )/2; compute f (xi ).
(g) Sort: Sort the vertices of S so that (8.1) holds.
Figure 8.1 is an illustration of the options in two dimensions. The vertices labeled 1, 2, and
3 are those of the original simplex.
The Nelder–Mead algorithm is not guaranteed to converge, even for smooth problems [89],
[188]. The failure mode is stagnation at a nonoptimal point. In §8.1.3 we will present some
examples from [188] that illustrate this failure. However, the performance of the Nelder–Mead
algorithm in practice is generally good [169], [274]. The shrink step is rare in practice and we
will assume in the analysis in §8.1.2 that shrinks do not occur. In that case, while a Nelder–Mead
iterate may not result in a reduction in the best function value, the average value
1 N +1
f= f (xj )
N +1 j=1
will be reduced.
1
r
oc
ic
3 2
Assumption 8.1.1 is satisfied by the Nelder–Mead sequence if no shrink steps are taken
and the initial simplex directions are linearly independent [169]. The Nelder–Mead algorithm
demands that the average function value improve, but no control is possible on which value is
improved, and the simplex condition number can become unbounded.
We can define a sufficient decrease condition for search algorithms that is analogous to the
sufficient decrease condition for steepest descent and related algorithms (3.2). We will ask that
the k + 1st iteration satisfy
(8.4) f k+1 − f k < −α
Dk f
2 .
Here α > 0 is a small parameter. Our choice of sufficient decrease condition is motivated by the
smooth case and steepest descent, where (3.2) and the lower bound −λ̄ on λ from Lemma 3.2.3
lead to
f (xk+1 ) − f (xk ) ≤ −λ̄α
∇f (xk )
2 ,
which is a smooth form of (8.4). Unlike the smooth case, however, we have no descent direction
and must incorporate λ̄ into α. This leads to the possibility that if the simplex diameter is much
smaller than
Dk f
, (8.4) could fail on the first iterate. We address this problem with the scaling
σ+ (S 0 )
α = α0 .
D0 f
A typical choice in line search methods, which we use in our numerical results, is α0 = 10−4 .
The convergence result for smooth functions follows easily from Lemma 6.2.1.
Theorem 8.1.1. Let a sequence of simplices satisfy Assumption 8.1.1 and let the assumptions
of Lemma 6.2.1 hold, with the Lipschitz constants K k uniformly bounded. Assume that {f k } is
bounded from below. Then if (8.4) holds for all but finitely many k and
lim σ+ (S k )κ(V k ) = 0,
k→∞
∗
Hence, if x is any accumulation point of the sequence {xk1 } then ∇f (x∗ ) = 0. This completes
the proof since κ(V k ) ≥ 1 and therefore σ+ (V k ) → 0.
The result for the noisy functions that satisfy (6.1) with fs smooth reflects the fact that
the resolution is limited by the size of φ. In fact, if σ+ (S k ) is much smaller than
φ
S k , no
information on fs can be obtained by evaluating f at the vertices of S k and once σ+ (S k ) is
1/2
smaller than
φ
S k no conclusions on ∇fs can be drawn. If, however, the noise decays to zero
sufficiently rapidly near the optimal point, the conclusions of Theorem 8.1.1 still hold.
Theorem 8.1.2. Let a sequence of simplices satisfy Assumption 8.1.1 and let the assumptions
of Lemma 6.2.2 hold with the Lipschitz constants Ksk uniformly bounded. Assume that {f k } is
bounded from below. Then if (8.4) holds for all but finitely many k and if
φ
S k
(8.5) lim κ(V k ) σ+ (S k ) + = 0,
k→∞ σ+ (S k )
then any accumulation point of the simplices is a critical point of fs .
Proof. Our assumptions, as in the proof of Theorem 8.1.1, imply that Dk f → 0. Recall that
Lemma 6.2.2 implies that
φ
S k
(8.6)
Dk fs
≤
Dk f
+ K k κ(V k ) σ+ (S k ) + ,
σ+ (S k )
8 10
10 10
0 50 100 150 0 50 100 150
15
3
10
10
10
10
2
10 5
10
1 0
10 10
0 50 100 150 0 50 100 150
With this data, the Nelder–Mead iteration will stagnate at the origin, which is not a critical point
for f . The stagnation mode is repeated inside contractions that leave the best point (which is not
a minimizer) unchanged.
We terminated the iteration when the difference between the best and worst function values
was < 10−8 .
We illustrate the behavior of the Nelder–Mead algorithm in Figures 8.2, 8.3, and 8.4. In all
the figures we plot, as functions of the iteration index, the difference between the best and worst
function values, σ+ , the maximum oriented length, the norm of the simplex gradient, and the l2
condition number of the matrix of simplex directions. In all three problems stagnation is evident
from the behavior of the simplex gradients. Note also how the simplex condition number is
growing rapidly.
8 5
10 10
0 20 40 60 0 20 40 60
6
10
0 4
10 10
2
10
1 0
10 10
0 10 20 30 40 0 10 20 30 40
and (Dk f )l is the lth component of Dk f . If Dk f = 0 we assume that the Nelder–Mead iteration
would have been terminated at iteration k because there is no difference between best and worst
values.
So, before ordering, the new simplex has the same first point as the old. The diameter of the
new simplex has not been increased since the diameter of the new simplex is at most σ+ (S k ).
Moreover all edge lengths have been reduced. So after reordering σ+ (S k+1 ) ≤ σ− (S k ). As for
κ, after the oriented shrink, but before reordering, κ(V ) = 1. After reordering, of course, the
best point may no longer be x1 . In any case the worst-case bound on κ is
√
(8.8) κ(V k+1 ) =
V k+1
2 ≤ (1 + N )2 .
0 0
10 10
2 2
10 10
4 4
10 10
0 10 20 30 40 50 0 10 20 30 40 50
6
10
2 4
10 10
2
10
1 0
10 10
0 10 20 30 40 50 0 10 20 30 40 50
0 0
10 10
5 2
10 10
10 4
10 10
0 20 40 60 0 20 40 60
0 2
10 10
2 1
10 10
4 0
10 10
0 20 40 60 0 20 40 60
0 0
10 10
5 2
10 10
10 4
10 10
0 20 40 60 0 20 40 60
0 3
10 10
2 2
10 10
4 1
10 10
6 0
10 10
0 20 40 60 0 20 40 60
Right Simplex
Equilateral Simplex
x3
x2
c3
x3
c2
c3 e2 r2
x1
x1 c2 x2
r3
r2
r3
e3
e3
e2
1. Evaluate f at the vertices of S and sort the vertices of S so that (8.1) holds.
2. Set f count = N + 1.
If the function values at the vertices of S c are known, then the cost of computing S + is 2N
additional evaluations. Just as with the Nelder–Mead algorithm, the expansion step is optional
but has been observed to improve performance.
The extension of MDS to bound constrained and linearly constrained problems is not trivial.
We refer the reader to [173] and [175] for details.
This exhaustion of a lattice takes place under more general conditions [262] but is most clear for
the equilateral and right simplex cases.
Theorem 6.2.9 implies that infinitely many contractions and convergence of the simplex
diameters to zero imply convergence of the simplex gradient to zero. The similarity of The-
orem 6.2.9 to Lemma 6.2.2 and of Theorem 8.2.1, the convergence result for multidirectional
search, to Theorem 8.1.2 is no accident. The Nelder–Mead iteration, which is more aggressive
than the multidirectional search iteration, requires far stronger assumptions (well conditioning
and sufficient decrease) for convergence, but the ideas are the same. Theorems 6.2.9 and 8.2.1
can be used to extend the results in [262] to the noisy case. The observation in [85] that one
can apply any heuristic or machine-dependent idea to improve performance, say, by exploring
far away points on spare processors (the “speculative function evaluations” of [46]) without
affecting the analysis is still valid here.
Theorem 8.2.1. Let f satisfy (6.1) and assume that the set
{x | f (x) ≤ f (x01 )}
(8.9) lim σ+ (S k ) → 0.
k→∞
and the new base point is xcb . The subtle part of the algorithm begins here. Rather than center
the next exploration at xcb , which would use some of the same points that were examined in
the previous exploration, the Hooke–Jeeves pattern move step is aggressive and tries to move
further. The algorithm centers the next exploratory move at
If this second exploratory move fails to improve upon f (xcb ), then an exploratory move with
xcb as the center is tried. If that fails h is reduced, x is set to xcb , and the process is started over.
Note that when h has just been set, the base point and the center of the stencil for the exploratory
moves are the same, but afterward they are not.
If, after the first exploratory move, xcb = x (i.e., as it will be if x is the best point in the
pattern), then x is left unchanged and h is reduced.
Therefore, whenever h is reduced, the stencil centered at x has x itself as the best point.
This is exactly the situation that led to a shrink in the MDS algorithm and, as you might expect,
will enable us to prove a convergence result like those in the previous sections. In [145] h was
simply multiplied by a constant factor. Our description in Algorithm hooke follows the model
of implicit filtering and uses a sequence of scales. Choice of perturbation directions could be
generalized to any simplex shape, not just the right simplices used in [145].
Figure 8.9 illustrates the idea for N = 2. The base point x lies at the center of the stencil. If
f (x+ + − −
1 ) < f (x), f (x2 ) < f (x), f (x1 ) ≥ f (x), and f (x2 ) ≥ f (x),
then the new base point xb will be located above and to the right of x. The next exploratory
move will be centered at xC , which is the center of the stencil in the upper right corner of the
figure.
The reader, especially one who plans to implement this method, must be mindful that points
may be sampled more than once. For example, in the figure, if the exploratory move centered
at xC fails, f will be evaluated for the second time at the four points in the stencil centered
at xb unless the algorithm is implemented to avoid this. The MDS method is also at risk of
sampling points more than once. The implementations of Hooke–Jeeves and MDS in our suite
of MATLAB codes keep the most recent 4N iterations in memory to guard against this. This
reevaluation is much less likely for the Nelder–Mead and implicit filtering methods. One should
also be aware that the Hooke–Jeeves algorithm, like Nelder–Mead, does not have the natural
parallelism that implicit filtering and MDS do.
One could implement a variant of the Hooke–Jeeves iteration by using xC = x + dHJ
instead of xC = x + 2dHJ and shrinking the size of the simplex on stencil failure. This is the
discrete form of the classical coordinate descent algorithm [180] and can also be analyzed by
the methods of this section (see [279] for a different view).
Our implementation follows the model of implicit filtering as well as the description in
[145]. We begin with the exploratory phase, which uses a base point xb , base function value
fb = f (xb ), and stencil center xC . Note that in the algorithm xb = xC for the first exploration
and xC = xb +dHJ thereafter. Algorithm hjexplore takes a base point and a scale and returns
a direction and the value at the trial point x + d. We let V = I be the matrix of coordinate
directions, but any nonsingular matrix of search directions could be used. The status flag sf is
used to signal failure and trigger a shrink step.
Algorithm 8.3.1. hjexplore(xb , xC , f, h, sf )
3. if xcb = xb ; sf = 1; xb = xcb
xC2+
xC1- xC1+
xC
x2+ xb xC2-
s1+
x1- x x1+
s1- s1+
s2-
x2-
The exploration is coupled to the pattern move to complete the algorithm for a single value
of the scale. The inputs for Algorithm hjsearch are an initial iterate x, the function, and the
scale. On output, a point x is returned for which the exploration has failed. There are other
considerations, such as the budget for function evaluations, that should trigger a return from the
exploratory phase in a good implementation. In our MATLAB code hooke.m we pay attention
to the number of function evaluations and change in the function value as part of the decision to
return from the exploratory phase.
1. xb = x; xC = x; sf = 1
2. Call hjexplore(x, xC , f, h, sf )
3. While sf = 1
(a) d = x − xb ; xb = x; xC = x + d
(b) Call hjexplore(x, xC , f, h, sf );
If sf = 0; xC = x; Call hjexplore(x, xC , f, h, sf )
Step 3b requires care in implementation. If sf = 0 on exit from the first call to hjexplore,
one should only test f at those points on the stencil centered at x that have not been evaluated
before.
The Hooke–Jeeves algorithm simply calls hjsearch repeatedly as h varies over a sequence
{hk } of scales.
Algorithm 8.3.3. hooke(x, f, {hk })
1. For k = 1, . . .
Call hjsearch(x, f, hk )
As is the case with implicit filtering, the Hooke–Jeeves algorithm can be applied to bound
constrained problems in a completely natural way [145], [227] by simply restricting the stencil
points to those that satisfy the bounds and avoiding pattern moves that leave the feasible region.
The Hooke–Jeeves algorithm shares with implicit filtering the property that extension to
bound constrained problems is trivial [145]. One simply restricts the exploratory and pattern
moves to the feasible set.
φ
B k
lim = 0,
k→∞ σ+ (S k )
where B k is the ball of radius 2hk about xk , then every limit point of {xk } is a critical point of
fs .
interpolatory data. The function being modeled could be too complex to be modeled in a sim-
ple way (think of the Lennard–Jones function), and very misleading results could be obtained.
However, this approach is often very productive even for smooth problems in which evaluation
of f is very expensive (see [28] for a high-flying example).
Initialization of the model requires an initial set of points at which to sample f . Selection of
this point set is not a trivial issue, and the regular stencils used in implicit filtering and the direct
search algorithms are very poor choices. The study of this issue alone is a field in itself, called
design and analysis of computer experiments (DACE) [27], [167], [230].
Having built such a model, one then finds one or more local minima of the model. One can
use either a conventional gradient-based method, a sampling algorithm of the type discussed in
Chapters 7 or 8, or an algorithm that is itself based on building models like the one described in
[62], the nongradient-based approaches being used when the model is expected to be multimodal
or nonconvex. Upon minimizing the model, one then evaluates f again at one or more new points.
The implementation of such a scheme requires careful coordination between the sampling
of the function, the optimization of the model, and the changing of the set of sample points. We
refer the reader to [28] and [4] for more information on recent progress in this area.
(8.11) f (x) ≥ flow (x, a, b) = max(f (a) − L(x − a), f (b) − L(b − x))
for all x ∈ [a, b]. If one samples f repeatedly, one can use (8.11) on a succession of intervals
and obtain a piecewise linear approximation to f . If In = [an , bn ] ⊂ [a, b] then f (x) ≥
flow (x, an , bn ) on In , the minimum value of flow (x, an , bn ) is
The algorithm begins with I0 = [a, b], selects the interval for which Vn is least, and divides at
Mn . This means that if K intervals have been stored we have, replacing In and adding IK+1 to
the list,
In = [an , Mn ] and IK+1 = [Mn , bn ].
The sequence of intervals is only ordered by the iteration counter, not by location. In this way
the data structure for the intervals is easy to manage.
If there are p and k such that p = k and Vp ≥ max(f (ak ), f (bk )), then Ip need not be
searched any longer, since the best value from Ip is worse than the best value in Ik . The
algorithm’s rule for division automatically incorporates this information and will never sample
from Ip .
There are two problems with this algorithm. One cannot expect to know the Lipschitz
constant L, so it must be estimated. An estimated Lipschitz constant that is too low can lead to
erroneous rejection of an interval. An estimate that is too large will lead to slow convergence,
since intervals that should have been discarded will be repeatedly divided. The second problem
10
0
0 1 2 3 4 5 6 7
is far more serious. The obvious generalization of the Shubert algorithm to more than one
dimension would replace intervals by N -dimensional hyperrectangles and require sampling at
each of the 2N vertices of the rectangle to be divided. This exponential complexity makes this
trivial generalization of the Shubert algorithm completely impractical.
The DIRECT algorithm [150] attempts to address these problems by sampling at the midpoint
of the hyperrectangle rather than the vertices and indirectly estimating the Lipschitz constant as
the optimization progresses. The scheme is not completely successful in that the mesh of sample
points becomes everywhere dense as the optimization progresses. Hence the algorithm becomes
an exhaustive search, a fact that is used in [150] to assert global convergence. In spite of the
exponential complexity of exhaustive search, even one with a fixed-size mesh (a problem with
any deterministic algorithm that is truly global [248]), DIRECT has been reported to perform well
in the early phases of the iteration [150], [108] and for suites of small test problems. DIRECT is
worth consideration as an intermediate algorithmic level between methods like implicit filtering,
Nelder–Mead, Hooke–Jeeves, or MDS on the conservative side and nondeterministic methods
like simulated annealing or genetic algorithms on the radical side.
We will describe DIRECT completely only for the case N = 1. This will make clear how
the algorithm implicitly estimates the Lipschitz constant. The extension to larger values of N
requires careful management of the history of subdivision of the hyperrectangles, and we will give
a simple pictorial account of that. For more details we refer to [150], [147], or the documentation
[108] of the FORTRAN implementation of DIRECT from the software collection.
As with the Shubert algorithm we begin with an interval [a, b] but base our lower bound and
our subdivision strategy on the midpoint c = (a + b)/2. If the Lipschitz constant L is known
then
If we are to divide an interval and also retain the current value c as the midpoint of an interval
in the set of intervals, we must divide an interval into three parts. If there are K intervals on
the list and an interval In = [an , bn ] with midpoint cn has been selected for division, the new
intervals are
IK+1 = [an , an + (bn − an )/3], In = [an + (bn − an )/3, bn − (bn − an )/3], and
IK+2 = [bn − (bn − an )/3, bn ].
So cn is still the midpoint of In and two new midpoints have been added.
The remaining part of the algorithm is the estimation of the Lipschitz constant and the
simultaneous selection of the intervals to be divided. If the Lipschitz constant were known, an
interval would be selected for division if f (c) − L(b − a)/2 were smallest. This is similar to
the Shubert algorithm. In order for there to even exist a Lipschitz constant that would force
an interval to be selected for division in this way, that interval must have the smallest midpoint
value of all intervals having the same length. Moreover, there should be no interval of a different
length for which f (c) − L(b − a)/2 was smaller.
The DIRECT algorithm applies this rule to all possible combinations of possible Lipschitz
constants and interval sizes. If one plots the values of f at the midpoints against the lengths of
the intervals in the list to obtain a plot like the one in Figure 8.10, one can visually eliminate
all but one interval for each interval length. By taking the convex hull of the lowest points, one
can eliminate interval lengths for which all function values are so high that f (c) − L(b − a)/2
would be smaller for the best point at a different length no matter what L was. For example, the
three points that intersect the line in Figure 8.10 would correspond to intervals that would be
subdivided at this step. The slopes of the line segments through the three points are estimates
of the Lipschitz constant. These estimates are not used explicitly, as they would be in the
Shubert algorithm, but implicitly in the process of selection of intervals to be divided. Unlike
the Shubert algorithm, where the Lipschitz constant is assumed known, the DIRECT algorithm
will eventually subdivide every interval.
The resulting algorithm may divide more than a single interval at each stage and the number
of intervals to be divided may vary. This is easy to implement for a single variable. However,
for more than one variable there are several ways to divide a hyperrectangle into parts and one
must keep track of how an interval has previously been divided in order not to cluster sample
points prematurely by repeatedly dividing an interval in the same way. Figures 8.11 and 8.12,
taken from [108], illustrate this issue for N = 2. In Figure 8.11 the entire rectangle will be
divided. Shading indicates that the rectangle has been selected for division. Four new midpoints
are sampled. The subdivision into new rectangles could be done in two ways: the figure shows
an initial horizontal subdivision followed by a vertical division of the rectangle that contains the
original center. The second division is shown in Figure 8.12. The two shaded rectangles are
selected for division. Note that four new centers are added to the small square and two to the
larger, nonsquare, rectangle. In this way the minimum number of new centers is added.
DIRECT parallelizes in a natural way. All hyperrectangles that are candidates for division
may be divided simultaneously, and for each hyperrectangle the function evaluations at each of
the new midpoints can also be done in parallel. We refer the reader to [150] and [108] for details
on the data structures and to [108] for a FORTRAN implementation and additional discussion
on the exploitation of parallelism.
6 6 6
5 9 8 5 9 8 5 9 8
2 2 2
6 6 6
9 9
5 9 8 7 5 6 9 8 7 5 6 9 8
4 4
2 3 2 6 3 2 6
8.5 Examples
In each of the examples we compare the central difference BFGS form of implicit filtering from
§7.6 (solid line) with the Nelder–Mead (dashed line), Hooke–Jeeves (solid line with circles), and
MDS (dashed-dotted line) algorithms.
For each example we specified both an initial iterate and choice of scales. This is sufficient
to initialize both implicit filtering and Hooke–Jeeves. We used the implicit filtering forward
difference stencil as the initial simplex for both Nelder–Mead and MDS.
The plots reflect the differences in the startup procedures for the varying algorithms. In
particular, Nelder–Mead and MDS sort the simplex and hence, if the initial iterate is not the best
point, report the lower value as the first iterate.
The relative performance of the various methods on these example problems should not be
taken as a definitive evaluation, nor should these examples be thought of as a complete suite of test
problems. One very significant factor that is not reflected in the results in this section is that both
implicit filtering [69], [55] and multidirectional search [85] are easy to implement in parallel,
while Nelder–Mead and Hooke–Jeeves are inherently sequential. The natural parallelism of
implicit filtering and multidirectional search can be further exploited by using idle processors to
explore other points on the line search direction or the pattern.
−80
−100
−120
−140
−160
function value
−180
−200
−220
−240
−260
−280
0 50 100 150 200 250
function evaluations
in Figures 8.13, 8.14, and 8.15. The other three algorithms perform equally well. Note in the
third example that MDS finds a local minimum that is not the global minimum.
8.5.2 Parameter ID
In the computations reported in this section each algorithm was allowed 500 evaluations of f
and the sequence of scales was {2−j }12 j=1 .
We begin with the two examples from §7.6.2. With the initial iterate of (5, 5)T , the exact
solution to the continuous problem lies on the grid that the Hooke–Jeeves algorithm uses to search
for the solution. This explains the unusually good performance of the Hooke–Jeeves optimization
shown in both Figures 8.16 and 8.17. When the initial iterate is changed to (5.1, 5.3)T , the
performance of Hooke–Jeeves is very different as one can see from Figures 8.18 and 8.19. The
other algorithms do not have such a sensitivity to the initial iterate for this example. We have no
explanation for the good performance turned in by the Nelder–Mead algorithm on this problem.
70
60
50
function value
40
30
20
10
0
0 20 40 60 80 100 120 140
function evaluations
70
60
50
function value
40
30
20
10
0 20 40 60 80 100 120
function evaluations
2
10
1
10
0
10
function value
−1
10
−2
10
−3
10
0 100 200 300 400 500 600
function evaluations
2
10
0
10
−2
10
function value
−4
10
−6
10
−8
10
−10
10
0 100 200 300 400 500 600
function evaluations
2
10
1
10
0
10
function value
−1
10
−2
10
−3
10
0 100 200 300 400 500 600
function evaluations
2
10
1
10
0
10
−1
10
−2
10
function value
−3
10
−4
10
−5
10
−6
10
−7
10
−8
10
0 100 200 300 400 500 600
function evaluations
2
10
1
10
0
10
−1
10
−2
10
function value
−3
10
−4
10
−5
10
−6
10
−7
10
−8
10
0 50 100 150 200 250 300 350 400
function evaluations
4
10
2
10
0
10
function value
−2
10
−4
10
−6
10
−8
10
0 500 1000 1500 2000 2500 3000 3500
function evaluations
2
10
1
10
0
10
−1
function value
10
−2
10
−3
10
−4
10
−5
10
0 200 400 600 800 1000 1200 1400 1600 1800 2000
function evaluations
3
10
2
10
1
10
0
function value
10
−1
10
−2
10
−3
10
−4
10
0 500 1000 1500 2000 2500 3000 3500
function evaluations
For each l = 1, 2, 3, apply the Nelder–Mead algorithm to the function f defined for x ∈ R4
by
(x1 − x2 x3 x4 )2 + (x2 − x3 x4 )2 + (x3 − x4 )2 + x24
with the initial simplex V l . What happened? This example is one of Nelder’s favorites
[203].
8.6.2. Show that if the set {x | f (x) ≤ f (x01 )} is bounded and S0 is either an equilateral or a
right simplex, then (8.9) holds.
8.6.3. One can modify MDS [260] by eliminating the expansion step and only computing re-
flected points until one is found that is better than x1 . If no reflected points are better,
then perform a contraction step. Prove that Theorem 8.2.1 holds for this implementation.
Implement MDS in this way and compare it with Algorithm mds. Are the savings in calls
to f for each iterate realized in a savings for the entire optimization?
8.6.4. The easiest problem in optimization is to minimize xT x. Give the algorithms in this
section a chance to show what they can do by using them to solve this problem. Try several
initial iterates (or initial simplices/patterns) and several problem dimensions (especially
N = 8, 16, 32).
8.6.5. The search methods in this section impose a structure on the sampling and thereby hope
to find a useful optimal point far more efficiently than using an unstructured deterministic
or random search. Implement an unstructured search and use your algorithm to minimize
xT x when N = 2. For an example of such a method, take the one from [6], please.
8.6.6. The Spendley, Hext, and Himsworth algorithm [244] manages the simplices in a very
different way from those we’ve discussed in the text. Use the information in [244] and
[267] to implement this algorithm. Use Theorem 6.2.9 to prove convergence for N = 2.
What happens to both your implementation and analysis when you try N = 3 or arbitrary
N ? Explain Table 5 in [244].
8.6.7. Use any means necessary to solve the Lennard–Jones problem. Have your results improved
since you tried exercises 6.4.3 and 7.7.4?
Bibliography
[1] E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines, Wiley, New
York, 1989.
[2] L. Adams and J. L. Nazareth, eds., Linear and Nonlinear Conjugate Gradient-Related
Methods, SIAM, Philadelphia, 1996.
[3] M. Al-Baali, Descent property and global convergence of the Fletcher-Reeves method
with inexact line searches, IMA J. Numer. Anal., 5 (1985), pp. 121–124.
[4] N. Alexandrov, J. E. Dennis, R. M. Lewis, and V. Torczon, A Trust Region Frame-
work for Managing the Use of Approximation Models in Optimization, Tech. Rep. 97-50,
Institute for Computer Applications in Science and Engineering, Hampton, VA, October
1997.
[5] E. L. Allgower and K. Georg, Numerical path following, in Handbook of Numerical
Analysis, P. G. Ciarlet and J. L. Lions, eds., vol. 5, North–Holland, Amsterdam, 1997,
pp. 3–207.
[6] Anonymous, A new algorithm for optimization, Math. Programming, 3 (1972), pp. 124–
128.
[7] L. Armijo, Minimization of functions having Lipschitz-continuous first partial derivatives,
Pacific J. Math., 16 (1966), pp. 1–3.
[8] U. M. Ascher and L. R. Petzold, Computer Methods for Ordinary Differential Equa-
tions and Differential-Algebraic Equations, SIAM, Philadelphia, 1998.
[9] K. E. Atkinson, Iterative variants of the Nyström method for the numerical solution of
integral equations, Numer. Math., 22 (1973), pp. 17–31.
[10] B. M. Averick and J. J. Moré, User Guide for the MINPACK-2 Test Problem Collection,
Tech. Rep. ANL/MCS-TM-157, Math. and Comp. Science Div. Report, Argonne National
Laboratory, Argone, IL, October 1991.
[11] O. Axelsson, Iterative Solution Methods, Cambridge University Press, Cambridge, 1994.
[12] S. Banach, Sur les opérations dans les ensembles abstraits et leur applications aux
équations intégrales, Fund. Math., 3 (1922), pp. 133–181.
[13] H. T. Banks and H. T. Tran, Mathematical and Experimental Modeling of Physical
Processes, unpublished notes for MA 573-4, 1997, Department of Mathematics, North
Carolina State University, Raleigh, NC.
[14] M. S. Barlett, An inverse matrix adjustment arising in discriminant analysis, Ann. Math.
Stat., 22 (1951), pp. 107–111.
161
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
162 BIBLIOGRAPHY
[16] K. J. Bathe and A. P. Cimento, Some practical procedures for the solution of nonlinear
finite element equations, Comput. Methods. Appl. Mech. Engrg., 22 (1980), pp. 59–85.
[19] , Projected Newton methods for optimization problems with simple constraints,
SIAM J. Control Optim., 20 (1982), pp. 221–246.
[20] J. T. Betts, An improved penalty function method for solving constrained parameter
optimization problems, J. Optim. Theory Appl., 16 (1975), pp. 1–24.
[21] , Solving the nonlinear least square problem: Application of a general method, J.
Optim. Theory Appl., 18 (1976), pp. 469–483.
[23] J. T. Betts and P. D. Frank, A sparse nonlinear optimization algorithm, J. Optim. Theory
Appl., 82 (1994), pp. 519–541.
[24] P. T. Boggs, The convergence of the Ben-Israel iteration for nonlinear least squares
problems, Math. Comp., 30 (1976), pp. 512–522.
[25] P. T. Boggs and J. E. Dennis, A stability analysis for perturbed nonlinear iterative
methods, Math. Comp., 30 (1976), pp. 1–17.
[26] I. Bongatz, A. R. Conn, and P. L. Toint, CUTE: Constrained and unconstrained testing
environment, ACM Trans. Math. Software, 21 (1995), pp. 123–160.
[27] A. J. Booker, DOE for Computer Output, Tech. Rep. BCSTECH-94-052, Boeing Com-
puter Services, Seattle, WA, 1994.
[29] D. M. Bortz and C. T. Kelley, The simplex gradient and noisy optimization problems,
in Computational Methods in Optimal Design and Control, J.T. Borggaard, J. Burns,
E. Cliff, S. Schrenk, eds., Birkhauser, Boston, 1998, pp. 77–90.
[30] A. Bouaricha, Tensor methods for large, sparse, unconstrained optimization, SIAM J.
Optim., 7 (1997), pp. 732–756.
[31] H. Brakhage, Über die numerische Behandlung von Integralgleichungen nach der
Quadraturformelmethode, Numer. Math., 2 (1960), pp. 183–196.
BIBLIOGRAPHY 163
[34] C. G. Broyden, A class of methods for solving nonlinear simultaneous equations, Math.
Comp., 19 (1965), pp. 577–593.
[37] C. G. Broyden, J. E. Dennis, and J. J. Moré, On the local and superlinear convergence
of quasi-Newton methods, J. Inst. Math. Appl., 12 (1973), pp. 223–246.
[42] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, A limited memory algorithm for bound
constrained optimization, SIAM J. Sci. Comput., 16 (1995), pp. 1190–1208.
[43] R. H. Byrd and J. Nocedal, A tool for the analysis of quasi-Newton methods with
application to unconstrained minimization, SIAM J. Numer. Anal., 26 (1989), pp. 727–
739.
[47] P. H. Calamai and J. Moré, Projected gradient methods for linearly constrained prob-
lems, Math. Programming, 39 (1987), pp. 93–116.
164 BIBLIOGRAPHY
[50] S. L. Campbell and K. D. Yeomans, Behavior of the nonunique terms in general DAE
integrators, Appl. Numer. Math., (1998), to appear.
[51] R. G. Carter, On the global convergence of trust region algorithms using inexact gradient
information, SIAM J. Numer. Anal., 28 (1991), pp. 251–265.
[52] A. Cauchy, Methode generale pour la resolution des systemes d’equations simultanees,
Comp. Rend. Acad. Sci. Paris, (1847), pp. 536–538.
[54] , Bound Constrained Optimization, PhD thesis, North Carolina State University,
Raleigh, North Carolina, 1999.
[56] T. D. Choi and C. T. Kelley, Superlinear convergence and implicit filtering, SIAM J.
Optim., 10 (2000), pp. 1149–1162.
[57] T. F. Coleman and Y. Li, On the convergence of interior-reflective Newton methods for
nonlinear minimization subject to bounds, Math. Programming, 67 (1994), pp. 189–224.
[58] , An interior trust region approach for nonlinear minimization subject to bounds,
SIAM J. Optim., 6 (1996), pp. 418–445.
[59] T. F. Coleman and J. J. Moré, Estimation of sparse Jacobian matrices and graph
coloring problems, SIAM J. Numer. Anal., 20 (1983), pp. 187–209.
[64] , Testing a class of methods for solving minimization problems with simple bounds
on the variables, Math. Comp., 50 (1988), pp. 399–430.
[65] , Convergence of quasi-Newton matrices generated by the symmetric rank one up-
date, Math. Programming Ser. A, 50 (1991), pp. 177–195.
BIBLIOGRAPHY 165
[68] J. W. Daniel, The conjugate gradient method for linear and nonlinear operator equations,
SIAM J. Numer. Anal., 4 (1967), pp. 10–26.
[71] W. C. Davidon, Variable Metric Methods for Minimization, Tech. Rep. ANL-5990, Ar-
gonne National Laboratory, Argone, IL, 1959.
[72] , Variable metric method for minimization, SIAM J. Optim., 1 (1991), pp. 1–17.
[75] R. Dembo and T. Steihaug, Truncated Newton algorithms for large-scale optimization,
Math. Programming, 26 (1983), pp. 190–212.
[76] J. E. Dennis, Nonlinear least squares and equations, in The State of the Art in Numerical
Analysis, D. Jacobs, ed., Academic Press, London, 1977, pp. 269–312.
[80] J. E. Dennis and H. H. W. Mei, Two unconstrained optimization algorithms which use
function and gradient values, J. Optim. Theory Appl., 28 (1979), pp. 453–482.
[82] , Quasi-Newton methods, methods, motivation and theory, SIAM Rev., 19 (1977),
pp. 46–89.
[83] J. E. Dennis and R. B. Schnabel, Least change secant updates for quasi-Newton meth-
ods, SIAM Rev., 21 (1979), pp. 443–459.
166 BIBLIOGRAPHY
BIBLIOGRAPHY 167
[102] R. Fletcher, Generalized inverse methods for the best least squares solution of systems
of nonlinear equations, Comput. J., 10 (1968), pp. 392–399.
[103] , A new approach to variable metric methods, Comput. J., 13 (1970), pp. 317–322.
[104] , Practical Methods of Optimization, John Wiley and Sons, New York, 1987.
[105] R. Fletcher and M. J. D. Powell, A rapidly convergent descent method for minimiza-
tion, Comput. J., 6 (1963), pp. 163–168.
[109] D. M. Gay, Computing optimal locally constrained steps, SIAM J. Sci. Statist. Comput.,
2 (1981), pp. 186–197.
[111] R. R. Gerber and F. T. Luk, A generalized Broyden’s method for solving simultaneous
linear equations, SIAM J. Numer. Anal., 18 (1981), pp. 882–890.
[113] P. E. Gill and W. Murray, Newton-type methods for unconstrained and linearly con-
strained optimization, Math. Programming, 28 (1974), pp. 311–350.
[116] , Algorithms for the solution of the nonlinear least-squares problem, SIAM J. Numer.
Anal., 15 (1978), pp. 977–992.
[118] P. Gilmore, An Algorithm for Optimizing Functions with Multiple Minima, Ph.D. thesis,
North Carolina State University, Raleigh, 1993.
[119] , IFFCO: Implicit Filtering for Constrained Optimization, Tech. Rep. CRSC-TR93-
7, Center for Research in Scientific Computation, North Carolina State University, May
1993. Available by anonymous ftp from math.ncsu.edu in pub/kelley/iffco/ug.ps.
168 BIBLIOGRAPHY
[120] P. Gilmore and C. T. Kelley, An implicit filtering algorithm for optimization of functions
with many local minima, SIAM J. Optim., 5 (1995), pp. 269–285.
[121] P. A. Gilmore, S. S. Berger, R. F. Burr, and J. A. Burns, Automated optimization
techniques for phase change piezoelectric ink jet performance enhancement. in Proc.
IS&T’s NIP 13: 1997 International Conference on Digital Printing Technologies, Society
for Imaging Science and Technology, 1997, pp. 716–721.
[122] R. Glowinski, Numerical Methods for Nonlinear Variational Problems, Springer-Verlag,
New York, 1984.
[123] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning,
Addison–Wesley, Reading, MA., 1989.
[124] D. Goldfarb, A family of variable metric methods derived by variational means, Math.
Comp., 24 (1970), pp. 23–26.
[125] A. A. Goldstein, Constructive Real Analysis, Harper and Row, New York, 1967.
[126] G. H. Golub and V. Pereyra, The differentiation of pseudo-inverses and nonlinear least
squares problems whose variables separate, SIAM J. Numer. Anal., 10 (1973), pp. 413–
432.
[127] G. H. Golub and C. G. Van Loan, Matrix Computations, Johns Hopkins University
Press, Baltimore, MD, 1983.
[128] A. Greenbaum, Iterative Methods for Solving Linear Systems, SIAM, Philadelphia, 1997.
[129] A. Griewank, On automatic differentiation, in Mathematical Programming: Recent De-
velopments and Applications, M. Iri and K. Tanabe, eds., Kluwer, Dordrecht, the Nether-
lands, 1989, pp. 83–108.
[130] A. Griewank and G. F. Corliss, eds., Automatic Differentiation of Algorithms: Theory,
Implementation, and Application, SIAM, Philadelphia, 1991.
[131] A. Griewank and P. L. Toint, Local convergence analysis for partitioned quasi-Newton
updates, Numer. Math., 39 (1982), pp. 429–448.
[132] , On the unconstrained optimization of partially separable functions, in Nonlinear
Optimization, M. J. D. Powell, ed., Academic Press, London, 1982.
[133] , Partitioned variable metric methods for large sparse optimization problems, Nu-
mer. Math., 39 (1982), pp. 119–137.
[134] L. Grippo and S. Lucidi, A globally convergent version of the Polak-Ribière conjugate
gradient method, Math. Programming, 78 (1997), pp. 375–392.
[135] W. A. Gruver and E. Sachs, Algorithmic Methods in Optimal Control, Pitman, London,
1980.
[136] W. Hackbusch, On the fast solving of parabolic boundary control problems, SIAM J.
Control Optim., 17 (1979), pp. 231–244.
[137] , Optimal H p,p/2 error estimates for a parabolic Galerkin method, SIAM J. Numer.
Anal., 18 (1981), pp. 681–692.
[138] , Multi-Grid Methods and Applications, Springer Ser. Comput. Math. 4, Springer-
Verlag, New York, 1985.
BIBLIOGRAPHY 169
[141] M. R. Hestenes and E. Steifel, Methods of conjugate gradient for solving linear sys-
tems, J. Res. Nat. Bureau Standards, 49 (1952), pp. 409–436.
[142] M. R. Hoare and P. Pal, Physical cluster mechanics: Statics and energy surfaces for
monoatomic systems, Adv. Phys., 20 (1971), pp. 161–196.
[143] J. H. Holland, Genetic algorithms and the optimal allocation of trials, SIAM J. Comput.,
2 (1973), pp. 88–105.
[144] , Adaption in Natural and Artificial Systems, University of Michigan Press, Ann
Arbor, MI, 1975.
[145] R. Hooke and T. A. Jeeves, “Direct search" solution of numerical and statistical prob-
lems, J. Assoc. Comput. Mach., 8 (1961), pp. 212–229.
[149] G. Jasen, Investment dartboard: Pros and dart throwers both lose money, Your Money
Matters, Wall Street Journal, May 7, 1997.
[151] L. Kantorovich and G. Akilov, Functional Analysis, 2nd ed., Pergamon Press, New
York, 1982.
[154] , Iterative Methods for Linear and Nonlinear Equations, SIAM, Philadelphia, 1995.
170 BIBLIOGRAPHY
[161] , Multilevel algorithms for constrained compact fixed point problems, SIAM J. Sci.
Comput., 15 (1994), pp. 645–667.
[163] , A trust region method for parabolic boundary control problems, SIAM J. Optim.,
9 (1999), pp. 1064–1081.
[168] J. Kostrowicki and L. Piela, Diffusion equation method of global minimization: Per-
formance for standard test functions, J. Optim. Theory Appl., 71 (1991), pp. 269–284.
[170] E. B. Lee and L. Markus, Foundations of Optimal Control Theory, John Wiley, New
York, London, Sydney, 1967.
[171] C. Lemaréchal, A view of line searches, in Optimization and Optimal Control, Auslander,
Oettli, and Stoer, eds., Lecture Notes in Control and Information Sciences 30, Springer-
Verlag, Berlin, 1981, pp. 59–78.
[172] K. Levenberg, A method for the solution of certain nonlinear problems in least squares,
Quart. Appl. Math., 4 (1944), pp. 164–168.
[173] R. M. Lewis and V. Torczon, Rank ordering and positive bases in pattern search algo-
rithms, Tech. Rep. 96-71, Institute for Computer Applications in Science and Engineering,
December, 1996.
[174] , Pattern search algorithms for linearly constrained minimization, SIAM J. Optim.,
9 (1999), pp. 1082–1099.
BIBLIOGRAPHY 171
[175] , Pattern search algorithms for linearly constrained minimization, SIAM J. Optim.,
10 (2000), pp. 917–941.
[176] D. C. Liu and J. Nocedal, On the limited memory BFGS method for large-scale opti-
mization, Math. Programming., 43 (1989), pp. 503–528.
[177] R. B. Long and W. C. Thacker, Data assimilation into a numerical equatorial ocean
model, part 2: Assimilation experiments, Dyn. Atmos. Oceans, 13 (1989), pp. 465–477.
[178] E. M. Lowndes, Vehicle Dynamics and Optimal Design, Ph.D. thesis, North Carolina
State University, Raleigh, 1998.
[179] S. Lucidi and M. Sciandrone, On the global convergence of derivative free methods
for unconstrained optimization, Università di Roma “La Sapienza”, Dipartimento di
Informatica e Sistemistica, 1997, reprint.
[182] C. D. Maranas and C. A. Floudas, A global optimization method for Weber’s problem,
in Large Scale Optimization: State of the Art, W. W. Hager, D. W. Hearn, and P. Pardalos,
eds., Kluwer Academic Publishers B.V., Boston, 1994, pp. 259–293.
[185] E. S. Marwil, Exploiting Sparsity in Newton-Type Methods, Ph.D. thesis, Cornell Uni-
versity, Ithaca, NY, 1978.
[186] H. Matthies and G. Strang, The solution of nonlinear finite element equations, Internat.
J. Numer. Methods Engrg., 14 (1979), pp. 1613–1626.
[189] E. H. Moore, General Analysis, Memoirs of the American Philosophy Society, 1935. I.
[191] , Trust regions and projected gradients, in System Modelling and Optimization,
Lecture Notes in Control and Information Sciences 113, Springer-Verlag, Berlin, 1988,
pp. 1–13.
[192] J. J. Moré and D. C. Sorensen, Computing a trust region step, SIAM J. Sci. Statist.
Comput., 4 (1983), pp. 553–572.
172 BIBLIOGRAPHY
[193] J. J. Moré and D. J. Thuente, Line search algorithms with guaranteed sufficient de-
crease, ACM Trans. Math. Software, 20 (1994), pp. 286–307.
[194] J. J. Moré and G. Toraldo, On the solution of large quadratic programming problems
with bound constraints, SIAM J. Optim., 1 (1991), pp. 93–113.
[195] J. J. Moré and S. J. Wright, Optimization Software Guide, SIAM, Philadelphia, 1993.
[196] J. J. Moré and Z. Wu, Global continuation for distance geometry problems, SIAM J.
Optim., 7 (1997), pp. 814–836.
[197] W. Murray and M. L. Overton, Steplength algorithms for minimizing a class of non-
differentiable functions, Computing, 23 (1979), pp. 309–331.
[198] S. G. Nash, Newton-type minimization via the Lanczos method, SIAM J. Numer. Anal.,
21 (1984), pp. 770–788.
[200] L. Nazareth, A relationship between the BFGS and conjugate gradient algorithm and
its implications for new algorithms, SIAM J. Numer. Anal., 16 (1979), pp. 794–800.
[201] , Conjugate gradient methods less dependent on conjugacy, SIAM Rev., 28 (1986),
pp. 501–511.
[204] J. A. Nelder and R. Mead, A simplex method for function minimization, Comput. J., 7
(1965), pp. 308–313.
[205] A. Neumaier, On convergence and restart conditions for a nonlinear conjugate gradient
method. Institut für Mathematik, Universität Wien, 1997, preprint.
[206] J. Nocedal, Updating quasi-Newton matrices with limited storage, Math. Comp., 35
(1980), pp. 773–782.
[208] , Conjugate gradient methods and nonlinear optimization, in Linear and Nonlinear
Conjugate Gradient Methods, L. M. Adams and J. L. Nazareth, eds., SIAM, Philadelphia,
pp. 9–23.
[209] J. Nocedal and Y. Yuan, Combining Trust Region and Line Search Techniques, Tech.
Rep. OTC 98/04, Optimization Technology Center, Northwestern University, Chicago,
IL, 1998.
BIBLIOGRAPHY 173
[212] R. Penrose, A generalized inverse for matrices, Proc. Cambridge Phil. Soc., 51 (1955),
pp. 406–413.
[214] S. A. Piyawskii, An algorithm for finding the absolute extremum of a function, USSR
Comp. Math. and Math. Phys., 12 (1972), pp. 57–67.
[215] E. Polak and G. Ribière, Note sur la convergence de methodes de directions conjugées,
Rev Française Informat Recherche Operationelle, 3e Année, 16 (1969), pp. 35–43.
[216] B. T. Polyak, The conjugate gradient method in extremal problems, USSR Comp. Math.
and Math. Phys., 9 (1969), pp. 94–112.
[218] , A hybrid method for nonlinear equations, in Numerical Methods for Nonlinear
Algebraic Equations, P. Rabinowitz, ed., Gordon and Breach, New York, 1970, pp. 87–
114.
[221] , Some global convergence properties of a variable metric algorithm without ex-
act line searches, in Nonlinear Programming, R. Cottle and C. Lemke, eds., American
Mathematical Society, Providence, RI, 1976, pp. 53–72.
[222] , Nonconvex minimization calculations and the conjugate gradient method, Lecture
Notes in Mathematics 1066, Springer-Verlag, Berlin, (1984), pp. 122–141.
[223] , On the global convergence of trust region algorithms for unconstrained minimiza-
tion, Math. Programming., 29 (1984), pp. 297–303.
[225] , How bad are the BFGS and DFP methods when the objective function is quadratic,
Math. Programming., 34 (1986), pp. 34–47.
[226] , Update conjugate directions by the BFGS formula, Math. Programming, 38 (1987),
pp. 29–46.
[227] , Direct search algorithms for optimization calculations, Acta Numerica, 7 (1998),
pp. 287–336.
[228] K. Radhakrishnan and A. C. Hindmarsh, Description and Use of LSODE, the Liver-
more Solver for Ordinary Differential Equations, Tech. Rep. URCL-ID-113855, Lawrence
Livermore National Laboratory, Livermore, CA, December 1993.
174 BIBLIOGRAPHY
BIBLIOGRAPHY 175
[249] G. W. Stewart, Introduction to Matrix Computations, Academic Press, New York, 1973.
[254] T. Tian and J. C. Dunn, On the gradient projection method for optimal control problems
with nonnegative L2 inputs, SIAM J. Control Optim., 32 (1994), pp. 516–537.
[255] P. L. Toint, On sparse and symmetric matrix updating subject to a linear equation, Math.
Comp., 31 (1977), pp. 954–961.
[257] , Towards an efficient sparsity exploiting Newton method for minimization, in Sparse
Matrices and Their Uses, I. S. Duff, ed., Academic Press, New York, 1981, pp. 57–88.
[258] , On large scale nonlinear least squares calculations, SIAM J. Sci. Statist. Comput.,
8 (1987), pp. 416–435.
[261] , Multidirectional Search, Ph.D. thesis, Rice University, Houston, TX, 1989.
[264] L. N. Trefethen and D. Bau, Numerical Linear Algebra, SIAM, Philadelphia, 1996.
[265] P. van Laarhoven and E. Aarts, Simulated Annealing, Theory and Practice, Kluwer,
Dordrecht, the Netherlands, 1987.
176 BIBLIOGRAPHY
[269] J. Werner, Über die globale konvergenz von Variable-Metric Verfahren mit nichtexakter
Schrittweitenbestimmung, Numer. Math., 31 (1978), pp. 321–334.
[270] T. A. Winslow, R. J. Trew, P. Gilmore, and C. T. Kelley, Doping profiles for optimum
class B performance of GaAs mesfet amplifiers, in Proc. IEEE/Cornell Conference on
Advanced Concepts in High Speed Devices and Circuits, IEEE, Piscataway, NJ, 1991,
pp. 188–197.
[272] P. Wolfe, Convergence conditions for ascent methods, SIAM Rev., 11 (1969), pp. 226–
235.
[273] , Convergence conditions for ascent methods II: Some corrections, SIAM Rev., 13
(1971), pp. 185–188.
[274] M. H. Wright, Direct search methods: Once scorned, now respectable, in Numerical
Analysis 1995, Proc. 1995 Dundee Bienneal Conference in Numerical Analysis, D. F.
Griffiths and G.A. Watson, eds., 1996, Addison–Wesley Longman, Harlow, U.K., pp. 191–
208.
[275] S. J. Wright, Compact storage of Broyden-class quasi-Newton matrices, Argonne Na-
tional Laboratory, Argone, IL, 1994, preprint.
[276] S. J. Wright and J. N. Holt, An inexact Levenbert-Marquardt method for large sparse
nonlinear least squares, J. Austral. Math. Soc. Ser. B, 26 (1985), pp. 387–403.
[277] Z. Wu, The effective energy transformation scheme as a special continuation approach
to global optimization with application to molecular conformation, SIAM J. Optim., 6
(1996), pp. 748–768.
[278] T. J. Ypma, The effect of rounding errors on Newton-like methods, IMA J. Numer. Anal.,
3 (1983), pp. 109–118.
Index
177
Buy this book from SIAM at http://www.ec-securehost.com/SIAM/FR18.html.
Copyright ©1999 by the Society for Industrial and Applied Mathematics. This electronic version is for personal use and may not be duplicated or distributed.
178 INDEX
INDEX 179
180 INDEX
Taylor’s theorem, 5
Termination
bound constrained problems, 91
small function differences, 21
small gradients, 21
small steps, 21
unconstrained problems, 21
Tosca, Floria, xiv
TR-CG
algorithm, 64
Truncated Newton methods, 28
Trust region, 50
actual reduction, 51
algorithm
adjustment, 51
paradigm, 52
dogleg, 58
classical, 58
convergence theorem, 52
Levenberg–Marquardt
algorithm, 58
parameter, 57
methods, 50
Newton point, 58
predicted reduction, 51
problem, 50
radius, 50
trial solution, 50
trial step, 50
unidirectional, 54
Trust region CG
algorithm, 64
Youngman
Henny, 159