Wolfram Mathematica Tutorial Collection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Wolfram Mathematica® Tutorial Collection

UNCONSTRAINED OPTIMIZATION
For use with Wolfram Mat hem at ica ® 7.0 and later.

For t h e la t e st u pda t e s a n d cor r e ct ion s t o t h is m a n u a l:


visit reference.wolfram.com
For in for m a t ion on a ddit ion a l copie s of t h is docu m e n t a t ion :
visit the Customer Service website at www.wolfram.com/services/customerservice
or email Customer Service at [email protected]
Com m e n t s on t h is m a n u a l a r e w e lcom e d a t :
[email protected]
Con t e n t a u t h or e d by :
Rob Knapp

Printed in the United States of America.


15 14 13 12 11 10 9 8 7 6 5 4 3 2

©2008 Wolfram Research, Inc.


All rights reserved. No part of this document may be reproduced or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the copyright
holder.
Wolfram Research is the holder of the copyright to the Wolfram Mat hem at ica software system ("Software") described
in this document, including without limitation such aspects of the system as its code, structure, sequence,
organization, “look and feel,” programming language, and compilation of command names. Use of the Software unless
pursuant to the terms of a license granted by Wolfram Research or as otherwise authorized by law is an infringement
of the copyright.
W olfr a m Re se a r ch , I n c. a n d W olfr a m M e dia , I n c. ( " W olfr a m " ) m a k e n o r e pr e se n t a t ion s, e x pr e ss,
st a t u t or y , or im plie d, w it h r e spe ct t o t h e Soft w a r e ( or a n y a spe ct t h e r e of) , in clu din g, w it h ou t lim it a t ion ,
a n y im plie d w a r r a n t ie s of m e r ch a n t a bilit y , in t e r ope r a bilit y , or fit n e ss for a pa r t icu la r pu r pose , a ll of
w h ich a r e e x pr e ssly discla im e d. W olfr a m doe s n ot w a r r a n t t h a t t h e fu n ct ion s of t h e Soft w a r e w ill m e e t
y ou r r e qu ir e m e n t s or t h a t t h e ope r a t ion of t h e Soft w a r e w ill be u n in t e r r u pt e d or e r r or fr e e . As su ch ,
W olfr a m doe s n ot r e com m e n d t h e u se of t h e soft w a r e de scr ibe d in t h is docu m e n t for a pplica t ion s in
w h ich e r r or s or om ission s cou ld t h r e a t e n life , in j u r y or sign ifica n t loss.
Mat hem at ica, Mat hLink, and Mat hSource are registered trademarks of Wolfram Research, Inc. J/ Link, Mat hLM,
.NET/ Link, and webMat hem at ica are trademarks of Wolfram Research, Inc. Windows is a registered trademark of
Microsoft Corporation in the United States and other countries. Macintosh is a registered trademark of Apple
Computer, Inc. All other trademarks used herein are the property of their respective owners. Mat hem at ica is not
associated with Mathematica Policy Research, Inc.
Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Methods for Local Minimization
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Newton's Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Gauss-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Nonlinear Conjugate Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Principal Axis Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Methods for Solving Nonlinear Equations


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
The Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Brent’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Step Control
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Line Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Trust Region Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Setting Up Optimization Problems in Mathematica


Specifying Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Variables and Starting Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Termination Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Symbolic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

UnconstrainedProblems Package
Plotting Search Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Introduction to Unconstrained
Optimization
Mathematica has a collection of commands that do unconstrained optimization (FindMinimum
and FindMaximum ) and solve nonlinear equations (FindRoot) and nonlinear fitting problems
(FindFit). All these functions work, in general, by doing a search, starting at some initial
values and taking steps that decrease (or for FindMaximum , increase) an objective or merit
function.

The search process for FindMaximum is somewhat analogous to a climber trying to reach a
mountain peak in a thick fog; at any given point, basically all that climbers know is their posi-
tion, how steep the slope is, and the direction of the fall line. One approach is always to go
uphill. As long as climbers go uphill steeply enough, they will eventually reach a peak, though it
may not be the highest one. Similarly, in a search for a maximum, most methods are ascent
methods where every step increases the height and stops when it reaches any peak, whether it
is the highest one or not.

The analogy with hill climbing can be reversed to consider descent methods for finding local
minima. For the most part, the literature in optimization considers the problem of finding min-
ima, and since this applies to most of the Mathematica commands, from here on, this documen-
tation will follow that convention.

For example, the function x sinHx + 1L is not bounded from below, so it has no global minimum,
but it has an infinite number of local minima.

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`

This shows a plot of the function x Sin@x + 1D.


In[2]:= Plot@x Sin@x + 1D, 8x, - 10, 10<D

Out[2]=
-10 -5 5 10

-5

-10
2 Unconstrained Optimization

This shows the steps taken by FindMinimum for the function x Sin@x + 1D starting at x = 0.
In[3]:= FindMinimumPlot@x Sin@x + 1D, 8x, 0<D

-0.5 -0.4 -0.3 -0.2 -0.1


-0.05

-0.10
Out[3]= :8-0.240125, 8x Ø -0.520269<<, 8Steps Ø 5, Function Ø 6, Gradient Ø 6<, >
-0.15

-0.20

The FindMinimumPlot command is defined in the Optimization`UnconstrainedProblems`


package loaded automatically by this notebook. It runs FindMinimum , keeps track of the func-
tion and gradient evaluations and steps taken during the search (using the EvaluationMonitor
and StepMonitor options), and shows them superimposed on a plot of the function. Steps are
indicated with blue lines, function evaluations are shown with green points, and gradient evalua-
tions are shown with red points. The minimum found is shown with a large black point. From
the plot, it is clear that FindMinimum has found a local minimum point.

This shows the steps taken by FindMinimum for the function x Sin@x + 1D starting at x = 2.
In[4]:= FindMinimumPlot@x Sin@x + 1D, 8x, 2<D

6
4
2
Out[4]= :8- 3.83922, 8x Ø 3.95976<<, 8Steps Ø 4, Function Ø 9, Gradient Ø 9<, >
3 4 5 6 7
-2
-4

Starting at 2, FindMinimum heads to different local minima, at which the function is smaller
than at the first minimum found.

From these two plots, you might come to the conclusion that if you start at a point where the
function is sloping downward, you will always head toward the next minimum in that direction.
However, this is not always the case; the steps FindMinimum takes are typically determined
using the value of the function and its derivatives, so if the derivative is quite small,
FindMinimum may think it has to go quite a long way to find a minimum point.
Unconstrained Optimization 3

This shows the steps taken by FindMinimum for the function x Sin@x + 1D starting at x = 7.
In[5]:= FindMinimumPlot@x Sin@x + 1D, 8x, 7<D

40

20

Out[5]= :8- 41.4236, 8x Ø 41.4356<<, 8Steps Ø 3, Function Ø 14, Gradient Ø 14<, >
15 20 25 30 35 40
-20

-40

When starting at x = 7, which is near a local maximum, the first step is quite large, so
FindMinimum returns a completely different local minimum.

All these commands have "find" in their name because, in general, their design is to search to
find any point where the desired condition is satisfied. The point found may not be the only one
(in the case of roots) or even the best one (in the case of fits, minima, or maxima), or, as you
have seen, not even the closest one to the starting condition. In other words, the goal is to find
any point at which there is a root or a local maximum or minimum. In contrast, the function
NMinimize tries harder to find the global minimum for the function, but NMinimize is also
generally given constraints to bound the problem domain. However, there is a price to pay for
this generality~NMinimize has to do much more work and, in fact, may call one of the "Find"
functions to polish a result at the end of its process, so it generally takes much more time than
the "Find" functions.

In two dimensions, the minimization problem is more complicated because both a step direction
and step length need to be determined.

This shows the steps taken by FindMinimum to find a local minimum of the function
cosIx2 - 3 yM + sinIx2 + y2 M starting at the point 8x, y< = 81, 1<.
In[6]:= FindMinimumPlot@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D, 88x, 1<, 8y, 1<<D
2.0

1.8

1.6
Out[6]= :8-2., 8x Ø 1.37638, y Ø 1.67868<<, 8Steps Ø 9, Function Ø 13, Gradient Ø 13<, >
1.4

1.2

1.0
0.8 1.0 1.2 1.4 1.6
4 Unconstrained Optimization

The FindMinimumPlot command for two dimensions is similar to the one-dimensional case, but
it shows the steps and evaluations superimposed on a contour plot of the function. In this
example, it is apparent that FindMinimum needed to change direction several times to get to
the local minimum. You may notice that the first step starts in the direction of steepest descent
(i.e., perpendicular to the contour or parallel to the gradient). Steepest descent is indeed a
possible strategy for local minimization, but it often does not converge quickly. In subsequent
steps in this example, you may notice that the search direction is not exactly perpendicular to
the contours. The search is using information from past steps to try to get information about
the curvature of the function, which typically gives it a better direction to go. Another strategy,
which usually converges faster, but can be more expensive, is to use the second derivative of
the function. This is usually referred to as "Newton's" method.

This shows the steps taken using Newton's method.


In[7]:= FindMinimumPlot@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D, 88x, 1<, 8y, 1<<, Method Ø NewtonD

Out[7]= :8-2., 8x Ø 1.37638, y Ø 1.67868<<,


1.8

1.6

8Steps Ø 5, Function Ø 6, Gradient Ø 6, Hessian Ø 6<, 1.4 >

1.2

1.0
1.0 1.1 1.2 1.3 1.4

In this example, it is clear that the extra information that "Newton's" method uses about the
curvature of the function makes a big difference in how many steps it takes to get to the mini-
mum. Even though Newton's method takes fewer steps, it may take more total execution time
since the symbolic Hessian has to be computed once and then evaluated numerically at each
step.

Usually there are tradeoffs between the rate of convergence or total number of steps taken and
cost per step. Depending on the size of the problems you want to solve, you may want to pick a
particular method to best match that tradeoff for a particular problem. This documentation is
intended to help you understand those choices as well as some ways to get the best results
from the functions in Mathematica. For the most part, examples will be used to illustrate the
ideas, but a limited exposition on the mathematical theory behind the methods will be given so
that you can better understand how the examples work.
Unconstrained Optimization 5

For the most part, local minimization methods for a function f are based on a quadratic model

1
qk HpL = f Hxk L + “ f Hxk LT p + 2
pT Bk p. (1)

The subscript k refers to the kth iterative step. In Newton's method, the model is based on the
exact Hessian matrix, Bk = “2 f Hxk L , but other methods use approximations to “2 f Hxk L, which are
typically less expensive to compute. A trial step sk is typically computed to be the minimizer of
the model, which satisfies the system of linear equations.

Bk sk = - “ f Hxk L

If f is sufficiently smooth and xk is sufficiently close to a local minimum, then with Bk = “2 f Hxk L,
the sequence of steps xk+1 = sk + xk is guaranteed to converge to the local minimum. However, in
a typical search, the starting value is rarely close enough to give the desired convergence.
Furthermore, Bk is often an approximation to the actual Hessian and, at the beginning of a
search, the approximation is frequently quite inaccurate. Thus, it is necessary to provide addi-
tional control to the step sequence to improve the chance and rate of convergence. There are
two frequently used methods for controlling the steps: line search and trust region methods.

In a "line search" method, for each trial step sk found, a one-dimensional search is done along
the direction of sk so that xk+1 = xk + ak sk . You could choose ak so that it minimizes f Hxk+1 L in this
direction, but this is excessive, and with conditions that require that f Hxk+1 L decreases suffi-
ciently in value and slope, convergence for reasonable approximations Bk can be proven. Mathe-
matica uses a formulation of these conditions called the Wolfe conditions.

In a "trust region" method, a radius Dk within which the quadratic model qk HpL in equation (1) is
“trusted” to be reasonably representative of the function. Then, instead of solving for the uncon-
strained minimum of (1), the trust region method tries to find the constrained minimum of (1)
with °p¥ § Dk . If the xk are sufficiently close to a minimum and the model is good, then often the
minimum lies within the circle, and convergence is quite rapid. However, near the start of a
search, the minimum will lie on the boundary, and there are a number of techniques to find an
approximate solution to the constrained problem. Once an approximate solution is found, the
actual reduction of the function value is compared to the predicted reduction in the function
value and, depending on how close the actual value is to the predicted, an adjustment is made
for Dk+1 .
6 Unconstrained Optimization

For symbolic minimization of a univariate smooth function, all that is necessary is to find a point
at which the derivative is zero and the second derivative is positive. In multiple dimensions,
this means that the gradient vanishes and the Hessian needs to be positive definite. (If the
Hessian is positive semidefinite, the point is a minimizer, but is not necessarily a strict one.) As
a numerical algorithm converges, it is necessary to keep track of the convergence and make
some judgment as to when a minimum has been approached closely enough. This is based on
the sequence of steps taken and the values of the function, its gradient, and possibly its Hes-
sian at these points. Usually, the Mathematica Find… functions will issue a message if they
cannot be fairly certain that this judgment is correct. However, keep in mind that discontinuous
functions or functions with rapid changes of scale can fool any numerical algorithm.

When solving "nonlinear equations", many of the same issues arise as when finding a "local
minimum". In fact, by considering a so-called merit function, which is zero at the root of the
equations, it is possible to use many of the same techniques as for minimization, but with the
advantage of knowing that the minimum value of the function is 0. It is not always advanta-
geous to use this approach, and there are some methods specialized for nonlinear equations.

Most examples shown will be from one and two dimensions. This is by no means because Mathe-
matica is restricted to computing with such small examples, but because it is much easier to
visually illustrate the main principles behind the theory and methods with such examples.
Unconstrained Optimization 7

Methods for Local Minimization

Introduction to Local Minimization


The essence of most methods is in the local quadratic model

1
qk HpL = f Hxk L + “ f Hxk LT p + pT Bk p
2

that is used to determine the next step. The FindMinimum function in Mathematica has five
essentially different ways of choosing this model, controlled by the method option. These meth-
ods are similarly used by FindMaximum and FindFit.

"Newton" use the exact Hessian or a finite difference approximation


if the symbolic derivative cannot be computed
"QuasiNewton" use the quasi-Newton BFGS approximation to the Hessian
built up by updates based on past steps
"LevenbergMarquardt" a Gauss|Newton method for least-squares problems; the
Hessian is approximated by J T J , where J is the Jacobian of
the residual function
"ConjugateGradient" a nonlinear version of the conjugate gradient method for
solving linear systems; a model Hessian is never formed
explicitly
"PrincipalAxis" works without using any derivatives, not even the gradi -
ent, by keeping values from past steps; it requires two
starting conditions in each variable

Basic method choices for FindMinimum .

With Method -> Automatic, Mathematica uses the "quasi-Newton" method unless the problem
is structurally a sum of squares, in which case the Levenberg|Marquardt variant of the "Gauss|
Newton" method is used. When given two starting conditions in each variable, the "principal
axis" method is used.
8 Unconstrained Optimization

Newton's Method
One significant advantage Mathematica provides is that it can symbolically compute derivatives.
This means that when you specify Method -> "Newton" and the function is explicitly differen-
tiable, the symbolic derivative will be computed automatically. On the other hand, if the func-
tion is not in a form that can be explicitly differentiated, Mathematica will use finite difference
approximations to compute the Hessian, using structural information to minimize the number of
evaluations required. Alternatively you can specify a Mathematica expression, which will give
the Hessian with numerical values of the variables.

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`

In this example, FindMinimum computes the Hessian symbolically and substitutes numerical
values for x and y when needed.
In[2]:= FindMinimum@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D, 88x, 1<, 8y, 1<<, Method -> "Newton"D
Out[2]= 8-2., 8x Ø 1.37638, y Ø 1.67868<<

This defines a function that is only intended to evaluate for numerical values of the variables.
In[3]:= f@x_ ? NumberQ, y_ ? NumberQD := Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D

The derivative of this function cannot be found symbolically since the function has been defined
only to evaluate with numerical values of the variables.

This shows the steps taken by FindMinimum when it has to use finite differences to compute
the gradient and Hessian.
In[4]:= FindMinimumPlot@f@x, yD, 88x, 1<, 8y, 1<<, Method -> "Newton"D

FindMinimum::lstol :
The line search decreased the step size to within tolerance specified by AccuracyGoal and
PrecisionGoal but was unable to find a sufficient decrease
in the function. You may need more than MachinePrecision
digits of working precision to meet these tolerances. à

Out[4]= :8-2., 8x Ø 1.37638, y Ø 1.67867<<,


1.8

1.6

8Steps Ø 4, Function Ø 89, Gradient Ø 26, Hessian Ø 5<, 1.4 >

1.2

1.0
1.0 1.1 1.2 1.3 1.4
Unconstrained Optimization 9

When the gradient and Hessian are both computed using finite differences, the error in the
Hessian may be quite large and it may be better to use a different method. In this case,
FindMinimum does find the minimum quite accurately, but cannot be sure because of inade-
quate derivative information. Also, the number of function and gradient evaluations is much
greater than in the example with the symbolic derivatives computed automatically because
extra evaluations are required to approximate the gradient and Hessian, respectively.

If it is possible to supply the gradient (or the function is such that it can be computed automati-
cally), the method will typically work much better. You can give the gradient using the
Gradient option, which has several ways you can "specify derivatives".

This defines a function that returns the gradient for numerical values of x and y.
In[5]:= g@x_ ? NumberQ, y_ ? NumberQD = Map@D@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D, ÒD &, 8x, y<D

Out[5]= 92 x CosAx + y E - 2 x SinAx - 3 yE, 2 y CosAx + y E + 3 SinAx - 3 yE=


2 2 2 2 2 2

This tells FindMinimum to use the supplied gradient. The Hessian is computed using finite
differences of the gradient.
In[6]:= FindMinimum@f@x, yD, 88x, 1<, 8y, 1<<, Gradient Ø g@x, yD, Method Ø "Newton"D
Out[6]= 8-2., 8x Ø 1.37638, y Ø 1.67868<<

If you can provide a program that gives the Hessian, you can provide this also. Because the
Hessian is only used by Newton's method, it is given as a method option of Newton.

This defines a function that returns the Hessian for numerical values of x and y.
In[7]:= h@x_ ? NumberQ, y_ ? NumberQD =
Outer@D@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D, ÒÒD &, 8x, y<, 8x, y<D
Out[7]= 99-4 x CosAx - 3 yE + 2 CosAx + y E - 2 SinAx - 3 yE - 4 x SinAx + y E,
2 2 2 2 2 2 2 2

6 x CosAx2 - 3 yE - 4 x y SinAx2 + y2 E=,


96 x CosAx2 - 3 yE - 4 x y SinAx2 + y2 E, -9 CosAx2 - 3 yE + 2 CosAx2 + y2 E - 4 y2 SinAx2 + y2 E==

This tells FindMinimum to use the supplied gradient and Hessian.


In[8]:= FindMinimum@f@x, yD, 88x, 1<, 8y, 1<<,
Gradient Ø g@x, yD, Method Ø 8"Newton", "Hessian" Ø h@x, yD<D
Out[8]= 8-2., 8x Ø 1.37638, y Ø 1.67868<<

In principle, Newton's method uses the Hessian computed either by evaluating the symbolic
derivative or by using finite differences. However, the convergence for the method computed
10 Unconstrained Optimization

this way depends on the function being convex, in which case the Hessian is always positive
definite. It is common that a search will start at a location where this condition is violated, so
the algorithm needs to take this possibility into account.

Here is an example where the search starts near a local maximum.


In[9]:= FindMinimumPlot@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D,
88x, 1.2<, 8y, .5<<, Method -> "Newton"D
Out[9]= :8-2., 8x Ø 1.37638, y Ø 1.67868<<,

2.0

1.5

8Steps Ø 4, Function Ø 11, Gradient Ø 11, Hessian Ø 5<, >

1.0

0.5
1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55

When sufficiently near a local maximum, the Hessian is actually negative definite.

This computes the eigenvalues of the Hessian near the local maximum.
In[10]:= Eigenvalues@[email protected], .5DD
Out[10]= 8-15.7534, -6.0478<

If you were to only apply the Newton step formula in cases where the Hessian is not positive
definite, it is possible to get a step direction that does not lead to a decrease in the function
value.

This computes the directional derivative for the direction found by solving “2 f Hxk L s0 = -“ f Hxk L.
Since it is positive, moving in this direction will locally increase the function value.
In[11]:= LinearSolve@[email protected], .5D, - [email protected], [email protected], .5D
Out[11]= 0.0172695

It is crucial for the convergence of line search methods that the direction be computed using a
positive definite quadratic model Bk since the search process and convergence results derived
from it depend on a direction with sufficient descent. See "Line Search Methods". Mathematica
Unconstrained Optimization 11

modifies the Hessian by a diagonal matrix Ek with entries large enough so that Bk = “2 f Hxk L + Ek is
positive definite. Such methods are sometimes referred to as modified Newton methods. The
modification to Bk is done during the process of computing a Cholesky decomposition somewhat
along the lines described in [GMW81], both for dense and sparse Hessians. The modification is
only done if “2 f Hxk L is not positive definite. This decomposition method is accessible through
LinearSolve if you want to use it independently.

This computes the step using B0 s0 = -“ f Hxk L, where B0 is determined as the Cholesky factors of
the Hessian are being computed.
In[12]:= LinearSolve@[email protected], .5D, - [email protected], .5D,
Method Ø 8"Cholesky", "Modification" Ø "Minimal"<D
Out[12]= 80.00405502, 0.0196737<

The computed step is in a descent direction.


In[13]:= %[email protected], .5D
Out[13]= -0.00645255

Besides the robustness of the (modified) Newton method, another key aspect is its convergence
rate. Once a search is close enough to a local minimum, the convergence is said to be q-
quadratic, which means that if x* is the local minimum point, then

°xk+1 - x* ¥ § b °xk - x* ¥2

for some constant b > 0.

At machine precision, this does not always make a substantial difference since it is typical that
most of the steps are spent getting near to the local minimum. However, if you want a root to
extremely high precision, Newton's method is usually the best choice because of the rapid
convergence.

This computes a very high-precision solution using Newton's method. The precision is adap-
tively increased from machine precision (the precision of the starting point) to the maximal
working precision of 100000 digits. Reap is used with Sow to save the steps taken. Counters
are used to track and print the number of function evaluations and steps used.
In[14]:= First@Timing@Block@8e = 0, s = 0<, 88min, minpoint<, 8points<< =
Reap@FindMinimum@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D,
88x, 1.<, 8y, 1.<<, Method -> "Newton", WorkingPrecision Ø 100 000,
StepMonitor ß Hs ++; Sow@8x, y<DL, EvaluationMonitor ß e ++DD;
Print@s, " steps and ", e, " evaluations"DDDD
17 steps and 27 evaluations
Out[14]= 4.56134
12 Unconstrained Optimization

When the option WorkingPrecision -> prec is used, the default for the AccuracyGoal and
PrecisionGoal is prec ê 2. Thus, this example should find the minimum to at least 50000 digits.

This computes a symbolic solution for the position of the minimum which the search approaches.
In[15]:= exact = 8x, y< ê. Last@Solve@8x ^ 2 + y ^ 2 ã 3 Pi ê 2, x ^ 2 - 3 y ã - Pi<, 8x, y<DD

9 3 1
Out[15]= : - -p+ 9 + 10 p , -3 + 9 + 10 p >
2 2 2

This computes the norm of the distance from the search points at the end of each step to the
exact minimum.
In[16]:= N@Map@Norm@exact - ÒD &, pointsDD

Out[16]= 90.140411, 0.0156607, 0.000236558, 6.09444 µ 10 , 3.8255 µ 10 , 1.59653 µ 10 , 3.24619 µ 10 ,


-8 -15 -29 -58

4.8604 µ 10-108 , 1.26122 µ 10-212 , 5.865676867279906 µ 10-406 , 1.755647053247051 µ 10-791 ,


4.345222958143836 µ 10-1581 , 1.099183429735576 µ 10-3141 , 1.614858677992596 µ 10-6262 ,
5.998002325828813 µ 10-12 514 , 1.543301971989607 µ 10-25 010 , 1.131416408748486 µ 10-50 010 =

The reason that more function evaluations were required than the number of steps is that
Mathematica adaptively increases the precision from the precision of the initial value to the
requested maximum WorkingPrecision. The sequence of precisions used is chosen so that as
few computations are done at the most expensive final precision as possible under the assump-
tion that the points are converging to the minimum. Sometimes when Mathematica changes
precision, it is necessary to reevaluate the function at the higher precision.

This shows a table with the precision of each of the points with the norm of their errors.
In[17]:= TableForm@Transpose@8Map@Precision, pointsD, N@Map@Norm@exact - ÒD &, pointsDD<DD
MachinePrecision 0.140411
MachinePrecision 0.0156607
MachinePrecision 0.000236558
MachinePrecision 6.09444 µ 10-8
24.4141 3.8255 µ 10-15
48.8283 1.59653 µ 10-29
97.6565 3.24619 µ 10-58
195.313 4.8604 µ 10-108

Out[17]//TableForm= 390.626 1.26122 µ 10-212


781.252 5.865676867279906 µ 10-406
1562.5 1.755647053247051 µ 10-791
3125.01 4.345222958143836 µ 10-1581
6250.02 1.099183429735576 µ 10-3141
12 500. 1.614858677992596 µ 10-6262
25 000.1 5.998002325828813 µ 10-12 514
50 000.2 1.543301971989607 µ 10-25 010
100 000. 1.131416408748486 µ 10-50 010
Unconstrained Optimization 13

Note that typically the precision is roughly double the scale Ilog10 M of the error. For Newton's

method this is appropriate since when the step is computed, the scale of the error will effec-
tively double according to the quadratic convergence.

FindMinimum always starts with the precision of the starting values you gave it. Thus, if you do
not want it to use adaptive precision control, you can start with values, which are exact or have
at least the maximum WorkingPrecision.

This computes the solution using only precision 100000 throughout the computation. (Warning:
this takes a very long time to complete.)
In[18]:= First@Timing@Block@8e = 0, s = 0<, 88min, minpoint<, 8points<< =
Reap@FindMinimum@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D,
88x, 1<, 8y, 1<<, Method -> "Newton", WorkingPrecision Ø 100 000,
StepMonitor ß Hs ++; Sow@8x, y<DL, EvaluationMonitor ß e ++DD;
Print@s, " steps and ", e, " evaluations"DDDD
17 steps and 18 evaluations
Out[18]= 1259.84 Second

Even though this may use fewer function evaluations, they are all done at the highest precision,
so typically adaptive precision saves a lot of time. For example, the previous command without
adaptive precision takes more than 50 times as long as when starting from machine precision.

With Newton’s method, both "line search" and "trust region" step control are implemented. The
default, which is used in the preceding examples, is the line search. However, any of them may
be done with the trust region approach. The approach typically requires more numerical linear
algebra computations per step, but because steps are better controlled, may converge in fewer
iterations.

This uses the unconstrained problems package to set up the classic Rosenbrock function, which
has a narrow curved valley.
In[19]:= p = GetFindMinimumProblem@RosenbrockD
2
Out[19]= FindMinimumProblemBH1 - X1 L + 100 I-X1 + X2 M , 88X1 , -1.2<, 8X2 , 1.<<, 8<, Rosenbrock, 82, 2<F
2 2
14 Unconstrained Optimization

This shows the steps taken by FindMinimum with a trust region Newton method for a Rosen-
brock function.
In[20]:= FindMinimumPlot@p, Method Ø 8"Newton", "StepControl" -> "TrustRegion"<D

Out[20]= :92.14681 µ 10 , 8X1 Ø 1., X2 Ø 1.<=,


-26

1.4

1.2

1.0

0.8
8Steps Ø 21, Function Ø 22, Gradient Ø 22, Hessian Ø 22<, >
0.6

0.4

0.2

0.0
-1.0 -0.5 0.0 0.5 1.0

This shows the steps taken by FindMinimum with a line search Newton method for the same
function.
In[21]:= FindMinimumPlot@p, Method Ø "Newton"D

Out[21]= :94.96962 µ 10 , 8X1 Ø 1., X2 Ø 1.<=,


-18

8Steps Ø 22, Function Ø 29, Gradient Ø 29, Hessian Ø 23<, -1 >

-2

-3
-1.0 -0.5 0.0 0.5 1.0

You can see from the comparison of the two plots that the trust region method has kept the
steps within better control as the search follows the valley and consequently converges with
fewer function evaluations.

The following table summarizes the options you can use with Newton's method.

option name default value


"Hessian" Automatic an expression to use for computing the
Hessian matrix
"StepControl" "LineSearch" how to control steps; options include
"LineSearch", "TrustRegion", or None

Method options for Method -> "Newton".


Unconstrained Optimization 15

Quasi-Newton Methods
There are many variants of quasi-Newton methods. In all of them, the idea is to base the
matrix Bk in the quadratic model

qk HpL = f Hxk L + “ f Hxk LT p +


1
2
pT Bk p

on an approximation of the Hessian matrix built up from the function and gradient values from
some or all steps previously taken.

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`

This shows a plot of the steps taken by the quasi-Newton method. The path is much less direct
than for Newton’s method. The quasi-Newton method is used by default by FindMinimum for
problems that are not sums of squares.
In[2]:= FindMinimumPlot@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D, 88x, 1<, 8y, 1<<D

Out[2]= :8-2., 8x Ø 1.37638, y Ø 1.67868<<,

2.0

1.8

1.6

8Steps Ø 9, Function Ø 13, Gradient Ø 13<, >


1.4

1.2

1.0
0.8 1.0 1.2 1.4 1.6

The first thing to notice about the path taken in this example is that it starts in the wrong
direction. This direction is chosen because at the first step all the method has to go by is the
gradient, and so it takes the direction of steepest descent. However, in subsequent steps, it
incorporates information from the values of the function and gradient at the steps taken to
build up an approximate model of the Hessian.

The methods used by Mathematica are the Broyden|Fletcher|Goldfarb|Shanno (BFGS) updates


and, for large systems, the limited-memory BFGS (L-BFGS) methods, in which the model Bk is

not stored explicitly, but rather Bk -1 “ f Hxk L is calculated by gradients and step directions stored
from past steps.
16 Unconstrained Optimization

The BFGS method is implemented such that instead of forming the model Hessian Bk at each

step, Cholesky factors Lk such that Lk .Lk T = Bk are computed so that only OIn2 M operations are

needed to solve the system Bk sk = -“ f Hxk L [DS96] for a problem with n variables.

For large-scale sparse problems, the BFGS method can be problematic because, in general, the
Cholesky factors (or the Hessian approximation Bk or its inverse) are dense, so the OIn2 M mem-

ory and operations requirements become prohibitive compared to algorithms that take advan-
tage of sparseness. The L-BFGS algorithm [NW99] forms an approximation to the inverse
Hessian based on the last m past steps, which are stored. The Hessian approximation may not
be as complete, but the memory and order of operations are limited to OHn mL for a problem with
n variables. In Mathematica 5, for problems over 250 variables, the algorithm is switched auto-
matically to L-BFGS. You can control this with the method option "StepMemory" -> m. With
m = ¶, the full BFGS method will always be used. Choosing an appropriate value of m is a trade-
off between speed of convergence and the work done per step. With m < 3, you are most likely
better off using a "conjugate gradient" algorithm.

This shows the same example function with the minimum computed using L-BFGS with m = 5.
In[3]:= FindMinimumPlot@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D,
88x, 1<, 8y, 1<<, Method Ø 8"QuasiNewton", "StepMemory" Ø 5<D

1.8

1.6

1.4

1.2

1
0.8 1 1.2 1.4 1.6

Out[3]= 88-2., 8x Ø 1.37638, y Ø 1.67868<<, 8Steps Ø 10, Function Ø 13, Gradient Ø 13<, Ü ContourGraphics Ü<

Quasi-Newton methods are chosen as the default in Mathematica because they are typically
quite fast and do not require computation of the Hessian matrix, which can be quite expensive
both in terms of the symbolic computation and numerical evaluation. With an adequate "line
search", they can be shown to converge superlinearly [NW99] to a local minimum where the
Hessian is positive definite. This means that
Unconstrained Optimization 17

°xk+1 - x* ¥
lim =0
kض °xk - x* ¥

or, in other words, the steps keep getting smaller. However, for very high precision, this does
not compare to the q-quadratic convergence rate of "Newton's" method.

This shows the number of steps and function evaluations required to find the minimum to high
precision for the problem shown.
In[4]:= First@Timing@Block@8e = 0, s = 0<, 88min, minpoint<, 8points<< =
Reap@FindMinimum@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D, 88x, 1.<, 8y, 1.<<,
Method -> "QuasiNewton", WorkingPrecision Ø 10 000,
StepMonitor ß Hs ++; Sow@8x, y<DL, EvaluationMonitor ß e ++DD;
Print@s, " steps and ", e, " evaluations"DDDD
95 steps and 106 evaluations
Out[4]= 2.79623

Newton's method is able to find ten times as many digits with far fewer steps because of its
quadratic convergence rate. However, the convergence with the quasi-Newton method is still
superlinear since the ratio of the errors is clearly going to zero.

This makes a plot showing the ratios of the errors in the computation. The ratios of the errors
are shown on a logarithmic scale so that the trend can clearly be seen over a large range of
magnitudes.
In[5]:= exact = 8x, y< ê. Last@Solve@8x ^ 2 + y ^ 2 ã 3 Pi ê 2, x ^ 2 - 3 y ã - Pi<, 8x, y<DD;
errs = Map@Norm@N@exact - ÒDD &, pointsD;
ListPlot@Log@10, Drop@errs, 1D ê Drop@errs, - 1DDD
20 40 60 80
-20

-40
Out[5]= -60

-80

-100

The following table summarizes the options you can use with quasi-Newton methods.

option name default value


"StepMemory" Automatic the effective number of steps to
"remember" in the Hessian approximation;
can be a positive integer or Automatic
"StepControl" "LineSearch" how to control steps; can be
"LineSearch" or None

Method options for Method -> "QuasiNewton".


18 Unconstrained Optimization

Gauss|Newton Methods
For minimization problems for which the objective function is a sum of squares,

1 m 1
f HxL = ‚r j HxL2 = rHxL.rHxL,
2 j=1 2

it is often advantageous to use the special structure of the problem. Time and effort can be
saved by computing the residual function rHxL, and its derivative, the Jacobian JHxL. The Gauss|
Newton method is an elegant way to do this. Rather than using the complete second-order
Hessian matrix for the quadratic model, the Gauss|Newton method uses Bk = Jk T Jk in (1) such
that the step pk is computed from the formula

Jk T Jk pk = - “ fk = - Jk T rk ,

where Jk = JHxk L, and so on. Note that this is an approximation to the full Hessian, which is
J T J + ⁄mj=1 r j “2 r j . In the zero residual case, where r = 0 is the minimum, or when r varies nearly

as a linear function near the minimum point, the approximation to the Hessian is quite good
and the quadratic convergence of "Newton’s method" is commonly observed.

Objective functions, which are sums of squares, are quite common, and, in fact, this is the form
of the objective function when FindFit is used with the default value of the NormFunction
option. One way to view the Gauss|Newton method is in terms of least-squares problems.
Solving the Gauss|Newton step is the same as solving a linear least-squares problem, so apply-
ing a Gauss|Newton method is in effect applying a sequence of linear least-squares fits to a
nonlinear function. With this view, it makes sense that this method is particularly appropriate
for the sort of nonlinear fitting that FindFit does.

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`

This uses the Unconstrained Problems Package to set up the classic Rosenbrock function, which
has a narrow curved valley.
In[2]:= p = GetFindMinimumProblem@RosenbrockD
2
Out[2]= FindMinimumProblemBH1 - X1 L + 100 I-X1 + X2 M , 88X1 , -1.2<, 8X2 , 1.<<, 8<, Rosenbrock, 82, 2<F
2 2
Unconstrained Optimization 19

When Mathematica encounters a problem that is expressly a sum of squares, such as the
Rosenbrock example, or a function that is the dot product of a vector with itself, the Gauss|
Newton method will be used automatically.

This shows the steps taken by FindMinimum with the Gauss|Newton method for Rosenbrock’s
function using a trust region method for step control.
In[3]:= FindMinimumPlot@p, Method Ø AutomaticD
1

-1

Out[3]= :80., 8X1 Ø 1., X2 Ø 1.<<, 8Steps Ø 15, Residual Ø 21, Jacobian Ø 16<, >
-2

-3

-1.0 -0.5 0.0 0.5 1.0

If you compare this with the same example done with "Newton’s method", you can see that it
was done with fewer steps and evaluations because the Gauss|Newton method is taking
advantage of the special structure of the problem. The convergence rate near the minimum is
just as good as for Newton’s method because the residual is zero at the minimum.

The Levenberg|Marquardt method is a Gauss|Newton method with "trust region" step control
(though it was originally proposed before the general notion of trust regions had been devel-
oped). You can request this method specifically by using the FindMinimum option
Method -> "LevenbergMarquardt" or equivalently Method -> "GaussNewton".

Sometimes it is awkward to express a function so that it will explicitly be a sum of squares or a


dot product of a vector with itself. In these cases, it is possible to use the "Residual" method
option to specify the residual directly. Similarly, you can specify the derivative of the residual
with the "Jacobian" method option. Note that when the residual is specified through the
"Residual" method option, it is not checked for consistency with the first argument of
FindMinimum . The values returned will depend on the value given through the option.

This finds the minimum of Rosenbrock’s function using the specification of the residual.
1 2
In[4]:= FindMinimumB JH1 - X1 L2 + 100 I- X21 + X2 M N, 88X1 , - 1.2`<, 8X2 , 1.`<<,
2
Method Ø 9"LevenbergMarquardt", "Residual" Ø 91 - X1 , 10 I- X21 + X2 M==F
Out[4]= 80., 8X1 Ø 1., X2 Ø 1.<<
20 Unconstrained Optimization

option name default value


"Residual" Automatic allows you to directly specify the residual r
such that f = 1 ê 2 r.r
"EvaluationMonitor" Automatic an expression that is evaluated each time
the residual is evaluated
"Jacobian" Automatic allows you to specify the (matrix) deriva -
tive of the residual
"StepControl" "TrustRegion" must be "TrustRegion", but allows you
to change control parameters through
method options

Method options for Method -> "LevenbergMarquardt".

Another natural way of setting up sums of squares problems in Mathematica is with FindFit,
which computes nonlinear fits to data. A simple example follows.

Here is a model function.


In[5]:= fm@a_, b_, c_, x_D := a If@x > 0, Cos@b xD, Exp@c xDD

Here is some data generated by the function with some random perturbations added.
In[6]:= Block@8e = 0.1, a = 1.2, b = 3.4, c = 0.98<,
data = Table@8x, fm@a, b, c, xD + e RandomReal@ 8- .5, .5<D<, 8x, - 5, 5, .1<DD;

This finds a nonlinear least-squares fit to the model function.


In[7]:= fit = FindFit@data, fm@a, b, c, xD, 88a, 1<, 8b, 3<, 8c, 1<<, xD
Out[7]= 8a Ø 1.20826, b Ø 3.40018, c Ø 1.0048<

This shows the fit model with the data.


In[8]:= Show@8ListPlot@dataD,
Plot@fm@a, b, c, xD ê. fit, 8x, - 5, 5<, PlotStyle Ø RGBColor@0, 1, 0DD<D

1.0

0.5

Out[8]=
-4 -2 2 4
-0.5

-1.0

In the example, FindFit internally constructs a residual function and Jacobian, which are in
turn used by the Gauss|Newton method to find the minimum of the sum of squares, or the
Unconstrained Optimization 21

nonlinear least-squares fit. Of course, FindFit can be used with other methods, but because a
residual function that evaluates rapidly can be constructed, it is often faster than the other
methods.

Nonlinear Conjugate Gradient Methods


The basis for a nonlinear conjugate gradient method is to effectively apply the linear conjugate
gradient method, where the residual is replaced by the gradient. A model quadratic function is
never explicitly formed, so it is always combined with a "line search" method.

The first nonlinear conjugate gradient method was proposed by Fletcher and Reeves as follows.
Given a step direction pk , use the line search to find ak such that xk+1 = xk + ak pk . Then compute

“ f Ixk+1 M.“ f Ixk+1 M


bk+1 = (1)
“ f Ixk M.“ f Ixk M

pk+1 = bk+1 pk - “ f Hxk+1 L.

It is essential that the line search for choosing ak satisfies the strong Wolfe conditions; this is
necessary to ensure that the directions pk are descent directions [NW99]].

An alternate method, which generally (but not always) works better in practice, is that of Polak
and Ribiere, where equation (2) is replaced with

“ f Ixk+1 M.( “ f Ixk+1 M-“ f Ixk MM


bk+1 = . (2)
“ f Ixk M.“ f Ixk M

In formula (3), it is possible that bk+1 can become negative, in which case Mathematica uses the
algorithm modified by using pk+1 = maxHbk+1 , 0L pk - “ f Hxk+1 L. In Mathematica, the default conjugate
gradient method is Polak|Ribiere, but the Fletcher|Reeves method can be chosen by using the
method option

Method Ø 8"ConjugateGradient", Method -> "FletcherReeves"<.

The advantage of conjugate gradient methods is that they use relatively little memory for large-
scale problems and require no numerical linear algebra, so each step is quite fast. The disadvan-
tage is that they typically converge much more slowly than "Newton" or "quasi-Newton" meth-
ods. Also, steps are typically poorly scaled for length, so the "line search" algorithm may
require more iterations each time to find an acceptable step.
22 Unconstrained Optimization

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`

This shows a plot of the steps taken by the nonlinear conjugate gradient method. The path is
much less direct than for Newton’s method.
In[2]:= FindMinimumPlot@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D,
88x, 1<, 8y, 1<<, Method -> "ConjugateGradient"D
Out[2]= :8-2., 8x Ø 1.37638, y Ø 1.67868<<,

2.4

2.2

2.0

1.8
8Steps Ø 9, Function Ø 22, Gradient Ø 22<, >
1.6

1.4

1.2

1.0
0.8 1.0 1.2 1.4 1.6 1.8 2.0

One issue that arises with nonlinear conjugate gradient methods is when to restart them. As
the search moves, the nature of the local quadratic approximation to the function may change
substantially. The local convergence of the method depends on that of the linear conjugate
gradient method, where the quadratic function is constant. With a constant quadratic function
for n variables and an exact line search, the linear algorithm will converge in n or fewer itera-
tions. By restarting (taking a steepest descent step with bk+1 = 0) every so often, it is possible
to eliminate information from previous points, which may not be relevant to the local quadratic
model at the current search point. If you look carefully at the example, you can see where the
method was restarted and a steepest descent step was taken. One option is to simply restart
after every k iterations, where k <= n. You can specify this using the method option
"RestartIterations" -> k. An alternative is to restart when consecutive gradients are not
sufficiently orthogonal according to the test

“ f Hxk L.“ f Hxk-1 L


< n,
“ f Hxk L.“ f Hxk L

with a threshold n between 0 and 1. You can specify this using the method option
"RestartThreshold" -> n.
Unconstrained Optimization 23

The table summarizes the options you can use with the conjugate gradient methods.

option name default value


"Method" "PolakRibiere" nonlinear conjugate gradient method can
be "PolakRibiere" or
"FletcherReeves"
"RestartThreshold" 1ê10 threshold n for gradient orthogonality
below which a restart will be done
"RestartIterations" ¶ number of iterations after which to restart
"StepControl" "LineSearch" must be "LineSearch", but you can use
this to specify line search methods

Method options for Method -> "ConjugateGradient".

It should be noted that the default method for FindMinimum in Mathematica 4 was a conjugate
gradient method with a near exact line search. This has been maintained for legacy reasons and
can be accessed by using the FindMinimum option Method -> "Gradient". Typically, this will
use more function and gradient evaluations than the newer Method -> "ConjugateGradient",
which itself often uses far more than the methods that Mathematica currently uses as defaults.

Principal Axis Method


"Gauss|Newton" and "conjugate gradient" methods use derivatives. When Mathematica cannot
compute symbolic derivatives, finite differences will be used. Computing derivatives with finite
differences can impose a significant cost in some cases and certainly affects the reliability of
derivatives, ultimately having an effect on how good an approximation to the minimum is
achievable. For functions where symbolic derivatives are not available, an alternative is to use a
derivative-free algorithm, where an approximate model is built up using only values from
function evaluations.

Mathematica uses the principal axis method of Brent [Br02] as a derivative-free algorithm. For
an n-variable problem, take a set of search directions u1 , u2 , …, un and a point x0 . Take xi to be
the point that minimizes f along the direction ui from xi-1 (i.e., do a "line search" from xi-1 ),
then replace ui with ui+1 . At the end, replace un with xn - x0 . Ideally, the new ui should be linearly
independent, so that a new iteration could be undertaken, but in practice, they are not. Brent's
algorithm involves using the singular value decomposition (SVD) on the matrix U = Hu1 , u2 , ... un L
24 Unconstrained Optimization

to realign them to the principal directions for the local quadratic model. (An eigen decomposi-
tion could be used, but Brent shows that the SVD is more efficient.) With the new set of ui
obtained, another iteration can be done.

Two distinct starting conditions in each variable are required for this method because these are
used to define the magnitudes of the vectors ui . In fact, whenever you specify two starting
conditions in each variable, FindMinimum , FindMaximum , and FindFit will use the principal axis
algorithm by default.

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`

This shows the search path and function evaluations for FindMinimum to find a local minimum
of the function cosIx2 - 3 yM + sinIx2 + y2 M.
In[2]:= FindMinimumPlot@Cos@x ^ 2 - 3 yD + Sin@x ^ 2 + y ^ 2D,
88x, 1.4, 1.5<, 8y, 1, 1.1<<, Method Ø "PrincipalAxis"D
1.0

0.9

0.8

Out[2]= :8-2., 8x Ø 2.12265, y Ø 0.454686<<, 8Steps Ø 4, Function Ø 148<, >


0.7

0.6

0.5

0.4
1.4 1.6 1.8 2.0 2.2 2.4

The basics of the search algorithm can be seen quite well from the plot since the derivative-free
line search algorithm requires a substantial number of function evaluations. First a line search is
done in the x direction, then from that point, a line search is done in the y direction, determin-
ing the step direction. Once the step is taken, the vectors ui are realigned appropriately to the
principal directions of the local quadratic approximation and the next step is similarly com-
puted.

The algorithm is efficient in terms of convergence rate; it has quadratic convergence in terms of
steps. However, in terms of function evaluations, it is quite expensive because of the derivative-
free line search required. Note that since the directions given to the line search (especially at
the beginning) are not necessarily descent directions, the line search has to be able to search in
both directions. For problems with many variables, the individual linear searches in all direc-
tions become very expensive, so this method is typically better suited to problems without too
many variables.
Unconstrained Optimization 25

Methods for Solving Nonlinear Equations

Introduction to Solving Nonlinear Equations


There are some close connections between finding a "local minimum" and solving a set of
nonlinear equations. Given a set of n equations in n unknowns, seeking a solution rHxL ã 0 is
equivalent to minimizing the sum of squares rHxL. rHxL when the residual is zero at the minimum,
so there is a particularly close connection to the "Gauss|Newton" methods. In fact, the Gauss|
Newton step for local minimization and the "Newton" step for nonlinear equations are exactly
the same. Also, for a smooth function, "Newton’s method" for local minimization is the same as
Newton’s method for the nonlinear equations “ f = 0. Not surprisingly, many aspects of the
algorithms are similar; however, there are also important differences.

Another thing in common with minimization algorithms is the need for some kind of "step
control". Typically, step control is based on the same methods as minimization except that it is
applied to a merit function, usually the smooth 2-norm squared, rHxL. rHxL.

"Newton" use the exact Jacobian or a finite difference approximation


to solve for the step based on a locally linear model
"Secant" work without derivatives by constructing a secant approxi -
mation to the Jacobian using n past steps; requires two
starting conditions in each dimension
"Brent" method in one dimension that maintains bracketing of
roots; requires two starting conditions that bracket a root

Basic method choices for FindRoot .

Newton's Method
Newton's method for nonlinear equations is based on a linear approximation

rHxL = Mk HpL = rHxk L + JHxk L p, p = Hx - xk L,

so the Newton step is found simply by setting Mk HpL = 0,

JHxk L pk = -rHxk L.
26 Unconstrained Optimization

Near a root of the equations, Newton's method has q-quadratic convergence, similar to
"Newton's" method for minimization. Newton's method is used as the default method for
FindRoot.

Newton's method can be used with either "line search" or "trust region" step control. When it
works, the line search control is typically faster, but the trust region approach is usually more
robust.

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`

Here is the Rosenbrock problem as a FindRoot problem.


In[2]:= p = GetFindRootProblem@RosenbrockD

Out[2]= FindRootProblemA910 I-X1 + X2 M, 1 - X1 =, 88X1 , -1.2<, 8X2 , 1.<<, 8<, Rosenbrock, 82, 2<E
2

This finds the solution of the nonlinear system using the default line search approach. (Newton's
method is the default method for FindRoot .)
In[3]:= FindRootPlot@pD
1

-1
Out[3]= :8X1 Ø 1., X2 Ø 1.<, 8Steps Ø 15, Residual Ø 27, Jacobian Ø 15<, >
-2

-3

-1.0 -0.5 0.0 0.5 1.0

Note that each of the line searches started along the line x == 1. This is a particular property of
the Newton step for this particular problem.

This computes the Jacobian and the Newton step symbolically for the Rosenbrock problem.

In[4]:= J = OuterAD, 910 I- X21 + X2 M, 1 - X1 =, 8X1 , X2 <E;


LinearSolveAJ, - 910 I- X21 + X2 M, 1 - X1 =E
Out[4]= 91 - X1 , 2 X1 - X1 - X2 =
2

When this step is added to the point, 8X1 , X2 <, it is easy to see why the steps go to the line
X1 = 1. This is a particular feature of this problem, which is not typical for most functions.
Unconstrained Optimization 27

Because the "trust region" approach does not try the Newton step unless it lies within the
region bound, this feature does not show up so strongly when the trust region step control is
used.

This finds the solution of the nonlinear system using the trust region approach. The search is
almost identical to the search with the "Gauss|Newton" method for the Rosenbrock objective
function in FindMinimum .
In[5]:= FindRootPlot@p, Method Ø 8"Newton", "StepControl" Ø "TrustRegion"<D
1

-1
Out[5]= :8X1 Ø 1., X2 Ø 1.<, 8Steps Ø 16, Residual Ø 21, Jacobian Ø 16<, >
-2

-3

-1.0 -0.5 0.0 0.5 1.0

When the structure of the Jacobian matrix is sparse, Mathematica will use SparseArray objects
both to compute the Jacobian and to handle the necessary numerical linear algebra.

When solving nonlinear equations is used as a part of a more general numerical procedure,
such as solving differential equations with implicit methods, often starting values are quite
good, and complete convergence is not absolutely necessary. Often the most expensive part of
computing a Newton step is finding the Jacobian and computing a matrix factorization. How-
ever, when close enough to a root, it is possible to leave the Jacobian frozen for a few steps
(though this does certainly affect the convergence rate). You can do this in Mathematica using
the method option "UpdateJacobian", which gives the number of steps to go before updating
the Jacobian. The default is "UpdateJacobian" -> 1, so the Jacobian is updated every step.

This shows the number of steps, function evaluations, and Jacobian evaluations required to find
a simple square root when the Jacobian is only updated every three steps.
In[6]:= Block@8s = 0, e = 0, j = 0<,
8FindRoot@x ^ 2 - 2, 88x, 1.5<<, Method Ø 8"Newton", "UpdateJacobian" Ø 3<,
EvaluationMonitor ß e ++, StepMonitor ß s ++,
Jacobian Ø 8Automatic, EvaluationMonitor ß j ++<D, s, e, j<D
Out[6]= 88x Ø 1.41421<, 5, 9, 2<

This shows the number of steps, function evaluations, and Jacobian evaluations required to find
a simple square root when the Jacobian is updated every step.
In[7]:= Block@8s = 0, e = 0, j = 0<,
8FindRoot@x ^ 2 - 2, 88x, 1.5<<, EvaluationMonitor ß e ++, StepMonitor ß s ++,
Jacobian Ø 8Automatic, EvaluationMonitor ß j ++<D, s, e, j<D
Out[7]= 88x Ø 1.41421<, 4, 5, 4<
28 Unconstrained Optimization

Of course for a simple one-dimensional root, updating the Jacobian is trivial in cost, so holding
the update is only of use here to demonstrate the idea.

option name default value


"UpdateJacobian" 1 number of steps to take before updating
the Jacobian
"StepControl" "LineSearch" method for step control, can be
"LineSearch", "TrustRegion", or
None (which is not recommended)

Method options for Method -> "Newton" in FindRoot .

The Secant Method


When derivatives cannot be computed symbolically, "Newton’s" method will be used, but with a
finite difference approximation to the Jacobian. This can have cost in terms of both time and
reliability. Just as for minimization, an alternative is to use an algorithm specifically designed to
work without derivatives.

In one dimension, the idea of the secant method is to use the slope of the line between two
consecutive search points to compute the step instead of the derivative at the latest point.
Similarly in n dimensions, differences between the residuals at n points are used to construct an
approximation of sorts to the Jacobian. Note that this is similar to finite differences, but rather
than trying to make the difference interval small in order to get as good a Jacobian approxima-
tion as possible, it effectively uses an average derivative just like the one-dimensional secant
method. Initially, the n points are constructed from two starting points that are distinct in all n
dimensions. Subsequently, as steps are taken, only the n points with the smallest merit function
value are kept. It is rare, but possible, that steps are collinear and the secant approximation to
the Jacobian becomes singular. In this case, the algorithm is restarted with distinct points.

The method requires two starting points in each dimension. In fact, if two starting points are
given in each dimension, the secant method is the default method except in one dimension,
where "Brent’s" method may be chosen.

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`
Unconstrained Optimization 29

This shows the solution of the Rosenbrock problem with the secant method.

In[2]:= FindRootPlotA910 I- X21 + X2 M, 1 - X1 =, 88X1 , - 1.2, - 1.<, 8X2 , 1., .9<<E

Out[2]= :8X1 Ø 1., X2 Ø 1.<, 8Steps Ø 21, Residual Ø 70<, -1 >

-2

-3

-1.0 -0.5 0.0 0.5 1.0

Note that, as compared to "Newton’s" method, many more residual function evaluations are
required. However, the method is able to follow the relatively narrow valley without directly
using derivative information.

This shows the solution of the Rosenbrock problem with Newton’s method using finite differ-
ences to compute the Jacobian.
In[3]:= FindRootPlotA910 I- X21 + X2 M, 1 - X1 =, 88X1 , - 1.2<, 8X2 , 1.<<,
Method Ø 8"Newton", StepControl -> "TrustRegion"<, Jacobian -> "FiniteDifference"E
1

-1
Out[3]= :8X1 Ø 1., X2 Ø 1.<, 8Steps Ø 17, Residual Ø 70, Jacobian Ø 16<, >
-2

-3

-1.0 -0.5 0.0 0.5 1.0

However, when compared to Newton’s method with finite differences, the number of residual
function evaluations is comparable. For sparse Jacobian matrices with larger problems, the
finite difference Newton method will usually be more efficient since the secant method does not
take advantage of sparsity in any way.

Brent’s Method
When searching for a real simple root of a real valued function, it is possible to take advantage
of the special geometry of the problem, where the function crosses the axis from negative to
30 Unconstrained Optimization

positive or vice versa. Brent’s method [Br02] is effectively a safeguarded secant method that
always keeps a point where the function is positive and one where it is negative so that the root
is always bracketed. At any given step, a choice is made between an interpolated (secant) step
and a bisection in such a way that eventual convergence is guaranteed.

If FindRoot is given two real starting conditions that bracket a root of a real function, then
Brent’s method will be used. Thus, if you are working in one dimension and can determine
initial conditions that will bracket a root, it is often a good idea to do so since Brent’s method is
the most robust algorithm available for FindRoot.

Even though essentially all the theory for solving nonlinear equations and local minimization is
based on smooth functions, Brent’s method is sufficiently robust that you can even get a good
estimate for a zero crossing for discontinuous functions.

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`

This shows the steps and function evaluations used in an attempt to find the root of a discontinu-
ous function.
In[2]:= FindRootPlot@2 UnitStep@Sin@xDD - 1, 8x, 3, 4<D

FindRoot::cvmit : Failed to converge to the requested accuracy or precision within 100 iterations. à
1.0

0.5

Out[2]= :8x Ø 3.14159<, 8Steps Ø 50, Residual Ø 51<, >


3.2 3.4 3.6 3.8 4.0
-0.5

-1.0

The method gives up and issues a message when the root is bracketed very closely, but it is
not able to find a value of the function, which is zero. This robustness carries over very well to
continuous functions that are very steep.

This shows the steps and function evaluations used to find the root of a function that varies
rapidly near its root.
In[3]:= FindRootPlot@ArcTan@10 000 Sin@xD D, 8x, 3, 4<, PlotRange Ø AllD
1.5
1.0
0.5
Out[3]= :8x Ø 3.14159<, 8Steps Ø 18, Residual Ø 19<, >
3.2 3.4 3.6 3.8 4.0
-0.5
-1.0
-1.5
Unconstrained Optimization 31

Step Control

Introduction to Step Control


Even with "Newton methods" where the local model is based on the actual Hessian, unless you
are close to a root or minimum, the model step may not bring you any closer to the solution. A
simple example is given by the following problem.

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`

This shows a simple example for root finding with step control disabled where the iteration
alternates between two points and does not converge. Note: On some platforms, you may see
convergence. This is due to slight variations in machine-number arithmetic, which may be
sufficient to break the oscillation.
In[2]:= FindRootPlot@Sin@xD, 8x, 1.1655611852072114<,
Method Ø 8Newton, StepControl Ø None<D
FindRoot::cvmit : Failed to converge to the requested accuracy or precision within 100 iterations. à

1.0

0.5

Out[2]= :8x Ø 21.9911<, 8Steps Ø 27, Residual Ø 27, Jacobian Ø 26<, >
5 10 15 20
-0.5

-1.0

This shows the same example problem with step control enabled. Since the first evaluation
point has not reduced the size of the function, the line search restricts the step and so the
iteration converges to the solution.
In[3]:= FindRootPlot@Sin@xD, 8x, 1.1655611852072114<, Method Ø "Newton"D

0.5

Out[3]= :8x Ø 0.<, 8Steps Ø 2, Residual Ø 3, Jacobian Ø 2<, >


-1.0 -0.5 0.5 1.0

-0.5
32 Unconstrained Optimization

A good step-size control algorithm will prevent repetition or escape from areas near roots or
minima from happening. At the same time, however, when steps based on the model function
are appropriate, the step-size control algorithm should not restrict them, otherwise the conver-
gence rate of the algorithm would be compromised. Two commonly used step-size control
algorithms are "line search" and "trust region" methods. In a line search method, the model
function gives a step direction, and a search is done along that direction to find an adequate
point that will lead to convergence. In a trust region method, a distance in which the model
function will be trusted is updated at each step. If the model step lies within that distance, it is
used; otherwise, an approximate minimum for the model function on the boundary of the trust
region is used. Generally the trust region methods are more robust, but they require more
numerical linear algebra.

Both step control methods were developed originally with minimization in mind. However, they
apply well to finding roots for nonlinear equations when used with a merit function. In Mathemat -
ica, the 2-norm merit function rHxL.rHxL is used.

Line Search Methods


A method like "Newton’s" method chooses a step, but the validity of that step only goes as far
as the Newton quadratic model for the function really reflects the function. The idea of a line
search is to use the direction of the chosen step, but to control the length, by solving a one-
dimensional problem of minimizing

f HaL ã f Ha pk + xk L,

where pk is the search direction chosen from the position xk . Note that

f ' HaL ã “ f Ha pk + xk L.pk ,

so if you can compute the gradient, you can effectively do a one-dimensional search with deriva-
tives.

Typically, an effective line search only looks toward a > 0 since a reasonable method should
guarantee that the search direction is a descent direction, which can be expressed as f£ a < 0.

It is typically not worth the effort to find an exact minimum of f since the search direction is
rarely exactly the right direction. Usually it is enough to move closer.
Unconstrained Optimization 33

One condition that measures progress is called the Armijo or sufficient decrease condition for a
candidate a* .

fHa* L § fH0L + m f ' H0L, 0 < m < 1

Often with this condition, methods will converge, but for some methods, Armijo alone does not
guarantee convergence for smooth functions. With the additional curvature condition,

†f ' Ha* L§ § h †f ' H0L§, 0 < m § h < 1,

many methods can be proven to converge for smooth functions. Together these conditions are
known as the strong Wolfe conditions. You can control the parameters m and h with the
"DecreaseFactor" -> m and "CurvatureFactor" -> h options of "LineSearch".

The default value for "CurvatureFactor" -> h is h  0.9, except for


Method -> "ConjugateGradient" where h = 0.1 is used since the algorithm typically works better
with a closer-to-exact line search. The smaller h is, the closer to exact the line search is.

If you look at graphs showing iterative searches in two dimensions, you can see the evaluations
spread out along the directions of the line searches. Typically, it only takes a few iterations to
find a point satisfying the conditions. However, the line search is not always able to find a point
that satisfies the conditions. Usually this is because there is insufficient precision to compute
the points closely enough to satisfy the conditions, but it can also be caused by functions that
are not completely smooth or vary extremely slowly in the neighborhood of a minimum.

This loads a package that contains some utility functions.


In[1]:= << Optimization`UnconstrainedProblems`

In[2]:= FindMinimum@x ^ 2 ê 2 + Cos@xD, 8x, 1<D

FindMinimum::lstol :
The line search decreased the step size to within tolerance specified by AccuracyGoal and
PrecisionGoal but was unable to find a sufficient decrease
in the function. You may need more than MachinePrecision
digits of working precision to meet these tolerances. à
Out[2]= 81., 8x Ø 0.000182658<<
34 Unconstrained Optimization

This runs into problems because the real differences in the function are negligible compared to
evaluation differences around the point, as can be seen from the plot.

In[24]:= Plot@x ^ 2 ê 2 + Cos@xD, 8x, 0, .0004<, PlotRange Ø 81 - 10 ^ - 15, 1 + 10 ^ - 15<D


2.0

1.5

Out[24]= 1.0

0.5

0.0001 0.0002 0.0003 0.0004

Sometimes it can help to subtract out the constants so that small changes in the function are
more significant.

In[18]:= FindMinimum@x ^ 2 ê 2 + Cos@xD - 1, 8x, 1<D

FindMinimum::lstol :
The line search decreased the step size to within tolerance specified by AccuracyGoal and
PrecisionGoal but was unable to find a sufficient decrease
in the function. You may need more than MachinePrecision
digits of working precision to meet these tolerances. à

Out[18]= 91.11022 µ 10 , 8x Ø 0.00024197<=


-16

In this case, however, the approximation is only slightly closer because the function is quite
noisy near 0, as can be seen from the plot.

In[19]:= Plot@x ^ 2 ê 2 + Cos@xD - 1, 8x, 0, .0004<D

1. µ 10-15

8. µ 10-16

6. µ 10-16
Out[19]=
4. µ 10-16

2. µ 10-16

0.0001 0.0002 0.0003 0.0004

Thus, to get closer to the correct value of zero, higher precision is required to compute the
function more accurately.

For some problems, particularly where you may be starting far from a root or a local minimum,
it may be desirable to restrict steps. With line searches, it is possible to do this by using the
"MaxRelativeStepSize" method option. The default value picked for this is designed to keep
searches from going wildly out of control, yet at the same time not prevent a search from using
reasonably large steps if appropriate.

You might also like