Overview
Roadmap for optimization
Yasmin H. Said1,2∗ and Edward J. Wegman1
This article focuses broadly on the area known as optimization. The intent is
to provide in broad brush strokes a perspective on the area in order to orient
the reader to more detailed treatments of specific subdisciplines of optimization
throughout WIREs: Computational Statistics. In this article we provide background
on mathematical programming, Lagrange multipliers, Karush-Kuhn-Tucker
Conditions, numerical optimization methods, linear programming, dynamic
programming, the calculus of variations, and metaheuristic algorithms . 2009
John Wiley & Sons, Inc. WIREs Comp Stat 2009 1 3–11
INTRODUCTION
O
ptimization is a simple concept that involves
a quantity that will be either minimized or
maximized. In trying to optimize this quantity, certain
parameters, sometimes called decision variables, are
subject to constraints. In general, an optimization
problem is comprised of three parts: the objective
function, variables, and constraints. The objective
function is the function to be optimized. The variables
or parameters involved define the model. Finally,
constraints determine the range of allowed values
for the unknowns. As taught in calculus, stationary
points of smooth functions are points at which the
derivative is equal to zero. In practical terms, these
points represent maxima, minima, or saddle points.
According to Weierstrass theorem, if a function has a
finite domain, a maximum and a minimum will exist.
The goal of optimization is to find these points if they
exist.
The simplest case of mathematical optimization is mathematical programming. Mathematical
programming considers as its objective function a
scalar real-valued function of real variables. Given
a function, φ : S → R, where R is the real line, the
optimization problem is to find x0 such that x0 ∈ S
and φ(x0 ) ≤ φ(x) for every x ∈ S (minimization) or
φ(x0 ) ≥ φ(x) for every x ∈ S (maximization). S is usually a subset of Rd , the d-dimensional Euclidean space,
which may be determined by a set of constraints. The
domain of φ, S, is called the search space and the
∗
Correspondence to:
[email protected]
1 Center
for Computational Data Sciences, George Mason University, Fairfax, VA 22030, USA
2 Department of Statistics, Oklahoma State University, Stillwater,
OK 74078, USA
DOI: 10.1002/wics.016
Vo lu me 1, Ju ly /Au gu s t 2009
elements of S are called feasible solutions. The function, φ, in the operations research literature is called
the objective function. In other literatures it is called
an energy function (physics) or cost function (decision or game theory). In the simplest cases, φ is a
twice-differentiable convex function over R with no
constraints on S in which case a minimum may be
determined by setting the first partial derivatives of φ,
∂φ
∂xi , i = 1, . . . , d, equal to 0 and solving for x or φ is a
twice differentiable concave function over R in which
case a maximum may similarly be determined.
In general, the objective function will not be a
convex or concave function, so that there may be a
number of local maxima or minima. For problems
with twice-differentiable objective functions without
constraints, the solutions can be found as stationary
points of φ. The stationary points occur where the
gradient of the objective function is zero, i.e., ∇φ = 0.
The Hessian matrix is used to classify each stationary
point. The Hessian matrix is
2
2φ
∂ φ
∂2φ
· · · ∂x∂ ∂x
2
∂x
∂x
1 2
1 d
∂x1
∂2φ
∂2φ
∂2φ
∂x ∂x
·
·
·
∂x2 ∂xd
2 1
∂x22
H(φ) = .
..
..
..
.
.
.
.
.
∂2φ
∂2φ
∂2φ
···
2
∂x ∂x
∂x ∂x
d
1
d
2
∂xd
If the Hessian is positive definite, the point is
a local minimum. If the Hessian is negative definite,
the point is a local maximum, and is the Hessian is
indefinite, the point is a saddle point.
Because the existence of derivatives may not
always be assumed, many other methods depending
on the smoothness of the objective function have been
developed. Some of the more well-known methods
include conjugate gradient methods, steepest descent
2009 Jo h n Wiley & So n s, In c.
3
Overview
www.wiley.com/wires/compstats
methods, interior point methods, Newton’s method,
quasi-Newton methods, simplex methods, evolutionary algorithms, genetic algorithms, simulated annealing, and tabu search. Constrained optimization problems can often be transformed into unconstrained
problems with the help of Lagrange multipliers. This
roadmap will briefly discuss some of these methods.
exist constants µk , k = 1, . . . , m and λj , j = 1, . . . , l
such that
µ ∇g (x∗ )
∇φ(x∗ ) + m
k=1
l k k ∗
Stationarity
+ j=1 λj ∇hj (x ) = 0
condition
and
gk (x∗ ) ≤ 0, for all k = 1, . . . , m
Primal feasibility
hj (x∗ ) = 0, for all j = 1, . . . , l
LAGRANGE MULTIPLIERS AND
KARUSH–KUHN–TUCKER
CONDITIONS
and
µk ≥ 0,
In the Introduction, simple multivariable calculus was
the tool for optimizing a twice differentiable function.
However, in nonlinear programming, optimizing that
function may be subject to constraints. A method for
doing this is called Lagrange multipliers. Consider an
objective function φ(x) and a set of constraints gk (x) =
0, k = 1, . . . , m. The domain of φ is an open set that
contains all points that satisfy the constraints. The
functions φ and gk must have continuous first partial
derivatives on the domain. Finally, the gradients
of gk must not be zero on the domain.
Then the
Lagrangian, , is (x, λ) = φ (x) + m
k=1 λk gk (x).
∂
= 0 if and only
Here λ = (λ1 , . . . , λm ). Notice that ∂x
i
∂gk
∂φ
if ∂x
=− m
λ
for
i
=
1,
.
.
.
,
d. Furthermore,
k
k=1
∂xi
i
∂
∂λk = 0 ⇒gk = 0. Thus, these new normal equations,
∂
∂
∂xi = 0, i = 1, . . . , d and ∂λk = 0, k = 1, . . . , m, yield
d + m simultaneous equations. The stationary points
of this set of equations determine the optimization
points of the constrained optimization problem. As
before computing the Hessian will indicate whether
the stationary point is maximum or minimum.
Consider now a generalization of the constrained
optimization problem given above. Let φ(x) be the
objective function. We wish to minimize φ(x) subject
to the constraints gk (x) ≤ 0, k = 1, . . . , m and hj (x) =
0, j = 1, . . . , l. gk (x) are the inequality constraints and
hj (x) are the equality constraints. William Karush1
first published the necessary conditions for this
general equality–inequality constrained problem in his
Masters thesis. However, it was not until Harold Kuhn
and Albert Tucker2 rediscovered and popularized the
necessary conditions in the Proceedings of the Second
Berkeley Symposium that the power of the method
was appreciated. The necessary conditions are now
known as the Karush–Kuhn–Tucker conditions.
Let φ:Rd → R, gk : Rd → R, k = 1, . . . , m, and
d
hj : R → R, j = 1, . . . , l. Assume these m + l + 1
functions are continuously differentiable at a point
x∗ . If x∗ is a local minimum that satisfies regularity
conditions that will be specified shortly, then there
4
k = 1, . . . , m
and
µk gk (x∗ ) = 0,
Dual feasibility
k = 1, . . . , m
Complementary
slackness.
The regularity conditions for a minimum point
to satisfy the Karush–Kuhn–Tucker conditions are
given below.
LICQ - linear independence constraint qualification: the gradients of the active inequality constraints
and the equality constraints are linearly independent
at x∗ .
MFCQ - Mangasarian–Fromowitz constraint
qualification: the gradients of the active inequality
constraints and the gradients of the equality constraints are positive-linearly independent at x∗ .
CRCQ - Constant rank constraint qualification:
for each subset of the gradients of the active
inequality constraints and the gradients of the equality
constraints the rank at a vicinity of x∗ is constant.
CPLD - Constant positive linear dependence
constraint qualification: for each subset of the
gradients of the active inequality constraints and the
gradients of the equality constraints, if it is positivelinear dependent at x∗ , then it is positive-linear
dependent at a vicinity of x∗ .
Slater condition - for a convex problem, there
exists a point x such that hj (x) = 0 and gk (x) < 0 for
all constraints gk active in x∗ .
Linearity constraints - If φ and gk are affine functions, then no other condition is needed to assure that
the minimum point satisfies the Karush–Kuhn–Tucker
conditions.
It is known that LICQ⇒MFCQ⇒CPLD and
that LICQ⇒CRCQ⇒CPLD. The converses are not
true. Also MFCQ is not equivalent to CRCQ. The
Kurash–Kuhn–Tucker conditions are widely used
in nonlinear programming. One prominent recent
example is constrained nonlinear optimization used
in the algorithm for determining the Support Vector
Machine.
2009 Jo h n Wiley & So n s, In c.
Vo lu me 1, Ju ly /Au gu s t 2009
WIREs Computational Statistics
Roadmap for optimization
NUMERICAL OPTIMIZATION
METHODS
Conjugate Gradient Methods
Often, because of the nature of the functions involved,
a purely analytic solution is not available or is
hard to compute analytically. For example, in the
simultaneous set of equations described by the
Lagrange multipliers or the Karush–Kuhn–Tucker
conditions, an easy analytic solution may not be
feasible. In such cases, one would want to turn to
numerical methods. We briefly describe three such
methods.
Newton’s Method
Also known as the Newton–Raphson Method,
Newton’s method is a simple iterative procedure for
finding the root or zeros of a function. Because in
optimization, we wish to find the stationary points
of a function, say φ(x), we are interested in finding
the zeros of the first derivative of that function,
dφ
= φ ′ (x). For the sake of discussion, consider finding
dx
the roots of a function f . Suppose the root of the
function is actually x∗ . Suppose x0 is a first guess at
)−0
the root. Then the tangent at x0 is f ′ (x0 ) = fx(x0−x
0
1
where x1 is the location of the intersection of
the tangent line with the x-axis. This formula may
n )−0
.
be applied recursively to obtain f ′ (xn ) = xfn(x−x
f (xn )
f ′ (xn )
n+1
Solving for xn+1 yields xn+1 = xn +
which is the
Newton’s method formula. If f is twice continuously
differentiable such that f ′ (x∗ ) = 0 and f ′′ (x∗ ) = 0, then
there is a neighborhood of x∗ , C(x∗ ), such that all
starting values, x0 ∈ C(x∗ ), the sequence xn → x∗ as
n → ∞.
The conjugate gradient method is a recursive
numerical method for finding a solution to a system
of linear equations whose matrix is symmetric and
positive definite. Consider the system of equations
Ax = b where A is a d × d symmetric postitive definite
matrix. Thus AT = A and xT A x > 0 for all nonzero vectors x∈ Rd . Two positive vectors u and v
are conjugate with respect to A if uT A v = 0. Define
the inner product of u and v with respect to A as
u, v A = uT A v = AT u, v = A u, v = u, Av . Thus
two vectors are conjugate if they are orthogonal with
respect to this inner product. Suppose now that {ui }
is a sequence of d mutually conjugate vectors. Then
{ui } spans Rd so that we may write the solution x∗ to
the system Ax = b as x∗ = di=1 λi ui . In this case we
have b = Ax∗ = di=1 λi Aui so that uTk b = uTk Ax∗ =
d
T
T
i=1 λi uk Aui = λk uk Auk . Solving for λk , we have
λk =
uT
b
k
T
uk Auk
=
uk ,b A
uk A .
Now take φ(x) = 12 xT A x-bT x, x ∈ Rd . This is a
quadratic form for which x∗ is the unique minimizer.
This suggests that we can use the gradient descent
method to find x∗ . We take the first basis vector to
be u1 = −b. If rk = b − Axk is the residual at the kth
iteration, rk is the negative gradient of φ at xk . We take
the direction closest to the gradient rk that preserves
the conjugacy constraint. Then uk+1 = rk −
The recursive equations are then:
Gradient Descent or Ascent
λk
=
xk+1
rk+1
=
=
µk
=
uk+1
=
d
Consider a function φ:R → R that is defined and
differentiable in a neighborhood of x∗ . Assume for
purposes of discussion that x∗ is the minimum of φ.
Suppose x0 is a starting point in the neighborhood of
x∗ . Then φ decreases from x0 at the maximum rate in
the direction of the negative gradient of φ(x0 ). This
observation allows one to formulate the recursion:
xn+1 = xn − γn ∇φ(xn ), n ≥ 0. γn is the step size of
the recursion. The gradient at xn is orthogonal to the
tangent plane of φ at xn . The step size γn is a decreasing
sequence of real numbers that must be chosen carefully
for convergence of xn to x∗ . Gradient ascent functions
in a similar way with xn+1 = xn + γn ∇φ(xn ), n ≥ 0
where x∗ is the maximum of φ. This algorithm is also
known as steepest descent or steepest ascent.
Vo lu me 1, Ju ly /Au gu s t 2009
uT
Ar
k k
uT
Auk
k
uk .
rTk rk
,
uTk Auk
xk − λk uk ,
rk − λk Auk ,
rTk+1 rk+1
,
rTk rk
rk+1 + µk uk .
Then iterate this system of equations until rk is
sufficiently small.
LINEAR PROGRAMMING
Linear programming is a technique for optimization
of a linear objective function subject to linear equality
and linear inequality constraints. Given a polytope
and a real-valued affine function on the polytope, a
linear programming solution will be a point in the
polytope where that function will have the smallest
2009 Jo h n Wiley & So n s, In c.
5
Overview
www.wiley.com/wires/compstats
(or largest) value. Suppose φ(x) = c1 x1 + c2 x2 + · · · +
cd xd = cT x where vectors c and x are both column
vectors. In the standard maximum problem, We wish
to maximize cT x subject to some constraint of the
form Ax ≤ b. x represents the vector of variables to be
determined, c and b are vectors of known coefficients,
and A is a m × d matrix of known coefficients. In
this case, cT x is the objective function and Ax ≤ b
and x ≥ 0 are the constraints that specify the convex
polytope. The standard minimum problem is to find
y = (y1 , . . . , yd ) to minimize bT y subject to Ay ≥ c.
A vector x in the standard maximum problem
or y in the standard minimum problem is said to be
feasible if it satisfies the corresponding constraints.
The set of feasible vectors is the constraint set. A
linear programming problem is said to be feasible if
the constraint set is not empty; otherwise it is said to be
infeasible. A feasible maximum (minimum) problem
is said to be unbounded if the objective function
can assume arbitrarily large positive (negative) values
at feasible vectors. If a problem is not unbounded
it is said to be bounded. A minimum problem can
be changed to a maximum problem by multiplying
the objective function by −1. Similarly constraints
d
of the form
j=1 aij xj ≥ bi can be changed to the
d
form j=1 (−aij )xj ≤ −bi . A variable xj may not be
restricted to be non-negative. In this case we can
replace xj by the difference of two variables uj − vj
where both are restricted to be non-negative. Thus
corresponding to every maximum problem, called the
primal problem, there is a corresponding minimum
problem which is said to be the dual problem.
The dual of a dual linear problem is the original
primal problem. Also every feasible solution of a
linear problem gives a bound on the optimal value
of objective function of its dual. The weak duality
theorem states that the objective function value of the
dual at any feasible solution is always greater than or
equal to the objective function value of the primal at
any feasible solution. The strong duality theorem says
that if the primal has an optimal solution, x∗ , then
the dual also has an optimal solution, y∗ , such that
cT x∗ = bT y∗ . Finally, if the primal is unbounded, then
the dual is infeasible and if the dual is unbounded,
then the primal is infeasible. It is possible for both the
primal and the dual to be infeasible.
Simplex Algorithm
The simplex algorithm is an algorithm to find a
solution to a linear programming problem and is
due to George Dantzig. We consider the linear
programming problem by maximizing cT x subject
to Ax ≤ b and x ≥ 0. Geometrically each inequality
6
specifies a half-space in d-dimensional Euclidean space
and their intersection is the set of all feasible values
the variables can assume. The region is either empty,
unbounded, or is a convex polytope. The set of
points on which the objective function obtains the
value v is defined by the hyperplane cT x = v. The
solution to the linear programming problem will be
by finding the largest v such that the hyperplane
still intersects the feasible region. As v increases, the
hyperplanes translate in the direction of the vector c.
The hyperplane corresponding to the largest v that still
intersects the feasible region will intersect a vertex, a
whole edge, or a face of the polytope. In the case of
a edge or face, it is still the case that the endpoints
of the edge or face will achieve the optimum value.
Thus the optimum value of the objective function
will always be achieved on one of the vertices of the
polytope.
The simplex algorithm is based on rewriting the
linear programming problem in an augmented form.
The augmented form changes the basic inequalities to
equalities by introducing the so-called slack variables.
In matrix form the problem becomes
Z
1 −cT 0
x
Maximize Z such that
0
A
I
xs
0
with x, xs ≥ 0
=
b
where x are the variables from the standard form,
xs are the slack variables from the augmentation
process, c contains the optimization coefficients, A
and b describe the system of constraint equations, and
Z is the variable to be optimized.
Suppose in the standard form of the problem
there are d variables and m constraints, not counting
the n non-negativity constraints. Generally, a vertex
of the simplex corresponds to making d of the m + d
total constraints tight, while adjacent vertices share
d − 1 tight constraints. In the augmented form, this
corresponds to setting m of the m + d variables (d
original and m slack) to 0. Such a setting of the
variables is called a basic solution. The m variables
that are set to 0 are called the nonbasic variables. One
can then solve the remaining d constraints, called the
basic variables, that will be uniquely determined. The
simplex algorithm begins by finding a basic feasible
solution. At each step, one basic and one nonbasic
variable are chosen according to the pivot rule, and
their roles are switched.
Klee and Minty3 developed a linear programming problem in which the polytope P is a distortion
2009 Jo h n Wiley & So n s, In c.
Vo lu me 1, Ju ly /Au gu s t 2009
WIREs Computational Statistics
Roadmap for optimization
of a d-dimensional cube. In this case, the simplex
method visits all 2d vertices before arriving at the
optimal vertex. Thus the worst-case complexity for the
simplex algorithm is exponential time. However, the
simplex method is remarkably efficient in practice. The
simplex algorithm has polynomial-time average-case
complexity under various distributions. The computational formulation of the simplex algorithm will
appear in another study.
If hx ≥ 0, then return unbounded
End If
α ← γ · min{−vki /(hv )i | (hv ) < 0, i =
1, . . . , m}
xk+1 ←xk + αhx
k←k+1
End Do
Interior Point Methods and Karmarkar’s
Algorithm
DYNAMIC PROGRAMMING
Karmarkar’s
algorithm
was
introduced by
Karmarkar4 as a polynomial-time algorithm for
solving linear programming problems. Although
Dantzig’s simplex method usually performs well in
practical problems, as noted above, it can in principle
have exponential complexity. Karmarkar’s algorithm
guarantees polynomial time complexity and is
reasonably efficient in practice. If d is the number
of variables and L is the number of bits input to the
algorithm, the runtime of Karmarkar’s algorithm is
O(d7/2 L2 ln(L)ln[ln(L)]). Karmarkar’s algorithm is an
interior point method. That is, the current approximation to the solution does not follow the boundary
of the feasible set as with the simplex method, but
iterates through the interior of the feasible region and
asymptotically approximates the solution. In general,
interior point methods (also called barrier methods)
can be used to solve linear and nonlinear convex optimization problems. A convex optimization problem
can be transformed into a linear programming problem by minimizing or maximizing a linear function
over a convex set. The idea of using interior methods
was investigated in the early 1960s for general nonlinear programming problems, but were not pursued
when more competitive methods were developed. The
introduction of Karmarkar’s algorithm established
new interest in interior point methods.
Karmarkar’s algorithm is difficult to describe
succinctly. A related interior point, but not polynomial
time algorithm is the affine-scaling algorithm. Pseudocode for the affine-scaling algorithm is given below.
Dynamic programming, pioneered by Richard Bellman in the 1940s Bellman5,6 , is a method of solving
problems that exhibit the properties of overlapping
subproblems and optimal substructure. Although on
the surface, dynamic programming would seem to be
related to nonlinear and linear programming, it is
actually an algorithm for speeding up computational
efficiency. Generally the method is substantially more
computationally efficient than naı̈ve methods. Optimal substructure means that optimal solutions of
subproblems can be used to find the optimal solutions
of the overall problem. Three steps are necessary:
Affine-Scaling Algorithm
Input: A, b, c, stopping criterion, γ .
k←0
Do while stopping criterion is not satisfied
vk ← b − Axk
Dv ←diag (vk1 , . . . , vkm )
−1
hx ← (AT D−2
v A) c
hv ← −Ahx
Vo lu me 1, Ju ly /Au gu s t 2009
1. Decompose the problem into smaller subproblems
2. Optimally solve the subproblems using the threestep process iteratively
3. Use the optimal sub-solutions to build an
optimal solution for the original problem.
Overlapping subproblems means that the same
subproblems are used to solve many different larger
problems. A computation wastes time if it is required
to recompute optimal solutions to problems already
computed. In order to avoid this, optimal solutions
to problems that have already been solved are saved.
This process is known as memoization. If the problem
needs to be solved later, one may simply look up
the solution that has been saved. Two approaches
are common in dynamic programming: top-down
and bottom-up. With a top-down approach, the
problem is decomposed into subproblems, and these
subproblems are solved and the solutions memoized,
in case they need to be solved again. With the
bottom-up approach, all subproblems are solved in
advance and then used to build up solutions to
larger problems. It is not always very intuitive to
recognize all the subproblems needed for solving the
given problem.
2009 Jo h n Wiley & So n s, In c.
7
Overview
www.wiley.com/wires/compstats
CALCULUS OF VARIATIONS
METAHEURISTIC ALGORITHMS
We only briefly mention calculus of variations. This
area of mathematics is quite involved and this brief
roadmap does not allow for a thorough treatment. Just
as ordinary calculus deals with functions, calculus
of variations deals with functionals. There are
many possible functionals, but a common example
would be integrals of an unknown function and/or
its derivatives. In statistics, some examples might
be L(f ) = ni=1 f (Xi ), the likelihood function where
X1 , . . . , Xn is a random sample and we desire to
optimize L(f ) with f is a density and f ∈ L2 (R).
Of course, without constraints on f , this maximum
likelihood problem has no solution. The constraint
that f satisfies some isotonic constraint is one that
yields a maximum
likelihood solution. ∞Similarly, if
one lets L(f ) = ni=1 (yi − f (xi ))2 + λ −∞ [f ′′ (x)]2 dx,
minimizing this functional for f ∈ L2 (R) yields a
penalized least squares solution.
Suppose f is a real-valued, continuous, bounded
function on an abstract topological space . Then the
supremum norm (sup norm) f ∞ = sup{f (ω) : ω ∈
}. A functional L(f ) defined on a space of functions
with norm ·
has a weak minimum at f0 if there
is a δ > 0 such that if f − f0 < δ, then L(f0 ) ≤ L(f ).
A weak maximum is defined similarly. This abstract
formulation may be replaced with a more typical
formulation. If is the space of r times continuously
differentiable
functions on a compact subset E ⊂ R,
then f
= rk=0 f (k) ∞ . A functional L(f ) has a
strong minimum at f0 if there is a δ > 0 such that if
f − f0 ∞ < δ, then L(f0 ) ≤ L(f ).
A typical problem might be to find the
stationary values of an integral of the form L(f ) =
b
a f (y, ẏ, x)dx. L(f ) has an extremum only if the
Euler–Lagrange differential equation is satisfied, i.e. if
∂f
d ∂f
∂y − dx ẏ = 0. The Euler–Lagrange equation plays
a prominent role in classical mechanics and differential
geometry.
Suppose f is r-times continuously differentiable.
The class of functions that are r times differentiable
on a compact set E is denoted by Cr (E).
The fundamental lemma of the calculus of
b
variations: Let f ∈ Cr [ a, b] and let a f (x)h(x)dx = 0
r
for every h ∈ C [ a, b] such that h(a) = h(b) = 0. Then
f (x) ≡ 0 on (a, b).
This lemma is used to prove that the extremal
b
functions of L(f ) = a f (y, ẏ, x)dx are weak solutions
∂f
d ∂f
of the Euler–Lagrange equation ∂y
− dx
ẏ = 0. A
generalization of the calculus of variations is called
Morse Theory.
8
Much of what has been described above has focused
on intensively mathematically based optimization.
Another class of optimization algorithms is motivated
by natural or industrial processes. Algorithms such
as evolutionary algorithms, genetic algorithms, simulated annealing, and tabu search fall into this class.
We briefly discuss each of these in turn.
Evolutionary Algorithms
Evolutionary algorithms are optimal search methods
that are inspired by the biological notion of
survival of the fittest. A fitness function is normally
hypothesized corresponding to the objective function.
In evolutionary algorithms, a hypothesized solution is
mutated by making a small random change to a single
element of the solution. The offspring solution is
measured against the fitness function and is discarded
if the mutation results in a less fit solution. In
some cases the random change may be replaced
by a purpose-driven change that may or may not
improve the fitness of the solution. Evolutionary
algorithms differ from more traditional mathematical
optimization techniques in that each iteration of
an evolutionary algorithms involves a competitive
selection that eliminates poor solutions. Mutation is
used to generate new solutions that are biased toward
regions of the solution space for which good solutions
have already been witnessed. Genetic algorithms are
closely related to evolutionary algorithms.
Genetic Algorithms
As with evolutionary algorithms, genetic algorithms
are an attempt at using the evolutionary laws of
Darwin to formulate an algorithm of optimization.
What these algorithms boil down to are two simple
underlying principles. The first principle is that many
structures can be reduced to a collection of binary
strings. The second principle is that transformations
to these strings, similar to evolution, can improve
these structures.
During the 1960s, John Holland from the University of Michigan began to study the applications
of genetic algorithms. In 1975, he published Adaptation in Natural and Artificial Systems which laid the
foundations for genetic algorithms. In his work, Holland considered not only the principle of evolution
through mutation but also through recombination
and crossover. Thus, his algorithm is both complex
and thorough, depending on how the user defines the
operations of evolution.
2009 Jo h n Wiley & So n s, In c.
Vo lu me 1, Ju ly /Au gu s t 2009
WIREs Computational Statistics
Roadmap for optimization
Genetic algorithms work with binary or bit
strings. A population is randomly generated to
represent potential solutions. Each subject in the
population receives a fitness level depending on
its similarities with the optimum. On the basis of
these fitness levels, mating will induce crossover
and mutations. This process will continue until the
optimum is reached. Thus perhaps a minor distinction
is that evolutionary algorithms focus on mutation,
whereas genetic algorithms mimic chromosomal
recombination. Solutions from one population are
taken and used by merging chromosomes (pieces
of each parent) to form a new population. This is
motivated by the expectation that the new population
will be better than the old one. Solutions which are
selected to form new solutions (offspring) are selected
according to their fitness: the more suitable they are
the more chances they have to reproduce.
One issue with genetic algorithms involves
fitness bias. If one individual has a level of fitness that
is very high, he will consequently breed more often.
As a result, the gene pool of the population shrinks
as it converges to the genome of the fittest individual.
Several methods are implemented to deal with these
problems. One method known as windowing involves
reducing the fitness of all members of a population
by the fitness of the weakest. Essentially removing
the weakest link from society increases the chances of
the fittest to survive and reproduce. An exponential
method that entails taking the square root of all the
fitness levels of the subjects helps to reduce the effects
of any individual with extremely high fitness. Linear
transformations for each fitness value can have the
same effect as the exponential method.
Simulated Annealing
Annealing is defined as a process of heating then
cooling for the purposes of softening or making a
metal less brittle. The logical question that follows
this definition is how this is related to optimization.
The answer is that when a metal is heated, its atoms are
freed from their initial position. In a process of slow
cooling, the atoms settle into a structure of minimum
energy. Thus, simulated annealing is an algorithm for
finding a global minimum in a given system.
Simulated annealing is a probabilistic metaheuristic global optimization algorithm for locating
a good approximation to the global minimum of a
given function in a large search space. For many problems, simulated annealing may be more effective than
exhaustive enumeration provided that the goal is to
find an acceptably good solution in a fixed amount of
time, rather than the best possible solution.
Vo lu me 1, Ju ly /Au gu s t 2009
During each step of the algorithm, the variable
that will eventually represent the minimum is replaced
by a random solution that is chosen according to
a temperature parameter, T. As the temperature
of the system decreases, the probability of higher
temperature values replacing the minimum decreases,
but it is always non-zero. The decrease in probability
ensures a gradual decrease in the value of the
minimum. However, the non-zero stipulation allows
for a higher value to replace the minimum. Though
this may sound like a flaw in the algorithm, it makes
simulated annealing very useful because it allows for
global minimums to be found rather than local ones.
If during the course of the implementation of the
algorithm a certain location (local minimum) has
a lower temperature than its neighbors yet much
higher than the overall lowest temperature (global
minimum), this non-zero probability stipulation will
allow for the value of the minimum to back track in a
sense and become unstuck from local minima.
Tabu Search
As with simulated annealing and genetic algorithms,
tabu search is a metaheuristic, a method of trialand-error problem-solving technique. Tabu search is
a very promising method in the field of optimization
and is designed for practical applications in many
fields including engineering and business. Tabu was
not a random word choice, but rather chosen to
emphasize that the goal of the search is to avoid
the taboo of wasting time in the wrong direction, as
other algorithms often do. This search implements
a form of memory so the search does not waste
time revisiting previously searched parts or unlikely
locations. This memory is a list of all prior moves. The
adaptive memory of tabu search is what distinguishes
it from genetic algorithms and simulated annealing.
A common saying is that the only way to learn is
through mistakes. This principle helps to define tabu
searches. Rather than making random decisions, tabu
searches can sometimes proceed in a direction of
intentional mistake so as to develop a strategy for
improvement. The next move in a tabu search is
chosen after examining all other possible moves from
a current position. Each move is then assigned a value
based on how optimal it is and the highest value is
typically chosen as the next move.
Similar to the provision in simulated annealing
that allows for a less optimal state to be chosen
so as to avoid local optimality, tabu search works
by examining neighboring configurations using its
memory to avoid reaching a state of local optimality.
Special override conditions are also included in case a
2009 Jo h n Wiley & So n s, In c.
9
Overview
www.wiley.com/wires/compstats
more optimal state has been marked taboo. As such
a recently developed method attributed to Glover and
Laguna7 , the full implications and implementations of
tabu search remain unknown.
CONCLUSIONS
Of course, applications of optimization abound in
statistical theory, including of course maximum likelihood, penalized methods, least squares, and minimum
entropy to name just a few. As data sets become larger
and models become more subtle, computational statistical methods rely increasingly on numerical optimization techniques. Moreover, optimization techniques
allow for a much richer class of methodologies for
data analysis often developed under the rubric of data
mining or machine learning. Finally it has to be noted
that the optimization algorithms described above as
metaheuristic such as simulated annealing and genetic
algorithms fundamentally involve statistical and probabilistic methods. Thus there is an elegant interplay
between optimization and computational statistics.
REFERENCES
1. Karush W. Minima of Functions of Several Variables
with Inequalities as Side Constraints. M.Sc. Dissertation. Department of Mathematics, University of
Chicago, Chicago, IL; 1939.
2. Kuhn HW, Tucker AW. Nonlinear programming, Proceedings of the Second Berkeley Symposium, University
of California Press, Berkeley; 1951, 481–492.
3. Klee V, Minty GJ. How good is the simplex algorithm?
In Shisha O, eds. Inequalities, III. New York: Academic
Press; 1972, 159–175.
4. Karmarkar N. A new polynomial time algorithm for linear programming. Combinatorica 1984, 4(4):373–395.
5. Bellman R. Dynamic Programming. Princeton, NJ:
Princeton University Press; 1957.
6. Bellman R. Dynamic Programming. Princeton, NJ:
Princeton University Press; 2003, Dover paperback edition.
7. Glover F, Laguna M. Tabu Search. Norwell, MA:
Kluwer; 1997.
Further Reading
Adler I, Karmarkar N, Resende MGC, Veiga G. An Implementation of Karmarkar’s Algorithm for Linear
Programming. Math Progr 1989, 44:297–335.
Arfken G. Calculus of variations. In Mathematical Methods for Physicists, Ch. 17. 3rd ed. Orlando, FL:
Academic Press; 1985, 925–962.
Ashlock D. Evolutionary Computation for Modeling and Optimization. New York, NY: Springer; 2006.
Avriel M. Nonlinear Programming: Analysis and Methods. New York, NY: Dover Publishing; 2003.
Banzhaf W, Nordin P, Keller R, Francone F. Genetic Programming - An Introduction. San Francisco, CA:
Morgan Kaufmann; 1998.
Bliss GA. Calculus of Variations. Chicago, IL: Open Court; 1925.
Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press; 2004.
Cerny V. A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm.
J Optim Theory Appl 1985, 45:41–51.
Cormen TH, Leiserson CE, Rivest RL, Stein C. Section 29.3: The simplex algorithm. In: Introduction to
Algorithms. 3rd ed. Cambridge, MA: MIT Press, New York: McGraw-Hill; 2001, 790–804.
Deuflhard P. Newton Methods for Nonlinear Problems: Affine Invariance and AdaptiveAlgorithms, Springer
Series in Computational Mathematics, 35. Berlin: Springer; 2004.
Elster KH. Modern Mathematical Methods of Optimization. Berlin: Akademie Verlag; 1993.
Fiacco AV, McCormick GP Nonlinear Programming: Sequential Unconstrained Minimization Techniques. New
York, NY: John Wiley & Sons; 1968.
Forsyth AR. Calculus of Variations. New York, NY: Dover Publishing; 1960.
Fox C. An Introduction to the Calculus of Variations. New York, NY: Dover Publications; 1987.
Gärtner B, Matoušek J. Understanding and Using Linear Programming. Berlin: Springer; 2006.
Goldberg DE. Genetic Algorithms in Search, Optimization and Machine Learning. Boston, MA: Kluwer
Academic Publishers; 1989.
10
2009 Jo h n Wiley & So n s, In c.
Vo lu me 1, Ju ly /Au gu s t 2009
WIREs Computational Statistics
Roadmap for optimization
Golub GH, Van Loan CF. Matrix Computations. 3rd ed. Chapter 10. Baltimore, MD: Johns Hopkins University
Press; 1996.
Granville V, Krivanek M, Rasson JP. Simulated annealing: a proof of convergence. IEEE Trans Pattern Anal
Mach Intell 1994, 16(6):652–656. doi:10.1109/34.295910.
Hillier F, Lieberman GJ. Introduction to Operations Research. 8th ed. New York: McGraw-Hill; 2005.
Holland JH. Adaptation in Natural and Artificial Systems. Ann Arbor, MI: University of Michigan Press; 1975
Reprinted in 1992 by Cambridge MA: MIT Press.
Isenberg C. The Science of Soap Films and Soap Bubbles. New York, NY: Dover Publications; 1992.
Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science. New Series 1983,
220(4598):671–680.
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of state calculations by fast
computing machines. J Chem Phys 1953, 21(6):1087–1092.
Nocedal J, Wright SJ. Numerical Optimization. New York, NY: Springer; 2006.
Papalambros PY, Wilde DJ. Principles of Optimal Design : Modeling and Computation. Cambridge: Cambridge
University Press; 2000.
Sagan H. Introduction to the Calculus of Variations. New York, NY: Dover publications; 1992.
Said YH. On genetic algorithms and their applications. In: Rao CR, Wegman EJ, Solka JL, eds. Handbook on
Statistics: Data Mining and Data Visualization 24. Amsterdam: Elsevier North Holland; 2005, 359–390.
Schrijver A. Theory of Linear and Integer Programming. New York, NY: John Wiley & Sons; 1998.
Snyman JA. Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical
and New Gradient-Based Algorithms. New York, NY: Springer; 2005.
Taha HA. Operations Research: An Introduction. 8th ed. Saddle River, NJ: Prentice Hall; 2007.
Vapnyarskii IB. Lagrange multipliers. In: Hazewinkel M, ed. Encyclopaedia of Mathematics. Boston, MA:
Kluwer Academic Publishers; 2001.
Wright S. Primal-Dual Interior-Point Methods. Philadelphia, PA: SIAM; 1997.
Yang XS. Introduction to Mathematical Optimization: From Linear Programming to Metaheuristics.
Cambridge: Cambridge International Science Publishing; 2008.
Ypma T. Historical development of the Newton-Raphson method. SIAM Rev 1995 37(4):531–551.
doi:10.1137/1037125
Vo lu me 1, Ju ly /Au gu s t 2009
2009 Jo h n Wiley & So n s, In c.
11