Academia.eduAcademia.edu

Roadmap for Optimization

2009, Wiley Interdisciplinary Reviews: Computational Statistics

This article focuses broadly on the area known as optimization. The intent is to provide in broad brush strokes a perspective on the area in order to orient the reader to more detailed treatments of specific subdisciplines of optimization throughout WIREs: Computational Statistics. In this article we provide background on mathematical programming, Lagrange multipliers, Karush-Kuhn-Tucker Conditions, numerical optimization methods, linear programming, dynamic programming, the calculus of variations, and metaheuristic algorithms .

Overview Roadmap for optimization Yasmin H. Said1,2∗ and Edward J. Wegman1 This article focuses broadly on the area known as optimization. The intent is to provide in broad brush strokes a perspective on the area in order to orient the reader to more detailed treatments of specific subdisciplines of optimization throughout WIREs: Computational Statistics. In this article we provide background on mathematical programming, Lagrange multipliers, Karush-Kuhn-Tucker Conditions, numerical optimization methods, linear programming, dynamic programming, the calculus of variations, and metaheuristic algorithms .  2009 John Wiley & Sons, Inc. WIREs Comp Stat 2009 1 3–11 INTRODUCTION O ptimization is a simple concept that involves a quantity that will be either minimized or maximized. In trying to optimize this quantity, certain parameters, sometimes called decision variables, are subject to constraints. In general, an optimization problem is comprised of three parts: the objective function, variables, and constraints. The objective function is the function to be optimized. The variables or parameters involved define the model. Finally, constraints determine the range of allowed values for the unknowns. As taught in calculus, stationary points of smooth functions are points at which the derivative is equal to zero. In practical terms, these points represent maxima, minima, or saddle points. According to Weierstrass theorem, if a function has a finite domain, a maximum and a minimum will exist. The goal of optimization is to find these points if they exist. The simplest case of mathematical optimization is mathematical programming. Mathematical programming considers as its objective function a scalar real-valued function of real variables. Given a function, φ : S → R, where R is the real line, the optimization problem is to find x0 such that x0 ∈ S and φ(x0 ) ≤ φ(x) for every x ∈ S (minimization) or φ(x0 ) ≥ φ(x) for every x ∈ S (maximization). S is usually a subset of Rd , the d-dimensional Euclidean space, which may be determined by a set of constraints. The domain of φ, S, is called the search space and the ∗ Correspondence to: [email protected] 1 Center for Computational Data Sciences, George Mason University, Fairfax, VA 22030, USA 2 Department of Statistics, Oklahoma State University, Stillwater, OK 74078, USA DOI: 10.1002/wics.016 Vo lu me 1, Ju ly /Au gu s t 2009 elements of S are called feasible solutions. The function, φ, in the operations research literature is called the objective function. In other literatures it is called an energy function (physics) or cost function (decision or game theory). In the simplest cases, φ is a twice-differentiable convex function over R with no constraints on S in which case a minimum may be determined by setting the first partial derivatives of φ, ∂φ ∂xi , i = 1, . . . , d, equal to 0 and solving for x or φ is a twice differentiable concave function over R in which case a maximum may similarly be determined. In general, the objective function will not be a convex or concave function, so that there may be a number of local maxima or minima. For problems with twice-differentiable objective functions without constraints, the solutions can be found as stationary points of φ. The stationary points occur where the gradient of the objective function is zero, i.e., ∇φ = 0. The Hessian matrix is used to classify each stationary point. The Hessian matrix is  2  2φ ∂ φ ∂2φ · · · ∂x∂ ∂x 2 ∂x ∂x 1 2 1 d  ∂x1  ∂2φ ∂2φ ∂2φ   ∂x ∂x  · · · ∂x2 ∂xd   2 1 ∂x22 H(φ) =  . .. ..  ..  .  .  . . .   ∂2φ  ∂2φ ∂2φ ··· 2 ∂x ∂x ∂x ∂x d 1 d 2 ∂xd If the Hessian is positive definite, the point is a local minimum. If the Hessian is negative definite, the point is a local maximum, and is the Hessian is indefinite, the point is a saddle point. Because the existence of derivatives may not always be assumed, many other methods depending on the smoothness of the objective function have been developed. Some of the more well-known methods include conjugate gradient methods, steepest descent  2009 Jo h n Wiley & So n s, In c. 3 Overview www.wiley.com/wires/compstats methods, interior point methods, Newton’s method, quasi-Newton methods, simplex methods, evolutionary algorithms, genetic algorithms, simulated annealing, and tabu search. Constrained optimization problems can often be transformed into unconstrained problems with the help of Lagrange multipliers. This roadmap will briefly discuss some of these methods. exist constants µk , k = 1, . . . , m and λj , j = 1, . . . , l such that  µ ∇g (x∗ ) ∇φ(x∗ ) + m k=1 l k k ∗ Stationarity + j=1 λj ∇hj (x ) = 0 condition and gk (x∗ ) ≤ 0, for all k = 1, . . . , m Primal feasibility hj (x∗ ) = 0, for all j = 1, . . . , l LAGRANGE MULTIPLIERS AND KARUSH–KUHN–TUCKER CONDITIONS and µk ≥ 0, In the Introduction, simple multivariable calculus was the tool for optimizing a twice differentiable function. However, in nonlinear programming, optimizing that function may be subject to constraints. A method for doing this is called Lagrange multipliers. Consider an objective function φ(x) and a set of constraints gk (x) = 0, k = 1, . . . , m. The domain of φ is an open set that contains all points that satisfy the constraints. The functions φ and gk must have continuous first partial derivatives on the domain. Finally, the gradients of gk must not be zero on the domain.  Then the Lagrangian, , is (x, λ) = φ (x) + m k=1 λk gk (x). ∂ = 0 if and only Here λ = (λ1 , . . . , λm ). Notice that ∂x i  ∂gk ∂φ if ∂x =− m λ for i = 1, . . . , d. Furthermore, k k=1 ∂xi i ∂ ∂λk = 0 ⇒gk = 0. Thus, these new normal equations, ∂ ∂ ∂xi = 0, i = 1, . . . , d and ∂λk = 0, k = 1, . . . , m, yield d + m simultaneous equations. The stationary points of this set of equations determine the optimization points of the constrained optimization problem. As before computing the Hessian will indicate whether the stationary point is maximum or minimum. Consider now a generalization of the constrained optimization problem given above. Let φ(x) be the objective function. We wish to minimize φ(x) subject to the constraints gk (x) ≤ 0, k = 1, . . . , m and hj (x) = 0, j = 1, . . . , l. gk (x) are the inequality constraints and hj (x) are the equality constraints. William Karush1 first published the necessary conditions for this general equality–inequality constrained problem in his Masters thesis. However, it was not until Harold Kuhn and Albert Tucker2 rediscovered and popularized the necessary conditions in the Proceedings of the Second Berkeley Symposium that the power of the method was appreciated. The necessary conditions are now known as the Karush–Kuhn–Tucker conditions. Let φ:Rd → R, gk : Rd → R, k = 1, . . . , m, and d hj : R → R, j = 1, . . . , l. Assume these m + l + 1 functions are continuously differentiable at a point x∗ . If x∗ is a local minimum that satisfies regularity conditions that will be specified shortly, then there 4 k = 1, . . . , m and µk gk (x∗ ) = 0, Dual feasibility k = 1, . . . , m Complementary slackness. The regularity conditions for a minimum point to satisfy the Karush–Kuhn–Tucker conditions are given below. LICQ - linear independence constraint qualification: the gradients of the active inequality constraints and the equality constraints are linearly independent at x∗ . MFCQ - Mangasarian–Fromowitz constraint qualification: the gradients of the active inequality constraints and the gradients of the equality constraints are positive-linearly independent at x∗ . CRCQ - Constant rank constraint qualification: for each subset of the gradients of the active inequality constraints and the gradients of the equality constraints the rank at a vicinity of x∗ is constant. CPLD - Constant positive linear dependence constraint qualification: for each subset of the gradients of the active inequality constraints and the gradients of the equality constraints, if it is positivelinear dependent at x∗ , then it is positive-linear dependent at a vicinity of x∗ . Slater condition - for a convex problem, there exists a point x such that hj (x) = 0 and gk (x) < 0 for all constraints gk active in x∗ . Linearity constraints - If φ and gk are affine functions, then no other condition is needed to assure that the minimum point satisfies the Karush–Kuhn–Tucker conditions. It is known that LICQ⇒MFCQ⇒CPLD and that LICQ⇒CRCQ⇒CPLD. The converses are not true. Also MFCQ is not equivalent to CRCQ. The Kurash–Kuhn–Tucker conditions are widely used in nonlinear programming. One prominent recent example is constrained nonlinear optimization used in the algorithm for determining the Support Vector Machine.  2009 Jo h n Wiley & So n s, In c. Vo lu me 1, Ju ly /Au gu s t 2009 WIREs Computational Statistics Roadmap for optimization NUMERICAL OPTIMIZATION METHODS Conjugate Gradient Methods Often, because of the nature of the functions involved, a purely analytic solution is not available or is hard to compute analytically. For example, in the simultaneous set of equations described by the Lagrange multipliers or the Karush–Kuhn–Tucker conditions, an easy analytic solution may not be feasible. In such cases, one would want to turn to numerical methods. We briefly describe three such methods. Newton’s Method Also known as the Newton–Raphson Method, Newton’s method is a simple iterative procedure for finding the root or zeros of a function. Because in optimization, we wish to find the stationary points of a function, say φ(x), we are interested in finding the zeros of the first derivative of that function, dφ = φ ′ (x). For the sake of discussion, consider finding dx the roots of a function f . Suppose the root of the function is actually x∗ . Suppose x0 is a first guess at )−0 the root. Then the tangent at x0 is f ′ (x0 ) = fx(x0−x 0 1 where x1 is the location of the intersection of the tangent line with the x-axis. This formula may n )−0 . be applied recursively to obtain f ′ (xn ) = xfn(x−x f (xn ) f ′ (xn ) n+1 Solving for xn+1 yields xn+1 = xn + which is the Newton’s method formula. If f is twice continuously differentiable such that f ′ (x∗ ) = 0 and f ′′ (x∗ ) = 0, then there is a neighborhood of x∗ , C(x∗ ), such that all starting values, x0 ∈ C(x∗ ), the sequence xn → x∗ as n → ∞. The conjugate gradient method is a recursive numerical method for finding a solution to a system of linear equations whose matrix is symmetric and positive definite. Consider the system of equations Ax = b where A is a d × d symmetric postitive definite matrix. Thus AT = A and xT A x > 0 for all nonzero vectors x∈ Rd . Two positive vectors u and v are conjugate with respect to A if uT A v = 0. Define the inner product of u and v with respect to A as u, v A = uT A v = AT u, v = A u, v = u, Av . Thus two vectors are conjugate if they are orthogonal with respect to this inner product. Suppose now that {ui } is a sequence of d mutually conjugate vectors. Then {ui } spans Rd so that we may write the solution x∗ to  the system Ax = b as x∗ = di=1 λi ui . In this case we  have b = Ax∗ = di=1 λi Aui so that uTk b = uTk Ax∗ = d T T i=1 λi uk Aui = λk uk Auk . Solving for λk , we have λk = uT b k T uk Auk = uk ,b A uk A . Now take φ(x) = 12 xT A x-bT x, x ∈ Rd . This is a quadratic form for which x∗ is the unique minimizer. This suggests that we can use the gradient descent method to find x∗ . We take the first basis vector to be u1 = −b. If rk = b − Axk is the residual at the kth iteration, rk is the negative gradient of φ at xk . We take the direction closest to the gradient rk that preserves the conjugacy constraint. Then uk+1 = rk − The recursive equations are then: Gradient Descent or Ascent λk = xk+1 rk+1 = = µk = uk+1 = d Consider a function φ:R → R that is defined and differentiable in a neighborhood of x∗ . Assume for purposes of discussion that x∗ is the minimum of φ. Suppose x0 is a starting point in the neighborhood of x∗ . Then φ decreases from x0 at the maximum rate in the direction of the negative gradient of φ(x0 ). This observation allows one to formulate the recursion: xn+1 = xn − γn ∇φ(xn ), n ≥ 0. γn is the step size of the recursion. The gradient at xn is orthogonal to the tangent plane of φ at xn . The step size γn is a decreasing sequence of real numbers that must be chosen carefully for convergence of xn to x∗ . Gradient ascent functions in a similar way with xn+1 = xn + γn ∇φ(xn ), n ≥ 0 where x∗ is the maximum of φ. This algorithm is also known as steepest descent or steepest ascent. Vo lu me 1, Ju ly /Au gu s t 2009 uT Ar k k uT Auk k uk . rTk rk , uTk Auk xk − λk uk , rk − λk Auk , rTk+1 rk+1 , rTk rk rk+1 + µk uk . Then iterate this system of equations until rk is sufficiently small. LINEAR PROGRAMMING Linear programming is a technique for optimization of a linear objective function subject to linear equality and linear inequality constraints. Given a polytope and a real-valued affine function on the polytope, a linear programming solution will be a point in the polytope where that function will have the smallest  2009 Jo h n Wiley & So n s, In c. 5 Overview www.wiley.com/wires/compstats (or largest) value. Suppose φ(x) = c1 x1 + c2 x2 + · · · + cd xd = cT x where vectors c and x are both column vectors. In the standard maximum problem, We wish to maximize cT x subject to some constraint of the form Ax ≤ b. x represents the vector of variables to be determined, c and b are vectors of known coefficients, and A is a m × d matrix of known coefficients. In this case, cT x is the objective function and Ax ≤ b and x ≥ 0 are the constraints that specify the convex polytope. The standard minimum problem is to find y = (y1 , . . . , yd ) to minimize bT y subject to Ay ≥ c. A vector x in the standard maximum problem or y in the standard minimum problem is said to be feasible if it satisfies the corresponding constraints. The set of feasible vectors is the constraint set. A linear programming problem is said to be feasible if the constraint set is not empty; otherwise it is said to be infeasible. A feasible maximum (minimum) problem is said to be unbounded if the objective function can assume arbitrarily large positive (negative) values at feasible vectors. If a problem is not unbounded it is said to be bounded. A minimum problem can be changed to a maximum problem by multiplying the objective function by −1. Similarly constraints d of the form j=1 aij xj ≥ bi can be changed to the d form j=1 (−aij )xj ≤ −bi . A variable xj may not be restricted to be non-negative. In this case we can replace xj by the difference of two variables uj − vj where both are restricted to be non-negative. Thus corresponding to every maximum problem, called the primal problem, there is a corresponding minimum problem which is said to be the dual problem. The dual of a dual linear problem is the original primal problem. Also every feasible solution of a linear problem gives a bound on the optimal value of objective function of its dual. The weak duality theorem states that the objective function value of the dual at any feasible solution is always greater than or equal to the objective function value of the primal at any feasible solution. The strong duality theorem says that if the primal has an optimal solution, x∗ , then the dual also has an optimal solution, y∗ , such that cT x∗ = bT y∗ . Finally, if the primal is unbounded, then the dual is infeasible and if the dual is unbounded, then the primal is infeasible. It is possible for both the primal and the dual to be infeasible. Simplex Algorithm The simplex algorithm is an algorithm to find a solution to a linear programming problem and is due to George Dantzig. We consider the linear programming problem by maximizing cT x subject to Ax ≤ b and x ≥ 0. Geometrically each inequality 6 specifies a half-space in d-dimensional Euclidean space and their intersection is the set of all feasible values the variables can assume. The region is either empty, unbounded, or is a convex polytope. The set of points on which the objective function obtains the value v is defined by the hyperplane cT x = v. The solution to the linear programming problem will be by finding the largest v such that the hyperplane still intersects the feasible region. As v increases, the hyperplanes translate in the direction of the vector c. The hyperplane corresponding to the largest v that still intersects the feasible region will intersect a vertex, a whole edge, or a face of the polytope. In the case of a edge or face, it is still the case that the endpoints of the edge or face will achieve the optimum value. Thus the optimum value of the objective function will always be achieved on one of the vertices of the polytope. The simplex algorithm is based on rewriting the linear programming problem in an augmented form. The augmented form changes the basic inequalities to equalities by introducing the so-called slack variables. In matrix form the problem becomes   Z 1 −cT 0   x Maximize Z such that 0 A I xs  0 with x, xs ≥ 0 = b  where x are the variables from the standard form, xs are the slack variables from the augmentation process, c contains the optimization coefficients, A and b describe the system of constraint equations, and Z is the variable to be optimized. Suppose in the standard form of the problem there are d variables and m constraints, not counting the n non-negativity constraints. Generally, a vertex of the simplex corresponds to making d of the m + d total constraints tight, while adjacent vertices share d − 1 tight constraints. In the augmented form, this corresponds to setting m of the m + d variables (d original and m slack) to 0. Such a setting of the variables is called a basic solution. The m variables that are set to 0 are called the nonbasic variables. One can then solve the remaining d constraints, called the basic variables, that will be uniquely determined. The simplex algorithm begins by finding a basic feasible solution. At each step, one basic and one nonbasic variable are chosen according to the pivot rule, and their roles are switched. Klee and Minty3 developed a linear programming problem in which the polytope P is a distortion  2009 Jo h n Wiley & So n s, In c. Vo lu me 1, Ju ly /Au gu s t 2009 WIREs Computational Statistics Roadmap for optimization of a d-dimensional cube. In this case, the simplex method visits all 2d vertices before arriving at the optimal vertex. Thus the worst-case complexity for the simplex algorithm is exponential time. However, the simplex method is remarkably efficient in practice. The simplex algorithm has polynomial-time average-case complexity under various distributions. The computational formulation of the simplex algorithm will appear in another study. If hx ≥ 0, then return unbounded End If α ← γ · min{−vki /(hv )i | (hv ) < 0, i = 1, . . . , m} xk+1 ←xk + αhx k←k+1 End Do Interior Point Methods and Karmarkar’s Algorithm DYNAMIC PROGRAMMING Karmarkar’s algorithm was introduced by Karmarkar4 as a polynomial-time algorithm for solving linear programming problems. Although Dantzig’s simplex method usually performs well in practical problems, as noted above, it can in principle have exponential complexity. Karmarkar’s algorithm guarantees polynomial time complexity and is reasonably efficient in practice. If d is the number of variables and L is the number of bits input to the algorithm, the runtime of Karmarkar’s algorithm is O(d7/2 L2 ln(L)ln[ln(L)]). Karmarkar’s algorithm is an interior point method. That is, the current approximation to the solution does not follow the boundary of the feasible set as with the simplex method, but iterates through the interior of the feasible region and asymptotically approximates the solution. In general, interior point methods (also called barrier methods) can be used to solve linear and nonlinear convex optimization problems. A convex optimization problem can be transformed into a linear programming problem by minimizing or maximizing a linear function over a convex set. The idea of using interior methods was investigated in the early 1960s for general nonlinear programming problems, but were not pursued when more competitive methods were developed. The introduction of Karmarkar’s algorithm established new interest in interior point methods. Karmarkar’s algorithm is difficult to describe succinctly. A related interior point, but not polynomial time algorithm is the affine-scaling algorithm. Pseudocode for the affine-scaling algorithm is given below. Dynamic programming, pioneered by Richard Bellman in the 1940s Bellman5,6 , is a method of solving problems that exhibit the properties of overlapping subproblems and optimal substructure. Although on the surface, dynamic programming would seem to be related to nonlinear and linear programming, it is actually an algorithm for speeding up computational efficiency. Generally the method is substantially more computationally efficient than naı̈ve methods. Optimal substructure means that optimal solutions of subproblems can be used to find the optimal solutions of the overall problem. Three steps are necessary: Affine-Scaling Algorithm Input: A, b, c, stopping criterion, γ . k←0 Do while stopping criterion is not satisfied vk ← b − Axk Dv ←diag (vk1 , . . . , vkm ) −1 hx ← (AT D−2 v A) c hv ← −Ahx Vo lu me 1, Ju ly /Au gu s t 2009 1. Decompose the problem into smaller subproblems 2. Optimally solve the subproblems using the threestep process iteratively 3. Use the optimal sub-solutions to build an optimal solution for the original problem. Overlapping subproblems means that the same subproblems are used to solve many different larger problems. A computation wastes time if it is required to recompute optimal solutions to problems already computed. In order to avoid this, optimal solutions to problems that have already been solved are saved. This process is known as memoization. If the problem needs to be solved later, one may simply look up the solution that has been saved. Two approaches are common in dynamic programming: top-down and bottom-up. With a top-down approach, the problem is decomposed into subproblems, and these subproblems are solved and the solutions memoized, in case they need to be solved again. With the bottom-up approach, all subproblems are solved in advance and then used to build up solutions to larger problems. It is not always very intuitive to recognize all the subproblems needed for solving the given problem.  2009 Jo h n Wiley & So n s, In c. 7 Overview www.wiley.com/wires/compstats CALCULUS OF VARIATIONS METAHEURISTIC ALGORITHMS We only briefly mention calculus of variations. This area of mathematics is quite involved and this brief roadmap does not allow for a thorough treatment. Just as ordinary calculus deals with functions, calculus of variations deals with functionals. There are many possible functionals, but a common example would be integrals of an unknown function and/or its derivatives. In statistics, some examples might be L(f ) = ni=1 f (Xi ), the likelihood function where X1 , . . . , Xn is a random sample and we desire to optimize L(f ) with f is a density and f ∈ L2 (R). Of course, without constraints on f , this maximum likelihood problem has no solution. The constraint that f satisfies some isotonic constraint is one that yields a maximum  likelihood solution. ∞Similarly, if one lets L(f ) = ni=1 (yi − f (xi ))2 + λ −∞ [f ′′ (x)]2 dx, minimizing this functional for f ∈ L2 (R) yields a penalized least squares solution. Suppose f is a real-valued, continuous, bounded function on an abstract topological space . Then the supremum norm (sup norm) f ∞ = sup{f (ω) : ω ∈ }. A functional L(f ) defined on a space of functions with norm · has a weak minimum at f0 if there is a δ > 0 such that if f − f0 < δ, then L(f0 ) ≤ L(f ). A weak maximum is defined similarly. This abstract formulation may be replaced with a more typical formulation. If is the space of r times continuously differentiable  functions on a compact subset E ⊂ R, then f = rk=0 f (k) ∞ . A functional L(f ) has a strong minimum at f0 if there is a δ > 0 such that if f − f0 ∞ < δ, then L(f0 ) ≤ L(f ). A typical problem might be to find the stationary values of an integral of the form L(f ) = b a f (y, ẏ, x)dx. L(f ) has an extremum only if the Euler–Lagrange differential equation is satisfied, i.e. if ∂f d ∂f ∂y − dx ẏ = 0. The Euler–Lagrange equation plays a prominent role in classical mechanics and differential geometry. Suppose f is r-times continuously differentiable. The class of functions that are r times differentiable on a compact set E is denoted by Cr (E). The fundamental lemma of the calculus of b variations: Let f ∈ Cr [ a, b] and let a f (x)h(x)dx = 0 r for every h ∈ C [ a, b] such that h(a) = h(b) = 0. Then f (x) ≡ 0 on (a, b). This lemma is used to prove that the extremal b functions of L(f ) = a f (y, ẏ, x)dx are weak solutions ∂f d ∂f of the Euler–Lagrange equation ∂y − dx ẏ = 0. A generalization of the calculus of variations is called Morse Theory. 8 Much of what has been described above has focused on intensively mathematically based optimization. Another class of optimization algorithms is motivated by natural or industrial processes. Algorithms such as evolutionary algorithms, genetic algorithms, simulated annealing, and tabu search fall into this class. We briefly discuss each of these in turn. Evolutionary Algorithms Evolutionary algorithms are optimal search methods that are inspired by the biological notion of survival of the fittest. A fitness function is normally hypothesized corresponding to the objective function. In evolutionary algorithms, a hypothesized solution is mutated by making a small random change to a single element of the solution. The offspring solution is measured against the fitness function and is discarded if the mutation results in a less fit solution. In some cases the random change may be replaced by a purpose-driven change that may or may not improve the fitness of the solution. Evolutionary algorithms differ from more traditional mathematical optimization techniques in that each iteration of an evolutionary algorithms involves a competitive selection that eliminates poor solutions. Mutation is used to generate new solutions that are biased toward regions of the solution space for which good solutions have already been witnessed. Genetic algorithms are closely related to evolutionary algorithms. Genetic Algorithms As with evolutionary algorithms, genetic algorithms are an attempt at using the evolutionary laws of Darwin to formulate an algorithm of optimization. What these algorithms boil down to are two simple underlying principles. The first principle is that many structures can be reduced to a collection of binary strings. The second principle is that transformations to these strings, similar to evolution, can improve these structures. During the 1960s, John Holland from the University of Michigan began to study the applications of genetic algorithms. In 1975, he published Adaptation in Natural and Artificial Systems which laid the foundations for genetic algorithms. In his work, Holland considered not only the principle of evolution through mutation but also through recombination and crossover. Thus, his algorithm is both complex and thorough, depending on how the user defines the operations of evolution.  2009 Jo h n Wiley & So n s, In c. Vo lu me 1, Ju ly /Au gu s t 2009 WIREs Computational Statistics Roadmap for optimization Genetic algorithms work with binary or bit strings. A population is randomly generated to represent potential solutions. Each subject in the population receives a fitness level depending on its similarities with the optimum. On the basis of these fitness levels, mating will induce crossover and mutations. This process will continue until the optimum is reached. Thus perhaps a minor distinction is that evolutionary algorithms focus on mutation, whereas genetic algorithms mimic chromosomal recombination. Solutions from one population are taken and used by merging chromosomes (pieces of each parent) to form a new population. This is motivated by the expectation that the new population will be better than the old one. Solutions which are selected to form new solutions (offspring) are selected according to their fitness: the more suitable they are the more chances they have to reproduce. One issue with genetic algorithms involves fitness bias. If one individual has a level of fitness that is very high, he will consequently breed more often. As a result, the gene pool of the population shrinks as it converges to the genome of the fittest individual. Several methods are implemented to deal with these problems. One method known as windowing involves reducing the fitness of all members of a population by the fitness of the weakest. Essentially removing the weakest link from society increases the chances of the fittest to survive and reproduce. An exponential method that entails taking the square root of all the fitness levels of the subjects helps to reduce the effects of any individual with extremely high fitness. Linear transformations for each fitness value can have the same effect as the exponential method. Simulated Annealing Annealing is defined as a process of heating then cooling for the purposes of softening or making a metal less brittle. The logical question that follows this definition is how this is related to optimization. The answer is that when a metal is heated, its atoms are freed from their initial position. In a process of slow cooling, the atoms settle into a structure of minimum energy. Thus, simulated annealing is an algorithm for finding a global minimum in a given system. Simulated annealing is a probabilistic metaheuristic global optimization algorithm for locating a good approximation to the global minimum of a given function in a large search space. For many problems, simulated annealing may be more effective than exhaustive enumeration provided that the goal is to find an acceptably good solution in a fixed amount of time, rather than the best possible solution. Vo lu me 1, Ju ly /Au gu s t 2009 During each step of the algorithm, the variable that will eventually represent the minimum is replaced by a random solution that is chosen according to a temperature parameter, T. As the temperature of the system decreases, the probability of higher temperature values replacing the minimum decreases, but it is always non-zero. The decrease in probability ensures a gradual decrease in the value of the minimum. However, the non-zero stipulation allows for a higher value to replace the minimum. Though this may sound like a flaw in the algorithm, it makes simulated annealing very useful because it allows for global minimums to be found rather than local ones. If during the course of the implementation of the algorithm a certain location (local minimum) has a lower temperature than its neighbors yet much higher than the overall lowest temperature (global minimum), this non-zero probability stipulation will allow for the value of the minimum to back track in a sense and become unstuck from local minima. Tabu Search As with simulated annealing and genetic algorithms, tabu search is a metaheuristic, a method of trialand-error problem-solving technique. Tabu search is a very promising method in the field of optimization and is designed for practical applications in many fields including engineering and business. Tabu was not a random word choice, but rather chosen to emphasize that the goal of the search is to avoid the taboo of wasting time in the wrong direction, as other algorithms often do. This search implements a form of memory so the search does not waste time revisiting previously searched parts or unlikely locations. This memory is a list of all prior moves. The adaptive memory of tabu search is what distinguishes it from genetic algorithms and simulated annealing. A common saying is that the only way to learn is through mistakes. This principle helps to define tabu searches. Rather than making random decisions, tabu searches can sometimes proceed in a direction of intentional mistake so as to develop a strategy for improvement. The next move in a tabu search is chosen after examining all other possible moves from a current position. Each move is then assigned a value based on how optimal it is and the highest value is typically chosen as the next move. Similar to the provision in simulated annealing that allows for a less optimal state to be chosen so as to avoid local optimality, tabu search works by examining neighboring configurations using its memory to avoid reaching a state of local optimality. Special override conditions are also included in case a  2009 Jo h n Wiley & So n s, In c. 9 Overview www.wiley.com/wires/compstats more optimal state has been marked taboo. As such a recently developed method attributed to Glover and Laguna7 , the full implications and implementations of tabu search remain unknown. CONCLUSIONS Of course, applications of optimization abound in statistical theory, including of course maximum likelihood, penalized methods, least squares, and minimum entropy to name just a few. As data sets become larger and models become more subtle, computational statistical methods rely increasingly on numerical optimization techniques. Moreover, optimization techniques allow for a much richer class of methodologies for data analysis often developed under the rubric of data mining or machine learning. Finally it has to be noted that the optimization algorithms described above as metaheuristic such as simulated annealing and genetic algorithms fundamentally involve statistical and probabilistic methods. Thus there is an elegant interplay between optimization and computational statistics. REFERENCES 1. Karush W. Minima of Functions of Several Variables with Inequalities as Side Constraints. M.Sc. Dissertation. Department of Mathematics, University of Chicago, Chicago, IL; 1939. 2. Kuhn HW, Tucker AW. Nonlinear programming, Proceedings of the Second Berkeley Symposium, University of California Press, Berkeley; 1951, 481–492. 3. Klee V, Minty GJ. How good is the simplex algorithm? In Shisha O, eds. Inequalities, III. New York: Academic Press; 1972, 159–175. 4. Karmarkar N. A new polynomial time algorithm for linear programming. Combinatorica 1984, 4(4):373–395. 5. Bellman R. Dynamic Programming. Princeton, NJ: Princeton University Press; 1957. 6. Bellman R. Dynamic Programming. Princeton, NJ: Princeton University Press; 2003, Dover paperback edition. 7. Glover F, Laguna M. Tabu Search. Norwell, MA: Kluwer; 1997. Further Reading Adler I, Karmarkar N, Resende MGC, Veiga G. An Implementation of Karmarkar’s Algorithm for Linear Programming. Math Progr 1989, 44:297–335. Arfken G. Calculus of variations. In Mathematical Methods for Physicists, Ch. 17. 3rd ed. Orlando, FL: Academic Press; 1985, 925–962. Ashlock D. Evolutionary Computation for Modeling and Optimization. New York, NY: Springer; 2006. Avriel M. Nonlinear Programming: Analysis and Methods. New York, NY: Dover Publishing; 2003. Banzhaf W, Nordin P, Keller R, Francone F. Genetic Programming - An Introduction. San Francisco, CA: Morgan Kaufmann; 1998. Bliss GA. Calculus of Variations. Chicago, IL: Open Court; 1925. Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press; 2004. Cerny V. A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm. J Optim Theory Appl 1985, 45:41–51. Cormen TH, Leiserson CE, Rivest RL, Stein C. Section 29.3: The simplex algorithm. In: Introduction to Algorithms. 3rd ed. Cambridge, MA: MIT Press, New York: McGraw-Hill; 2001, 790–804. Deuflhard P. Newton Methods for Nonlinear Problems: Affine Invariance and AdaptiveAlgorithms, Springer Series in Computational Mathematics, 35. Berlin: Springer; 2004. Elster KH. Modern Mathematical Methods of Optimization. Berlin: Akademie Verlag; 1993. Fiacco AV, McCormick GP Nonlinear Programming: Sequential Unconstrained Minimization Techniques. New York, NY: John Wiley & Sons; 1968. Forsyth AR. Calculus of Variations. New York, NY: Dover Publishing; 1960. Fox C. An Introduction to the Calculus of Variations. New York, NY: Dover Publications; 1987. Gärtner B, Matoušek J. Understanding and Using Linear Programming. Berlin: Springer; 2006. Goldberg DE. Genetic Algorithms in Search, Optimization and Machine Learning. Boston, MA: Kluwer Academic Publishers; 1989. 10  2009 Jo h n Wiley & So n s, In c. Vo lu me 1, Ju ly /Au gu s t 2009 WIREs Computational Statistics Roadmap for optimization Golub GH, Van Loan CF. Matrix Computations. 3rd ed. Chapter 10. Baltimore, MD: Johns Hopkins University Press; 1996. Granville V, Krivanek M, Rasson JP. Simulated annealing: a proof of convergence. IEEE Trans Pattern Anal Mach Intell 1994, 16(6):652–656. doi:10.1109/34.295910. Hillier F, Lieberman GJ. Introduction to Operations Research. 8th ed. New York: McGraw-Hill; 2005. Holland JH. Adaptation in Natural and Artificial Systems. Ann Arbor, MI: University of Michigan Press; 1975 Reprinted in 1992 by Cambridge MA: MIT Press. Isenberg C. The Science of Soap Films and Soap Bubbles. New York, NY: Dover Publications; 1992. Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science. New Series 1983, 220(4598):671–680. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of state calculations by fast computing machines. J Chem Phys 1953, 21(6):1087–1092. Nocedal J, Wright SJ. Numerical Optimization. New York, NY: Springer; 2006. Papalambros PY, Wilde DJ. Principles of Optimal Design : Modeling and Computation. Cambridge: Cambridge University Press; 2000. Sagan H. Introduction to the Calculus of Variations. New York, NY: Dover publications; 1992. Said YH. On genetic algorithms and their applications. In: Rao CR, Wegman EJ, Solka JL, eds. Handbook on Statistics: Data Mining and Data Visualization 24. Amsterdam: Elsevier North Holland; 2005, 359–390. Schrijver A. Theory of Linear and Integer Programming. New York, NY: John Wiley & Sons; 1998. Snyman JA. Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms. New York, NY: Springer; 2005. Taha HA. Operations Research: An Introduction. 8th ed. Saddle River, NJ: Prentice Hall; 2007. Vapnyarskii IB. Lagrange multipliers. In: Hazewinkel M, ed. Encyclopaedia of Mathematics. Boston, MA: Kluwer Academic Publishers; 2001. Wright S. Primal-Dual Interior-Point Methods. Philadelphia, PA: SIAM; 1997. Yang XS. Introduction to Mathematical Optimization: From Linear Programming to Metaheuristics. Cambridge: Cambridge International Science Publishing; 2008. Ypma T. Historical development of the Newton-Raphson method. SIAM Rev 1995 37(4):531–551. doi:10.1137/1037125 Vo lu me 1, Ju ly /Au gu s t 2009  2009 Jo h n Wiley & So n s, In c. 11