Mdobook-J R R A MARTINS
Mdobook-J R R A MARTINS
Mdobook-J R R A MARTINS
Design
Optimization
Joaquim R. R. A. Martins
Andrew Ning
Engineering
Design
Optimization
joaquim r. r. a. martins
University of Michigan
andrew ning
Brigham Young University
Publication
First electronic edition January 2020.
Contents
Contents i
Preface v
Acknowledgements vii
1 Introduction 1
1.1 Design Optimization Process 2
1.2 Optimization Problem Formulation 5
1.3 Optimization Problem Classification 16
1.4 Optimization Algorithms 20
1.5 Selecting an Optimization Approach 25
1.6 Notation 27
1.7 Summary 27
Problems 28
i
Contents ii
Problems 67
Problems 247
Bibliography 424
Index 438
Preface
v
Preface vi
vii
Introduction
1
Optimization is a human instinct. People constantly seek to improve
their lives and the systems that surround them. Optimization is intrinsic
in biology, as exemplified by the evolution of species. Birds optimize
their wings’ shape in real time, and dogs have been shown to find
optimal trajectories. Even more broadly, many laws of physics relate to
optimization, such as the principle of minimum energy. As Leonhard
Euler once wrote, “nothing at all takes place in the universe in which
some rule of maximum or minimum does not appear.”
Optimization is often used to mean improvement, but mathemati-
cally it is a much more precise concept: finding the best possible solution
by changing variables that can be controlled, often subject to constraints.
Optimization has a broad appeal because it is applicable in all domains
and because we can all identify with a desire to make things better.
Any problem where a decision needs to be made can be cast as an
optimization problem.
While some simple optimization problems can be solved analytically,
most practical problems of interest are too complex to be solved this way.
The advent of numerical computing, together with the development of
optimization algorithms, has enabled us to solve problems of increasing
complexity.
Optimization problems occur in various areas, such as economics,
political science, management, manufacturing, biology, physics, and
engineering. A large segment of optimization applications focuses on
operations research, which deals with problems such as deciding on the
price of a product, setting up a distribution network, scheduling, or
suggesting routes.
Another large segment of applications focuses on the design of
engineering systems—the subject of this book. Design optimization
problems abound in the various engineering disciplines, such as wing
design in aerospace engineering, process control in chemical engineer-
ing, structural design in civil engineering, circuit design in electrical
engineering, and mechanism design in mechanical engineering. Most
engineering systems rarely work in isolation and are linked to other
systems. This gave rise to the field of multidisciplinary design optimization
1
1 Introduction 2
Manual iteration
Change
design
manually No
Optimization
Update
design
No
variables
Initial
design
Evaluate
Specifications Optimality
objective and
achieved?
Formulate constraints
optimization
problem
Yes
and do not require intervention from the designer. Figure 1.2: Conventional (top) ver-
This automated process does not usually provide a “push-button” sus design optimization process
(bottom).
solution; it requires human intervention and expertise (often more
expertise than in the traditional process). Human decisions are still
needed in the design optimization process. Before running an op-
timization, in addition to determining the specifications and initial
design, engineers need to formulate the design problem. This requires
expertise in both the subject area and in numerical optimization. The
designer must decide what the objective is, which parameters can be
changed, and which constraints must be enforced. These decisions
have profound effects on the outcome, so it is crucial that the designer
formulates the optimization problem well.
After running the optimization, engineers must assess the design
because it is unlikely that a valid and practical design is obtained after
the first time a formulation is developed. After evaluating the optimal
1 Introduction 5
Increased
Design performance
optimization
performance
System
Conventional
design process
Reduced cost
Cummulative
cost
Reduced time
Uncertainty
Consider a wing design problem where the wing planform shape is rect-
angular. The planform could be parametrized by the span (𝑏) and the chord
(𝑐) as seen in Fig. 1.5, so that 𝑥 = [𝑏, 𝑐]. However, this choice is not unique.
𝑐
5 10 𝑏
4 𝑆
8 Figure 1.5: Wing span (𝑏) and
=
𝑆=
chord (𝑐).
𝑆
𝑆=5
35
1.0
=
𝑏=
25
15
6 12
𝑐=
𝑐 3 𝐴𝑅
4 1.5
𝑐=
2 .0 𝑏=8
2 𝑐=
𝑐=2
.5
2
=1
=
𝐴𝑅 7
1
𝐴𝑅 =
0
𝑏=4
for two different sets of design
2 4 6 8 10 5 10 15 20 25 variables, 𝑥 = [𝑏, 𝑐] and 𝑥 = [𝑆, 𝐴𝑅].
𝑏 𝑆
Two other variables are often used in aircraft design: wing area (𝑆) and wing
aspect ratio (𝐴𝑅), also shown in the figure. Because these variables are not
independent (𝑆 = 𝑏𝑐 and 𝐴𝑅 = 𝑏 2 /𝑆), we cannot just add them to the set
of design variables. Instead, we must pick any two variables out of the four
to parametrize the design because we have four possible variables and two
dependency relationships.
For this wing, the variables must be positive to be physically meaningful,
so we must remember to explicitly bound these variables to be greater than
zero in an optimization. The variables should be bound from below by small
positive values because numerical models are probably not prepared to take
zero values. No upper bound is needed unless the optimization algorithm
requires it.
The computation of the objective function is done through a numer- sible way to turn a maximization problem
into a minimization one, but is generally
ical model whose complexity can range from a simple explicit equation less desirable as it alters the scale of the
to a system of coupled implicit models (more on this in Chapter 3). problem and could introduce a divide-by-
zero problem.
The choice of objective function is crucial for successful design
optimization. If the function does not represent the true intent of the max 𝑓 (𝑥)
designer, it does not matter how precisely the function and its optimum
point is computed—the mathematical optimum will be non-optimal
from the engineering point of view. A bad choice of objective function
is a common mistake in design optimization. 0
𝑥∗
The choice of objective function is not always obvious. For example,
minimizing the weight of a vehicle might sound like a good idea, but
this might result in a vehicle that is too expensive to manufacture. In this
case, manufacturing cost would probably be a better objective. However, min − 𝑓 (𝑥)
there is a tradeoff between manufacturing cost and the efficiency of the
vehicle. It might not be obvious which of these objectives is the most Figure 1.7: A maximization prob-
lem can be transformed into an
appropriate one because this trade depends on customer preferences.
equivalent minimization one.
This issue motivates multiobjective optimization, which is the subject of
Chapter 9. Multiobjective optimization does not yield a single design
but rather a range of designs that settle for different tradeoffs between
the objectives.
Experimenting with different objectives should be part of the design
exploration process (this is represented by the outer loop in the design
optimization process in Fig. 1.2). Results from optimizing the “wrong”
objective can still yield insights into the design tradeoffs and trends for
the system at hand.
Let us consider the appropriate objective function for Ex. 1.1. A common
objective for a wing is to minimize drag. However, this does not take into
account the propulsive efficiency, which is strongly affected by speed. A better
objective might be to minimize the required power, which balances drag and
propulsive efficiency.
1 Introduction 11
The contours for the required power are shown in Fig. 1.8 for the two
choices of design variable sets discussed in Ex. 1.1. We can locate the minimum
graphically (denoted by the dot). While the two optimum solutions are the
same, the shapes of the objective function contours are different. In this case,
using aspect ratio and wing area simplifies the relationship between the design
variables and the objective by aligning the two main curvature trends with
each design variable.
1.5 70
1.2
50
𝑐 0.9 𝐴𝑅
Figure 1.8: Required power con-
30
tours for two different choices of
0.6 design variable sets. The optimal
wing is the same for both cases, but
10
0.3
the functional form of the objective
5 15 25 35 5 10 15 20 25 is simplified in the bottom one.
𝑏 𝑆
In this case, the optimal wing has an aspect ratio that is much higher
than typically seen in aircraft or birds. While the high aspect ratio increases
aerodynamic efficiency, it adversely affects the structural strength, which we
did not consider here. Thus, as in most engineering problems, we need to add
constraints.
1.2.3 Constraints
The vast majority of practical design optimization problems require the
enforcement of constraints. These are functions of the design variables
that we want to restrict in some way. Like the objective function,
constraints are computed through a model whose complexity can vary
widely. The feasible region is the set of points that satisfy all constraints.
We seek to minimize the objective function within this feasible design
space.
When we restrict a function to being equal to a fixed value, we call
this an equality constraint, denoted by ℎ(𝑥) = 0. When the function
is required to be less than or equal to a certain value, we have an
1 Introduction 12
inequality constraint, denoted by 𝑔(𝑥) ≤ 0.¶ While we use “less or equal” ¶A strict inequality, 𝑔(𝑥) < 0, is never
used because then 𝑥 could be arbitrarily
by convention, you should be aware that some other texts and software close to the equality. Since the optimum
programs use “greater or equal” instead. There is no loss of generality is at 𝑔 = 0 for an active constraint, the
exact solution would then be ill-defined
with either convention, as we can always multiply the constraint by −1 from a mathematical perspective. Also,
to convert between the two. the difference is not meaningful when us-
ing finite-precision arithmetic (which is al-
ways the case when using a computer).
Tip 1.3: Check the inequality convention.
𝑔1 (𝑥) ≤ 0 ℎ1 (𝑥) = 0
(active) (active)
𝑓 (𝑥) 𝑓 (𝑥)
even if they seem to make sense. When this happens, constraints have
to be relaxed or removed.
The problem must not be over-constrained, or else there is no feasible
region in the design space over which the function can be minimized.
Thus, the number of independent equality constraints must be less or
equal to the number of design variables (𝑛 ℎ ≤ 𝑛 𝑥 ). There is no limit
on the number of inequality constraints. However, they must be such
that there is a feasible region, and the number of active constraints plus
the equality constraints must still be less or equal than the number of
design variables.
The feasible region grows when constraints are removed and shrinks
when constraints are added (unless these constraints are redundant).
As the feasible region grows, the optimum objective function usually
improves, or at least stays the same. Conversely, the optimum worsens
or stays the same when the feasible region shrinks.
One common issue in optimization problem formulation is distin-
guishing objectives from constraints. For example, we might be tempted
to minimize the stress in a structure, but this would inevitably result in
an overdesigned heavy structure. Instead, we might want minimum
weight (or cost) with sufficient safety factors on stress, which can be
enforced by an inequality constraint.
Most engineering problems require constraints—often a large num-
ber of them. While constraints may at first appear limiting, they are
what enable the optimizer to find useful solutions.
As previously mentioned, some algorithms require the user to
provide an initial guess for the design variable values. While it is easy
to assign values within the bounds, it might not be as easy to ensure
that the initial design satisfies the constraints. This is not an issue for
most optimization algorithms, but some require starting with a feasible
design. 𝑐
1.5
While this is the standard formulation used in this book, other books
and software manuals might differ from this. For example, they might
use different symbols, use “greater or equal than” for the inequality
constraint, or maximize instead of minimizing. In any case, it is possible
to convert between standard formulations to get equivalent problems.
All single objective, continuous optimization problems can be writ-
ten in this form. Although our target applications are engineering
design problems, many other problems can be stated in this form, and
thus, the methods covered in this book can be used to solve those
problems.
The values of the objective and constraint functions for a given set Optimizer 𝑥∗
𝑥 (0)
of design variables are computed through the analysis, which consists
of one or more numerical models. The analysis must be fully automatic 𝑥 𝑓 , 𝑔, ℎ
so that multiple optimization cycles can be completed without human
Analysis
intervention, as shown in Fig. 1.11. The optimizer usually requires
an initial design 𝑥 (0) and then queries the analysis for a sequence of
Figure 1.11: The analysis computes
designs until it finds the optimum design, 𝑥 ∗ . the objective ( 𝑓 ) and constraint val-
ues (𝑔, ℎ) for a given set of design
Tip 1.5: Using an optimization software package. variables (𝑥).
These are passed to the optimizer, which will call them back as needed during
the optimization process. The functions take the design variable values as
inputs and output the function values, as shown in Fig. 1.11. Study the software
documentation for the details of how to use it.∗∗ To make sure you understand ∗∗ Possiblesoftware includes: fmincon in
Matlab, scipy.optimize.minimize with
how to use a given optimization package, test it on simple problems for which
the SLSQP method in Python, Optim.jl
you know the solution first (see Prob. 1.5). with the IPNewton method in Julia, and
the Solver add-in in Microsoft Excel.
When the optimizer queries the analysis for a given 𝑥, the constraints
do not have to be feasible. The optimizer is responsible for changing 𝑥
so that the constraints are satisfied.
The objective and constraint functions must depend on the design
variables; if a function does not depend on any variable in the whole
domain, it can be ignored and should not appear in the problem
statement.
Ideally, 𝑓 , 𝑔, and ℎ should be computable for all values of 𝑥 that
make physical sense. Lower and upper design variable bounds should
be set to avoid non-physical designs as much as possible. Even after
taking this precaution, models in the analysis sometimes fail to provide
a solution. A good optimizer can handle such eventualities gracefully.
There are some mathematical transformations that do not change
the solution of the optimization problem (Eq. 1.4). Multiplying either
the objective and constraints by a constant does not change the optimal
design; it only changes the optimum objective value. Adding a constant
to the objective does not change the solution, but adding a constant to
any constraint changes the feasible space and can change the optimal
design.
Determining an appropriate set of design variables, objective, and
constraints is a crucial aspect of the outer loop shown in Fig. 1.2,
which requires human expertise in engineering design and numerical
optimization.
computation for which we only see inputs (including the design vari-
Figure 1.13: A model is considered
ables) and outputs (including objective and constraints), as illustrated a black box when we only see its
in Fig. 1.13. inputs and outputs.
1 Introduction 17
Continuous
Design variables Discrete
Mixed
Single
Problem Objective
formulation Multiobjective
Constrained
Constraints
Unconstrained
Optimization
Continuous
problem Smoothness
classification Discontinuous
Linear
Linearity
Nonlinear
1.3.1 Smoothness
The degree of function smoothness with respect to variations in the
design variables depends on the continuity of the function values and
their derivatives. When the value of the function varies continuously,
the function is said to be 𝐶 0 continuous. If the first derivatives also vary
continuously, then the function is 𝐶 1 continuous, and so on. A function
is smooth when the derivatives of all orders vary continuously every-
where in its domain. Function smoothness with respect to continuous
design variables affects what type of optimization algorithm can be
used. Figure 1.14 shows one-dimensional examples for a discontinuous,
𝐶 0 function, and 𝐶 1 function.
As we will see later, discontinuities in the function value or deriva-
tives limit the type of optimization algorithm that can be used because 𝑓 (𝑥)
1.3.2 Linearity
𝑥
and 11.3. 𝑥∗
While many problems can be formulated as linear or quadratic
problems, most engineering design problems are nonlinear. However, 𝑥1
it is common to have at least a subset of constraints that are linear, and
some general nonlinear optimization algorithms take advantage of the Figure 1.15: Example of a linear
techniques developed to solve linear and quadratic problems. optimization problem in two dimen-
sions.
†† Historically, optimization problems
1.3.3 Multimodality and Convexity were referred to as “programming” prob-
lems, so much of the existing literature
Functions can be either unimodal or multimodal. Unimodal functions refers to these as “linear programming”
have a single minimum, while multimodal functions have multiple and “quadratic programming”.
1 Introduction 19
dent that the function is unimodal with every new optimization that 1. He et al., Robust aerodynamic shape
optimization—from a circle to an airfoil.
converges to the same optimum. ‡‡ 2019
Often, we need not be too concerned about the possibility of multiple Unimodal
local minima. From an engineering design point of view, achieving a Convex
local optimum that is better than the initial design is already a useful
result.
Convexity is a concept related to multimodality. A function is
Multimodal
convex if all line segments connecting any two points in the function
lies above the function and never intersect it. Convex functions are
always unimodal. Also, all multimodal functions are non-convex, but
not all unimodal functions are convex (see Fig. 1.17).
Convex optimization seeks to minimize convex functions over con- Figure 1.17: Multimodal functions
vex sets. Like linear optimization, convex optimization is another have multiple minima, while uni-
subfield of numerical optimization with many applications. When the modal functions have only one
minimum. All multimodal func-
objective and constraints are convex functions, we can use specialized
tions are non-convex, but not all
formulations and algorithms that are much more efficient than gen- unimodal functions are convex.
1 Introduction 20
Zeroth
Order First
Second
Local
Search
Global
Optimality Mathematical
criterion Heuristic
Function Direct
evaluation Surrogate model
is even richer information that tells us the rate of the change in the
gradient, which provides an idea of where the function will flatten out.
There is a distinction between the order of information provided
by the user and the order of information that is actually used in the
algorithm. For example, a user might only provide function values
to a gradient-based algorithm and rely on the algorithm to internally
estimate gradients by requesting additional function evaluations and
using finite differences (see Section 6.4). Gradient-based algorithms
can also internally estimate curvature based on gradient values (see
Section 4.4.4).
In theory, gradient-based algorithms require the functions to be
sufficiently smooth (at least 𝐶 1 continuous). However, in practice, they
can tolerate the occasional discontinuity, as long as this discontinuity
does not happen to be at the optimum point.
We devote a considerable portion of this book to gradient-based
algorithms because they generally scale better to problems with many
design variables, and they have rigorous mathematical criteria for
optimality. We also cover the various approaches for computing
gradients in detail because the accurate and efficient computation of
these gradients is crucial for the efficacy and efficiency of these methods
(see Chapter 6).
Current state-of-the-art optimization algorithms also use second-
order information to implement Newton-type methods for second-
order convergence. However, these algorithms tend to build second-
order information based on the provided gradients, as opposed to
requiring users to provide the second-order information directly (see
Section 4.4.4).
Because gradient-based methods require accurate gradients and
smooth enough functions, they require more knowledge about the mod-
els and optimization algorithm than gradient-free methods. Chapters 3
through 6 are devoted to making the power of gradient-based methods
more accessible by providing the necessary theoretical and practical
knowledge.
1.4.5 Stochasticity
This attribute is independent of the stochasticity of the model that
we mentioned previously, and it is strictly related to whether the
optimization algorithm itself contains steps that are determined at
random or not.
A deterministic optimization algorithm always evaluates the same
points and converge to the same result given the same initial conditions.
In contrast, a stochastic optimization algorithm evaluates a different set
of points if run multiple times from the same initial conditions, even
if the models for the objective and constraints are deterministic. For
example, most evolutionary algorithms include steps determined by
generating random numbers. Gradient-based algorithms are usually
deterministic, but some exceptions exist, such as stochastic gradient
descent from the machine learning community (see Section 10.5).
we can perform the time integration to solve for the full-time history
of the states and then compute the objective and constraint function
values for an optimization iteration. This means that every optimization
iteration requires solving for the complete time history. An example
of this type of problem is a trajectory optimization problem where the
design variables are the coordinates representing the path, and the
objective is to minimize the total energy expended to get to a given
destination. 2 Although such a problem involves a time dependence, we 2. Betts, Survey of numerical methods for
trajectory optimization. 1998
solve a single optimization problem, so we still classify such a problem
as static.
For another class of time-dependent optimization problems, how-
ever, we solve for a sequence or a time history of decisions at different
time instances because we must make decisions as time progresses.
These are called dynamic optimization problems (also known as dynamic
programming). 3,4 In such problems, the design variables are the 3. Bryson et al., Applied Optimal Control;
Optimization, Estimation, and Control. 1969
sequence of decisions, and the decision at a given time instance is
4. Bertsekas, Dynamic programming and
influenced by the decisions made in the previous time instances. An optimal control. 1995
example of a dynamic optimization problem would be to optimize the
throttle, braking, and steering of a car at each time instance such that
the overall time in a racecourse is minimized. This is an example of
an optimal control problem, a type of dynamic optimization problem
where a control law is optimized for a dynamical system over a period
of time. Dynamic optimization is not broadly covered in this book,
except in the context of discrete optimization (see Section 8.5). Different
approaches are used in general, but many of the concepts covered here
are instrumental in the numerical solution of optimal control problems.
Convex? Yes
Linear optimization, quadratic optimization, etc.
Ch. 11
Yes
No Branch and bound
Yes
Dynamic programming
Discrete? Yes
Linear? Markov chain?
Ch. 8 No SA or Binary GA
No
No
Yes BFGS
Yes Ch. 4 Yes
Differentiable?
Unconstrained? Multimodal? Multistart
Ch. 6 SQP or IP
No Ch. 5
No
Yes
DIRECT, GA, PS, etc.
Gradient free
Multimodal?
Ch. 7 Nelder–Mead
No
The first node asks about convexity. While it is often not immediately Figure 1.20: Decision tree for select-
apparent if the problem is convex, with some experience, one can usually ing an optimization algorithm.
discern whether attempting to reformulate in a convex manner is likely
to be possible. In most instances, convexity occurs for problems with
simple objectives and constraints (e.g., linear or quadratic), such as in
control applications where the optimization is performed repeatedly.
A convex problem can be solved with the more general gradient-based
or gradient-free algorithms, but it would be inefficient not to take
advantage of the convex formulation structure if we can do so.
The next node asks about discrete variables. Problems with discrete
variables are much harder to solve, so we often consider techniques
to avoid using discrete variables whenever possible. For example, a
wind turbine position in a field could be posed as a discrete variable
within a discrete set of options. Alternatively, we could represent the
wind turbine position as a continuous variable with two continuous
coordinate variables. That level of flexibility may or may not be desirable
but will almost always lead to better solutions.
Next, we consider if the model is differentiable or if it can be made
differentiable through model improvements. If the problem is high
dimensional (more than a few tens of variables as a rule of thumb),
gradient-free methods are generally intractable. We would either need
to either make the model differentiable or reduce the dimensionality
of the problem. Another alternative if the problem is not readily
differentiable is to consider surrogate-based optimization (the box
labeled “noisy or expensive”). If we go the surrogate-based optimization
1 Introduction 27
1.6 Notation
1.7 Summary
Problems
𝑥12 + 𝑥 22 ≤ 1,
1
𝑥1 − 3𝑥2 + ≥ 0,
2
and bound constraints:
𝑥1 ≥ 0, 𝑥2 ≥ 0.
Plot the constraints and identify the feasible region. Find the
constrained minimum graphically. Use optimization software
to solve the constrained minimization problem. Which of the
inequality constraints and bounds are active at the solution?
31
2 A Short History of Optimization 32
𝜃 𝜃
2.2 Optimization Revolution: Derivatives and Calculus
Mirror
0
The scientific revolution generated significant optimization develop- 𝐵
ments in the 17th and 18th centuries that intertwined with other Figure 2.2: The law of reflection
mathematics and physics developments. can be derived by minimizing the
In the early 17th century, Johannes Kepler published a book in length of the beam of light.
which he derived the optimal dimensions of a wine barrel.5 He became 5. Kepler, Nova stereometria doliorum
vinariorum (New solid geometry of wine
interested in this problem when he bought a barrel of wine, and the barrels). 1615
merchant charged him based on a diagonal length (see Fig. 2.3). This
outraged Kepler because he realized that the amount of wine could
vary for the same diagonal length, depending on the barrel proportions.
In 1955, Lester Ford and Delbert Fulkerson created the first known
algorithm to solve the maximum flow problem, which has applications
in transportation, electrical circuits, and data transmission. While the
problem could already be solved with the simplex algorithm, they
proposed a more efficient algorithm for this specialized problem.
In 1957, Richard Bellman derived the necessary optimality condi-
tions for dynamic programming problems. These are expressed in
what became known as the Bellman equation (Section 8.5), which was
first applied to engineering control theory, and subsequently became a
core principle in the development of economic theory.
In 1959, William Davidon developed the first quasi-Newton method
to solve nonlinear optimization problems that rely on approximations
of the curvature based on gradient information. He was motivated by
his work at Argonne National Lab, where he used a coordinate descent
method to perform an optimization that kept crashing the computer
before converging. Although Davidon’s approach was a breakthrough
in nonlinear optimization, his original paper was rejected. It was
eventually published more than 30 years later in the first issue of the
SIAM Journal of Optimization.15 Fortunately, his valuable insight had 15. Davidon, Variable Metric Method for
Minimization. 1991
been recognized well before that by Roger Fletcher and Michael Powell,
who developed the method further.16 The method became known as 16. Fletcher et al., A Rapidly Convergent
Descent Method for Minimization. 1963
DFP (Section 4.4.4).
Another quasi-Newton approximation method was independently
proposed in 1970 by Charles Broyden, Roger Fletcher, Donald Goldfarb,
and David Shanno, now called the BFGS approximation. Larry Armijo,
A. Goldstein, and Philip Wolfe develop the conditions for the line search
in gradient-based methods that ensure convergence (see Section 4.3.2).17 17. Wolfe, Convergence Conditions for
Ascent Methods. 1969
With these developments in unconstrained optimization, researchers
seek methods to solve constrained problems as well. Penalty and barrier
methods are developed but fall out of favor because of numerical issues
(see Section 5.3).
In another effort to solve nonlinear constrained problems, Robert
Wilson proposed the sequential quadratic programming (SQP) method
in his Ph.D. thesis.18 SQP essentially consists in applying the Newton 18. Wilson, A simplicial algorithm for
concave programming. 1963
method to solve the KKT conditions (see 5.4). Shih–Ping Han re-
invented SQP in 197619 and Michael Powell popularized this method in 19. Han, Superlinearly convergent variable
metric algorithms for general nonlinear
a series of papers starting from 1977.20 programming problems. 1976
There were attempts to model the natural process of evolution 20. Powell, Algorithms for nonlinear con-
starting in the 1950s. In 1975, John Holland proposed genetic algorithms straints that use Lagrangian functions. 1978
(GA) to solve optimization problems (Section 7.5).21 Research in GAs 21. Holland, Aptation in Natural and
Artificial Systems. 1975
increased dramatically after that, thanks in part to the exponential
increase in computing power (Section 7.5).
2 A Short History of Optimization 37
Hooke et al.22 proposed a gradient-free method, which they call 22. Hooke et al., “Direct Search” Solution
of Numerical and Statistical Problems. 1961
“pattern search.” In 1965, Nelder et al.23 developed the nonlinear
23. Nelder et al., A Simplex Method for
simplex method another gradient-free nonlinear optimization based Function Minimization. 1965
on heuristics (Section 7.3). (This has no connection to the simplex
algorithm for linear programming problems mentioned above.)
The Mathematical Programming Society was founded in 1973, an
international association for researchers active in optimization. It was
eventually renamed Mathematical Optimization Society in 2010.
Narendra Karmarkar presented a revolutionary new method in
1984 to solve large-scale LPs as much as a hundred times faster than the
simplex method.24 The New York Times publishes a news item on the 24. Karmarkar, A New Polynomial-Time
Algorithm for Linear Programming. 1984
first page with the headline “Breakthrough in Problem Solving.” This
starts the age of interior point methods, which are related to the barrier
methods dismissed in the 1960s. Interior point methods are eventually
adapted to solve nonlinear problems (see Section 5.5) and contribute to
the unification of linear and nonlinear optimization.
The pattern search algorithms that Hooke and Jeeves, and Nelder
and Meade developed were disparaged by applied mathematicians,
who preferred the rigor and efficiency of the gradient-based methods
developed soon after that. Nevertheless, they were further developed
and remain popular with engineering practitioners because of their sim-
plicity. Pattern search methods experienced a renaissance in the 1990s
with the development of convergence proofs that added mathematical
rigor and the availability of more powerful parallel computers.46 Today, 46. Torczon, On the Convergence of Pattern
Search Algorithms. 1997
pattern search methods remain a useful option (and sometimes the
only one) for some types of optimization problems.
Global optimization algorithms also experienced further develop-
ments. Jones et al.47 developed the DIRECT algorithm, which uses a 47. Jones et al., Lipschitzian optimization
without the Lipschitz constant. 1993
rigorous approach to find the global optimum (Section 7.4).
The first genetic algorithms started the development of the broader
class of evolutionary optimization algorithms inspired more broadly by
natural and societal processes. Optimization by simulated annealing
represents one of the early examples of this broader perspective.48 An- 48. Kirkpatrick et al., Optimization by
Simulated Annealing. 1983
other example is particle swarm optimization (PSO) (see Section 7.6).49
49. Kennedy et al., Particle Swarm Opti-
Since then, there has been an explosion in the number of evolutionary mization. 1995
algorithms, inspired by any process imaginable (see the side note at the
end of Section 7.2 for a partial list). Evolutionary algorithms have re-
mained heuristic and have not experienced the mathematical treatment
applied to pattern search methods.
There has been a sustained interest in surrogate-models (also know
as metamodels) since the seminal contributions in the 1950s. Kriging
surrogate models are still being used and have been the focus of many
improvements, but new techniques such as radial-basis functions have
also emerged.50 Surrogate-based optimization is now an area of active 50. Forrester et al., Recent advances in
surrogate-based optimization. 2009
research (see Chapter 10).
Artificial intelligence (AI) has experienced a revolution in the last
decade and is connected to optimization in several ways. The early AI
efforts focused on solving problems that could be described formally,
such as design optimization problem statements. Today, AI solves
problems that are difficult to describe formally, such as face recognition.
This new capability is made possible by the development of deep
learning neural networks, the availability of large datasets for training
the neural networks, and increased computer power. Deep learning
neural networks learn to map a set of inputs to a set of outputs based
on training data and can be viewer as a type of surrogate model (see
Section 10.5). These networks are trained using optimization algorithms
that minimize a loss function (analogous to model error), but they
require specialized optimization algorithms such as stochastic gradient
2 A Short History of Optimization 41
descent. 51 The gradients for this problem are efficiently computed with 51. Bottou et al., Optimization Methods for
Large-Scale Machine Learning. 2018
backpropagation, which is a specialization of the reverse mode of AD.52
52. Baydin et al., Automatic Differentiation
in Machine Learning: a Survey. 2018
2.5 Summary
42
3 Numerical Models and Solvers 43
not possible. When that is the case, we must discretize the continuous Finite-precision
equations to obtain the numerical model. This numerical model must states
𝐾𝑢 = 𝑓 ,
where 𝐾 is the stiffness matrix, 𝑓 is the vector of applied loads, and 𝑢 are the
displacements that we want to compute. At each joint, there are two degrees of
freedom (horizontal and vertical) that describe the displacement and applied
force. Since there are 9 joints, each with 2 degrees of freedom, the size of this
linear system is 18.
3 Numerical Models and Solvers 45
𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑛 ) = 0 𝑖 = 1, . . . , 𝑛. (3.1)
where 𝑟 is a vector of residuals that has the same size as the vector of
state variables 𝑢. The equation defining the residuals could be any
expression that can be coded in a computer program. No matter how
complex the mathematical model, it can always be written as a set of
equations in this form, which we write more compactly as 𝑟(𝑢) = 0.
This residual notation can still be used to represent explicit functions,
so we can use it for all the functions in a model without loss of generality.
Suppose we have the explicit function 𝑢 𝑓 , 𝑓 (𝑢), where 𝑢 is a vector
and 𝑢 𝑓 is the scalar function value (and not one of the components
of 𝑢). We can rewrite this function as a residual equation by moving
all the terms to one side to get a 𝑟(𝑢 𝑓 ) = 𝑓 (𝑢) − 𝑢 𝑓 = 0. Even though
it might seem more natural to use explicit functions, we might be
motivated to use the residual form to write the whole model in the
compact notation, 𝑟(𝑢) = 0. This will be helpful in later chapters when
computing derivatives (Chapter 6) and solving systems with multiple
components (Chapter 13).
𝑢12 + 2𝑢2 − 1 = 0,
𝑢1 + cos(𝑢1 ) − 𝑢2 = 0, (3.2)
𝑓 (𝑢1 , 𝑢2 ) = 𝑢1 + 𝑢2 .
The first two equations are written as implicit functions and the third equation
is given as an explicit function. The first equation could be manipulated to
obtain an explicit function of either 𝑢1 or 𝑢2 . The second equation does not have
a closed-form solution and cannot be written as an explicit function for 𝑢1 . The
third equation is an explicit function of 𝑢1 and 𝑢2 . Given these equations, we
might decided to solve the first two equations for 𝑢1 and 𝑢2 using a nonlinear
solver and then evaluate 𝑓 (𝑢1 , 𝑢2 ). However, we can write the whole system as
3 Numerical Models and Solvers 46
You can use the same nonlinear solver to solve for all three equations simulta-
neously.
The linear system from Ex. 3.1 can be obtained by a finite-element dis-
cretization of the governing equations. This is an example of a set of implicit
equations, which we can write as a set of residuals,
𝑟(𝑢) = 𝐾𝑢 − 𝑓 = 0, (3.4)
where 𝑢 are the state variables. While the solution for 𝑢 could be written as an
explicit function, 𝑢 = 𝐾 −1 𝑓 , this is usually not done. Instead, we use a linear
solver that does not explicitly form the inverse of the stiffness matrix.
In addition to computing the displacements, we might also want to compute
the axial stress in each of the 15 truss members. This is an explicit function of
the displacements, which is given by the linear relationship
𝜎 = 𝑆𝑢, (3.5)
Õ
15
𝑚= 𝜌𝑎 𝑖 𝑙 𝑖 , (3.6)
𝑖=1
Mesh point 𝑧
Cell 𝑧
Element 𝑧
Differential equations need to be discretized over the domain to be Figure 3.3: Discretization methods
solved numerically. There are three main methods for the discretization in one spatial dimension.
of differential equations: the finite-difference method, the finite-volume
method, and the finite-element method. The finite-difference method
approximates the derivatives in the differential equations by the value
of the relevant quantities at a discrete number of points in a mesh (see
Fig. 3.3). The finite-volume method is based on the integral form of the
PDEs. It divides the domain into control volumes called cells (which
also form a mesh), and the integral is evaluated for each cell. The values
of the relevant quantities can be defined either at the centroids of the
cells or at the cell vertices. The finite-element model divides the domain
into elements (which are similar to cells) over which the quantities
are interpolated using pre-defined shape functions. The values are
computed at specific points in the element that are not necessarily at the
element boundaries. Governing equations can also include integrals,
which can be discretized with quadrature rules.
With any of these discretization methods, the final result is a set
of algebraic equations that we can write as 𝑟(𝑢) = 0 and solve for the
state variables 𝑢. This is a potentially large set of equations depending
on the domain and discretization (it is common to have millions of
equations in three-dimensional computational fluid dynamic problems).
The number of state variables of the discretized model is equal to the
number of equations for a complete and well-defined model. In the
most general case, the set of equations could be implicit and nonlinear.
3 Numerical Models and Solvers 48
When a problem involves both space and time, the prevailing ap-
proach is to decouple the discretization in space from the discretization
in time—called the method of lines (see Fig. 3.4). The discretiza-
tion in space is performed first, yielding an ODE in time. The time
derivative can then be approximated as a finite difference, leading to a
time-integration scheme.
The discretization process usually yields implicit algebraic equations PDE
This is more useful error measure in most cases. When the exact value
𝑥 ∗ is close to zero, however, this definition breaks down. To address 𝑧1 𝑧2 𝑧3 𝑧4 𝑧5
this, we avoid the division by zero by using 𝑧
If we try to store 24.11 using three digits, we get 24.1. The relative error is
24.11 − 24.1
≈ 0.0004, (3.11)
24.11
which is lower than the maximum possible representation error of 𝜖 mach = 0.005
established in Ex. 3.4
𝑎 0.
± 𝑏 0. 0 0 0 0 0 0 0 0 0 0
𝑐 0.
𝑎 0.
− 𝑏 0.
Figure 3.6: Subtracting two num-
bers that are close to each other
𝑐 0. 0 0 0 0 0 0 0 0 0 0 0 results in a loss of the digits that
match.
Common digits are lost Remaining digits
computation around 𝑥 = 2 due to subtractive cancellation. Figure 3.7: With double precision,
the minimum of this quadratic
function is in an interval much
larger than machine zero.
3.5.2 Truncation Errors † Roundoff error, discussed in the previ-
ous section, is sometimes also referred to
In the most general sense, truncation errors arise from performing a as truncation error because digits are trun-
cated, but we avoid this confusing naming
finite number of operations where an infinite number of operations and only use truncation error to refer to a
would be required to get an exact result.† Truncation errors would arise truncation in the number of operations.
3 Numerical Models and Solvers 52
cedure that starts with a guess for the states 𝑢 and then improve that 105
Fig. 3.8. The norm of the residuals decreases gradually until a limit is 10−10
0 200 400 600
reached (near 10−10 in this case). This limit represents the lowest error 𝑘
that can be achieved with the iterative solver and is determined by other
sources of error such as roundoff and truncation errors. If we terminate Figure 3.8: Norm of residuals ver-
sus the number of iterations for an
before reaching the limit (either by setting a convergence tolerance to a iterative solver.
value higher than 10−10 , or by setting an iteration limit to lower than 400
iterations), we incur an additional error. However, it might be desirable
to trade-off a less precise solution for a lower computational effort.
Tip 3.8: Find the level of the numerical noise in your model.
It is important to know the level of error in your model because this limits
the type of optimizer you can use and how well you can optimize. In Ex. 3.6,
we saw that if we plot a function at a small enough scale, we can see discrete
3 Numerical Models and Solvers 53
steps in the function due to roundoff errors. When accumulating all sources of
error in a more elaborate model (roundoff, truncation, and iterative), we no
longer have a neat step pattern. Instead, we get numerical noise, as shown in
Fig. 3.9. The level of noise can be estimated by the amplitude of the oscillations
and gives us the order of magnitude of the total numerical error.
0.5369
0.58 +2 · 10−8
𝑓 𝑓 ∼ 10−8
0.56
Figure 3.9: To find the level of nu-
0.5369
0.54 merical noise of a function of inter-
est with respect to an input param-
0.52 eter (left), we magnify both axes by
0.5369 several orders of magnitude and
−2 · 10−8 evaluate the function at points that
0.5
0 1 2 3 4 2 − 1 · 10−6 2.0 2 + 1 · 10−6 are closely spaced (right).
𝑥 𝑥
variable values.
The overall attitude towards programming should be that all code has bugs
until it is verified through testing.
which means that the norm of the error tends to zero as the number of
iterations tends to infinity. This sequence converges with order 𝑟 when
𝑟 is the largest number that satisfies
||𝑥 (𝑘+1) − 𝑥 ∗ ||
0 ≥ lim = 𝛾 < ∞, (3.13)
𝑘→∞ ||𝑥 (𝑘) − 𝑥 ∗ || 𝑟
constant factor for every iteration. If 𝛾 = 0.1, for example, and we start
with an initial error norm of 0.1, we get the sequence,
Thus, after six iterations, we get six-digit accuracy. Now suppose that
𝛾 = 0.9. Then we would have
If 𝛾 = 1, then the error norm sequence with a starting error norm of 0.1
would be
10−1 , 10−2 , 10−4 , 10−8 , . . . . (3.19)
Thus we achieve more than six digits of accuracy in just three iterations!
In this case, the number of correct digits doubles at every iteration.
For 𝛾 > 1, the convergence might not be as good, but the series is sill
convergent.
If 𝑟 ≥ 1 and 𝛾 → 0, we have superlinear convergence, which includes
quadratic and higher rates of convergence. There is a special case of
superlinear convergence that is relevant for optimization algorithms,
which is when 𝑟 = 1. This case is desirable because in practice it
behaves similarly to quadratic convergence and can be achieved by
gradient-based algorithms that use first derivatives (as opposed to
second derivatives). In this case, we can write
0.1 10−1
Linear
10−2 𝑟 = 1, 𝛾 = 0.9
0.08 Superlinear
10−3 𝑟=1
0.06 𝛾→0
10−4
𝑥 𝑥
0.04 10−5
10−6
0.02 Linear Figure 3.10: Sample sequences for
10−7 Quadratic 𝑟=1 linear, superlinear, and quadratic
𝑟=2 𝛾 = 0.1
0
cases plotted in a linear scale (left)
10−8
0 2 4 6 0 2 4 6 and logarithmic scale (right).
𝑘 𝑘
When using a linear scale plot, you can only see differences in two significant
digits. To reveal changes beyond three digits, you should use a logarithmic
scale. This need occurs frequently in plotting the convergence behavior of
optimization algorithms.
iteration:
||𝑥 (𝑘+1) − 𝑥 ∗ || ||𝑥 (𝑘+1) − 𝑥 (𝑘) ||
≈ , (3.22)
||𝑥 (𝑘) − 𝑥 ∗ || ||𝑥 (𝑘) − 𝑥 (𝑘−1) ||
The rate of convergence can then be estimated numerically with the
values of the last available four iterates using
Finally, we can also monitor any quantity by taking the step length
and normalizing it in the same way as Eq. 3.8,
LU decomposition
Direct
QR decomposition
Linear Jacobi
Solver SOR
Newton
+ linear solver CG
Krylov subspace
Nonlinear
Nonlinear GMRES
variants of
fixed point
each iteration through explicit expressions that are easy to compute, as Figure 3.11: Overview of solution
illustrated in Fig. 3.12. Iterative methods can be fixed-point iterations, methods for linear and nonlinear
systems.
such as Jacobi, Gauss–Seidel, and successive over-relaxation (SOR), or
Krylov subspace methods, such as the conjugate gradient (CG) and
generalized minimum residual (GMRES) methods.¶ Direct solvers ¶ SeeSaad56 for more details on iterative
methods in the context of large-scale nu-
are well established and are included in the standard libraries for merical models.
most programming languages. Iterative solvers are less widespread in 56. Saad, Iterative Methods for Sparse
standard libraries, but they are becoming more commonplace. Linear Systems. 2003
Residual
Because some numerical libraries have functions to compute 𝐴−1 , you Iterative Direct
might be tempted to do this and then multiply by a vector to compute 𝑢 = 𝐴−1 𝑏.
This is a bad idea because finding the inverse is computationally expensive. 𝜖machine
Instead, use 𝐿𝑈 decomposition or another method from Fig. 3.11.
𝒪 𝑛3
Effort
Direct methods are the right choice for many problems because they Figure 3.12: While direct methods
are generally robust. However, for large systems where 𝐴 is sparse, the only yield the solution at the end
cost of direct methods can become prohibitive, while iterative methods of the process, iterative methods
produce approximate intermediate
remain viable. Iterative methods have other advantages, such as being results.
able to trade between computational cost and precision, and to restart
from a good guess (see Appendix B for details).
When it comes to nonlinear solvers, the most efficient methods are
based on Newton’s method, which we explain later in this chapter
(Section 3.8). Newton’s method solves sequence of problems that are
linearizations of the nonlinear problem about the current iterate. The
linear problem at each Newton iteration can be solved using any linear
solver, as indicated by the incoming arrow in Fig. 3.11. Although
efficient, Newton’s method is not robust in that it does not always
converge. Therefore, it requires modifications so that it can converge
reliably.
3 Numerical Models and Solvers 60
𝑟 (𝑘)
𝑟 (𝑘) + Δ𝑢𝑟 0(𝑘) = 0 ⇒ Δ𝑢 = − . (3.27)
𝑟 0(𝑘)
𝑟 (𝑘)
𝑢 (𝑘+1) = 𝑢 (𝑘) − . (3.28)
𝑟 0(𝑘)
If 𝑟 0(𝑘) = 0, the algorithm will not converge because it yields a step to
infinity. Small enough values of 𝑟 0(𝑘) also cause an issue with large
steps, but the algorithm might still converge.
𝑟 𝑟
1 √
𝑢2 = , 𝑢2 = 𝑢1 . (3.35)
𝑢1
This corresponds to the two lines shown in Fig. 3.14, where the solution is at
their intersection, 𝑢 = (1, 1). (In this example, the two equations are explicit
and we could solve them by substitution, but they could have been implicit.)
To solve this using Newton’s method, we need to write these as residuals, 5
𝑢2
1 4
𝑟1 = 𝑢2 − =0
𝑢1 (3.36)
√ 3 𝑢 (0)
𝑟2 = 𝑢2 − 𝑢1 = 0.
2
The Jacobian can be derived analytically and the Newton step is given by the
1
linear system
" 1 # 𝑢∗
1 Δ𝑢
𝑢2 − 𝑢11
0
𝑢12 1
= − √ . (3.37) 0 1 2 3
− √1
2 𝑢1
1 Δ𝑢2 𝑢2 − 𝑢1 𝑢1
Starting from 𝑢 = (2, 3) yields the iterations shown below with the quadratic Figure 3.14: Newton iterations.
convergence shown in Fig. 3.15.
||𝑟||
100
𝑢1 𝑢2 ||𝑢 − 𝑢 ∗ || ||𝑟||
10−3
2.000000 3.000000 2.23 2.50
0.485281 0.878679 5.28 × 10−1 2.50 10−6
When performing design optimization, we ultimately need to compute Figure 3.16: For a general model,
the values of the objective and constraint functions in the optimization the state variables 𝑢 are implicit
functions of the design variables 𝑥
problem (1.4). There is typically an intermediate step that requires
through the solution of the govern-
solving the governing equations for the given design 𝑥 at one or more ing equations.
3 Numerical Models and Solvers 64
minimize 𝑓 (𝑥)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥
subject to 𝑔 𝑗 (𝑥, 𝑢) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔
ℎ 𝑘 (𝑥, 𝑢) = 0 𝑘 = 1, . . . , 𝑛 ℎ (3.38)
x𝑖 ≤ 𝑥 𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥
while solving 𝑟 𝑙 (𝑥, 𝑢) = 0 𝑙 = 1, . . . , 𝑛𝑢
by varying 𝑢𝑙 𝑙 = 1, . . . , 𝑛𝑢
Here, “while solving” means that the governing equations are solved
at each optimization iteration to find a valid 𝑢 for each value of 𝑥.
𝑥
Optimizer
Example 3.14: Structural sizing optimization.
𝑢
Recalling the truss problem of Ex. 3.3, suppose we want to minimize the 𝑟(𝑥, 𝑢) = 0
mass of the structure (𝑚) by varying the cross sectional areas of the trusses (𝑥),
subject to stress constraints. We can write the problem statement as 𝑓 (𝑥, 𝑢)
𝑔(𝑥, 𝑢)
𝑓 , 𝑔, ℎ ℎ(𝑥, 𝑢)
minimize 𝑚(𝑥)
by varying 𝑥 ≥ 𝑥min 𝑗 = 1, . . . , 15
Figure 3.17: The computation of the
subject to |𝜎 𝑗 (𝑥, 𝑢)| − 𝜎max ≤ 0 𝑗 = 1, . . . , 15 (3.39) objective ( 𝑓 ) and constraint func-
while solving 𝐾𝑢 − 𝑓 = 0 tions (ℎ,𝑔) for a given set of design
variables (𝑥) usually involves the so-
by varying 𝑢𝑙 𝑙 = 1, . . . , 18 lution of a numerical model (𝑟 = 0)
by varying the state variables (𝑢).
The governing equations are a linear set of equations whose solution determines
the displacements of the a given design (𝑥) for a given load condition ( 𝑓 ). We
mentioned previously that the objective and constraint functions are usually
explicit functions of the state variables, design variables, or both. As we saw in
3 Numerical Models and Solvers 65
Ex. 3.3, the mass is indeed an explicit function of the cross sectional areas. In
this case, it does not even depend on the state variables. The constraint function
is also an explicit function, but in this case it is just a function of the state
variables. This example illustrates a common situation where the solution of
the state variables requires the solution of implicit equations (structural solver),
while the constraints (stresses) and objective (weight) are explicit functions of
the states and design variables.
statement as equality constraints and design variables, respectively. Figure 3.18: In the full-space ap-
proach, the governing equations are
Example 3.15: Structural sizing optimization using a full-space approach. solved by the optimizer by varying
the state variables.
If we wanted to solve this problem using a full-space approach, we would
forgo the linear solver by adding 𝑢 to the set of design variables and letting the
optimizer enforce the governing equations. This would result in the following
3 Numerical Models and Solvers 66
problem,
minimize 𝑚(𝑥)
by varying 𝑥 𝑗 ≥ 𝑥 min 𝑗 = 1, . . . , 15
𝑢𝑙 𝑗 = 1, . . . , 18 (3.41)
subject to |𝜎 𝑗 (𝑥, 𝑢)| − 𝜎max ≤ 0 𝑗 = 1, . . . , 15
𝐾𝑢 − 𝑓 = 0.
3.10 Summary
Problems
3.2 Choose an engineering system that you are familiar with and
describe each of the components illustrated in Fig. 3.1 for that
system. List all the options for the mathematical and numerical
models that you can think of and describe the assumptions for
each model. What type of solver is usually used for each model
(see Section 3.7) and what are their state variables? What are the
state variables for each model?
3 Numerical Models and Solvers 68
𝑢12 + 2𝑢2 = 1,
𝑢1 + cos(𝑢1 ) − 𝑢2 = 0,
𝑓 (𝑢1 , 𝑢2 ) = 𝑢1 + 𝑢2 .
Which equations are explicit and which ones are implicit? Write
these equations in residual form.
3.4 Reproduce a plot similar to the one shown in Fig. 3.7 for
𝑓 (𝑥) = cos(𝑥) + 1
in the neighborhood of 𝑥 = 𝜋.
𝑟(𝑢) = 𝑢 3 − 6𝑢 2 + 12𝑢 − 8 = 0.
𝐸 − 𝑒 sin(𝐸) = 𝑀,
𝑟(𝑢) = 𝑎𝑢 3 − 6𝑢 2 + 12𝑢 − 8 = 0.
3.8 Reproduce the solution of Ex. 3.13 and then try different initial
guesses. Can you define a distinct region from where Newton’s
method converges?
3.9 Choose a problem that you are familiar with and find the magni-
tude of numerical noise in one or more outputs of interest with
respect to one or more inputs of interest. What means do you
have to decrease the numerical noise? What is the lowest possible
level of noise you can achieve?
Unconstrained Gradient-Based Optimization
4
In this chapter we focus our attention on unconstrained minimiza-
tion problems with continuous design variables (see Fig. 1.12). Such
optimization problems can be written as
minimize 𝑓 (𝑥)
(4.1)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥 ,
70
4 Unconstrained Gradient-Based Optimization 71
4.1 Fundamentals
𝑓 𝜕𝑓
𝜕𝑓 𝜕𝑥1
𝜕𝑥2
𝑥2
∇𝑓
𝑥1 𝜕𝑓 Figure 4.2: Components of the
𝜕𝑓 𝜕𝑥 1 gradient vector in the 2D case.
𝜕𝑥 2
This defines the vector field shown in Fig. 4.3, where each vector points in the
direction of steepest local increase.
2 Saddle point
Maximum
𝑥2 0
Minimum
−2 Saddle point
Consider the wing design problem from Ex. 1.1, where the objective
function 𝑓 is the required power. For the derivative of power with respect to
span (𝜕 𝑓 /𝜕𝑏), the units are Watts per meter (W/m). For example, for a wing
with 𝑐 = 1 m and 𝑏 = 12 m, we have 𝑓 = 1087.85 W and 𝜕 𝑓 /𝜕𝑏 = −41.65 W/m.
This means that an increase in span of 1 m a linear approximation predicts a
decrease power of 41.65 W. However, because the function is nonlinear, the
actual power at 𝑏 = 13 𝑚 is 1059.77 W (see Fig. 4.4). The relative derivative for
1.5 1,150
1.2 1,125
(12, 1.0)
1,100
𝑐 0.9 𝑓
𝜕𝑓
1,075 𝜕𝑏
0.6
1059.77
0.3
1046.20 Figure 4.4: Power versus span and
5 15 25 35 11 12 13 14 15 16 the corresponding derivative.
𝑏 𝑏
this same design can be computed as (𝜕 𝑓 /𝜕𝑏)(𝑏/ 𝑓 ) = −0.459, which means that
for a 1% increase in span, the linear approximation predicts a 0.459% decrease
in power; the actual decrease is 0.310%.
𝑓 (𝑥 + 𝜖𝑝) − 𝑓 (𝑥)
∇𝑝 𝑓 (𝑥) ≡ lim (4.7)
𝜖→0 𝜖
We can find this derivative by projecting the gradient onto the desired
direction 𝑝 using the dot product
∇𝑝 𝑓 (𝑥) = ∇ 𝑓 𝑇 𝑝. (4.8)
∇𝑓 𝑇𝑝
From the gradient projection, we can see why the gradient is the
direction of steepest increase. If we use this definition of the dot
product,
∇𝑝 𝑓 (𝑥) = ∇ 𝑓 𝑇 𝑝 = k∇ 𝑓 k k𝑝 k cos 𝜃, (4.9)
we can see that this is maximized when 𝜃 = 0◦ . That is, the directional
derivative is largest when 𝑝 points in the same direction as ∇ 𝑓 . If
−90◦ < 𝜃 < 90◦ , the directional derivative is positive and is thus in
a direction of increase (Fig. 4.6). If 90◦ < 𝜃 < 270◦ , the directional
derivative is negative and 𝑝 points in a descent direction. Finally, if
𝜃 = ±90◦ , the directional derivative is 0 and thus the function value
does not change and is locally flat in that direction. That condition
occurs if ∇ 𝑓 and 𝑝 are orthogonal, and thus the gradient is always
orthogonal to contour surfaces.
To get the correct slope in the original units of 𝑥, the direction
should be normalized as 𝑝ˆ = 𝑝/ 𝑝 . In the gradient-based optimization
4 Unconstrained Gradient-Based Optimization 75
Positive directional
derivative (∇ 𝑓 𝑇 𝑝 > 0)
∇𝑓
𝜃
Negative directional 𝑝 Figure 4.6: The gradient ∇ 𝑓 is al-
derivative (∇ 𝑓 𝑇 𝑝 < 0)
ways orthogonal to contour lines
(surfaces), and the directional
derivative in the direction of 𝑝 is
Contour line tangent
given by ∇ 𝑓 𝑇 𝑝.
(∇ 𝑓 𝑇 𝑝 = 0)
(0,1)
1.5 10
(-1,1) (1,1)
∇𝑓 k∇ 𝑓 k
𝑥 8
1
𝑝
𝑥 + 𝛼𝑝 6
−6 −3 0 3 6
𝑥2 0.5 𝑓 (-1,0) (1,0)
4 𝑥 ∇𝑓 𝑇𝑝
0
2
∇𝑓 𝑇𝑝
(-1,-1) (1,-1)
−0.5 0
−1.5 −1 −0.5 0 0.5 𝛼 (0,-1)
𝑥1
In one dimension, the gradient reduces to a scalar (the slope) and Figure 4.7: Derivative along the
the curvature is also a scalar that can be calculated by taking the second direction 𝑝.
𝜕2 𝑓
. (4.14)
𝜕𝑥 𝑖 𝜕𝑥 𝑗
𝜕2 𝑓 𝜕2 𝑓 𝜕2 𝑓
𝜕𝑥 2 ··· 𝜕𝑥 1 𝜕𝑥 𝑛
21 𝜕𝑥 1 𝜕𝑥2
𝜕 𝑓 𝜕2 𝑓
𝜕𝑥 𝜕𝑥 𝜕𝑥 2 𝜕𝑥 𝑛
𝜕2 𝑓
···
∇ ∇ 𝑓 (𝑥) ≡ 𝐻(𝑥) ≡ 2 1 𝜕𝑥22
. (4.16)
... .. .. ..
2 . . .
𝜕 𝑓
𝜕𝑥 𝑛 𝜕𝑥1 𝜕2 𝑓 𝜕2 𝑓
𝜕𝑥 𝑛 𝜕𝑥2
··· 𝜕𝑥 𝑛2
𝐻𝑣 = 𝜅𝑣. (4.19)
whose contours are shown in Fig. 4.8. These contours are ellipses that have the
same focus points. The Hessian of this quadratic is
2 −1
𝐻= , (4.21)
−1 4
√
which is constant. To find the curvature in the direction 𝑝 = [−1/2, − 3/2]𝑇 ,
we compute
h
√ i 2
" −1 # √
−1 2 7− 3
𝑝 𝐻𝑝 = −1 − 3
𝑇 √
− 3 = . (4.22)
2 2 −1 4 2
2
4 Unconstrained Gradient-Based Optimization 78
(0,1)
Consider the same polynomial from Ex. 4.1. Differentiating the gradient
we obtained previously, we obtain the Hessian,
6𝑥1 4𝑥 2
𝐻(𝑥1 , 𝑥2 ) = . (4.24)
4𝑥2 4𝑥1 − 6𝑥2
We can visualize the variation of the Hessian by plotting the principal curvatures
at different points (Fig. 4.9).
2 Saddle point
𝑥2 0
Maximum Minimum
−2 Saddle point
Taylor series, see Appendix A.6.
1 3
𝑓 (𝑥 + 𝑝) = 𝑓 (𝑥) + ∇ 𝑓 (𝑥)𝑇 𝑝 + 𝑝 𝑇 𝐻(𝑥)𝑝 + 𝒪 𝑝 . (4.25)
2
We use a second-order Taylor series (ignoring the cubic term)
because it results in a quadratic, which is the lowest order Taylor series
that can have a minimum. For a function that is 𝐶 2 continuous, this
approximation can be made arbitrarily accurate by making 𝛼 small
enough.
Using the gradient and Hessian of the two-variable polynomial from Ex. 4.1
and Ex. 4.5, we can use Eq. 4.25 to construct a second-order Taylor expansion
about 𝑥 (0) ,
𝑇
3𝑥12 + 2𝑥22 − 20 6𝑥 1 4𝑥2
𝑓˜(𝑝) = 𝑓 𝑥 (0) + 𝑝 + 𝑝𝑇 𝑝. (4.26)
4𝑥 1 𝑥2 − 3𝑥 22 4𝑥 2 4𝑥1 − 6𝑥2
Figure 4.10 shows the resulting Taylor series expansions about different points.
4 4 4
2 2 2
Saddle point
𝑥2 0 𝑥2 0 𝑥2 0
Minimum Maximum
−2 −2 −2
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1
1
𝑓 (𝑥 ∗ + 𝑝) = 𝑓 (𝑥 ∗ ) + ∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 + 𝑝 𝑇 𝐻(𝑥 ∗ )𝑝 + . . . . (4.27)
2
For 𝑥 ∗ to be an optimal point, we must have 𝑓 (𝑥 ∗ + 𝑝) ≥ 𝑓 (𝑥 ∗ ) for all 𝑝.
This implies that the first- and second-order terms in the Taylor series
have to be non-negative, that is,
1
∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 + 𝑝 𝑇 𝐻(𝑥 ∗ )𝑝 ≥ 0. (4.28)
2
Because the magnitude of 𝑝 is small, we can always find a 𝑝 such
that the first term dominates. Therefore, we require that
∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 ≥ 0. (4.29)
Because 𝑝 can be in any arbitrary direction, the only way this inequality
can be satisfied is if all the elements of the gradient are zero,
∇ 𝑓 (𝑥 ∗ ) = 0. (4.30)
𝑝 𝑇 𝐻(𝑥 ∗ )𝑝 ≥ 0. (4.31)
From Eq. 4.18, we know that this term represents the curvature in
direction 𝑝, so this means that the function curvature must be positive
of zero when projected in any direction. You may recognize this
4 Unconstrained Gradient-Based Optimization 81
(a) Positive definite (b) Positive semidefinite (c) Indefinite (d) Negative definite
In summary, the necessary optimality conditions for an unconstrained Figure 4.11: Quadratic functions
with different types of Hessians
optimization problem are
from positive definite to negative
definite.
∇ 𝑓 (𝑥 ∗ ) = 0,
(4.32)
𝐻(𝑥 ∗ ) is positive semidefinite.
∇ 𝑓 (𝑥 ∗ ) = 0,
(4.33)
𝐻(𝑥 ∗ ) is positive definite.
4 Unconstrained Gradient-Based Optimization 82
We can find the minima of this function analytically by solving for the optimality
conditions analytically.
To find the critical points of this function, we solve for the points at which
the gradient is equal to zero,
𝜕𝑓 3
𝜕𝑥 2𝑥1 + 6𝑥12 + 3𝑥1 − 2𝑥 2
∇𝑓 = 𝜕𝑓 =
1 =
0
𝜕𝑥2 2𝑥 2 − 2𝑥 1 0
From the second equation we have that 𝑥 2 = 𝑥1 . Substituting this into the first
equation yields,
𝑥 1 2𝑥12 + 6𝑥 + 1 = 0.
The solutions of this equation yields three points
" √ # "√ #
0 − 23 − 27 7
− 32
𝑥 (1)
= , 𝑥 (2)
= √ , 𝑥 (3)
= √2
0 − 23 − 27 7
− 32
2
Since the first diagonal element is positive as well, 𝐻 at this point is positive
definite so 𝑥 (1) is a local minimum. For the second point,
√
3(3 + 7) −2 √
(2)
det 𝐻 𝑥 = det = 14 + 6 7 > 0.
−2 2
√
Since 3(3 + 7) > 0 as well, 𝑥 (2) is also a local minimum. For the third point,
√
9 − 3 7 −2 √
(3)
det 𝐻 𝑥 = det = 14 − 6 7 < 0,
−2 2
√
and since 9 − 3 7 > 0, this is a saddle point.
4 Unconstrained Gradient-Based Optimization 83
1
Minimum
(local)
0
Saddle point
𝑥2 −1
−2 Minimum
(global)
−3
Figure 4.12: Minima and critical
points for a polynomial of two
−4
−4 −3 −2 −1 0 1 2 variables.
𝑥1
These three critical points are shown in Fig. 4.12. To find out which of the
two local minima
if the global
one, we evaluate the function at each of these
points. Since 𝑓 𝑥 (2) < 𝑓 𝑥 (1) , 𝑥 (2) is the global minimum.
∇𝑓 ∞
< 𝜏, (4.34)
𝑥 (𝑘) 𝑥∗ 𝑥∗ 𝑥∗
𝑥2 𝑥2 𝑠 (𝑘) 𝑥2 𝑥 (𝑘)
𝑥 (𝑘)
𝑥1 𝑥1 𝑥1
quadratic approximations match the local gradient and curvature at Figure 4.13: Taylor series quadratic
models are only guaranteed to be
the respective points. However, the Taylor series quadratic about the
accurate near the point about which
first point (left plot) yields a quadratic without a minimum (the only the series is expanded (𝑥). When
critical point is a saddle point). The second point (middle plot) yields the point is far from the optimum,
the quadratic model might result
a quadratic whose minimum is closer to the true minimum. Finally, in a function without a minimum
the Taylor series about the actual minimum point (right plot) yields a (left).
quadratic with the same minimum, as would be expected, but we can
see how the quadratic model worsens the further we are from the point.
Because the Taylor series is only guaranteed to be a good model
locally, we need a globalization strategy to ensure convergence to an
optimum. Globalization here means to make the algorithm robust
4 Unconstrained Gradient-Based Optimization 85
1. Create a model about the current point. This model can be based Figure 4.14: Line search approach.
For the line search subproblem, we assume that we are given a 𝑝 (𝑘)
starting at 𝑥 (𝑘) and a suitable search direction 𝑝 (𝑘) along which we are
going to search. The line search then operates solely on points along 𝑥 (𝑘)
direction 𝑝 (𝑘) starting from 𝑥 (𝑘) , which can be written as
Figure 4.16: The line search starts
from a given point 𝑥 (𝑘) and searches
𝑥 (𝑘+1) = 𝑥 (𝑘) + 𝛼𝑝 (𝑘) , (4.35)
solely along direction 𝑝 (𝑘) .
𝑥 (𝑘) 𝑝 (𝑘)
𝑥2 𝑓
𝑥 (𝑘+1)
𝑥 (𝑘) + 𝛼𝑝 (𝑘) Figure 4.17: The line search projects
the 𝑛-dimensional problem onto
one-dimension, where the indepen-
𝑥1 𝛼 dent variable is 𝛼.
𝑥 (𝑘) 𝑥 𝑘+1 = 𝑥 (𝑘) + 𝛼(𝑘+1) 𝑝 (𝑘)
of effort determining the exact minimum in the 𝑝 (𝑘) direction because it 𝑥 (𝑘)
would not take us any closer to the minimum of the overall function 𝑥2
(the dot on the right side of the plot). Instead, we should find a point 𝑝 (𝑘)
which is the directional derivative along the search direction. The slope
at the start of a given line search is
𝑇
𝜙0(0) = ∇ 𝑓 (𝑘) 𝑝 (𝑘) . (4.38)
Because 𝑝 (𝑘) must be a descent direction, 𝜙0(0) is always negative.
Fig. 4.20 is a version of the one-dimensional slice from Fig. 4.17 in the
4 Unconstrained Gradient-Based Optimization 88
this notation. The 𝛼 axis and the slopes scale with the magnitude of
𝑝 (𝑘) .
The simplest line search algorithm to find a “good enough” point relies 𝜙
𝜙0(0) Sufficient
decrease line
𝜙(𝛼)
guesses for the step length. However, in many cases we have no idea of
the scale of function, so our initial guess may not be suitable. Even if
we do have an educated guess for 𝛼, it is only a guess and the first step
might not satisfy the sufficient decrease condition.
One simple algorithm that is guaranteed to find a step that satisfies
the sufficient decrease condition is backtracking (Alg. 4.9). This algo-
rithm starts with a maximum step and successively reduces the step
by a constant ratio 𝜌 until it satisfies the sufficient decrease condition
(a typical value is 𝜌 = 0.5). Because our search direction is a descent
direction, we know that if we backtrack enough we will achieve an
acceptable decrease in function value.
Inputs:
𝛼init > 0: Initial step length
0 < 𝜇1 < 1: Sufficient decrease factor
0 < 𝜌 < 1: Backtracking factor
Outputs:
𝛼∗ : Step size satisfying sufficient decrease condition
𝛼 = 𝛼init
while 𝜙(𝛼) > 𝜙(0) + 𝜇1 𝛼𝜙0 (0) do Is function value is above sufficient decrease line?
𝛼 = 𝜌𝛼 Backtrack
end while
Suppose we do a line search starting from 𝑥 = (−1.25, 1.25) in the direction 1.5
𝑝 = [4, 0.75], as shown in Fig. 4.22. Applying the backtracking algorithm with
1 𝑥 (𝑘)
𝜇1 = 10−4 and 𝜌 = 0.7 produces the iterations shown in Fig. 4.23. The sufficient
decrease line appears to be horizontal, but it just has a small slope because 0.5
𝜇1 is small. Using a large initial step of 𝛼init = 1.2 (left), several iterations are 0
0 1 2 3
required. For a small initial step of 𝛼init = 0.05 (right), the algorithm satisfies −3 −2 −1
𝑥1
sufficient decrease at the first iteration but misses out on further decreases.
Figure 4.22: Line search direction.
30 30
20 20
𝑓 10 𝑓 10
0 0
−10 −10
Figure 4.23: Backtracking using
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 different initial steps.
𝛼 𝛼 (0) 𝛼 (0) 𝛼
initial step is far too large, and the step sizes that satisfy sufficient de-
crease are smaller than the starting step by several orders of magnitude.
Depending on the value of 𝜌, this scenario requires a large number of
backtracking evaluations.
The other undesirable scenario is where our initial guess immedi-
ately satisfies sufficient decrease, but the slope of the function at this
point is still highly negative and we could have decreased the function
value by much more if we had taken a larger step. In this case, our
guess for the initial step is far too small.
Even if our original step size is not too far from an acceptable step
size, the basic backtracking algorithm ignores any information we have
about the function values and its gradients, and blindly takes a reduced
step based on a preselected ratio 𝜌. We can make more intelligent
estimates of where an acceptable step is based on the evaluated function
values (and gradients, if available). In the next section, we introduce a
more sophisticated line search algorithm that is able to deal with these
scenarios much more efficiently.
the slope of the function at the candidate point with the slope at the
start of the line search, we can get an idea if the function is “bottoming
out,” or flattening, using the curvature condition:
This condition requires that the magnitude of the slope at the new point
be lower than the magnitude of the slope at the start of the line search
by a factor of 𝜇2 . This requirement is called the curvature condition
because by comparing the two slopes, we are effectively quantifying
the curvature of the function. Typical values of 𝜇2 range from 0.1 to 0.9,
and the best value depends on the method for determining the search
direction and is also problem dependent. To guarantee that there are
steps that satisfy both sufficient decrease and sufficient curvature, the
sufficient decrease slope must be shallower than the sufficient curvature
slope, that is, 0 < 𝜇1 ≤ 𝜇2 ≤ 1. As 𝜇2 tends to zero, enforcing the
sufficient curvature condition tends toward an exact line search.
The sign of the slope at a point satisfying this condition is not
important; all that matters is that the function be shallow enough. The
idea is that if the slope 𝜙0(𝛼) is still negative with a magnitude similar
to the slope at the start of the line search, then the step is too small, and
we expect the function to decrease even further by taking a larger step.
If the slope 𝜙0(𝛼) is positive with a magnitude similar to that at the start
of the line search, then the step is too large, and we expect to decrease
the function further by taking a smaller step. On the other hand, when
the slope is shallow enough (either positive or negative), we assume
that the candidate point is near a local minimum and additional effort
will yield only incremental benefits that are wasteful in the context of
our larger problem. The sufficient decrease and curvature conditions
are collectively known as the strong Wolfe conditions. Figure 4.24 shows
acceptable intervals that satisfy the strong Wolfe conditions.
𝜙0(0) Sufficient
decrease line
𝜙(𝛼)
a step satisfying the strong Wolfe conditions. Note that using the
curvature condition means we require derivative information (𝜙0).
There are various line search algorithms in the literature, including
some that are derivative-free. Here, we detail a line search algorithm
similar to that presented by Nocedal and Wright 58 . The algorithm has 58. Nocedal et al., Numerical Optimization.
2006
two stages:
2. The pinpointing stage finds a point that satisfies the strong Wolfe
conditions within the interval provided by the bracketing stage.
1. The function value at the candidate step is higher than at the start
of the line search.
If the step satisfies sufficient decrease and the slope is negative, the step
size is increased to look for a larger function value reduction along the
line.
Inputs:
𝛼1 > 0: Initial step size guess
0 < 𝜇1 < 1: Sufficient decrease factor
0 < 𝜇2 < 1: Sufficient curvature factor
𝜌 > 1: Step size increase factor
Outputs:
𝛼∗ : Step size satisfying strong Wolfe conditions
𝛼0 = 0
𝑖=1
while true do
Evaluate 𝜙(𝛼 𝑖 )
if [𝜙(𝛼 𝑖 ) > 𝜙(0) + 𝜇1 𝛼 𝑖 𝜙0 (0)] or [𝜙(𝛼 𝑖 ) > 𝜙(𝛼 𝑖−1 ) and 𝑖 > 1] then
𝛼∗ = pinpoint(𝛼 𝑖−1 , 𝛼 𝑖 ) return 𝛼∗
end if
4 Unconstrained Gradient-Based Optimization 93
Evaluate 𝜙0 (𝛼 𝑖 )
if |𝜙0 (𝛼 𝑖 )| ≤ −𝜇2 𝜙0 (0) then return 𝛼∗ = 𝛼 𝑖
else if 𝜙0 (𝛼 𝑖 ) ≥ 0 then
𝛼∗ = pinpoint(𝛼 𝑖 , 𝛼 𝑖−1 ) return 𝛼 ∗
else
𝛼 𝑖+1 = 𝜌𝛼 𝑖 𝜌 > 1, e.g. 2
end if
𝑖 = 𝑖+1
end while
𝛼 𝑖−1 𝛼𝑖
𝛼 𝑖−1 𝛼𝑖
The algorithm for the second stage, the pinpoint(𝛼 low , 𝛼 high ) func-
tion, is given in Alg. 4.12. In the first step, we need to estimate a good
candidate point within the interval that is expected to satisfy the strong
Wolfe conditions. A number of algorithms can be used to find such a
point. Since we have the function value and derivative at one endpoint
of the interval, and at least the function value at the other endpoint, one
4 Unconstrained Gradient-Based Optimization 94
Inputs:
𝛼low : Lower limit for pinpoint function
𝛼high : Upper limit for pinpoint function
0 < 𝜇1 < 1: Sufficient decrease factor
0 < 𝜇2 < 1: Sufficient curvature factor
Outputs:
𝛼∗ : Step size satisfying strong Wolfe conditions
𝑗=0
while true do
Find 𝛼low ≤ 𝛼 𝑗 ≤ 𝛼high Using quadratic (4.41) or cubic interpolation
Evaluate 𝜙(𝛼 𝑗 )
if 𝜙(𝛼 𝑗 ) > 𝜙(0) + 𝜇1 𝛼 𝑗 𝜙0 (0) or 𝜙(𝛼 𝑗 ) > 𝜙(𝛼 low ) then
𝛼high = 𝛼 𝑗
else
Evaluate 𝜙0 (𝛼 𝑗 )
if |𝜙0 (𝛼 𝑗 )| ≤ −𝜇2 𝜙0 (0) then
𝛼∗ = 𝛼 𝑗 return 𝛼∗
else if 𝜙0 (𝛼 𝑗 )(𝛼high − 𝛼 low ) ≥ 0 then
𝛼high = 𝛼low
end if
4 Unconstrained Gradient-Based Optimization 95
𝛼low = 𝛼 𝑗
end if
𝑗 = 𝑗+1
end while
𝛼high 𝛼 low
𝛼 low 𝛼 high
Done
Figure 4.26: Visual representation
of the pinpointing algorithm.
𝛼∗
The line search defined by Alg. 4.11 followed by Alg. 4.12 is guar-
anteed to find a step length satisfying the strong Wolfe conditions
for any parameters 𝜇1 and 𝜇2 . A robust algorithm needs to consider
additional issues. One of these criteria is to ensure that the new point
in the pinpoint algorithm is not so close to an endpoint as to cause the
interpolation to be ill conditioned. A fall-back option in case the inter-
polation fails could be a simpler algorithm, such as bisection. Another
of these criteria is to ensure that the loop does not continue indefinitely
because finite-precision arithmetic leads to indistinguishable function
4 Unconstrained Gradient-Based Optimization 96
value changes.
30 30
Bracketing
20 20 Bracketing
𝑓 10 Pinpointing 𝑓 10
Pinpointing
0 0
−10 −10
Figure 4.27: Example of a line
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 search iteration.
𝛼 𝛼 (0) 𝛼 (0) 𝛼
Let us perform the same line search as in Alg. 4.9 but now using bracketing
and pinpointing instead of backtracking. Using a large initial step of 𝛼init = 1.2
(left), bracketing is achieved in the first iteration. Then pinpointing finds a
point better than the one found using backtracking. The small initial step
of 𝛼init = 0.05 (right) does not satisfy the strong Wolfe conditions and the
bracketing stage moves forward as long as the function keeps decreasing. The
end result is a point that is much better than the one obtained with backtracking.
𝑝 = −∇ 𝑓 , (4.42)
or as a normalized direction
∇ 𝑓 (𝑘)
𝑝 (𝑘) = − . (4.43)
k∇ 𝑓 (𝑘) k
While steepest descent sounds like the best possible search direction
to decrease a function, it actually is not. The reason is that when a
function curvature varies greatly with direction, the gradient alone is a
poor representation of function behavior beyond a small neighborhood.
−5 −5 −5
−5 0 5 10 −5 0 5 10 −5 0 5 10
𝑥1 𝑥1 𝑥1
𝛽 = 1 (left), this quadratic has the same curvature in all directions and the Figure 4.28: Iteration history for
steepest descent direction points directly to the minimum. When 𝛽 > 1 (middle a quadratic function using the
steepest descent method with an
and right), this is no longer the case and steepest descent shows abrupt changes exact line search.
in the subsequent search directions. This zigzagging is an inefficient way to
approach the minimum. The higher the difference in curvature, the more
iterations it takes.
search:
𝜕 𝑓 (𝑥 (𝑘) + 𝛼𝑝 (𝑘) )
=0
𝜕𝛼
𝜕 𝑓 (𝑥 (𝑘+1) )
=0
𝜕𝛼
𝜕 𝑓 (𝑥 (𝑘+1) ) 𝜕(𝑥 (𝑘) + 𝛼𝑝 (𝑘) ) (4.44)
=0
𝜕𝑥 (𝑘+1) 𝜕𝛼
𝑇
∇ 𝑓 (𝑘+1) 𝑝 (𝑘) = 0
𝑇
−𝑝 (𝑘+1) 𝑝 (𝑘) = 0
Hence each search direction is orthogonal to the previous one. As
discussed in the last section, exact line searches are not desirable, so
the search directions are not precisely orthogonal. However, the overall
zigzagging behavior still exists.
Another issue with steepest descent is that the gradient at the current
point on its own does not provide enough information to inform a good
guess of the initial step size. As we saw in the line search, this initial
choice has a large impact on the efficiency of the line search because
the first guess could be orders of magnitude too small or too large.
Second-order methods later in this section will help with this problem.
In the meantime we can make a guess of the step size for a given line
search based on the result of the previous one. If we assume that at the
current line search we will obtain a decrease in objective function that
is comparable to the previous one, we can write
𝑇 𝑇
𝛼 (𝑘) ∇ 𝑓 (𝑘) 𝑝 (𝑘) ≈ 𝛼(𝑘−1) ∇ 𝑓 (𝑘−1) 𝑝 (𝑘−1) . (4.45)
Solving for the step length, and inserting the steepest descent direction,
we get the guess
k∇ 𝑓 (𝑘−1) k 2
𝛼(𝑘) = 𝛼 (𝑘−1) . (4.46)
k∇ 𝑓 (𝑘) k 2
𝑥2
This is just the first guess in the new line search, which will then proceed 3
34 iterations
as usual. If the slope of the function decreases relative to the previous 𝑥 (0)
2
line search, this guess decreases relative to the previous line search step
length, and vice versa. 1
𝑥∗
0
Example 4.15: Steepest descent applied to the bean function
1 2 𝑥1
2 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 ) + (1 − 𝑥2 ) + 2𝑥2 − 𝑥12 , (4.47)
2 Figure 4.29: Steepest descent opti-
using the steepest descent algorithm with a two-stage line search. Using an exact mization path.
4 Unconstrained Gradient-Based Optimization 99
previous one. When the function steepens, the damping becomes larger,
and vice versa.
Minimizing the same bean function from Ex. 4.15 and the same line search
algorithm and settings, we get the optimization path shown in Fig. 4.32. The 𝑥2
3
changes in direction for the conjugate-gradient method are smaller than for 22 iterations
steepest descent and it takes less iterations to achieve the same convergence 2
𝑥 (0)
tolerance.
1
𝑥∗
0
𝑇 1 𝑇
𝑓 (𝑥 (𝑘) + 𝑠 (𝑘) ) = 𝑓 (𝑘) + ∇ 𝑓 (𝑘) 𝑠 (𝑘) + 𝑠 (𝑘) 𝐻 (𝑘) 𝑠 (𝑘) + . . . , (4.51)
2
where 𝑠 (𝑘) is some vector centered at 𝑥 (𝑘) . We can find the step 𝑠 (𝑘) that
minimizes this quadratic model (ignoring the higher-order terms). We
do this by taking the derivative with respect to 𝑠 (𝑘) and setting that
equal to zero:
d 𝑓 (𝑥 (𝑘) + 𝑠 (𝑘) )
= ∇ 𝑓 (𝑘) + 𝐻 (𝑘) 𝑠 (𝑘) = 0
d𝑠 (𝑘)
𝐻 (𝑘) 𝑠 (𝑘) = −∇ 𝑓 (𝑘) (4.52)
−1
𝑠 (𝑘) = −𝐻 (𝑘) ∇ 𝑓 (𝑘) .
2. Problem: The predicted new point 𝑥 (𝑘) + 𝑠 (𝑘) is based on a second- Figure 4.33: Iteration history for
order approximation and so may not actually yield a good point. quadratic function using an exact
line search and Newton’s method.
In fact, the new point could be worse: 𝑓 (𝑥 (𝑘) + 𝑠 (𝑘) ) > 𝑓 (𝑥 (𝑘) ). Unsurprisingly, only one iteration is
Because the search direction 𝑠 (𝑘) is a descent direction, if we back- required.
track enough our search direction will yield a function decrease.
Minimizing the same bean function from Ex. 4.15 and Ex. 4.17, we get the 𝑥2
optimization path shown in Fig. 4.34. Newton’s method takes fewer iterations 3
8 iterations
to achieve the same convergence tolerance. 𝑥 (0)
2
1
𝑥∗
4.4.4 Quasi-Newton Methods 0
𝑓 (𝑘+1) − 𝑓 (𝑘)
𝑓0 ≈ , (4.53)
𝑥 (𝑘+1) − 𝑥 (𝑘)
4 Unconstrained Gradient-Based Optimization 103
search, which is 𝑠 (𝑘) = 𝑥 (𝑘+1) − 𝑥 (𝑘) = 𝛼 (𝑘) 𝑝 (𝑘) , we can write the secant 𝑓 (𝑘+1)
condition
𝐵(𝑘+1) 𝑠 (𝑘) = ∇ 𝑓 (𝑘+1) − ∇ 𝑓 (𝑘) . (4.55)
𝑥
𝑥 (𝑘) 𝑥 (𝑘+1)
This states that the projection of the approximate Hessian onto 𝑠 (𝑘) must
yield the same curvature predicted by taking the difference between Figure 4.35: A secant line at point
the gradients. The secant condition provides a requirement consisting 𝑥 (𝑘) used to estimate 𝑓 0 (𝑘) .
of 𝑛 equations where the step and the gradients are known. However,
there are 𝑛(𝑛 + 1)/2 unknowns in the approximate Hessian (recall that
it is a symmetric matrix), so this is not sufficient to determine 𝐵. There
is another requirement, which is that 𝐵 must be positive definite. This
yields another 𝑛 but that still leaves us with an infinite number of
possibilities for 𝐵.
Given that 𝐵 must be positive definite, the secant condition (4.55) is
only possible if the predicted curvature is positive along the step, that
is,
𝑇
𝑠 (𝑘) ∇ 𝑓 (𝑘+1) − ∇ 𝑓 (𝑘) > 0. (4.56)
This is called the curvature condition which is automatically satisfied if
the line search finds a step that satisfies the strong Wolfe conditions.
Davidon, Fletcher, and Powell devised an effective strategy to esti-
mate the Hessian.15,16 Because there is an infinite number of solutions,
15. Davidon, Variable Metric Method for
they formulated a way to select 𝐻 by picking the one that was “closest” Minimization. 1991
to the Hessian of the previous iteration, while still satisfying the re- 16. Fletcher et al., A Rapidly Convergent
Descent Method for Minimization. 1963
quirements of the secant rule: symmetry and positive definiteness. This
turns out to be an optimization problem in itself, but with an analytic
solution. This led to the DFP method, which was a very impactful idea
in the field of nonlinear optimization.
59. Broyden, The Convergence of a Class
This method was soon superseded by the BFGS method developed of Double-rank Minimization Algorithms 1.
General Considerations. 1970
by Broyden, Fletcher, Goldfarb, and Shannon59–62 and so we focus on
60. Fletcher, A new approach to variable
that method instead. They started with the observation that what we metric algorithms. 1970
ultimately want is the inverse of the Hessian so that we can predict the 61. Goldfarb, A family of variable-metric
next search direction: methods derived by variational means. 1970
Their key insight was that rather than estimating the Hessian, then
solving a linear system, we should directly estimate the Hessian inverse
−1
instead. We will denote the Hessian inverse as 𝑉 (𝑉 (𝑘) = 𝐻 (𝑘) ). Using
the Hessian inverses changes our search prediction step to:
where
𝑠 (𝑘) = 𝑥 (𝑘+1) − 𝑥 (𝑘) = 𝛼(𝑘) 𝑝 (𝑘)
is the step that resulted from the last line search. The other important
term is the estimate of the curvature in the direction of that line search,
which is given by the difference between the gradients at the end and
start of the line search (the last two major iterations),
𝑝 0 = −∇ 𝑓0 (4.62)
and thus the first step is a steepest descent step. Subsequent iterations
use information from the previous Hessian inverse, the direction and
length of the last step, and the difference in the last two gradients to
improve the estimate of the Hessian inverse.
The optimization problem (4.60) does not explicitly include a con-
straint on positive definiteness. It turns out that this update formula
will always produce a 𝑉 (𝑘+1) that is positive definite as long as 𝑉 (𝑘)
is positive definite. Therefore if we start with an identify matrix as
suggested above, all subsequent updated produce positive definite
matrices.
Minimizing the same bean function from previous examples using BFGS, 𝑥2
3
we get the optimization path shown in Fig. 4.36. We initialize the inverse 7 iterations
Hessian to the identity matrix. Using the BFGS update procedure, after two 𝑥 (0)
2
iterations, with 𝑥 (2) = (0.065647, −0.219401), the inverse Hessian approximation
is 1
0.320199 −0.100560
𝑉 (2) = . 𝑥∗
−0.100560 0.219681 0
𝑠𝑇 𝑦
𝑉 (0) = 𝐼 (4.65)
𝑦𝑇 𝑦
4 Unconstrained Gradient-Based Optimization 107
where the 𝑠 and 𝑦 values on the right hand side would use the previous
iteration. The algorithm is summarized in Alg. 4.20.
Algorithm 4.20: Compute the product of inverse Hessian and a vector using
the BFGS update rule
Inputs:
∇ 𝑓 (𝑘) : Gradient at point 𝑥 (𝑘)
𝑠 (𝑘−1,...,𝑘−𝑚) : History of steps 𝑥 (𝑘) − 𝑥 (𝑘−1)
𝑦 (𝑘−1,...,𝑘−𝑚) : History of gradient differences ∇ 𝑓 (𝑘) − ∇ 𝑓 (𝑘−1)
Outputs:
𝑑: Desired product 𝑉 (𝑘) ∇ 𝑓 (𝑘)
𝑑 = ∇ 𝑓 (𝑘)
for 𝑖 = 𝑘 − 1 to 𝑘 − 𝑚 by −1 do
𝑇
𝛼(𝑖) = 𝜎 (𝑖) 𝑠 (𝑖) 𝑑
𝑑 = 𝑑 − 𝛼(𝑖) 𝑦 (𝑖)
end for
𝑠 𝑇𝑘−1 𝑦 𝑘−1
𝑉 (0) = 𝐼
𝑦 𝑇𝑘−1 𝑦 𝑘−1
𝑑 = 𝑉 (0) 𝑑
for 𝑖 = 𝑘 − 𝑚 to 𝑘 − 1 do
𝑇
𝛽(𝑖) = 𝜎(𝑖) 𝑦 (𝑖) 𝑑
𝑑 = 𝑑 + (𝛼 − 𝛽(𝑖) )𝑠 (𝑖)
(𝑖)
end for
−5
−7.38683 × 10
𝑑∗ = −1
0 2
5.75370 × 10−5 −2
𝑥1
ℓ1 ℓ2
𝑘1 𝑘2
𝑥1
𝑥2
Figure 4.38: Two-spring system
with no applied force (top) and
with applied force (bottom).
𝑚𝑔
12 12
59 iterations 39 iterations
8 𝑥∗ 8 𝑥∗
4 4
𝑥2 𝑥2
0 𝑥 (0) 0 𝑥 (0)
−4 −4
−8 −8
−5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1
12 12
17 iterations 5 iterations
8 𝑥∗ 8
4 4
𝑥2 𝑥2
0 𝑥 (0) 0 𝑥 (0)
𝑥∗
−4 −4
−8 −8
−5 0 5 10 15 −5 0 5 10 15
Figure 4.39: Minimizing the total
𝑥1 𝑥1
potential for two-spring system.
(c) Quasi-Newton (d) Newton
is shown in Fig. 4.41. All four methods use an inexact line search using the
same parameters and a convergence tolerance of k∇ 𝑓 k ∞ ≤ 10−6 . Compared to
the previous two examples, the difference between the steepest descent and
the other methods is much more dramatic (two orders of magnitude more
iterations!), owing to the more challenging variation in the curvature (recall
Ex. 4.14).
Steepest descent does converge takes a large number of iterations because it
bounces between the steep walls of the valley. One of the line search gets lucky
and takes a shortcut to another part of the valley, but even then it cannot make
up for its inherent inefficiency. The conjugate gradient method is much more
efficient because it damps the steepest descent oscillations with a contribution
from the previous direction. Eventually, conjugate gradient achieves superlinear
convergence near the optimum, which saves many iterations to get the last
several orders of magnitude in the convergence criterion. The methods that
use second-order information are even more efficient, exhibiting quadratic
convergence in the last few iterations.
𝑥∗
1 𝑥∗ 1
𝑥2 𝑥2
𝑥 (0) 𝑥 (0)
0 0
−1 −1
−1 0 1 −1 0 1
𝑥1 𝑥1
35 iterations 21 iterations
2 2
𝑥∗ 𝑥∗
1 1
𝑥2 𝑥2
𝑥 (0) 𝑥 (0)
0 0
100
10−1
||∇ 𝑓 || ∞ Newton Figure 4.41: Convergence of the
10−2 four methods shows the dramatic
10−3
difference between the linear con-
Quasi- vergence of steepest descent, the
10−4 Newton Conjugate superlinear convergence of the
gradient conjugate gradient method, and
10−5
the quadratic convergence of the
10−6 methods that use second-order
100 101 102 103 information.
Major iterations
𝑥∗
𝑥 (𝑘+1)
𝑥2
𝑥 (𝑘) 𝑠 (𝑘)
minimize 𝑓˜(𝑠)
by varying 𝑠 (4.67)
subject to k𝑠 k ≤ Δ,
where 𝑓˜(𝑠) is the local trust-region model, 𝑠 is the step from the current
iteration point, and Δ is the size of the trust region. Note that we use
the notation 𝑠 instead of 𝑝 to indicate that this is a step vector (direction
and magnitude) and not just the direction 𝑝 used in line search based
methods.
The subproblem above defines the trust-region as a norm. The
Euclidean norm, k𝑠 k 2 , defines a spherical trust region and is the most
common type of trust region. Sometimes ∞-norms are used instead as
they are easy to apply, but 1-norms are rarely used as they are just as
complex as 2-norms but introduce sharp corners that are sometimes
problematic 64 . The shape of the trust region, dictated by the norm, 64. Conn et al., Trust Region Methods.
2000
can have a significant impact on the convergence rate. The ideal trust
region shape depends on the local function space and some algorithms
allow for the trust region shape to change throughout the optimization.
4 Unconstrained Gradient-Based Optimization 113
𝑇 1
minimize 𝑓˜(𝑠) = 𝑓 (𝑘) + ∇ 𝑓 (𝑘) 𝑠 + 𝑠 𝑇 𝐵(𝑘) 𝑠
2 (4.68)
subject to k𝑠 k 2 ≤ Δ(𝑘) ,
𝑓 (𝑥) − 𝑓 (𝑥 + 𝑠)
𝑟= . (4.69)
𝑓˜(0) − 𝑓˜(𝑠)
Example 4.26: Minimizing total potential energy of spring system with trust
region
Minimizing the total potential energy function from Ex. 4.22 using the
a trust-region method starting from the same points as before yields the
optimization path shown in Fig. 4.44. The initial trust region size is Δ = 0.3 and
the maximum allowable is Δ = 1.5. The convergence criteria is based on the
difference in subsequent iterations, such that k𝑥 (𝑘) − 𝑥 (𝑘−1) k ≤ 10−8 . The first
few quadratic approximations do not not have a minimum because the function
has negative curvature around the starting point, but the trust region prevents
steps that are too large. When it gets close enough to the bowl containing the
minimum, the quadratic approximation has a minimum and the trust region
subproblem yields a minimum within the trust region. In the last few iterations,
the quadratic is a good model and therefore the region remains large.
8 8 8
4 4 4
𝑥2 𝑥2 𝑥2
0 𝑠 (0) 0 0
𝑥 (0)
𝑠 (5)
−4 −4 −4
𝑠 (3)
−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1 𝑥1
8 8 𝑠 (11) 8
𝑠 (8)
𝑥∗
4 4 4
𝑥2 𝑥2 𝑥2
0 0 0
−4 −4 −4
−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1 𝑥1
Hessian) can be supplied. In fact several optimization packages require Figure 4.44: Minimizing the total
the user to provide the full Hessian in order to use a trust-region potential for two-spring system
using a trust-region method.
approach. Trust-region methods generally require fewer iterations
than quasi-Newton methods but each iteration is more computationally
expensive because of the need for at least one matrix factorization.
Scaling can also be more challenging with trust-region approaches.
Newton’s method is invariant with scaling, but the use of a Euclidean
trust-region constraint implicitly assumes that the function changes in
each direction at a similar rate. Some enhancements try to address this
issue through the use of elliptical trust regions rather than spherical
ones.
2 2 2
1 𝑥 (0) 1 1
𝑥2 𝑠 (0) 𝑥2 𝑥2
0 0
𝑠 (3) 0
𝑠 (7)
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
2 2 2
𝑥∗
1 1 1
𝑥2 𝑥2 𝑥2
𝑠 (17)
0 𝑠 (12) 0 0
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
4.6 Summary
Problems
4.4 Review Kepler’s wine barrel story from Section 2.2. Approximate
the barrel as a cylinder and find the height and diameter of a
barrel that maximizes its volume for a diagonal measurement of
1 m.
4.6 Consider a slightly modified version of the function from Prob. 4.5,
where we add a 𝑥24 term to get
Can you find the critical points analytically? Plot the function
contours. Locate the critical points graphically and classify them.
4.7 Line search algorithm implementation. Implement the two line search
algorithms from Section 4.3, such that they work in 𝑛 dimensions
(𝑥 and 𝑝 can be vectors of any size).
a) As a first test for your code, reproduce the results from the
examples in Section 4.3 and plot the function and iterations
for both algorithms. For the line search that satisfies the
strong Wolfe conditions, reduce the value of 𝜇2 until you get
an exact line search. How much accuracy can you achieve?
b) Test your code on another easy two-dimensional function,
such as the bean function from Ex. 4.15, starting from differ-
ent points and using different directions (but remember that
you must always provide valid descent direction, otherwise
the algorithm might not work!). Does it always find a suit-
able point? Exploration: Try different values of 𝜇2 and 𝜌 to
analyze their effect on the number of iterations.
c) Apply your line search algorithms to the two-dimensional
Rosenbrock function and then the 𝑛-dimensional variant
(see Appendix C.1.2). Again, try different points and search
directions to see how robust the algorithm is, and try to tune
𝜇2 and 𝜌.
a) For your first test problem, reproduce the results from the
examples in Section 4.4.
b) Minimize the two-dimensional Rosenbrock function (see
Appendix C.1.2) using the various algorithms and compare
your results starting from 𝑥 = (−1, 2). Compare the total
number of evaluations. Compare the number of minor
versus major iterations. Discuss the trends. Exploration: Try
different starting points and tuning parameters (e.g., 𝜌 and
𝜇2 in the line search) and compare the number of major and
minor iterations.
c) Benchmark your algorithms on the 𝑛-dimensional variant
of the Rosenbrock function (see Appendix C.1.2). Try 𝑛 = 3
and 𝑛 = 4 first, then 𝑛 = 8, 16, 32, . . .. What is the highest
number of dimensions you can solve? How does the number
of function evaluations scale with the number of variables?
d) Optional: Implement L-BFGS and compare it with BFGS.
4 Unconstrained Gradient-Based Optimization 122
4.11 Aircraft wing design. We will solve the aircraft wing design problem
described in Appendix C.1.6.
123
5 Constrained Gradient-Based Optimization 124
minimize 𝑓 (𝑥)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥
subject to 𝑔 𝑗 (𝑥) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 (5.1)
ℎ 𝑘 (𝑥) = 0 𝑘 = 1, . . . , 𝑛 ℎ
x𝑖 ≤ 𝑥 𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥
where the 𝑔 𝑗 (𝑥) are the inequality constraints, ℎ 𝑘 (𝑥) are the equality
constraints, and x and 𝑥 are lower and upper bound constraints on
the design variables. Both objective and constraint functions can
be nonlinear, but they should be 𝐶 2 continuous to be solved using
gradient-based optimization algorithms. The inequality constraints
are expressed as “less than” without loss of generality because they
can always be converted to “greater than” by putting a negative sign
on 𝑔 𝑗 . We could also eliminate the equality constraint without loss
of generality by replacing it with two inequality constraints, 𝑔 𝑗 ≤ 𝜖
and −𝑔 𝑗 ≤ 𝜖, where 𝜖 is some small number. In practice, numerical
precision and the implementations of many methods make it desirable
to distinguish between equality and inequality constraints.
𝑓 (𝑥 ∗ + 𝑝) ≥ 𝑓 (𝑥 ∗ ). (5.5)
Given the Taylor series expansion (5.4), the only way that this inequality
can be satisfied is if
∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 ≥ 0. (5.6)
For a given ∇ 𝑓 (𝑥 ∗ ), there are always an infinite number of directions
along which the function decreases, which correspond to the halfspace
shown in Fig. 5.2. If the problem is unconstrained then 𝑝 can be in any
direction and the only way to satisfy this inequality is if ∇ 𝑓 (𝑥 ∗ ) = 0.
Therefore, ∇ 𝑓 (𝑥 ∗ ) = 0 is a necessary condition for an unconstrained
minimum.
∇𝑓
∇𝑓 𝑇𝑝 = 0
∇𝑓 𝑇𝑝 > 0
Figure 5.2: The gradient 𝑓 (𝑥),
which is the direction of steepest
function increase, splits the design
space into two halves. Here we
∇𝑓 𝑇𝑝 < 0 highlight the halfspace of directions
Half space of that result in function decrease.
function decrease
Again, the step size is assumed to be small enough so that the higher-
order terms are negligible. Assuming we are at a feasible point, then
ℎ 𝑗 (𝑥) = 0 for all constraints 𝑗. To remain feasible, the step, 𝑝, must be
5 Constrained Gradient-Based Optimization 127
such that the new point is also feasible, i.e., ℎ 𝑗 (𝑥 + 𝑝) = 0 for all 𝑗. This
implies that the feasibility of the new point requires
∇ℎ 𝑗 (𝑥)𝑇 𝑝 = 0, 𝑗 = 1, . . . , 𝑛 ℎ , (5.8)
𝑥
𝑥
∇ℎ2 (𝑥)
∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 ≥ 0 (5.9)
∇ℎ 𝑗 (𝑥 ∗ )𝑇 𝑝 = 0, 𝑗 = 1, . . . , 𝑛 ℎ . (5.10)
inequality Eq. 5.9 is the case when it is zero. This is because a hyperplane
in 𝑛 𝑥 dimensions includes directions in the descent halfspace of the
same dimensions unless the hyperplane is perpendicular to ∇ 𝑓 (𝑥 ∗ ) (see
Fig. 5.4 for an illustration in 3-D space).
Õ
𝑛ℎ
∇ 𝑓 (𝑥 ∗ ) = − 𝜆 𝑗 ∇ℎ 𝑗 (𝑥 ∗ ), (5.11)
𝑗=1
to zero as follows,
𝜕ℒ 𝜕𝑓 Õ
𝑛ℎ
𝜕ℎ 𝑗
= + 𝜆𝑗 = 0, 𝑖 = 1, . . . , 𝑛 𝑥 ,
𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖
𝑗=1
(5.13)
𝜕ℒ
= ℎ 𝑗 = 0, 𝑗 = 1, . . . , 𝑛 ℎ .
𝜕𝜆 𝑗
𝜕ℒ 1
= 1 + 𝜆𝑥1 = 0
𝜕𝑥 1 2
𝜕ℒ
= 2 + 2𝜆𝑥 2 = 0 (5.18)
𝜕𝑥 2
𝜕ℒ 1
= 𝑥12 + 𝑥22 − 1 = 0.
𝜕𝜆 4
Solving these three equations for the three unknowns (𝑥1 , 𝑥2 , 𝜆), we obtain two
possible solutions:
" √ #
𝑥1 − 2 √
𝑥𝐴 = = √ , 𝜆𝐴 = 2,
𝑥2 − 22
"√ # (5.19)
𝑥 2 √
𝑥 𝐵 = 1 = √2 , 𝜆𝐵 = − 2.
𝑥2
2
These two points are shown in Fig. 5.5, together with the objective and
2
∇𝑓
∇ℎ
1
Maximum
∇𝑓 𝑥𝐵
𝑥2 0
Minimum
𝑥𝐴
−1
1 −2 0 2
𝜆 0 𝑥1
∇𝑥𝑥 ℒ = 2 . (5.20)
0 2𝜆
Figure 5.6: The minimum of the
√ Lagrangian function with the op-
The Hessian is only positive definite for the case where 𝜆𝐴 = 2 and therefore
timum√ Lagrange multiplier value
𝑥 𝐴 is an optimum. Recall that the Hessian only needs to be positive definite in
(𝜎 = 2) is the constrained mini-
the feasible directions, but here we can easily show that it is positive or negative mum of the original problem.
5 Constrained Gradient-Based Optimization 131
∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 ≥ 0, (5.21)
For a given candidate point that satisfies the constraints, there are
two possibilities to consider for each constraint: whether the constraint
is inactive (𝑔 𝑗 (𝑥) < 0) or active (𝑔 𝑗 (𝑥) = 0). If a given constraint is
inactive, then we do not need to add any conditions because we can
take a step, 𝑝, in any direction and remain feasible as long as the step is
small enough. If a given constraint is active, then we can treat it as an
equality constraint.
Thus, the optimality conditions derived for the equality constrained
case can be reused here, but with a crucial modification. First, the
requirement that the gradient of the objective is a linear combination
of the gradients of the constraints, only needs to consider the active
constraints. This can be written as
Õ
𝑛 𝑔,active
∇ 𝑓 (𝑥 ∗ ) = − 𝜎 𝑗 ∇𝑔 𝑗 (𝑥 ∗ ), (5.23)
𝑗=1
∇𝑔(𝑥)
Feasible
descent Descent ∇𝑔(𝑥)
directions directions
𝜎>0 𝜎<0
We need to include all inequality constraints in the optimality Figure 5.7: Constrained minimum
conditions for 2-D case with one
conditions because we do not know in advance which constraints are
inequality constraint. The objective
active. To do this, we replace the inequality constraints 𝑔 𝑘 ≤ 0 with the function gradient must be parallel
equality constraints: and have opposite directions (cor-
responding to a positive Lagrange
multiplier) so that there are no
𝑔 𝑘 + 𝑠 2𝑘 = 0, 𝑘 = 1, . . . , 𝑛 𝑔 (5.24) feasible descent directions.
𝜕ℒ 𝜕𝑓 Õ
𝑛ℎ
𝜕ℎ 𝑗 Õ 𝜕𝑔 𝑘
𝑛𝑔
∇𝑥 ℒ = 0 ⇒ = + 𝜆𝑗 + 𝜎𝑘 = 0, 𝑖 = 1, . . . , 𝑛 𝑥
𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖
𝑗=1 𝑘=1
(5.26)
This criteria is the same as before, but with additional Lagrange mul-
tipliers and constraints. Taking the derivatives with respect to the
5 Constrained Gradient-Based Optimization 133
𝜕ℒ
∇𝜆 ℒ = 0 ⇒ = ℎ 𝑗 = 0, 𝑗 = 1, . . . , 𝑛 ℎ , (5.27)
𝜕𝜆 𝑗
𝜕ℒ
∇𝜎 ℒ = 0 ⇒ = 𝑔 𝑘 + 𝑠 2𝑘 = 0 𝑘 = 1, . . . , 𝑛 𝑔 , (5.28)
𝜕𝜎 𝑘
which enforces the inequality constraints. Finally, differentiating the
Lagrangian with respect to the slack variables, we obtain
𝜕ℒ
∇𝑠 ℒ = 0 ⇒ = 2𝜎 𝑘 𝑠 𝑘 = 0, 𝑘 = 1, . . . , 𝑛 𝑔 , (5.29)
𝜕𝑠 𝑘
which is called the complementarity condition. This condition helps us to
distinguish the active constraints from the inactive ones. For each in-
equality constraint, either the Lagrange multiplier is zero (which means
that the constraint is inactive), or the slack variable is zero (which means
that the constraint is active). Unfortunately, this condition introduces a
combinatorial problem whose complexity grows exponentially with
the number of inequality constraints, since the number of combinations
of active versus inactive constraints is 2𝑛 𝑔 .
These requirements are called the Karush–Kuhn–Tucker (KKT)
conditions and are summarized below:
𝜕𝑓 Õ
𝑛ℎ
𝜕ℎ 𝑗 Õ 𝜕𝑔 𝑘
𝑛𝑔
+ 𝜆𝑗 + 𝜎𝑘 = 0, 𝑖 = 1, . . . , 𝑛 𝑥
𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖
𝑗=1 𝑘=1
ℎ 𝑗 = 0, 𝑗 = 1, . . . , 𝑛 ℎ
(5.30)
𝑔 𝑘 + 𝑠 2𝑘 = 0, 𝑘 = 1, . . . , 𝑛 𝑔
𝜎 𝑘 𝑠 𝑘 = 0, 𝑘 = 1, . . . , 𝑛 𝑔
𝜎 𝑘 ≥ 0, 𝑘 = 1, . . . , 𝑛 𝑔
The last addition, that the Lagrange multipliers associated with the
inequality constraints must be nonnegative, was implicit in the way we
defined the Lagrangian but is now made explicit. As shown in Fig. 5.7,
the Lagrange multiplier for an inequality constraint must be positive
otherwise there is a direction that is feasible and would decrease the
objective function.
The equality and inequality constraints are often lumped together
for convenience, since the expression for the Lagrangian follows the
same form for both cases. As in the equality constrained case, these
5 Constrained Gradient-Based Optimization 134
∇ℎ 𝑗 (𝑥)𝑇 𝑝 = 0, 𝑗 = 1, . . . , 𝑛 ℎ ,
(5.31)
∇𝑔𝑖 (𝑥) 𝑝 = 0,
𝑇
for all 𝑖 in active set.
Consider a variation of the simple problem (Ex. 5.4), where the equality is
replaced by an inequality as follows:
2
∇𝑓
∇𝑔
1
Maximum
∇𝑓 𝑥𝐵
𝑥2 0
Minimum
−1 𝑥𝐵
∇𝑔
−2 Figure 5.8: Inequality problem with
linear objective and feasible space
−3 −2 −1 0 1 2 3 within an ellipse.
𝑥1
1
ℒ(𝑥1 , 𝑥2 , 𝜎, 𝑠) = 𝑥 1 + 2𝑥2 + 𝜎 𝑥 12 + 𝑥22 − 1 + 𝑠 2 . (5.33)
4
Differentiating this with respect to all the variables to get the first-order
optimality conditions,
𝜕ℒ 1
=1+ 𝜎𝑥 = 0,
𝜕𝑥1 2 1
𝜕ℒ
= 2 + 2𝜎𝑥 2 = 0,
𝜕𝑥2
(5.34)
𝜕ℒ 1 2
= 𝑥 + 𝑥22 − 1 = 0,
𝜕𝜎 4 1
𝜕ℒ
= 2𝜎𝑠 = 0.
𝜕𝑠
5 Constrained Gradient-Based Optimization 135
Starting with the last equation, there are two possibilities: 𝑠 = 0 (meaning
the constraint is active) and 𝜎 = 0 (meaning the constraint is not active).
However, we can see that setting 𝜎 = 0 in either of the two first equations does
not yield a solution. Assuming that 𝑠 = 0 and 𝜎 ≠ 0, we can solve the equations
to obtain the same points as in Ex. 5.4:
There are the same critical points as in the equality constrained case of Ex. 5.4.
However, now the sign of the Lagrange multiplier has meaning. According
to the KKT conditions, the Lagrange multiplier has to be non-negative. Only Feasible ∇𝑓
𝑥 𝐴 satisfies this condition and therefore there is no descent direction that is directions
feasible, as shown in Fig. 5.9. The Hessian of the Lagrangian at this point is the
same as in Ex. 5.4, where we showed it is positive definite. Therefore, 𝑥 𝐴 is a 𝑥∗
minimum. Unlike the equality-constrained problem, we did not need to check
the Hessian of point 𝑥 𝐵 because the Lagrange multiplier is negative, and as a
Descent
consequence there are feasible descent directions, as shown in Fig. 5.10. ∇𝑔 directions
𝑔2 (𝑥 2 ) = −𝑥2 ≤ 0.
The feasible region is the top half of the ellipse, as shown in Fig. 5.11. The
Feasible
descent
2 directions
∇ 𝑓 (𝑥)
∇ 𝑓 (𝑥 ∗ ) ∇𝑔1 (𝑥) ∇ 𝑓 (𝑥) Figure 5.10: At this critical point,
1
the Lagrange multiplier is negative
𝑥 and all descent directions are feasi-
∇𝑔1 (𝑥 ∗ ) Minimum 𝑥
𝑥2 0 ble, so this point is not a minium.
𝑥∗ ∇𝑔1 (𝑥)
−1
∇𝑔2 (𝑥 ∗ ) ∇𝑔2 (𝑥)
−2
Figure 5.11: Only one point satisfies
−3 −2 −1 0 1 2 3 the first-order KKT conditions.
𝑥1
5 Constrained Gradient-Based Optimization 136
While these examples can be solved analytically, they are the ex-
ception rather than the rule. The KKT conditions quickly become
challenging to solve analytically (try solving Ex. 5.1). Furthermore,
engineering problems usually involve functions that are defined by
models with implicit equations, which are impossible to solve ana-
lytically. The reason we include these analytic examples is to better
understand the KTT conditions. For the rest of the chapter, we focus
on numerical methods, which are necessary for the vast majority of
practical problems.
𝜇Õ
the form,
𝐹(𝑥; 𝜇) = 𝑓 (𝑥) + ℎ 𝑖 (𝑥)2 . (5.41)
2
𝑖
5 Constrained Gradient-Based Optimization 139
Inputs:
𝑥 0 : Starting point
𝜇0 > 0: Initial penalty parameter
𝜌 > 1: Penalty increase factor
Outputs:
5 Constrained Gradient-Based Optimization 140
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝐹(𝑥 𝑘 ; 𝜇 𝑘 ) with respect to 𝑥 𝑘
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Increase penalty parameter∗
∗
𝑥 𝑘+1 = 𝑥 𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
Consider the equality constrained problem from Ex. 5.4. The penalized
function for that case is
2
𝜇 1 2 2
𝐹(𝑥; 𝜇) = 𝑥 1 + 2𝑥2 + 𝑥 + 𝑥2 − 1 . (5.43)
2 4 1
This function is shown in Fig. 5.15 for different values of the penalty parameter
𝜇. The penalty is active for all points that are infeasible, but the minimum of
the penalized function does not coincide with the constrained minimum of
∗ 𝜌 may range from a conservative value (1.2) to an aggressive value (10), depending
on the problem.
5 Constrained Gradient-Based Optimization 141
the original problem. The penalty parameter needs to be increased for the
minimum of the penalized function to approach the correct solution, but this
results in a highly nonlinear function.
2 2 2
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗ 𝑥∗ 𝑥∗
−1 −1 −1 𝑥 𝐹∗
𝑥 𝐹∗
𝑥 𝐹∗
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
the inequality constraint is violated (i.e., when 𝑔 𝑗 (𝑥) > 0). This behavior
can be achieved by defining a new penalty function as
𝜇Õ
𝐹(𝑥; 𝜇) = 𝑓 (𝑥) + max[0, 𝑔 𝑗 (𝑥)]2 . (5.44)
2
𝑗
𝐹(𝑥; 𝜇)
Figure 5.16: Quadratic penalty
for an inequality constrained 1-
𝜇↑ D problem. The minimum of the
penalized function approaches the
constrained minimum from the
𝑥 infeasible side.
∗
𝑥true
Consider the equality constrained problem from Ex. 5.5. The penalized
function for that case is
2
𝜇 1
𝐹(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + max 0, 𝑥12 + 𝑥22 − 1 . (5.45)
2 4
This function is shown in Fig. 5.17 for different values of the penalty parameter 𝜇.
The contours of the feasible region inside the ellipse coincide with the original
function contours, but outside the feasible region, the contours change to create
a function whose minimum approaches the exact constrained minimum as the
penalty parameters is increased.
2 2 2
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗ 𝑥∗ 𝑥∗
−1 −1 −1 𝑥 𝐹∗
𝑥 𝐹∗
𝑥 𝐹∗
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
Augmented Lagrangian
As explained above, the quadratic penalty method requires a large
value of 𝜇 for constraint satisfaction, but the large 𝜇 degrades the
numerical conditioning. The augmented Lagrangian method alleviates
this dilemma by adding the quadratic penalty to the Lagrangian instead
5 Constrained Gradient-Based Optimization 143
Õ
𝑛ℎ
𝜇Õℎ𝑛
𝐹(𝑥, 𝜆; 𝜇) = 𝑓 (𝑥) + 𝜆 𝑗 ℎ 𝑗 (𝑥) + ℎ 𝑗 (𝑥)2 , (5.46)
2
𝑗 𝑗
Õ
𝑛ℎ
∇𝑥 𝐹(𝑥, 𝜆; 𝜇) = ∇ 𝑓 (𝑥) + [𝜆 𝑗 + 𝜇ℎ 𝑗 (𝑥)]∇ℎ 𝑗 = 0 (5.47)
𝑗
Õ
𝑛ℎ
∇𝑥 ℒ(𝑥 ∗ , 𝜆∗ ) = ∇ 𝑓 (𝑥 ∗ ) + 𝜆∗𝑗 ∇ℎ 𝑗 (𝑥 ∗ ) = 0, (5.48)
𝑗
𝜆∗𝑗 ≈ 𝜆 𝑗 + 𝜇ℎ 𝑗 . (5.49)
Inputs:
𝑥 0 : Starting point
𝜆0 = 0: Initial Langrange multiplier
𝜇0 > 0: Initial penalty parameter
𝜌 > 1: Penalty increase factor
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝑘=0
5 Constrained Gradient-Based Optimization 144
Consider the equality constrained problem from Ex. 5.5. Assuming the
inequality constraint is active, the augmented Lagrangian (Eq. 5.46) for that
problem is
2
1 2 2 𝜇 1 2 2
𝐹(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + 𝜎 𝑥1 + 𝑥2 − 1 + 𝑥 + 𝑥2 − 1 . (5.53)
4 2 4 1
Applying Alg. 5.11, starting with 𝜇 = 0.5 and using 𝜌 = 1.1, we get the iterations
shown in Fig. 5.18. Compared to the quadratic penalty in Ex. 5.9, the penalized
function is much better conditioned, thanks to the term associated with the
Lagrange multiplier. The minimum of the penalized function eventually
becomes the minimum of the constrained problem without the need for a large
penalty parameter.
5 Constrained Gradient-Based Optimization 145
2 2 2
𝑥 (0)
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
Õ
𝑛𝑔
6
𝐹(𝑥; 𝜇) = 𝑓 (𝑥) − 𝜇 log(−𝑔 𝑗 (𝑥)). (5.56) Figure 5.19: Two different interior
𝑗 barrier functions.
Like exterior methods, interior methods must also solve a sequence
of unconstrained problems but with 𝜇 → 0 (see Alg. 5.13). As the
5 Constrained Gradient-Based Optimization 146
Inputs:
𝑥 0 : Starting point
𝜇0 > 0: Initial penalty parameter
𝜌 < 1: Penalty decrease factor
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝐹(𝑥 𝑘 ; 𝜇 𝑘 ) with respect to 𝑥 𝑘
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Decrease penalty parameter
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
Objective Objective
𝑥∗ 𝑥∗
𝑥∗ 𝑥∗
𝑝 𝑝
Interior Exterior
Constraint Constraint
penalty penalty
Consider the equality constrained problem from Ex. 5.5. The penalized
function for that case using the logarithmic penalty (Eq. 5.56) is
1
𝐹(𝑥; 𝜇) = 𝑥1 + 2𝑥2 − 𝜇 log − 𝑥12 − 𝑥22 + 1 . (5.57)
4
This function is shown in Fig. 5.22 for different values of the penalty parameter
5 Constrained Gradient-Based Optimization 148
2 2 2
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥 𝐹∗ 𝑥 𝐹∗ 𝑥 𝐹∗
−1 𝑥∗ −1 𝑥∗ −1 𝑥∗
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
𝜇. The penalized function is defined only in the feasible space, so we do not Figure 5.22: Logarithmic penalty
plot its contours outside the ellipse. for one inequality constraint. The
minimum of the penalized func-
tion approaches the constrained
minimum from the feasible side.
∇𝑥 ℒ = 𝑄𝑥 + 𝑓 + 𝐴𝑇 𝜆 = 0
(5.66)
∇𝜆 ℒ = 𝐴𝑥 + 𝑏 = 0
1 𝑇 2
minimize 𝑠 ∇𝑥𝑥 ℒ𝑠 + ∇𝑥 ℒ 𝑇 𝑠
2
by varying 𝑠 (5.68)
subject to [∇ℎ]𝑠 + ℎ = 0
1 𝑇 2
𝑠 ∇𝑥𝑥 ℒ𝑠 + ∇ 𝑓 𝑇 𝑠 + 𝜆𝑇 [∇ℎ]𝑠 (5.69)
2
Then, we substitute the constraint [∇ℎ]𝑠 = −ℎ into the objective:
1 𝑇 2
𝑠 ∇𝑥𝑥 ℒ𝑠 + ∇ 𝑓 𝑇 𝑠 − 𝜆𝑇 ℎ (5.70)
2
Now, we can remove the last term in the objective, because it does not
depend on the design variables (𝑠) resulting in the following equivalent
problem:
1 𝑇 2
minimize 𝑠 ∇𝑥𝑥 ℒ𝑠 + ∇ 𝑓 𝑇 𝑠
2
by varying 𝑠 (5.71)
subject to [∇ℎ]𝑠 + ℎ = 0
Using the QP solution method outlined above, results in the follow-
ing system of linear equations:
∇2𝑥𝑥 ℒ [∇ℎ]𝑇 𝑠𝑥 −∇ 𝑓
= (5.72)
∇ℎ 0 𝜆 𝑘+1 −ℎ
Subtracting the second term on both sides yields the same set of
equations we found from applying Newton’s method to the KKT
conditions: 2
∇𝑥𝑥 ℒ [∇ℎ]𝑇 𝑠 𝑥 −∇𝑥 ℒ
= (5.74)
∇ℎ 0 𝑠𝜆 −ℎ
The derivation based on solving the KKT conditions is more fun-
damental. This alternative derivation relies the somewhat arbitrary
choices (made in hindsight) of choosing a QP as the subproblem and
using an approximation of the Lagrangian with constraints rather than
an approximation of the objective with constraints, or an approximation
of the Lagrangian with no constraints. Nevertheless, it is a useful
conceptual model to consider the method as sequentially creating and
solving QPs.
max(𝜎) ≤ 𝜎 𝑦 (5.75)
𝜎𝑖 ≤ 𝜎 𝑦 for 𝑖 = 1 . . . 𝑚 (5.76)
While this adds many more constraints, if an active set method is used, there
is little cost to adding more constraints as most of them will be inactive.
Alternatively, a constraint aggregation method (Section 5.7) could be used.
over the convergence, we can have two separate tolerances for the norm
of the optimality and feasibility.
Inputs:
𝑥 0 : Starting point
𝜏opt : Optimality tolerance
𝜏feas : Feasibility tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝜙(𝛼, 𝜇) = 𝑓 (𝑥 𝑘 + 𝛼𝑝 𝑥 ) + 𝜇k ℎ(𝑥 𝑘 + 𝛼𝑝 𝑥 )k
The step in the design variable space, 𝑠 𝑘 , is the step that resulted from
the latest line search. The Lagrange multiplier is fixed to the latest value
when approximating the curvature of the Lagrangian because we only
need the curvature in the space of the design variables.
Recall that for the QP problem to have a solution then 𝑊𝑘 must be
positive definite. To ensure that this 𝑊𝑘 is always positive definite, a
damped BFGS update formula was devised.20§ This method replaces 20. Powell, Algorithms for nonlinear con-
straints that use Lagrangian functions. 1978
the 𝑦 with a new vector 𝑟 defined as:
§ The damped BFGS formula is not al-
We now solve Ex. 5.4 using the SQP method (Alg. 5.16). The gradient of
the equality constraint is
1
𝑥 1
∇ℎ = 2 1 = ,
2𝑥 2 2
and differentiating the Lagrangian with respect to 𝑥 yields
𝜕ℒ 1 + 21 𝜆𝑥 1 1
= = .
𝜕𝑥 2 + 2𝜆𝑥2 2
2 2 2
𝑥 (0)
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
already seen. We formulate the constrained optimization problem as: Figure 5.25: SQP algorithm itera-
Õ tions.
minimize 𝑓 (𝑥) − 𝜇 ln 𝑠 𝑘
𝑘
by varying 𝑥, 𝑠 (5.81)
subject to ℎ 𝑗 (𝑥) = 0 for 𝑗 = 1, . . . , 𝑛 ℎ
𝑔 𝑘 (𝑥) + 𝑠 𝑘 = 0 for 𝑘 = 1, . . . , 𝑛 𝑔
more sensitive to the initial starting point and the scaling of the prob-
lem.58 These are of course only generalities, and are not replacements 58. Nocedal et al., Numerical Optimization.
2006
for testing multiple algorithms on the problem of interest. Many op-
timization frameworks make it easy to switch between optimization
algorithms facilitating this type of testing.
For the first iteration, differentiating the Lagrangian with respect to 𝑥 yields 2 19 iterations
𝑥 (0)
𝜕ℒ 1 + 12 𝜎𝑥1 1 1
= = ,
𝜕𝑥 2 + 2𝜎𝑥2 2
0
1
𝐹(𝑥; 𝜇) = 𝑓 (𝑥) + 𝜆(𝑥)𝑇 ℎ(𝑥) + 𝜇||𝑐 𝑣𝑖𝑜 (𝑥)|| 22 . (5.89)
2
where 𝑐 𝑣𝑖𝑜 are the constraint violations defined as:
(
| ℎ 𝑖 (𝑥)| for equality constraints
𝑐 𝑣𝑖𝑜,𝑖 (𝑥) = (5.90)
max(0, 𝑔𝑖 (𝑥)) for inequality constraints
5 Constrained Gradient-Based Optimization 161
There are three possible outcomes. Let us consider three example points that
6
illustrate these three outcomes.
1. (1, 4): this point is not dominated by any point in the filter. The step is 4
The outline of this section presents only the basic ideas. Robust
implementation of a filter method requires imposing sufficient decrease
conditions, not unlike those in the unconstrained case, as well as a few
73. Fletcher et al., A Brief History of Filter
other minor modifications.73 Methods. 2006
5 Constrained Gradient-Based Optimization 162
4 4
2 2
𝑥 (0) 𝑥∗ 𝑥 (0) 𝑥∗
𝑥2 𝑥2
0 0
−2 −2
−2 0 2 4 −2 0 2 4
Figure 5.29: Numerical solution of
𝑥1 𝑥1
Ex. 5.1.
(a) Sequential quadratic programming (b) Interior point method
𝑥 𝑐1 𝑥 𝑐2
𝑦𝑐
𝑘1 , ℓ1 𝑘2 , ℓ2
12 12
8 8
4 4
𝑥2 𝑥∗ 𝑥2 𝑥∗
0 0
𝑥 (0) 𝑥 (0)
−4 −4
−8 −8
−5 0 5 10 15 −5 0 5 10 15
ℓrope2
rope1 ℓ rope2
rope1 Figure 5.31: Optimization of con-
𝑥1 𝑥1
strained spring system.
(a) Sequential quadratic programming (b) Interior point method
1 ©Õ 𝜌𝑔 𝑗 (𝑥) ª
𝑚
𝐾𝑆(𝑥) = ln 𝑒 ® (5.92)
𝜌
« 𝑗=1 ¬
where 𝜌 is an aggregation parameter, like a penalty parameter used in
penalty methods.
Consider the constrained spring system from Ex. 5.24. Aggregating the two
constraints using the KS function, we can formulate a single constraint as
where q
2 2
𝑔1 (𝑥1 , 𝑥2 ) = 𝑥1 + 𝑥 𝑐1 + 𝑥2 + 𝑦𝑐 − ℓ 𝑐1 ,
q (5.94)
2 2
𝑔2 (𝑥1 , 𝑥2 ) = 𝑥1 − 𝑥 𝑐2 + 𝑥2 + 𝑦𝑐 − ℓ 𝑐2 .
We plot the contour of 𝐾𝑆 = 0 in Fig. 5.32 for increasing values of the aggregation
paramter 𝜌. We can see the difference in the feasible region for the lowest value
of 𝜌, which results in a conservative optimum. For the highest value of 𝜌, the
optimum obtained with constraint aggregation is graphically indistinguishible,
and the objective function value approaches the true optimal value of -22.1358.
5.8 Summary
8 8 8
6 6 6
𝑥2 𝑥2 𝑥2
4 𝑥∗ 4 𝑥∗ 4 𝑥∗
∗
∗
∗
𝑥KS 𝑥KS
2 𝑥 KS 2 2
ℓ rope2
rope1 ℓrope2
rope1 ℓrope2
rope1
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
𝑥1 𝑥1 𝑥1
Lagrange multipliers associated with the inequality constraints), is Figure 5.32: Spring system con-
called the KKT conditions. strained by two cables.
Problems
a) Reproduce the results from Ex. 5.17 (SQP) or Ex. 5.19 (interior
point).
b) Solve the problem from Prob. 5.3.
c) Solve the problem detailed in Prob. 5.11.
d) Compare the computational cost, precision, and robustness
of your optimizer with an existing software package.
𝑑
5.11 Aircraft Fuel Tank. A jet aircraft needs to carry a streamlined ℓ
minimize 𝐷(ℓ , 𝑑)
by varying ℓ , 𝑑
subject to 𝑉req − 𝑉(ℓ , 𝑑) ≤ 0.
1 2
𝐷= 𝜌𝑣 𝐶 𝐷 𝑆,
2
where the air density is 𝜌 = 0.55 kg/m3 , the aircraft speed is
𝑣 = 300 m/s. The drag coefficient of an ellipsoid can be estimated
as‖ " 1/2 2#
Hoerner76 provides this approximation
‖
in page 6-18.
ℓ 𝑑 𝑑
𝐶𝐷 = 𝐶 𝑓 3 + 4.5 + 21 . 76. Hoerner, Fluid-Dynamic Drag. 1965
𝑑 ℓ ℓ
5.12 Solve a variation of Ex. 5.24 where we replace the system of cables
with a cable and a rod that resists both tension and compression.
The cable is positioned above the spring as shown in Fig. 5.36,
where 𝑥 𝑐 = 2 m and 𝑦 𝑐 = 3 m, with a maximum length of
ℓ 𝑐 = 7.0 m. The rod is positioned at 𝑥 𝑟 = 2 m and 𝑦𝑟 = 4 m, with
a length of ℓ 𝑟 = 4.5 m. How does this change the formulation of
𝑥𝑐
𝑦𝑐 ℓ𝑐
𝑘1 , ℓ1 𝑘2 , ℓ2
ℓ𝑟 𝑦𝑟
Figure 5.36: Spring system con-
strained by two cables.
𝑥𝑟
5.14 Solve the same three-bar truss optimization problem of Prob. 5.13
by aggregating all the constraints into a single constraint. Try
different aggregation parameters and see how close you can get
to the solution you obtained for Prob. 5.13.
Õ
10
minimize 𝜌 𝐴𝑖 ℓ 𝑖
𝑖=1
by varying 𝐴𝑖 , 𝑖 = 1, . . . , 10 cross-sectional areas
subject to 𝐴 𝑖 ≥ 𝐴min minimum area
|𝜎𝑖 | ≤ 𝜎 𝑦 𝑖 for 𝑖 = 1, . . . , 10 yield stress constraints
5 Constrained Gradient-Based Optimization 172
5.16 Solve the same three-bar truss optimization problem of Prob. 5.15
by aggregating all the constraints into a single constraint. Try
different aggregation parameters and see how close you can get
to the solution you obtained for Prob. 5.15.
Computing Derivatives
6
Derivatives play a central role in many numerical algorithms. In the
context of optimization, we are interested in computing derivatives for
the gradient-based optimization methods introduced in the previous
chapter. The accuracy and computational cost of the derivatives is
critical for the success of these optimization methods. In this chapter, we
introduce the various methods for computing derivatives and discuss
the relative advantages of each method.
173
6 Computing Derivatives 174
𝜕 𝑓𝑖
𝐽𝑖𝑗 = . (6.3)
𝜕𝑥 𝑗
Consider the following function with two inputs and two outputs:
𝑓1 (𝑥1 , 𝑥2 ) 𝑥1 𝑥2 + sin 𝑥1
𝑓 (𝑥) = = . (6.4)
𝑓2 (𝑥1 , 𝑥2 ) 𝑥1 𝑥2 + 𝑥22
𝑥 𝑥 𝑥
𝑣1 = 𝑥
𝑅(𝑥, 𝑢) = 0 𝑅(𝑥, 𝑢) = 0 𝑢 1) 0
𝑉2 (𝑣=
𝑣2 = 𝑢)
𝑅(𝑥,
𝑣3 = 𝑉3 (𝑣1 , 𝑣2 )
..
.
𝑓 (𝑥, 𝑢) 𝑓 𝑓 (𝑥, 𝑢) 𝑓 𝑓 = 𝑉(𝑣1 , . . .) 𝑓 (𝑥, 𝑢) 𝑓
𝑓 = sin(𝑥 + 𝑓 ) (6.7)
dfdx =
cos(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x +
sin (2*x))))))))))*( cos(x + sin(x + sin(x + sin(x + sin(x + sin(x +
sin(x + sin(x + sin (2*x)))))))))*( cos(x + sin(x + sin(x + sin(x +
sin(x + sin(x + sin(x + sin (2*x))))))))*( cos(x + sin(x + sin(x + sin
(x + sin(x + sin(x + sin (2*x)))))))*( cos(x + sin(x + sin(x + sin(x +
sin(x + sin (2*x))))))*( cos(x + sin(x + sin(x + sin(x + sin (2*x)))))
*( cos(x + sin(x + sin(x + sin (2*x))))*( cos(x + sin(x + sin (2*x)))*(
cos(x + sin (2*x))*(2* cos (2*x) + 1) + 1) + 1) + 1) + 1) + 1) + 1) +
1) + 1)
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) + ℎ + + +..., (6.8)
𝜕𝑥 𝑗 2! 𝜕𝑥 𝑗 2 3! 𝜕𝑥 𝑗 3
where 𝑒ˆ 𝑗 is the unit vector in the 𝑗 th direction. Solving the above for the
first derivative we obtain the finite-difference formula,
𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥)
= + 𝒪(ℎ), (6.9)
𝜕𝑥 𝑗 ℎ
𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥) 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥)
= lim ≈ . (6.10)
𝜕𝑥 𝑗 ℎ→0 ℎ ℎ
Assuming each function evaluation yields the full vector 𝑓 , the above Figure 6.3: Exact derivative com-
formulas compute the 𝑗 th column of the Jacobian (6.2). To compute pared to a forward-difference finite-
the full Jacobian, we need to loop through each direction 𝑒ˆ 𝑗 , add a difference approximation.
step, recompute 𝑓 , and compute a finite difference. Hence, the cost
of computing the complete Jacobian is proportional to the number of
input variables of interest, 𝑛 𝑥 .
For a second-order estimate of the first derivative, we can use the
expansion of 𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 ) to obtain,
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) − ℎ + − +.... (6.12)
𝜕𝑥 𝑗 2! 𝜕𝑥 𝑗 2 3! 𝜕𝑥 𝑗 3
6 Computing Derivatives 179
Then, if we subtract this from the expansion (6.8) and solve the resulting
equation for the derivative of 𝑓 , we get the central-difference formula,
𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 )
= + 𝒪(ℎ 2 ). (6.13)
𝜕𝑥 𝑗 2ℎ
𝜕𝑓 𝜕𝑓
−
𝜕2 𝑓 𝜕𝑥 𝑗 𝑥+ℎ 𝑒ˆ 𝑗 𝜕𝑥 𝑗 𝑥−ℎ 𝑒ˆ 𝑗
= + 𝒪(ℎ 2 ). (6.14)
𝜕𝑥 𝑗 2 2ℎ
Then we can use central difference again to estimate both 𝑓 0(𝑥 + ℎ) and
𝑓 0(𝑥 − ℎ) in the above equation to obtain,
𝜕2 𝑓 𝑓 (𝑥 + 2ℎ 𝑒ˆ 𝑗 ) − 2 𝑓 (𝑥) + 𝑓 (𝑥 − 2ℎ 𝑒ˆ 𝑗 )
= + 𝒪(ℎ). (6.15)
𝜕𝑥 𝑗 2 4ℎ 2
𝑓 (𝑥 + ℎ𝑝) − 𝑓 (𝑥)
𝐷𝑝 𝑓 = + 𝒪(ℎ). (6.16)
ℎ
One application of directional derivatives is to compute the slope in
line searches (Section 4.3).
𝑓 (𝑥 + ℎ) +1.234567890123431
𝑓 (𝑥) +1.234567890123456
Δ𝑓 −0.000000000000025
Table 6.1: Subtractive cancellation leads to a loss of precision and ultimately inaccu-
rate finite difference estimates.
space. Therefore, repeating this study for other values of 𝑥 might be required.
Inputs:
𝑥: Point about which to compute the gradient
𝑓 : Function of interest
Outputs:
𝐽: Jacobian of 𝑓 about point 𝑥
However, if the result does not match, this directional derivative does
not tell you which components of the gradient are incorrect.
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) + 𝑖 ℎ − − 𝑖 +... (6.17)
𝜕𝑥 𝑗 2 𝜕𝑥 𝑗 2 6 𝜕𝑥 𝑗 3
𝜕𝑓 ℎ 3 𝜕3 𝑓
Im[ 𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 )] = ℎ − +... (6.18)
𝜕𝑥 𝑗 6 𝜕𝑥 𝑗 3
𝑓 (𝑥) is a real function of a real variable, this most easily works with
models that do not already involve complex numbers, but the procedure
can be extended to work with functions that are already complex
(multicomplex step),79 or to provide exact second derivatives.80 In 79. Lantoine et al., Using Multicomplex
Variables for Automatic Computation of
Tip 6.10 we explain how to convert programs to handle the required High-Order Derivatives. 2012
complex arithmetic for the complex-step method to work in general. 80. Fike et al., The Development of Hyper-
Unlike finite-differences, this formula has no subtraction operation Dual Numbers for Exact Second-Derivative
Calculations. 2011
and thus no subtractive cancellation error. The only source of numerical
error is the truncation error. However, if ℎ is decreased to a small
enough value (say, 10−40 ), the truncation error can be eliminated. Then,
the precision of the complex-step derivative approximation (6.19) will
match the precision of 𝑓 . This is a tremendous advantage over the
finite-difference approximations (6.9) and (6.13).
Like the finite-difference approach, each evaluation yields a column
of the Jacobian (𝜕 𝑓 /𝜕𝑥 𝑗 ), and the cost of computing all the derivatives
is proportional to the number of design variables. The cost of the
complex-step method is more comparable to that of a central difference
as opposed to a forward difference because we must now compute a
real and an imaginary part for every number in our program.
If we take the real part of the Taylor series expansion (6.17), we
obtain the value of the function on the real axis,
𝑓 (𝑥) = Re 𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) + 𝒪(ℎ 2 ). (6.20)
1 𝜕2 𝑓
ℎ2 < 𝜖 𝑓 (𝑥) , (6.21)
2 𝜕𝑥 𝑗 2
1 𝜕3 𝑓
ℎ2 < 𝜖 𝑓 0(𝑥) . (6.22)
6 𝜕𝑥 𝑗 3
Inputs:
𝑥: Point about which to compute the gradient
𝑓 : Function of interest
Outputs:
𝐽: Jacobian of 𝑓 about point 𝑥
To show the how the complex-step method works, consider the following
analytic function:
𝑒𝑥
𝑓 (𝑥) = √ . (6.23)
sin3 𝑥 + cos3 𝑥
The exact derivative at 𝑥 = 1.5 is computed to 16 digits based on symbolic
differentiation as a reference value. The errors relative to this reference for
6 Computing Derivatives 185
the complex-step derivative approximation and the forward and central finite-
difference approximations are computed as
d𝑓 d𝑓
−
d𝑥 approx d𝑥 exact
𝜖= (6.24)
d𝑓
d𝑥 exact
101
derivative approximations as
the step size decreases. Finite-
10−8 difference approximations initially
converge as the truncation error
10−11
decrease, but when the step is too
small, the subtractive cancellation
errors become overwhelming. The
10−14 complex-step approximation does
Complex step
not suffer from this issue. Note that
10−17 the x-axis is oriented so that smaller
10−1 10−5 10−9 10−13 10−17 10−21 10−25 step sizes are to the right.
Step size, ℎ
that it handles logical operators consistently, and that certain functions yield
the correct derivatives.
First, the program may need to be modified to use complex numbers. In
programming languages like Fortran or C, this involves changing real valued
type declarations (e.g., double) to complex type declarations (e.g., double
complex). In some languages, such as Matlab, this is not necessary because
functions are overloaded to accept either type automatically.
Second, some changes may be required to preserve the correct logical flow
through the program. Relational logic operators such as “greater than” and
“less than” or “if” and “else” are usually not defined for complex numbers. These
operators are often used in programs, together with conditional statements, to
redirect the execution thread. The original algorithm and its “complexified”
version should follow the same execution thread, and therefore, defining
these operators to compare only the real parts of the arguments is the correct
approach. Functions that choose one argument, such as the maximum or the
minimum values are based on relational operators. Following the previous
argument, we should determine the maximum and minimum values based on
the real parts alone.
Third, some functions need to be redefined for complex arguments. The
most common function that needs to be redefined is the absolute value function,
which for a complex number, 𝑧 = 𝑥 + 𝑖𝑦, is defined as
q
|𝑧| = 𝑥2 + 𝑦2 . (6.25)
d𝑣 𝑖 Õ 𝜕𝑉𝑖 d𝑣 𝑘
𝑖−1
= , (6.29)
d𝑣 𝑗 𝜕𝑣 𝑘 d𝑣 𝑗
𝑘=𝑗
d𝑣 𝑖
𝑣¤ 𝑖 , . (6.30)
d𝑣 𝑗
Once we are done applying the chain rule (6.29) for the chosen input
variable, we end up with the corresponding full column of the Jacobian,
i.e., the tangent vector.
Suppose we have four variables: 𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 and 𝑥 = 𝑣1 , 𝑓 = 𝑣 4 ,
and we want d 𝑓 /d𝑥. Using the above formula we set 𝑗 = 1 (as we want
the derivative with respect to 𝑣1 = 𝑥) and increment in 𝑖 to get the
6 Computing Derivatives 189
sequence of derivatives
𝑣¤ 1 = 1
𝜕𝑉2
𝑣¤ 2 = 𝑣¤ 1
𝜕𝑣1
𝜕𝑉3 𝜕𝑉3 (6.31)
𝑣¤ 3 = 𝑣¤ 1 + 𝑣¤ 2
𝜕𝑣1 𝜕𝑣2
𝜕𝑉4 𝜕𝑉4 𝜕𝑉4 d𝑓
𝑣¤ 4 = 𝑣¤ 1 + 𝑣¤ 2 + 𝑣¤ 3 ,
𝜕𝑣1 𝜕𝑣2 𝜕𝑣3 d𝑥
The colored derivatives show how the values are reused. In each step
we just need to compute the partial derivatives of the current operation
𝑉𝑖 and then multiply using total derivatives that have already been
computed. We move forward evaluating partial derivatives of 𝑉 in the
same sequence that we evaluate the original function. This is convenient
because all of the unknowns are partial derivatives, meaning that we
only need to compute derivatives based on the operation at hand (or
line of code).
Using forward mode AD, obtaining derivatives with respect to
additional outputs is either free (e.g., d𝑣3 /d𝑣1 , 𝑣¤ 3 in Eq. 6.31) or
requires only one more line of computation (e.g., if we had an additional
output 𝑣5 ), and thus has a negligible additional cost for a large code.
However, if we want the derivatives with respect to additional inputs
(e.g., d𝑣4 /d𝑣2 ) we would need to evaluate an entire set of similar
calculations. Thus, the cost of the forward mode scales linearly with
the number of inputs and is practically independent of the number of
outputs.
Consider the function with two inputs and two outputs from Ex. 6.1. The
explicit expressions in this function could be evaluated using only two lines of
code. However, to make the AD process more apparent, we write the code such
that each line has a single unary or binary operation, which is how a computer
ends up evaluating the expression.
𝑣1 = 𝑉1 (𝑣1 ) = 𝑥1
𝑣2 = 𝑉2 (𝑣2 ) = 𝑥2
𝑣3 = 𝑉3 (𝑣1 , 𝑣2 ) = 𝑣1 𝑣 2
𝑣4 = 𝑉4 (𝑣1 ) = sin 𝑣1 (6.32)
𝑣5 = 𝑉5 (𝑣3 , 𝑣4 ) = 𝑣3 + 𝑣4 = 𝑓1
𝑣6 = 𝑉6 (𝑣2 ) = 𝑣 22
𝑣7 = 𝑉7 (𝑣3 , 𝑣6 ) = 𝑣3 + 𝑣6 = 𝑓2
6 Computing Derivatives 190
𝑣1 𝑣4 𝑣5
𝑥1 sin + 𝑓1
𝑣1 𝑣3
×
Figure 6.5: Dependency graph
𝑣2 𝑣3 for the numerical example evalua-
𝑥2 [·]2 + 𝑓2 tions (6.32).
𝑣2 𝑣6 𝑣7
The operations above result in the dependency graph shown in Fig. 6.5.
Say we want to compute d 𝑓2 /d𝑥1 , which in our example is d𝑣7 /d𝑣1 . The
evaluation point is the same as in Ex. 6.1: 𝑥 = (𝜋/4, 2). Using the forward
mode, set the seed for the corresponding input, 𝑣¤ 1 to one, and the seed for the
other input to zero. Then we get the sequence,
𝑣¤ 1 = 1
𝑣¤ 2 = 0
𝜕𝑉3 𝜕𝑉3
𝑣¤ 3 = 𝑣¤ 1 + 𝑣¤ 2 = 𝑣2 𝑣¤ 1 =2
𝜕𝑣 1 𝜕𝑣2
𝜕𝑉4
𝑣¤ 4 = 𝑣¤ 1 = cos 𝑣1 𝑣¤ 1 = 0.707 . . .
𝜕𝑣 1 (6.33)
𝜕𝑉5 𝜕𝑉5 𝜕 𝑓1
𝑣¤ 5 = 𝑣¤ 3 + 𝑣¤ 4 = 𝑣¤ 3 + 𝑣¤ 4 = 2.707 . . . =
𝜕𝑣 3 𝜕𝑣4 𝜕𝑥1
𝜕𝑉6
𝑣¤ 6 = 𝑣¤ 2 = 2𝑣 2 𝑣¤ 2 =0
𝜕𝑣 2
𝜕𝑉7 𝜕𝑉7 𝜕 𝑓2
𝑣¤ 7 = 𝑣¤ 3 + 𝑣¤ 6 = 𝑣¤ 3 + 𝑣¤ 6 =2= ,
𝜕𝑣 3 𝜕𝑣6 𝜕𝑥1
where we have evaluated the partial derivatives numerically. We now have a
procedure (not a symbolic expression) for computing d 𝑓2 /d𝑥1 for any (𝑥 1 , 𝑥2 ).
While we set out to compute d 𝑓2 /d𝑥1 , we also obtained d 𝑓1 /d𝑥 1 as a
byproduct. For a given input, we can obtain the derivatives for all outputs for
essentially the same cost as one output. In contrast, if we want the derivative
with respect to the other input, d 𝑓1 /d𝑥 2 , a new sequence of calculations is
necessary. Because this example contains so few operations the difference is
small, but for a long program with many inputs, the difference will be large.
𝑣¯ 4 = 1
𝜕𝑉4
𝑣¯ 3 = 𝑣¯ 4
𝜕𝑣3
𝜕𝑉3 𝜕𝑉4 (6.36)
𝑣¯ 2 = 𝑣¯ 3 + 𝑣¯ 4
𝜕𝑣2 𝜕𝑣2
𝜕𝑉2 𝜕𝑉3 𝜕𝑉4 d𝑓
𝑣¯ 1 = 𝑣¯ 2 + 𝑣¯ 3 + 𝑣¯ 4 ,
𝜕𝑣1 𝜕𝑣1 𝜕𝑣1 d𝑥
The partial derivatives for 𝑉 must be computed for 𝑉4 first, then 𝑉3 , and
so on. Therefore, we have to traverse the code in reverse. In practice,
not every variable depends on every other variable, so a dependency
graph is created during code evaluation. Then, when computing the
adjoint derivatives we traverse the dependency graph in reverse. As
before, the derivatives we need to compute in each line are only partial
derivatives.
6 Computing Derivatives 192
To compute the same derivative using the reverse mode, we need to choose
the output to 𝑓2 by setting the seed for the corresponding output, 𝑣¯ 7 to one.
Then we get,
𝑣¯ 7 = 1
𝜕𝑉7
𝑣¯ 6 = 𝑣¯ 7 = 𝑣¯ 7 =1
𝜕𝑣6
𝑣¯ 5 = = = 0 (nothing depends on 𝑣5 )
𝜕𝑉5
𝑣¯ 4 = 𝑣¯ 5 = 𝑣¯ 5 =0
𝜕𝑣4
𝜕𝑉7 𝜕𝑉5
𝑣¯ 3 = 𝑣¯ 7 + 𝑣¯ 5 = 𝑣¯ 7 + 𝑣¯ 5 =1
𝜕𝑣3 𝜕𝑣3
𝜕𝑉6 𝜕𝑉3 𝜕 𝑓2
𝑣¯ 2 = 𝑣¯ 6 + 𝑣¯ 3 = 2𝑣2 𝑣¯ 6 + 𝑣1 𝑣¯ 3 = 4.785 =
𝜕𝑣2 𝜕𝑣2 𝜕𝑥2
𝜕𝑉4 𝜕𝑉3 𝜕 𝑓2
𝑣¯ 1 = 𝑣¯ 4 + 𝑣¯ 3 = (cos 𝑣1 )𝑣¯ 4 + 𝑣2 𝑣¯ 3 =2= ,
𝜕𝑣1 𝜕𝑣1 𝜕𝑥1
(6.37)
While we set out to evaluate d 𝑓2 /d𝑥1 , we also computed d 𝑓2 /d𝑥2 as a
byproduct. For each output, the derivatives of the all inputs come at the cost of
evaluating only one more line of code. Conversely, if we want the derivatives
of 𝑓1 a whole new set of computations is needed.
In the reverse computation sequence above, we needed the values of some
of the variables in the original code to be precomputed and stored. In addition,
the reverse computation requires the dependency graph information (which,
for example, is how we would know that nothing depended on 𝑣5 ). In forward
mode, the computation of a given derivative, 𝑣¤ 𝑖 , requires the partial derivatives
of the line of code that computes 𝑣 𝑖 with respect to its inputs. In the reverse case,
however, to compute a given derivative, 𝑣¯ 𝑗 , we require the partial derivatives of
the functions that the current variable 𝑣 𝑗 affects with respect to 𝑣 𝑗 . Knowledge
6 Computing Derivatives 193
for 𝑖 = 1 to 20 do
𝑓 = sin(𝑥 + 𝑓 )
𝑓¤ = ( 𝑥¤ + 𝑓¤) · cos(𝑥 + 𝑓 )
end for
return 𝑓 , 𝑓¤ d 𝑓 /d𝑥 is given by 𝑓¤
The reverse mode AD version of the same function is shown below. We set
𝑓¯ = 1 to get the derivative of 𝑓 . Now we need two distinct loops: a forward one
that computes the original function, and a reverse one that accumulates the
derivatives in reverse starting from the last derivative in the chain. Because the
derivatives that are accumulated in the reverse loop depend on the intermediate
values of the variables, we need to store all the variables in the forward loop.
Here we do it via a stack, which is a data structure that stores a one-dimensional
array. Using the stack concept, we can only add an element to the top of the
stack (push) and take the element from the top of the stack (pop).
Input: 𝑥, 𝑓¯ Set 𝑓¯ = 1 to get d 𝑓 /d𝑥
𝑓 =𝑥
for 𝑖 = 1 to 20 do
push( 𝑓 ) Put current value of 𝑓 on top of stack
𝑓 = sin(𝑥 + 𝑓 )
end for
for 𝑖 = 20 to 1 do Do the reverse loop
𝑓 = pop() Get value of 𝑓 from top stack
𝑓¯ = cos(𝑥 + 𝑓 ) · 𝑓¯
end for
𝑥¯ = 𝑥¯ + 𝑓¯
return 𝑓 , 𝑥¯ d 𝑓 /d𝑥 is given by 𝑥¯
Tip 6.14: Operator overloading versus source code transformation. 83. Hascoët et al., TAPENADE 2.1 User’s
Guide. 2004
To use derived types and operator overloading approach to AD, we need a
84. Griewank et al., Algorithm 755: ADOL-
language that supports these features, such as C++, Fortran 90, Python, Julia, C: A Package for the Automatic Differen-
Matlab, etc. The derived types feature is used to replace all the real numbers tiation of Algorithms Written in C/C++.
1996
in the code, 𝑣, with a dual number type that includes both the original real
85. Wiltschko et al., Tangent: automatic dif-
number and the corresponding derivative as well, i.e., 𝑢 = (𝑣, 𝑣¤ ). Then, all ferentiation using source code transformation
operations are redefined (overloaded) such that, in addition to the result of in Python. 2017
the original operations, they yield the derivative of that operation. All these 86. Revels et al., Forward-Mode Automatic
Differentiation in Julia. 2016
additional operations are performed “behind the scenes” without adding much
87. Neidinger, Introduction to Automatic
code. Except for the variable declarations and setting the seed, the code remains Differentiation and MATLAB Object-
exactly the same as the original. Oriented Programming. 2010
There are AD tools available for most programming languages, including † Although some AD tools can be applied
Fortran,83 C/C++,84 Python,85 Julia,86 and Matlab.87 They have been exten- recursively to yield higher order deriva-
tives, this approach is not typically effi-
sively developed and provide the user with great functionality, including the cient and is sometimes unstable.88
calculation of higher-order derivatives, multivariable derivatives, and reverse 88. Betancourt, A geometric theory of
mode options.† higher-order automatic differentiation. 2018
6 Computing Derivatives 195
The operator overloading approach is much more elegant, since the original
code stays practically the same and can be maintained directly. The source
code transformation approach, on the other hand, enlarges the original code
and results in code that is less readable, making it hard to debug the new
extended code. Instead of maintaining source code transformed by AD, it is
advisable to work with the original source, and devise a workflow where the
parser is rerun before compiling a new version. The advantage of the source
code transformation is that it tends to yield faster code, and it is easier to see
what operations actually take place when debugging.
where these equations are solved for the 𝑛𝑢 state variables 𝑢 for a given
fixed set of design variables 𝑥. This means that 𝑢 is an implicit function
of 𝑥.
The functions of interest, 𝑓 , can in general be written as explicit
functions of the state variables and the design variables, i.e.,
Therefore, 𝑓 depends not only explicitly on the design variables, but also
implicitly through the governing equations (6.38). This dependency is
illustrated in Fig. 6.6.
d𝑟 𝜕𝑅 𝜕𝑅 d𝑢
= + = 0, (6.41)
d𝑥 𝜕𝑥 𝜕𝑢 d𝑥
where d𝑟/d𝑥 and d𝑢/d𝑥 are both (𝑛𝑢 × 𝑛 𝑥 ) matrices, and the Jacobian,
𝜕𝑅/𝜕𝑢, is a square matrix of size (𝑛𝑢 × 𝑛𝑢 ).
We can visualize the requirement for the total derivative (6.41) to
be zero in Fig. 6.7, which is a simplified representation of the set of
points that satisfy the governing equations. In this case, it is just a line
that maps a scalar 𝑥 to a scalar 𝑢. In the general case, the governing
equations map 𝑥 ∈ R𝑛 𝑥 to 𝑢 ∈ R𝑛𝑢 and the set of points that satisfy the
governing equations is a manifold in 𝑥-𝑢 space. As a result, any change,
d𝑥, must be accompanied by the appropriate change, d𝑢, so that the
governing equations are still satisfied. If we look at small perturbations
about a feasible point and want to remain feasible, the variations of 𝑥
and 𝑢 are no longer independent, because the total derivative of the
governing equation residuals (6.41) with respect to 𝑥 must be zero.
As in the total derivative of the function of interest (6.40), the partial
∇𝑅
derivatives here do not take into account the solution of the governing
equations and are therefore much more easily computed than the total d𝑢
derivatives. However, if we provide the two partial derivative terms in 𝑢 d𝑥
Table 6.2: Cost comparison of computing sensitivities for direct and adjoint methods.
Step Direct Adjoint
Partial derivative computation Same Same
Linear solution 𝑛 𝑥 times 𝑛 𝑓 times
Matrix multiplications Same Same
𝜆
+ cos 𝜆 = 0 (6.51)
𝑚
Our goal is to compute the derivative d 𝑓 /d𝑚, but 𝜆 is an implicit function of 𝑚.
In other words, we cannot find an explicit expression for 𝜆 as a function of 𝑚,
substitute that expression into Eq. 6.50, and then differentiate normally.
Fortunately, the direct and adjoint methods will allow us to compute this
derivative.
6 Computing Derivatives 200
𝜕𝐹 𝜕𝑓 𝜕𝐹 𝜕𝑓
= = 2𝜆𝑚, = = 𝑚2
𝜕𝑥 𝜕𝑚 𝜕𝑢 𝜕𝜆
(6.53)
𝜕𝑅 𝜕𝑅 𝜆 𝜕𝑅 𝜕𝑅 1
= =− 2, = = − sin 𝜆
𝜕𝑥 𝜕𝑚 𝑚 𝜕𝑢 𝜕𝜆 𝑚
Because this is a problem of only one function of interest and one design
variable there is no distinction between the direct and adjoint methods (forward
and reverse), and the matrix inverse is simply a division. Substituting these
partial derivatives into the total derivative equation (6.43) yields:
d𝑓 𝜆
= 2𝜆𝑚 + 1 . (6.54)
d𝑚 − sin 𝜆
𝑚
Thus, we are able to obtain the desired derivative in spite of the implicitly
defined function. Here, it was possible to get an explicit expression for the total
derivative, but in general, it is only possible to get a numeric value.
Of all the possible rectangles that can be inscribed in the ellipse, we want the
rectangle with maximum area (spoiler alert: this is the solution for Prob. 5.5).
Then, the area of the rectangle is constrained as,
Figure 6.8: Rectangle inscribed in
𝑟2 (𝑢1 , 𝑢2 ) = 4𝑢1 𝑢2 − 2𝑥1 𝑥 2 = 0. (6.56) ellipse.
Suppose that our functions of interest are the rectangle perimeter and the
rectangle aspect ratio,
𝑢
𝑓1 = 4(𝑢1 + 𝑢2 ), 𝑓2 = 1 . (6.57)
𝑢2
We want to find the derivatives of these functions of interest with respect to the
ellipse semi-major axes.
6 Computing Derivatives 201
We use index notation for clarity because the equations involve derivatives of
matrices with respect to vectors.
First, we need to derive expressions to compute all of the partial derivatives.
The residual equations can be written as:
where 𝐾 is the stiffness matrix and we assumed that the external forces are not
functions of the mesh node locations. The partial derivatives of these equations
are:
𝜕ℛ 𝑘 𝜕𝐾 𝑘 𝑗
= 𝑢𝑗
𝜕𝑥 𝑖 𝜕𝑥 𝑖
(6.60)
𝜕ℛ 𝑘
= 𝐾𝑘 𝑗 .
𝜕𝑢 𝑗
The second equation is convenient because we already have the stiffness matrix.
The first equation involves a new term, which are the derivatives of the stiffness
matrix with respect to each of the design variables.
One of the functions of interest is the stress, which under the elastic
assumption, is a linear function of the deflections that can be expressed as:
We can solve this either using the direct or adjoint method. The direct method
would solve the linear system for 𝜙, for one input 𝑖 at a time, and then multiply
through in the second equation:
𝜕𝐾 𝑘 𝑗
𝐾𝑘 𝑗 𝜙𝑗 = − 𝑢𝑗
𝜕𝑥 𝑖
(6.64)
d𝜎𝑛
= 𝑆𝑛 𝑗 𝜙 𝑗
d𝑥 𝑖
This approach is preferable if we have few design variables 𝑥 𝑖 and many stresses
𝜎𝑛 . Or, we can solve this with the adjoint approach. The adjoint method solves
the linear system for 𝜓, one output, 𝑛, at a time, and then multiplies through in
the second equation:
𝐾 𝑘 𝑗 𝜓 𝑘 = −𝑆𝑛 𝑗
(6.65)
d𝜎𝑛 𝜕𝐾 𝑘 𝑗
= 𝜓𝑘 𝑢𝑗 .
d𝑥 𝑖 𝜕𝑥 𝑖
This approach is preferable when there are many design variables and few
stresses.
[𝑥1 , 𝑥 2 , . . . , 𝑥 𝑗 + ℎ, . . . , 𝑥 𝑛 𝑥 ]𝑇 (6.66)
𝐽11 0 0 0 0
0 0 0 0
𝐽22
0 0
0 𝐽33 0 (6.67)
0 0
0 0 𝐽44
0 𝐽55
0 0 0
6 Computing Derivatives 203
For this scenario, the Jacobian can be evaluated with one evaluation
rather than 𝑛 𝑥 evaluations. This is because a given output 𝑓𝑖 depends
on only one input 𝑥 𝑖 . We could think of the outputs as 𝑛 𝑥 independent
functions. Thus, for finite differencing rather than requiring 𝑛 𝑥 input
vectors with 𝑛 𝑥 function evaluations we can use one input vector:
A subset of columns that do not have more than one nonzero in the same
row are said to have structurally orthogonal columns. For this example
the following columns are structurally orthogonal: (1, 3), (1, 5), (2, 3), (2,
4, 5), (2, 6), and (4, 5). Structurally orthogonal columns can be combined,
forming a smaller Jacobian that reduces the number of forward passes
required. This reduced Jacobian is referred to as compressed. There is
more than one way to compress the example Jacobian, but for this case
the minimum number of compressed columns (referred to as colors) is
three. A compressed Jacobian is shown below where columns 1 and 3
have been combined, and 2, 4, and 5 have been combined:
𝐽11 0 0 0 𝐽16 𝐽11 𝐽16
𝐽14
𝐽14
0 0 0 0 𝐽23 0
𝐽23 𝐽24 𝐽24
𝐽31 0 ⇒ 𝐽31
0
𝐽32 0 0 0 𝐽32 (6.70)
0 0 0 0
0 0 0 𝐽45 𝐽45
0 𝐽 𝐽56
0 𝐽53 0 𝐽55 𝐽56 53 𝐽55
𝐽11 0 0 0 𝐽16
𝐽14
0 0 0 0 𝐽11 0 0 𝐽16
𝐽23 𝐽24 𝐽14 𝐽45
𝐽31 0 ⇒ 0 0 (6.71)
𝐽32 0 0 0 0 𝐽23 𝐽24 0
0 0 𝐽31 𝐽56
0 0 0 𝐽45 𝐽32 𝐽53 0 𝐽55
0
0 𝐽53 0 𝐽55 𝐽56
6 Computing Derivatives 204
AD can also be used even more flexibly where both modes are used:
forward passes to evaluate groups of structurally orthogonal columns,
and reverse passes to evaluate groups of structurally orthogonal rows.
Rather than taking incremental steps in each direction as is done in finite
differencing, in AD we set the seed vector with ones in the directions we
wish to evaluate, similar to how the seed is set for directional derivatives
as discussed in Section 6.6.
For these small Jacobians it is fairly straightforward to determine
how best to compress the matrix. For a large matrix this is not so easy.
The approach that is used is called graph coloring. In one approach, a
graph is created with row and column indices as vertices and edges
denoting nonzero entries in the Jacobian. Graph coloring algorithms
use heuristics to estimate the fewest number of “colors” or orthogonal
columns. Graph coloring is a large field of its own with derivative
computation as just one application.§ § Gebremedhin et al.90 provide a review
Outputs Outputs
The illustrate the potential benefits of using a sparse representation, the Figure 6.9: Representation of the Ja-
Jacobian was constructed for various sizes of inflow conditions using both cobian for this example. The blocks
indicate areas where a derivatie ex-
forward AD, and forward AD with graph coloring (Fig. 6.10). After about ists, and the blank spots where the
100 inflow conditions, the difference in time required exceeds an order of derivative is always zero. The left is
magnitude (note the log-log scale). As Jacobians are needed at every iteration the original Jacobian and the right
is the compressed representation.
6 Computing Derivatives 205
100
relationship. 10−4
100 101 102
To get a broader view of these methods, we go back to the notion of
Inflow conditions
the list of variables (6.27) considered when introducing AD:
Figure 6.10: The compressed Jaco-
𝑣 𝑖 = 𝑉𝑖 (𝑣 1 , 𝑣2 , . . . , 𝑣 𝑖 , . . . , 𝑣 𝑛 ). (6.72) bian.
𝑟 = 𝑅(𝑣) = 0. (6.73)
For the inputs and outputs, the residuals assume that the associated
variables (𝑥 and 𝑓 ) are free, but they are constructed such that the
variables assume the correct values when the residual equations are
satisfied.
Table 6.3: Variable and residual definition needed to recover the various derivative
computation methods with the UDE (6.74). The residuals of the governing equations
are represented by 𝑅 𝑔 to distinguish them from the UDE residuals.
Using the variable and residual definitions from Table 6.3 for the
monolithic method in the left hand side of the UDE (6.74), we get
" #
𝐼 0 𝐼 0
d 𝑓 = 𝐼,
𝐼 𝐼
𝜕𝐹 (6.75)
−
𝜕𝑥 d𝑥
which yields the obvious result d 𝑓 /d𝑥 = 𝜕𝐹/𝜕𝑥. This is not a particu-
larly useful result, but it shows that the UDE can recover the monolithic
case.
For the analytic derivatives, the left-hand side of the UDE becomes,
𝐼 0 𝐼 0
0 0
𝜕𝑅 𝑔 d𝑢 d𝑢
− 𝜕𝑅 𝑔
0 0 = 𝐼.
𝜕𝑥 − (6.76)
𝜕𝑢 d d𝑥 d𝑟
𝜕𝐹 𝑓 d𝑓
− 𝐼 𝐼
𝜕𝐹
−
𝜕𝑥 𝜕𝑢 d𝑥 d𝑟
Since we are only interested in the d 𝑓 /d𝑥 block in the second matrix,
we can ignore the second and third block columns of that matrix.
Multiplying the remaining blocks out and using the definition 𝜙 ,
− d𝑢/d𝑥, we get the direct linear system (6.45) and the total derivative
equation (6.46).
The right-hand side of the UDE yields the transposed system,
𝜕𝐹𝑇 d𝑓
𝐼 − 𝜕𝑅 𝑔
𝑇
𝐼 d𝑢
d𝑥
−
𝜕𝑥 d𝑥
𝜕𝐹
𝜕𝑥
0 − 𝑔
𝑇 𝑇 0 d𝑢 d 𝑓 = 𝐼. (6.77)
𝜕𝑅
− d𝑟
𝜕𝑢 0
d𝑟
𝐼
𝜕𝑢
0 0 𝐼 0
6 Computing Derivatives 207
1 0 1 0
0 ... 0 ...
𝜕𝑉2 .. d𝑣2 ..
− ..
. ..
.
𝜕𝑣 1 . d𝑣 1 .
1
. . = 𝐼,
1
(6.78)
.. .. ..
0 .. .. ..
0
. . . .
𝜕𝑉𝑛 d𝑣 𝑛 d𝑣 𝑛
− 𝜕𝑉𝑛
1 1
𝜕𝑣 ... − d𝑣 ...
1
𝜕𝑣 𝑛−1 1 d𝑣 𝑛−1
where the Jacobian d 𝑓 /d𝑥 is composed of a subset of derivatives in the
corners near the d𝑣 𝑛 /d𝑣1 term. To compute these derivatives, we need
to perform forward substitution and compute one column of the total
derivative matrix at the time, where each column is associated with the
inputs of interest.
The reverse mode yields
d𝑣2 d𝑣 𝑛
1 − 2 1
𝜕𝑉 𝜕𝑉𝑛
... −
d𝑣1
...
d𝑣1
𝜕𝑣1
𝜕𝑣1
.. ..
0 0
.. ..
1 .
. 1 . . = 𝐼, (6.79)
.. 𝜕𝑉𝑛 .. d𝑣 𝑛
. .
.. .. ..
1
𝜕𝑣 𝑛−1 d𝑣 𝑛−1
. . − .
0 1 0 1
... 0 ... 0
where the derivatives of interest are now near the top right corner of
the total derivative matrix. To compute these derivatives, we need to
perform back substitutions, which computes one column of the matrix
at the time. Since the total derivative matrix is transposed here, the
reverse mode actually computes a row of the total derivative Jacobian
at the time, where each row is associated with an output of interest.
This is consistent with what we concluded before: The cost of the
forward mode is proportional to the number of the inputs of interest,
while the cost of the reverse move is proportional to the number of
outputs of interest.
6 Computing Derivatives 208
Problems
6.5 Suppose you have two airplanes that are flying in a horizontal
plane defined by 𝑥 and 𝑦 coordinates. Both airplanes start at
𝑦 = 0, but airplane 1 starts at 𝑥 = 0 while airplane 2 has a head
start of 𝑥 = Δ𝑥. The airplanes fly at a constant velocity. Airplane 1
has a velocity 𝑣1 in the direction of the positive 𝑥-axis and airplane
two has a velocity 𝑣 2 at an angle 𝛾 with the 𝑥-axis. The functions
of interest are the distance (𝑑) and the angle (𝜃) between the two
airplanes as a function of time. The independent variables are Δ𝑥,
𝛾, 𝑣 1 , 𝑣2 , 𝑡. Write the code that computes the functions of interest
(outputs) for a given set of independent variables (inputs). Use
AD to differentiate the code. Choose a set of inputs, compute the
derivatives of all the outputs with respect to the inputs and verify
them against the complex-step method.
𝐸 − 𝑒 sin(𝐸) = 𝑀,
6.7 Compute the derivatives for the ten-bar truss problem described in
Appendix C.2.2 using the direct and adjoint implicit differentiation
methods. We want to compute the derivatives of the objective
(mass) with respect to the design variables (ten cross-sectional
areas), and the derivatives of the constraints (stresses in all ten
bars) with respect to the design variables (a 10 × 10 Jacobian
matrix). Compute the derivatives using:
6.8 We can now solve the ten-bar truss problem (previously solve in
Prob. 5.15) using the derivatives computed in Prob. 6.7. Solve this
optimization problem using both finite-difference derivatives and
an implicit analytic method. Report the following:
6.9 Aggregate the constraints for the ten-bar truss problem and extend
the code from Prob. 6.7 to compute the required constraint deriva-
tives using the implicit analytic method that is most advantageous
in this case. Verify your derivatives against the complex-step
method. Solve the optimization problem and compare your re-
sults to the ones you obtained in Prob. 6.8. How close can you get
to the reference solution?
Gradient-Free Optimization
7
Gradient-free algorithms fill an important role in optimization. The
gradient-based algorithms introduced in Chapter 4 are efficient in
finding local minima for high-dimensional nonlinear problems defined
by continuous smooth functions. However, the assumptions made
for these algorithms are not always valid, which can render these
algorithms ineffective. Also, gradients might not be available, as in the
case of functions given as a black box.
In this chapter, we introduce only a few popular representative
gradient-free algorithms. Most are designed to handle unconstrained
functions only, but they can be adapted to solve constrained problems
by using the penalty or filtering methods introduced in Chapter 5. We
start by discussing the problem characteristics that are relevant to the
choice between gradient-free and gradient-based algorithms and then
give an overview of the types of gradient-free algorithms.
212
7 Gradient-Free Optimization 213
by numerical noise, are not the reason why one believes the physical
design space is multimodal.
105
Figure 7.1: Cost of optimization for
1.49 increasing the number of design
2.52 variables of the 𝑛-dimensional
104 Rosenbrock function. A gradient-
free algorithm compared with
103 and a gradient-based algorithm
Analytic with gradients computed with
0.37
finite-differences and analytically.
102
A gradient-based optimizer with
analytic gradients enables much
101 102 103 better scalability.
Number of design variables
Table 7.1: Classification of gradient-free optimization methods, using the characteris- 95. Rios et al., Derivative-free optimization:
a review of algorithms and comparison of
tics of Fig. 1.19.
software implementations. 2013
Optimal Iteration Function Stochas
Search criteria proc. eval. -ticity
Mathematical
Mathematical
Deterministic
Stochastic
Surrogate
Heuristic
Heuristic
Global
Direct
Local
Nelder–Mead X X X X X
GPS X X X X X
MADS X X X X X
Trust region X X X X X
Implicit filtering X X X X X
DIRECT X X X X X
MCS X X X X X
EGO X X X X X
SMFs X X X X X
Branch and fit X X X X X
Hit and run X X X X X
Evolutionary X X X X X
7 Gradient-Free Optimization 216
rithms and implicit filtering. The model is an analytic approximate 98. Le Digabel, Algorithm 909: NOMAD:
Nonlinear Optimization with the MADS
of the original function (also called a surrogate model) and it should algorithm. 2011
be smooth, easy to evaluate, and accurate in the neighborhood of
the current point. The trust-region approach detailed in Section 4.5
can be considered gradient-free if the surrogate model is constructed
using just evaluations of the original function without evaluating its
gradients. This does not prevent the trust-region algorithm from using
gradients of the surrogate model, which can be computed analytically.
Implicit filtering methods extend the trust region method by adding
a surrogate model of the function gradient and use that to guide the
search. This effectively becomes a gradient-based method applied to
the surrogate model instead of evaluating the function directly as done
for the methods in Chapter 4.
Global-search algorithms can be broadly classified as deterministic
or stochastic, depending on whether they include random parameter
generation within the optimization algorithm.
Deterministic, global-search algorithms can be either direct or
model-based. Direct algorithms include Lipschitzian-based parti-
tioning techniques—such as the “divide a hyperrectangle” (DIRECT)
algorithm detailed in Section 7.4 and branch and bound search (dis-
cussed in Chapter 8)—and multilevel coordinate search (MCS). The
DIRECT algorithm selectively divides the space of the design variables
§ DIRECT
into smaller and smaller 𝑛-dimensional boxes (hyperrectangles) and is one of the few gradient-free
methods that has a built-in way to handle
uses mathematical arguments to decide on which boxes should be constraints that is not a penalty or filtering
subdivided.§ Branch-and-bound search also partitions the design space, method 99 .
99. Jones, Direct Global Optimization
but estimates lower and upper bounds for the optimum by using the Algorithm. 2009
7 Gradient-Free Optimization 217
The simplex method of Nelder et al.23 is a deterministic, direct-search 23. Nelder et al., A Simplex Method for
Function Minimization. 1965
method that is among the most cited of the gradient-free methods. It
is also known as the nonlinear simplex—not to be confused with the
simplex algorithm used for linear programming, with which it has
nothing in common.
The Nelder–Mead algorithm is based on a simplex, which is a
geometric figure defined
(0) (1)by a set (𝑛)of 𝑛 + 1 points in the design space of
𝑛 variables, 𝑋 = 𝑥 , 𝑥 , . . . , 𝑥 . In two dimensions, the simplex
is a triangle, and in three dimensions it becomes a tetrahedron. Each
optimization iteration is represented by a different simplex. The
algorithm consists in modifying the simplex at each iteration using
five simple operations. The sequence of operations to be performed is
chosen based on the relative values of the objective function at each of
the points.
The first step of the simplex algorithm is to generate 𝑛 + 1 points
based on an initial guess for the design variables. This could done by
simply adding steps to each component of the initial point to generate
𝑛 new points. However, this will generate a simplex with different
edge lengths. Equal length edges are preferable. Suppose we want the
length of all sides to be 𝑙 and that the first guess is 𝑥 (0) . The remaining
points of the simplex, 𝑥 (1) , . . . , 𝑥 (𝑛) , can be computed by
𝑋 = 𝑥 (0) , 𝑥 (1) , . . . , 𝑥 (𝑛−1) , 𝑥 (𝑛) , the best point is 𝑥 (0) , and the worst one
is 𝑥 (𝑛) .
The Nelder–Mead algorithm performs four main operations on the
simplex to create a new one: reflection, expansion, outside contraction,
inside contraction and shrinking. The operations are show in Fig. 7.3.
Each of these operations, except for the shrinking, generates a new
point given by:
𝑥 = 𝑥 𝑐 + 𝛼 𝑥 𝑐 − 𝑥 (𝑛) (7.3)
where 𝛼 is a scalar, and 𝑥 𝑐 is the centroid of all the points except for the
worst one, i.e.,
1 Õ (𝑖)
𝑛−1
𝑥𝑐 = 𝑥 . (7.4)
𝑛
𝑖=0
This generates a new point along the line that connects the worst point,
𝑥 (𝑛) , and the centroid of the remaining points, 𝑥 𝑐 . This direction can be
seen as a possible descent direction.
𝑥𝑒
𝑥𝑟
𝑥 (1) 𝑥 (1) 𝑥 (1)
𝑥𝑐
𝑥 𝑖𝑐
𝑥 (0) 𝑥 (0) 𝑥 (0)
𝑥 (2) 𝑥 (2) 𝑥 (2)
Figure 7.3: Nelder–Mead algorithm
(d) Outside contraction (e) Inside contraction (f) Shrink operations for 𝑛 = 2.
(𝛼 = 0.5) (𝛼 = −0.5)
where 𝛾 = 0.5.
Alg. 7.2 details how a new simplex is obtained for each iteration. In
each iteration, the focus is on replacing the worst point with a better
one, as opposed to improving the best. The corresponding flowchart is
shown in Fig. 7.4.
Inputs:
𝑥 (0) : Starting point
𝜏𝑥 : Simplex size tolerances
𝜏 𝑓 : Function value standard deviation tolerances
Outputs:
𝑥 ∗ : Optimal point
Sort 𝑥 (0) , . . . , 𝑥 (𝑛−1) , 𝑥 (𝑛) Order from the lowest (best) to the highest 𝑓 (𝑥 (𝑗) )
1 Í𝑛−1 (𝑖)
𝑥𝑐 = 𝑛 𝑖=0 𝑥 The centroid excluding the worst point 𝑥 (𝑛) (7.4)
if 𝑓 (𝑥 𝑒 ) < 𝑓 (𝑥 (0) )
then Is expanded point better than the best?
𝑥 (𝑛) = 𝑥 𝑒 Accept expansion and replace worst point
else
𝑥 (𝑛) = 𝑥 𝑟 Accept reflection
end if
else if 𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (𝑛−1) ) then Is reflected better than second worst?
𝑥 (𝑛) = 𝑥 𝑟 Accept reflected point
else
7 Gradient-Free Optimization 221
if 𝑓 (𝑥 𝑖𝑐 ) < 𝑓 (𝑥 (𝑛) )
then Inside contraction better than worst?
𝑥 (𝑛) = 𝑥 𝑖𝑐 Accept inside contraction
else
for 𝑗 = 1 to 𝑛 do
𝑥 (𝑗) = 𝑥 (0) + 0.5 𝑥 (𝑗) − 𝑥 (0) Shrink, (7.5) with 𝛾 = 0.5
end for
end if
else
𝑥 𝑜𝑐 = 𝑥 𝑐 + 0.5 𝑥 𝑐 − 𝑥 (𝑛) Outside contraction, (7.3) with 𝛼 = 0.5
if 𝑓 (𝑥 𝑜𝑐 ) < 𝑓 (𝑥 𝑟 ) then Is contraction better than reflection?
𝑥 (𝑛) = 𝑥 𝑜𝑐 Accept outside contraction
else
for 𝑗 = 1 to 𝑛 do
𝑥 (𝑗) = 𝑥 (0) + 0.5 𝑥 (𝑗) − 𝑥 (0) Shrink, (7.5) with 𝛾 = 0.5
end for
end if
end if
end if
end while
The cost for each iteration is one function evaluation if the reflection
is accepted, two function evaluations if an expansion or contraction is
performed, and 𝑛 + 2 evaluations if the iteration results in shrinking.
Although we could parallelize the 𝑛 evaluations when shrinking, it
would not be worthwhile because the other operations are sequential.
There are a number of ways to quantify the convergence of the
simplex method. One straightforward way is to use the size of simplex,
i.e.,
Õ
𝑛−1
Δ𝑥 = ||𝑥 (𝑖) − 𝑥 (𝑛) ||, (7.6)
𝑖=0
𝑘 = 𝑘+1
𝑥𝑒
𝑥 (1)
𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (0) ) 𝑥 (1) 𝑓 (𝑥 𝑒 ) ≤ 𝑓 (𝑥 (0) )
𝑥 (𝑛) = 𝑥 𝑒
𝑥 (0) 𝑥 (0)
𝑥 (2) 𝑥 (2)
else
𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (𝑛−1) ) 𝑥 (𝑛) = 𝑥 𝑟
𝑥𝑟 𝑥 (1)
𝑥 (1)
𝑓 (𝑥 𝑟 ) ≥ 𝑓 (𝑥 (𝑛) ) 𝑓 (𝑥 𝑖𝑐 ) ≤ 𝑓 (𝑥 (𝑛) )
𝑥𝑐 𝑥 (𝑛) = 𝑥 𝑖𝑐
𝑥 𝑖𝑐
𝑥 (0)
𝑥 (0) 𝑥 (1)
𝑥 (2)
𝑥 (2) else
else
𝑥 (0)
𝑥 (1) 𝑥 𝑜𝑐 𝑥 (2)
else 𝑓 (𝑥 𝑜𝑐 ) ≤ 𝑓 (𝑥 𝑟 )
𝑥 (𝑛) = 𝑥 𝑜𝑐
𝑥 (0)
𝑥 (2)
Note that the methodology, like most direct-search methods, cannot Figure 7.4: Flowchart of Nelder–
directly handle constraints. One approach to handle constraints would Mead (Alg. 7.2).
Figure 7.5 shows the sequence of simplices that results when minimizing
the bean function using a Nelder–Mead simplex. The initial simplex on the
upper left is equilateral. The first iteration is a reflection, followed by an
inside contraction, another reflection, and an inside contraction before the
shrinking. The simplices then shrink dramatically in size, slowly converging to
the minimum.
Using a convergence tolerance of 10−6 in the difference between 𝑓best and
𝑓worst the problem took 68 function evaluations.
gradient-free optimization algorithms in this chapter in that it is based 47. Jones et al., Lipschitzian optimization
without the Lipschitz constant. 1993
7 Gradient-Free Optimization 223
𝑥2 1
𝑥∗
to the left and right, respectively. We show this cone in Fig. 7.6 (left), as
well as cones corresponding to other values of 𝑘.
Shubert’s Algorithm
If a Lipschitz constant for a single variable function is known, Shubert’s
algorithm can find the global minimum of that function. Because the
Lipschitz constant is not available in the general case, the DIRECT
algorithm is designed so that it does not require this constant. However,
we explain Shubert’s algorithm first because it provides some of the
basic concepts used in the DIRECT algorithm.
Shubert’s algorithm starts with a domain within which we want to
find the global minimum—[𝑎, 𝑏] in Fig. 7.7. Using the property of the
Lipschitz constant 𝑘 defined in Eq. 7.10, we know that the function is
always above a cone of slope 𝑘 evaluated at any point in the domain.
We start by establishing a first lower bound on the global minimum
by finding the intersection of the cones—𝑥1 in Fig. 7.7 (left)—for the
7 Gradient-Free Optimization 225
3rd
2nd
𝑓 𝑓
1st increase
in lower bound
𝑎 𝑥1 𝑏 𝑎 𝑥2 𝑥1 𝑥5 𝑥3 𝑥4 𝑏
𝑥 𝑥
extremes of the domain. We evaluate the function at 𝑥1 and can now Figure 7.7: Shubert’s algorithm
draw a cone about this point to find two more intersections (𝑥2 and requires an initial domain and a
valid Lipschitz constant (left) and
𝑥3 ). Because these two points always intersect at the same objective then increases the lower bound
lower bound value, they both need to be evaluated to see which one of the global minimum with each
successive iteration (right).
has the highest lower bound increase (the 𝑥3 side in this case). Each
subsequent iteration of Shubert’s algorithm adds two new points to
either side of the current point. These two points are evaluated to find
out which side has the lowest actual function value and that side gets
selected to be divided.
The lowest bound on the function increases at each iteration and
ultimately converges to the global minimum. At the same time, the
segments in 𝑥 decrease in size. The lower bound can switch from
distinct regions, as the lower bound in one region increases beyond
the lower bound in another region. Using the minimum Lipschitz
constant in this algorithm would be the most efficient because it would
correspond to the largest possible increments in the lower bound at
each iteration.
The two major shortcomings of Shubert’s algorithm are that: (1) A
Lipschitz constant is usually not available for a general function and (2)
it is not easily extended to 𝑛 dimension. These two shortcomings are
addressed by the DIRECT algorithm.
One-dimensional DIRECT
Before explaining the 𝑛-dimensional DIRECT algorithm, we introduce
the one-dimensional version, which is based on principles similar to
those of the Shubert algorithm. The main difference is that instead
of evaluating at the cone intersection points, we divide the segments
evenly and evaluate the center of the segments.
Consider the closed domain [𝑎, 𝑏] shown in Fig. 7.8 (left). For each
segment, we evaluate the objective function at the midpoint of the
segment. In the first segment, which spans the whole domain, this is
𝑐 0 = (𝑎 + 𝑏)/2. Assuming some value of 𝑘, which is not known and
7 Gradient-Free Optimization 226
which we will not need, the lower bound on the minimum would be
𝑓 (𝑐) − 𝑘(𝑏 − 𝑎)/2.
+𝑘 −𝑘
𝑓 𝑓
𝑓 (𝑐) − 12 𝑘(𝑏 − 𝑎)
𝑎 𝑐 = 12 (𝑎 + 𝑏) 𝑏 𝑎 𝑐 = 12 (𝑎 + 𝑏) 𝑏
𝑥
𝑥
𝑑 = 12 (𝑏 − 𝑎)
We want to increase this lower bound on the function minimum Figure 7.8: The DIRECT algorithm
evaluates the middle point (left)
by dividing this segment further. To do this in a regular way that
and each successive iteration tri-
reuses previously evaluated points and can be repeated indefinitely, sects the segments that have the
we divide it into three segments, as shown in Fig. 7.8 (right). Now we greatest potential (right).
have increased the lower bound on the minimum. Unlike the Shubert
algorithm, the lower bound is a discontinuous function across the
segments, as shown in Fig. 7.8 (right). We now have a regular division
of segments, which is more amenable for extending the method to 𝑛
dimensions.
Instead of continuing to divide every segment into three other
segments, we only divide segments selected according to a potentially
optimal criterion. To better understand this criterion, consider a set of
segments [𝑎 𝑖 , 𝑏 𝑖 ] at a given DIRECT iteration, where segment 𝑖 has a
half length 𝑑 𝑖 = (𝑏 𝑖 − 𝑎 𝑖 )/2 and a function value 𝑓 (𝑐 𝑖 ) evaluated at the
segment center 𝑐 𝑖 = (𝑎 𝑖 + 𝑏 𝑖 )/2. If we plot 𝑓 (𝑐 𝑖 ) versus 𝑑 𝑖 for a set of
segments, we get the pattern shown in Fig. 7.9.
𝑓 (𝑐)
𝑘
𝑓 (𝑐 𝑗 )
The overall rationale for the potentially optimal criterion is that there
are two metrics that quantify this potential: the size of the segment and
7 Gradient-Free Optimization 227
the function value at the center of the segment. The greater the size of
the segment, the greater the potential for containing a minimum. The
lower the function value, the greater that potential is as well. For a set
of segments of the same size, we know that the one with the lowest
function value has the best potential and should be selected. If two
segments had the same function value and different sizes, the one with
the largest size would should be selected. For a general set of segments
with various sizes and value combinations, there might be multiple
than can be considered potentially optimal.
We identify potentially optimal segments as follows. If we draw a
line with a slope corresponding to a Lipschitz constant 𝑘 from any point
in Fig. 7.9, the intersection of this line with the vertical axis is a bound
on the objective function for the corresponding segment. Therefore,
the lowest bound for a given 𝑘 can be found by drawing a line through
the point that achieves the lowest intersection.
However, we do not know 𝑘 and we do not want to assume a value
because we do not want to bias the search. If 𝑘 were high, it would favor
dividing the larger segments. Low values of 𝑘 would result in dividing
the smaller segments. The DIRECT method hinges on considering all
possible values of 𝑘, effectively eliminating the need for this constant.
To eliminate the dependence on 𝑘, we select all the points for which
there is a line with slope 𝑘 that does not go above any other point. This
corresponds to selecting the points that form a lower convex hull, as
shown by the piecewise linear function in Fig. 7.9. This establishes a
lower bound on the function for each segment size.
Mathematically, a segment 𝑗 in the set of current segments 𝑆 is said
to be potentially optimal if there is a 𝑘 ≥ 0 such that
𝑓 (𝑐 𝑗 ) − 𝑘𝑑 𝑗 ≤ 𝑓 (𝑐 𝑖 ) − 𝑘𝑑 𝑖 ∀𝑖 ∈ 𝑆 (7.11)
𝑓 (𝑐 𝑗 ) − 𝑘𝑑 𝑗 ≤ 𝑓min − 𝜀 𝑓min (7.12)
where 𝑓min is the best current objective function value, and 𝜀 is a small
positive parameter. The first condition corresponds to finding the
points in the lower convex hull mentioned previously.
The second condition in Eq. 7.12 ensures that the potential minimum
is better than the lowest function value so far by at least a small amount.
This prevents the algorithm from becoming too local, wasting function
evaluations in search of smaller function improvements. The parameter
𝜀 balances the search between local and global search. A typical value
is 𝜀 = 10−4 , and its the range is usually such that 10−2 ≤ 𝜀 ≤ 10−7 .
There are efficient algorithms for finding the convex hull of an
arbitrary set of points in two dimensions, such as the Jarvis march.
7 Gradient-Free Optimization 228
These algorithms are more than we need here, since we only require the
lower part of the convex hull, so they can be simplified for this purpose.
As in the Shubert algorithm, the division might switch from one part
of the domain to another, depending on the new function values. When
compared to the Shubert algorithm, the DIRECT algorithm produces
a discontinuous lower bound on the function values, as shown in
Fig. 7.10.
DIRECT in 𝑛 Dimensions
The 𝑛-dimensional DIRECT algorithm is similar to the one-dimensional
version but becomes more complex.†† The main differences is that we †† In this chapter, we present an improved
version of DIRECT 99 .
deal with hyperrectangles instead of segments. A hyperrectagle can
99. Jones, Direct Global Optimization
be defined by its centerpoint position 𝑐 in 𝑛-dimensional space and a Algorithm. 2009
half length in each direction 𝑖, 𝛿𝑒 𝑖 , as shown in Fig. 7.11. The DIRECT
algorithm assumes that the initial dimensions are normalized so that
we start with a hypercube.
δe2
Figure 7.11: Hyperrectangle in
c δe1 three dimensions, where 𝑑 is the
δe3
maximum distance between the
d center and the vertices and 𝛿𝑒 𝑖 is
the half-length in each direction 𝑖.
Inputs:
𝑥: Variable upper bounds
x: Variable lower bounds
Outputs:
𝑥 ∗ : Optimal point
The values for these three points are plotted in the 2nd column from
the right in the 𝑓 -𝑑 plot, where the center point is reused, as indicated
by the arrow and the matching color. At this iteration, we have two
points that define the convex hull. In the second iteration, we have
three rectangles of the same size, so we divide the one with the lowest
value and evaluate the centers of the two new rectangles (which are
squares in this case). We now have another column of points in the 𝑓 -𝑑
plot corresponding to a smaller 𝑑 and an additional point that defines
the lower convex hull. Because the convex hull now has two points, we
trisect two different rectangles in the third iteration.
1 𝑓
2 𝑓
𝑥2
−1
−2
−2 −1 0 1 2 3 4 Figure 7.13: CAPTION
𝑥1
𝑓
40
30
20
7.5 Genetic Algorithms 10
0
Genetic algorithms (GAs) are the most well-know and widely used
type of evolutionary algorithm. They were also among the earliest to −10
have been developed.‡‡ GAs, like many evolutionary algorithms, are −20
10−2 10−1 100
population based: The optimization starts with a set of design points (the 𝑑
population) rather than a single starting point, and each optimization
Figure 7.14: CAPTION
iteration updates this set in some way. Each iteration in the GA is called
‡‡ The first GA software was written in
a generation, and each generation has a population with 𝑁𝑝 points. A
1954, followed by other seminal work.101
chromosome is used to represent each point and contains the values for Initially, these GAs were not written to per-
all the design variables, as shown in Fig. 7.15. Each design variable is form optimization, but rather, to model
the evolutionary process. GAs were even-
represented by a gene. As we will see later, there are different ways for tually applied to optimization.102
genes to represent the design variables. 101. Barricelli, Esempi numerici di processi
di evoluzione. 1954
GAs evolve the population using an algorithm inspired by biological
102. Jong, An analysis of the behavior of a
reproduction and evolution using three main steps: 1) selection, 2) class of genetic adaptive systems. 1975
7 Gradient-Free Optimization 232
Population
Gene Chromosome
𝑥 (0) 𝑥1 𝑥2 ... 𝑥𝑛
specified for the generation of the initial population, and the size of
that population varies. Similarly, there are many possible methods for
selecting the parents, for generating the offspring, and for selecting the
survivors. Here, the new population (𝑃𝑘+1 ) is formed exclusively by
the offspring generated from crossover. However, some GAs add an Crossover
extra selection process that selects a surviving population of size 𝑁𝑝 Offspring
among the population of parents and offspring.
Inputs:
𝑥: Variable upper bounds
x: Variable lower bounds Population 𝑃𝑘+1
Outputs:
𝑥 ∗ : Optimal point
𝑘 = 0n o
𝑃 𝑘 = 𝑥 (1) , 𝑥 (2) , . . . , 𝑥 (𝑁𝑃 ) Generate initial population
while 𝑘 < 𝑘max do Figure 7.16: At each GA iteration,
Compute 𝑓 (𝑥) ∀ 𝑥 ∈ 𝑃 𝑘 Evaluate fitness pairs of parents are selected from
Select 𝑁𝑝 /2 parent pairs from 𝑃 𝑘 for crossover Selection the population to generate the
offspring through crossover, which
Generate a new population of 𝑁𝑝 offspring (𝑃 𝑘+1 ) Crossover
become the new population.
7 Gradient-Free Optimization 233
𝑥¯ − x
Δ𝑥 = . (7.14)
2𝑚 − 1
To have a more precise representation, we must use more bits.
When using binary-encoded GAs, we do not need to encode the de-
sign variables (since they are generated and manipulated directly in the
binary representation), but we do need to decode them before providing
them to the evaluation function. To decode a binary representation, we
use
Õ
𝑚−1
𝑥 =x+ 𝑏 𝑖 2𝑖 Δ𝑥. (7.15)
𝑖=0
𝑖 1 2 3 4 5 6 7 8 9 10 11 12
𝑏𝑖 0 0 0 1 0 1 1 0 0 0 0 1
Initial Population
The first step in a genetic algorithm is to generate an initial set (pop-
ulation) of points. As a rule of thumb, the population size should be
approximately one order of magnitude larger than the number of design
variables, but in general you will need to experiment with different
population sizes.
One popular way to choose the initial population is to do it at random.
Using binary encoding, we can assign each bit in the representation of
the design variables a 50% chance of being either 1 or 0. This can be
done by generating a random number 0 ≤ 𝑟 ≤ 1 and setting the bit to 0
if 𝑟 ≤ 0.5 and 1 if 𝑟 > 0.5. For a population of size 𝑁𝑃 , with 𝑛 𝑥 design
variables, and each variable is encoded using 𝑚 bits, the total number
of bits that needs to be generated is 𝑁𝑃 × 𝑛 𝑥 × 𝑚.
To achieve better spread in a larger dimension space, methods like
Latin hypercube sampling are generally more effective than random
populations (discussed in Section 10.2).
Evaluate Fitness
The objective function for all the points in the population must be
evaluated and then converted to a fitness value. These evaluations
could be done in parallel. The numerical optimization convention
is usually to minimize the objective, while the GA convention is to
maximize the fitness. Therefore, we can convert the objective to fitness
simply by setting 𝐹 = − 𝑓 .
For some types of selection (like the tournament selection detailed
in the next step) all the fitness values need to be positive. To achieve
that, we can perform the following conversion:
− 𝑓𝑖 + Δ𝐹
𝐹= , (7.16)
max(1, Δ𝐹 − 𝑓low )
7 Gradient-Free Optimization 235
where Δ𝐹 = 1.1 𝑓high −0.1 𝑓low is based on the highest and lowest function
values in the population, and the denominator is introduced to scale
the fitness.
Selection
In this step we choose points from the population for reproduction
in a subsequent step. On average, it is desirable to choose a mating
pool that improves in fitness (thus mimicking the concept of natural
selection), but it is also important to maintain diversity. In total, we
need to generate 𝑁𝑃 /2 pairs.
The simplest selection method is to randomly select two points from
the population until the requisite number of pairs is complete. This
approach is not particularly effective because there is no mechanism to
move the population toward points with better objective functions.
Tournament selection is a better method that randomly pairs up 𝑁𝑃
points, and selects the best point from each pair to join the mating pool.
The same pairing and selection process is repeated to create 𝑁𝑃 /2 more
points to complete a mating pool of 𝑁𝑃 points.
Figure 7.17 illustrates the process with a very small population. Each
member of the population ends up in the mating pool zero, one, or two times
with better points more likely to appear in the pool. The best point in the
population will always end up in the pool twice, while the worst point in the
population will always be eliminated.
12 2
10 2
10 15
7 6
7 6
15 7
2 10
2 10 Figure 7.17: Tournament selection
6 12 example.
points a larger sector on the roulette wheel so that they have a higher
probability of being selected.
To find the sizes of the sectors in the roulette wheel selection, we
use the fitness value defined by Eq. 7.16. We then take the normalized
cumulative sum of the scaled fitness values to compute an interval for
each members in the population 𝑗 as
Í𝑗
𝐹𝑖
𝑖=1
𝑆𝑗 = (7.17)
Í𝑃
𝑁
𝐹𝑖
𝑖=1
This ensures that the probability of a member being selected for repro-
duction is proportional to its scaled fitness value.
Assume that 𝐹 = [20, 5, 45, 10]. Then 𝑆 = [0.25, 0.3125, 0.875, 1], which
divides the “wheel” into four segments shown graphically as show in Fig. 7.18.
0
0.875
𝑥 (4)
𝑥 (1)
Crossover
0.25
𝑥 (2)
In the reproduction operation, two points (offspring) are generated
0.3125
from a pair of points (parents). Various strategies are possible in genetic 𝑥 (3)
algorithms. Single-point crossover usually involves generating a random
integer 1 ≤ 𝑘 ≤ 𝑚 − 1 that defines the crossover point. This is illustrated
in Table 7.2. For one of the offspring, the first 𝑘 bits are taken from, say, Figure 7.18: Roulette wheel selec-
tion example.
parent 1 and the remaining bits from parent 2. For the second offspring,
the first 𝑘 bits are taken from parent 2 and the remaining ones from
parent 1. Various extensions exist like two-point crossover or 𝑛-point
crossover.
Mutation
Mutation is a random operation performed to change the genetic infor-
mation and is needed because even though selection and reproduction
effectively recombine existing information, occasionally some useful
7 Gradient-Free Optimization 237
Initial Population
The most common approach is to pick the 𝑁𝑃 points using random
sampling within the provided design bounds. Each member is often
7 Gradient-Free Optimization 238
chosen at random within some initial bounds. For each design variable
𝑥 𝑖 , with bounds such that x𝑖 ≤ 𝑥 𝑖 ≤ 𝑥¯ 𝑖 , we could use,
𝑥 𝑖 = x𝑖 + 𝑟( 𝑥¯ 𝑖 − x𝑖 ) (7.19)
Selection
The selection operation does not depend on the design variable en-
coding, and therefore, we can just use any of the selection approaches
already described in the binary-encoded GA.
Crossover
When using real-encoding, the term “crossover” does not accurately
describe the process of creating the two offspring from a pair of points.
Instead, the approaches are more accurately described as a blending,
although the name crossover is still often used.
There are various options for the reproduction of two points encoded
using real numbers. A common method is linear crossover, which
generates two or more points in the line defined by the two parent
points. One option for linear crossover is to generate the following two
points:
𝑥 𝑐1 = 0.5𝑥 𝑝1 + 0.5𝑥 𝑝2 ,
(7.20)
𝑥 𝑐2 = 2𝑥 𝑝2 − 𝑥 𝑝1 ,
where parent two is more fit than parent 1 ( 𝑓 (𝑥 𝑝2 ) < 𝑓 (𝑥 𝑝1 )). An
example of this linear crossover approach is shown in Fig. 7.19, where
we can see that child 1 is the average of the two parent points, while
child 2 is obtained by extrapolating in the direction of the “fitter” parent.
Another option is a simple crossover like the binary case where a
𝑥 𝑐2
random integer is generated to split the vectors. For example with a
split after the first index: 𝑥 𝑝2
𝑥 𝑝1 = [𝑥 1 , 𝑥2 , 𝑥3 , 𝑥4 ] 𝑥 𝑐1
𝑥 𝑝2 = [𝑥 5 , 𝑥6 , 𝑥7 , 𝑥8 ] 𝑥 𝑝1
⇓ (7.21)
Figure 7.19: Linear crossover pro-
𝑥 𝑐1 = [𝑥 1 , 𝑥6 , 𝑥7 , 𝑥8 ] duces two new points along the line
𝑥 𝑐2 = [𝑥 5 , 𝑥2 , 𝑥3 , 𝑥4 ] defined by the two parent points.
7 Gradient-Free Optimization 239
This simple crossover does not generate as much diversity as the binary
case does and relies more heavily on effective mutation. Many other
strategies have been devised for real-encoded GAs 104 . 104. Deb, Multi-Objective Optimization
Using Evolutionary Algorithms. 2001
Mutation
Like a binary-encoded GA, mutation should only occur with a small
probability (e.g., 𝑝 = 0.005 ∼ 0.1). However, rather than changing
each bit with probability 𝑝, we now change each design variable with
probability 𝑝.
Many mutation methods rely on random variations around an
existing member such as a uniform random operator:
Figure 7.20 shows the evolution of the population when minimizing the
bean function using a genetic algorithm. The initial population size was 40,
and the simulation was run for 14 generations, requiring 2000 total function
evaluations. Convergence was assumed if the best member in the population
improved by less than 10−4 for 3 consecutive generations.
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
2. among two feasible solutions, choose the one with a better objec- Figure 7.20: Population evolution
tive at iterations 𝑘 using a genetic algo-
rithm to minimize the bean func-
tion
3. among two infeasible solutions, choose the one with a smaller
constraint violation
This concept is a lot like the filter methods discussed in Section 5.6.
7.5.4 Convergence
Rigorous mathematical convergence criteria, like those used in gradient-
based optimization, do not apply to genetic algorithms. The most
common way to terminate a genetic algorithm is to simply specify a
maximum number of iterations, which corresponds to a computational
budget. Another similar approach is to run indefinitely until the user
manually terminates the algorithm, usually by monitoring the trends
in population fitness.
A more automated approach is to track a running average of the
population fitness, although it can be difficult to decide what tolerance
to apply to this criterium as we generally aren’t interested in the average
performance anyway. Perhaps a more direct metric of interest is to
track the fitness of the best member in the population. However, this
7 Gradient-Free Optimization 241
Although these are just design points, the history for each point is 105. Eberhart et al., New Optimizer Using
Particle Swarm Theory. 1995
relevant to the PSO algorithm, so we use adopt the term “particle”.
Each particle moves according to a velocity, and this velocity changes
according to the past objective function values of that particle and
the current objective values of the rest of the particles. Each particle
remembers the location where it found its best result so far and it
exchanges information with the swarm about the location where the
swarm has found the best result so far.
The position of particle 𝑖 for iteration 𝑘 + 1 is updated according to
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝑣 𝑘+1 Δ𝑡, (7.24)
where Δ𝑡 is a constant artificial time step. The velocity for each particle
is updated as follows:
(𝑖) (𝑖) (𝑖)
(𝑖) (𝑖) 𝑥best − 𝑥 𝑘 𝑥best − 𝑥 𝑘
𝑣 𝑘+1 = ¯ 𝑘
𝑤𝑣 + 𝑐 1 𝑟1 + 𝑐 2 𝑟2 . (7.25)
Δ𝑡 Δ𝑡
The first component in this update is the “inertia”, which, through the
parameter 𝑤, ¯ dictates how much the new velocity should tend to be
the same as the one in the previous iteration.
The second term represents “memory” and is a vector pointing
toward the best position particle 𝑖 has seen in all its iterations so far,
(𝑖)
𝑥best . The weight in this term consists of a constant 𝑐 1 , and a random
7 Gradient-Free Optimization 242
We then use this step to update the particle position for the next iteration,
i.e.,
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + Δ𝑥 𝑘+1 . (7.27)
The three components of the update (7.26) are shown in Fig. 7.21 for a
two-dimensional case.
𝑥best
(𝑖)
𝑥 best (𝑖)
𝑥 𝑘+1
(𝑖) (𝑖)
𝑐 1 𝑟1 𝑥best − 𝑥 𝑘
(𝑖)
Δ𝑥 𝑘+1
(𝑖)
𝑥 𝑘−1
(𝑖)
𝑐 2 𝑟2 𝑥 best − 𝑥 𝑘
Figure 7.21: Components of the
(𝑖)
𝑥𝑘 (𝑖)
PSO update.
𝑤Δ𝑥 𝑘
Typical values for the inertia parameter 𝑤 are in the interval [0.8, 1.2].
A lower value of 𝑤 reduces the particle’s inertia and tends toward faster
convergence to an minimum, while a higher value of 𝑤 increases the
particle’s inertia and tends toward increased exploration to potentially
help discover multiple minima. Thus, there is a tradeoff in this value.
Both 𝑐 1 and 𝑐2 values are in the interval [0, 2], and typically closer to 2.
The first step in the PSO algorithm is to initialize the set of particles
(Alg. 7.12). Like a GA, the initial set of points can be determined at
random or can use a more sophisticated design of experiments strategy
(like Latin hypercube sampling). The main loop in the algorithm
computes the steps to be added to each particle and updates their
positions. A number of convergence criteria are possible, some of
which are similar to the simplex method and GA: the distance (sum
7 Gradient-Free Optimization 243
or norm) between each particle and the best particle falls below some
tolerance, the best particle’s fitness changes by less than some tolerance
across multiple generations, the difference between the best and worst
member falls below some tolerance. In the case of PSO, another
alternative is to check whether the velocities for all particles (norm,
mean, etc.) falls below some tolerance. Some of these criteria that
assume all the particles will congregate (distance, velocities) don’t work
well for multimodal problems. In those cases tracking just the best
particle’s fitness may be more desirable.
Figure 7.22 shows the sequence of simplices that results when minimizing
the bean function using a particle swarm method. The initial population size
was 40 and the optimization required 600 function evaluations. Convergence
was assumed if the best value found by the population did not improve by
more than 10−4 for 3 consecutive iterations.
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
Inputs:
𝑥: Variable upper bounds
x: Variable lower bounds
𝑤: “Inertia” parameter
𝑐 1 : Self influence parameter
𝑐 2 : Social influence parameter
Outputs:
𝑥 ∗ : Optimal point
𝑘=0
for all i do Loop to initialize all particles
(𝑖)
Generate position 𝑥0 within specified bounds.
(𝑖) (𝑖)
𝑥best = 𝑥0
First position is the best so far.
(𝑖)
Evaluate 𝑓 𝑥0
if 𝑖 = 0 then
(𝑖)
𝑥best = 𝑥0
else
(𝑖)
if 𝑓 𝑥0 < 𝑓 (𝑥best ) then
(𝑖)
𝑥best = 𝑥0
end if
end if
(𝑖)
Initialize “velocity” Δ𝑥 𝑘
end for
while not converged do Main iteration loop
(𝑖) (𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝑤Δ𝑥 𝑘 + 𝑐 1 𝑟1 𝑥best − 𝑥 𝑘 + 𝑐2 𝑟2 𝑥 best − 𝑥 𝑘
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + Δ𝑥 𝑘+1 Update the particle position while enforcing bounds.
(𝑖) (𝑖)
if 𝑥 𝑘+1 < 𝑥 lower or 𝑥 𝑘+1 > 𝑥upper then
(𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝑐1 𝑟1 𝑥best − 𝑥 𝑘 + 𝑐 2 𝑟2 𝑥best − 𝑥 𝑘
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + Δ𝑥 𝑘+1
end if
(𝑖)
for all 𝑥 𝑘+1 do
(𝑖)
Evaluate 𝑓 𝑥 𝑘+1
(𝑖) (𝑖)
if 𝑓 𝑥 𝑘+1 < 𝑓 𝑥best then
(𝑖) (𝑖)
𝑥best = 𝑥 𝑘+1
end if
end for
𝑘 = 𝑘+1
end while
We now return to the Jones function (Eq. 7.13 used in Ex. 7.5 to demonstrate
the DIRECT method), but make it discontinuous by adding the following
function:
Δ 𝑓 = 4dsin(𝜋𝑥1 ) sin(𝜋𝑥2 )e. (7.28)
By taking the ceiling of the product of the two sine waves, this function creates
a checkerboard pattern with zeros and fours. Adding this function to the Jones
function produces the discontinuous function shown in Fig. 7.23, where we
can clearly see the discontinuities. The global optimium remains the same as
the original function. The resulting optimization paths demonstrate that the
gradient-free algorithms are effective. Both the GA and PSO find the global
minimum, but they require a large number of evaluations for the same accuracy.
Nelder–Mead converges quickly, but not to the global minimum.
3 3
2420 evaluations 760 evaluations
2 2
1 1
𝑥2 𝑥2
0 0
𝑥∗
−1 −1
𝑥∗
−1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1
3 3 3
55 evaluations 99 evaluations 384 evaluations
2 𝑥∗ 2 2
1 1 1
𝑥2 𝑥2 𝑥2
0 0 0
𝑥∗
−1 −1 −1
𝑥∗
−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
as you can afford with the lowest convergest tolerance possible and tabulate the
number of function evaluations and the respective objective function values.
To compare the computational cost for a specified tolerance, you can determine
the number of function evaluations that each algorithm requires to achieve
a given number of digits agreement in the objective function. Alternatively,
you can compare the objective achieved for the different algorithms for a
given number of function evaluations. Comparison becomes more challenging
for constrained problems because a better objective that is less feasible is not
necessarily better. In that case, you need to make sure that all results are feasible
to the same tolerance. When comparing algotrithms that include stochastic
procedures (e.g., GA, PSO), you should run each optimizatin multiple times
to get statistically significant data and compare the mean and variance of the
performance metrics.
7.7 Summary
Problems
7.3 Program the DIRECT algorithm and perform the following stud-
ies:
7.5 Program the PSO algorithm and perform the following studies:
a) Gradient-free algorithm
b) Gradient-based algorithm with gradients computed using
finite differences
c) Gradient-based algorithm with exact gradients
250
8 Discrete Optimization 251
Even though a discrete optimization problem limits the options and thus
conceptually sounds easier to solve, in practice discrete optimization
problems are usually much more difficult and inefficient compared
to continuous problems. Thus, if it is reasonable to do so, it is often
desirable to find ways to avoid using discrete design variables. There
are a couple ways this can be accomplished.
The first approach is an exhaustive search. We just discussed how
exhaustive search scales poorly, but sometimes we have many contin-
uous variables but only a few discrete variables with few options. In
this case enumerating all options is possible. For each combination of
discrete variables, the optimization is repeated using all continuous
variables. We then choose the best feasible solution amongst all the
optimization. Assuming, the continuous part of the problem can be
solved, this approach will lead to the true optimum.
minimize 𝑐𝑇 𝑥
subject to ˆ ≤ 𝑏ˆ
𝐴𝑥
(8.1)
𝐴𝑥 + 𝑏 = 0
𝑥 𝑖 ∈ Z+ for some or all i
selected:
Õ
𝑛
𝑥𝑖 = 1 (8.2)
𝑖=1
𝑥1 = 0 1
𝑥2 = 0 1 0 1
then we can prune that branch. We know this because adding more
constraints will always lead to a solution that is either the same or
worse, never better (assuming you always find the global optimum,
which we can guarantee for LP problems). The solution from a relaxed
problem provides a lower bound—the best that could be achieved if
continuing on that branch. The logic for these various possibilities
is summarized in Alg. 8.4. The initial starting point for 𝑓best can be
𝑓best = ∞ if nothing is known, but if a known feasible solution exists, or
can be found quickly by some heuristic, providing any finite best point
can often greatly speed up the optimization.
Inputs:
𝑓best : Best known solution, if any; otherwise 𝑓best = ∞
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
We begin at the first node by solving the linear relaxation. The binary
constraint is removed, and instead replaced with continuous bounds: 0 ≤ 𝑥 𝑖 ≤ 1.
The solution to this LP is:
While, depth-first was recommended above, for this example we will use
breadth-first only because it is shorter in this case giving a more concise example.
The depth-first tree is also shown at the end of the example. We solve both of the
problems at this next level as shown in Fig. 8.4. Neither of these optimizations
yields all binary values so we have to branch both of them. In this case the left
node branches on 𝑥 2 (the only fractional component) and the right node also
branches on 𝑥2 (the most fractional component).
The first branch (see Fig. 8.5) yields a feasible binary solution! The corre-
sponding function value 𝑓 = −4 is saved as the best value we have seen so far.
There is no need to continue on this branch as the solution cannot be improved
on this particular branch.
We continue solving along the rest of this row (Fig. 8.6). The third node
on this row yields another binary solution. In this case the function value is
𝑓 = −4.9, which is better, and so we save this as the best value we have seen so
far. The second and fourth nodes do not yield a solution. Normally we’d have
to branch these further but both of them have a lower bound which is worse
than the best solution we have found so far. Thus, we can prune both of these
branches.
All branches have been pruned and so we have solved the original problem:
8 Discrete Optimization 258
𝑥 ∗ = [1, 0, 1, 1]𝑇
(8.6)
𝑓 ∗ = −4.9
Alternatively, we could have used a depth-first strategy. In this case, it is
less efficient, but in general that is not known beforehand. The depth-first tree
for this same example is depicted in Fig. 8.7. Feasible solutions to the problem
are shown with 𝑓 ∗ .
𝑥3 = 0 1
𝑥2 = 0 1 0 1
𝑓 ∗ = −4 𝑓 ∗ = −4.9 bounded
𝑥1 = 0 1
𝑓 ∗ = −2.6
𝑥4 = 0 1
Figure 8.7: The search path with a
depth-first strategy instead.
𝑓∗ = −3.6 infeasible
𝑥3 ≤ 4 𝑥3 ≥ 5
infeasible
𝑥2 ≤ 1 𝑥2 ≥ 2
𝑥1 ≤ 0 𝑥1 ≥ 1 𝑥3 ≤ 2 𝑥3 ≥ 3
bounded
𝑥1 ≤ 1 𝑥1 ≥ 2 Figure 8.8: A breadth-search of
the mixed integer programming
example.
infeasible bounded
8 Discrete Optimization 260
Once all the branches are pruned we see that the solution is:
𝑥 ∗ = [0, 2, 3, 0.5]𝑇
(8.10)
𝑓 ∗ = −13.75.
Greedy algorithms are perhaps the simplest approach for discrete opti-
mization problems. This approach is more of a concept than a specific
algorithm. The implementation varies with the application. The idea is
to reduce the problem to a subset of smaller problems (often down to a
single choice), and then make a locally optimal decision. That decision
is locked in, and then the next small decision is made in the same
manner. A greedy algorithm does not revisit past decisions, and so
ignores much of the coupling that may occur between design variables.
As an example consider the weighted directed graph shown in Fig. 8.9. The
objective is to traverse from node 1 to node 12 with the smallest possible
cost (cost denoted by the numbers above path segments). Note that a series
of discrete choices must be made at each step, and those decisions limit the
available options in the next step. This graph might represent a transportation
problem for shipping goods, information flow through a social network, or a
supply chain problem.
A greedy algorithm simply makes the best choice assuming each decision
is the only decision that will be made. Starting at node 1, we first choose to
move to node 3 because that is the smallest cost between the three options
(node 2 cost 2, node 3 cost 1, node 4 cost 5). We then choose to move to node 6
because that is the smallest cost between the next two available options (node
6 cost 4, node 7 cost 6) and so on. The path selected by the greedy algorithm
is highlighted in the figure and results in a total cost of 15. The algorithm is
easy to apply and scalable, but will not generally find the global optimum. The
global optimum in this case, also highlighted in the figure, results in a total
8 Discrete Optimization 261
5
5 3
2 2 9
5
2 4 3
Global 4 6
4
1 6
1 3 10 12
6 7
5 2
Greedy 7 5 Figure 8.9: The greedy algorithm
3 1
in this weighted directed graph
4 4 11
results in a cost of 15, compared to
5 2 the global optimum with a cost of
10.
8
cost of 10. To find that global optimum we have to consider the impact of our
choices on future decisions. A method to do this will be discussed in the next
section.
• Traveling salesman (Ex. 8.1): Always select the nearest city as the next
step.
• Propeller problem (Ex. 8.2 but with more discrete variables): optimize
the number of blades with all remaining discrete variables fixed, then
optimize the material selection with all remaining discrete variables
fixed, . . ..
• Grocery shopping (Ex. 11.1)‡ : There are many possibilities for formulat- ‡ This is a form of the knapsack problem,
ing a greedy solution. For example: always pick the cheapest food item which is a classic problem in discrete opti-
mization
next, or always pick the most nutritious food item next, or always pick
the food item with the most nutrition per unit cost.
8 Discrete Optimization 262
𝑓0 = 0
𝑓1 = 1 (8.11)
𝑓𝑛 = 𝑓𝑛−1 + 𝑓𝑛−2
Notice, that we do not need a full history, but can compute the next
number in the sequence just by knowing the last two states.§ We could § We can also convert this to a standard first
procedure fib(𝑛)
if 𝑛 ≤ 1 then
return 𝑛
8 Discrete Optimization 263
else
return fib(𝑛 − 1) + fib(𝑛 − 2)
end if
end procedure
fib(5)
fib(4) fib(3)
procedure fib2(𝑛)
𝑓0 = 0
𝑓1 = 1
8 Discrete Optimization 264
for 𝑖 = 2 to 𝑛 do
𝑓𝑖 = 𝑓𝑖−1 + 𝑓𝑖−2
end for
return 𝑓𝑛
end procedure
We can also express this in terms of our transition function to show the
dependence on the current decision:
Let us solve the graph problem posed in Ex. 8.7 using dynamic programming.
For convenience, we will repeat a smaller version of the figure in Fig. 8.12. We
will use the tabulation (bottom-up) approach. To do this we construct a table
where we keep track of the cost to move from this node to the end (node 12),
and which node we should move to next.
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost
Next
We start from the end. The last node is simple. There is no cost to move 5 5 3
from node 12 to the end (we are already there) and there is no next node. 2
2 5 9
2 4 3
4 6
Node 1 2 3 4 5 6 7 8 9 10 11 12 4
1 6
1 3 6 10 12
Cost 0 5
7
2
Next - 3 7 5 1
4
4 5 2 11
We now move back one level to consider nodes 9, 10, and 11. These nodes
8
all lead to node 12 and so are straightforward. we will be a little more careful
with the formulas as we get to the more complicated cases next. Figure 8.12: Small version of
Fig. 8.9 for convenience.
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 3 6 2 0
Next 12 12 12 -
We now move back one level to nodes 5, 6, 7, 8. For node 5 the cost is the
following (i.e., Bellman’s equation):
Note that we have already computed the minimum value for cost(9), cost(10),
and cost(11) and so just look up these values in the table. In this case, the
8 Discrete Optimization 266
minimum total value is 3 and is associated with moving to node 11. Similarly,
the cost for node 6 is:
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 3 8 3 6 2 0
Next 11 9 12 12 12 -
We repeat this process, moving back and reusing optimal solutions to find
the global optimum. The completed table looks like the following:
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 10 8 12 9 3 8 7 4 3 6 2 0
Next 2 5 6 8 11 9 11 11 12 12 12 -
From the table we see that the minimum cost is 10, and is achieved by
moving to node 2, under node 2 we see that we next go to node 5, then 11, and
finally 12. Thus, the tabulation gives us the global minimum for cost and the
design decisions to achieve that.
In its present form, the problem has a linear objective and linear
constraints, so branch and bound is a good fit. However, it can also
be formulated as a Markov chain, so we can therefore use dynamic
programming. The dynamic programming version allows us to ac-
commodate variations such as stochasticity and other constraints more
easily. To see that this can be posed as a Markov chain, we define the
state as the remaining capacity of the knapsack 𝑘 and the number of
items we have already considered. In other words, we are interested in
𝑣(𝑘, 𝑖) where 𝑣 is the value function (optimal value given the inputs),
𝑘 is the remaining capacity in the knapsack and 𝑖 indicates that we
have already considered items 1 through 𝑖 (this doesn’t mean we have
added them all to our knapsack, but that we have considered them).
We iterate through a series of decisions 𝑥 𝑖 deciding whether to take
item 𝑖 or not, which transitions us to a new state where 𝑖 increases and
𝑘 may decrease depending on whether or not we took the item.
The real problem we are interested in is 𝑣(𝐾, 𝑛), which we will solve
using tabulation. Starting at the bottom, we know that 𝑣(𝑘, 0) = 0 for
any 𝑘. In words, this just means that no matter what the capacity is,
if we haven’t considered any items yet then the value is 0. To work
forward, let’s consider a general case considering item 𝑖, with the
assumption that we have already solved up to item 𝑖 − 1 for any capacity.
If item 𝑖 cannot fit in our knapsack (𝑤 𝑖 > 𝑘) then we cannot take the
item. Alternatively, if the weight is less than the capacity we need to
make a choice: select item 𝑖 or do not. If we do not, then the value is
unchanged: 𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1). If we do select item 𝑖 then our value is
𝑐 𝑖 plus the best we could do with the previous items but with a capacity
that was smaller by 𝑤 𝑖 : 𝑣(𝑘, 𝑖) = 𝑣 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1). Whichever of
these decisions yields a better value is what we should choose. This
process is summarized in Alg. 8.12.
Inputs:
𝑐 𝑖 : Cost of item 𝑖
𝑤 𝑖 : Weight of item 𝑖
𝐾: Total available capacity
Outputs:
𝑣(0 : 𝐾, 0 : 𝑛): 𝑣(𝑘, 𝑖) is the optimal cost for capacity 𝑘 considering items 1 through 𝑖 , note that
indexing starts at 0
for 𝑘 = 0 to 𝐾 do
𝑣(𝑘, 0) = 0 No items considered, value is zero for any capacity
end for
8 Discrete Optimization 268
Note that we will end up filling all entries in the matrix 𝑣[𝑘, 𝑖], in
order to extract the last value 𝑣[𝐾, 𝑛]. For small numbers, filling this
matrix (or table) is often illustrated manually, hence the name tabulation.
Like the Fibonnaci example, using dynamic programming instead of a
fully recursive solution reduces the complexity from 𝒪(2𝑛 ) to 𝒪(𝐾𝑛),
which means it is psuedolinear. It is only psuedolinear because there is
a dependence on the knapsack size. For small capacities the problem
scales well even with many items, but as the capacity grows the problem
becomes scales much less efficiently. Note that the knapsack problem
requires integer weights. Real numbers can be scaled up to integers
(e.g., 1.2, 2.4 becomes 12, 24). Arbitrary precision floats are not feasible
given the number of combinations to search across.
Let’s consider five items with the following weights and costs:
𝑤 𝑖 = [4, 5, 2, 6, 1]
(8.20)
𝑐 𝑖 = [4, 3, 3, 7, 2]
The capacity of our knapsack is 𝐾 = 10. Using Alg. 8.12 we find that the optimal
cost is 12. The value matrix looks as follows:
0 0 0 0 0 0
0 2
0 0 0 0
0 3
0 0 3 3
0 5
0 0 3 3
0 5
4 4 4 4
0 4 4 4 4 6 (8.21)
0 4 4 7 7 7
0 4 4 7 7 9
0 10
4 4 7 10
0 12
4 7 7 10
0 12
4 7 7 11
8 Discrete Optimization 269
To determine which items produce this cost we need to add a bit more logic.
To focus on the main principles this was left out of the previous algorithm, but
for completeness is discussed in this example. To keep track of the selected
items we need to define a selection matrix 𝑆 of the same size as 𝑣 (note that this
matrix is indexed starting at zero in both dimensions). Every time we accept an
item 𝑖 in Alg. 8.12 we note that in the matrix as 𝑆 𝑘,𝑖 = 1. We would replace this
line:
𝑣(𝑘, 𝑖) = max(𝑣(𝑘, 𝑖 − 1), 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1))
with
if 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1) > 𝑣(𝑘, 𝑖 − 1) then
𝑣(𝑘, 𝑖) = 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1)
𝑆(𝑘, 𝑖) = 1
else
𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1)
end if
Then at the end of the algorithm we can determine, which entries were
saved by this logic:
Input: 𝑆 Selection matrix (𝐾 + 1 × 𝑛 + 1) matrix from (8.12)
𝑘=𝐾
𝑋 ∗ = {} Initialize solution 𝑋 ∗ as an empty set
for 𝑖 = 𝑛 to 1 by −1 do
if 𝑆 𝑘,𝑖 = 1 then
add 𝑖 to 𝑋 ∗ Item 𝑖 was selected
𝑘 −= 𝑤 𝑖
end if
end for
return 𝑋 ∗
For this example the selection matrix 𝑆 looks as follows:
0 0 0 0 0 0
0 1
0 0 0 0
0 0
0 0 1 0
0 1
0 0 1 0
0 1
1 0 0 0
𝑆 = 0 1 0 0 0 1 (8.22)
0 1 0 1 0 0
0 1 0 1 0 1
0 0
1 0 1 1
0 1
1 1 0 1
0 1
1 1 0 1
Following the above algorithm we find that we selected items 3, 4, 5 for a total
cost of 12, as expected, and a total weight of 9.
is accepted. If the energy level increases, the new state might still be
accepted with probability,
− 𝑓 (𝑥 new ) − 𝑓 (𝑥)
exp , (8.25)
𝑇
where Boltzmann’s constant is removed because it is just an arbitrary
scale factor in the optimization context. Otherwise, the state remains
unchanged. Constraints can be naturally handled in this algorithm
without resorting to penalities by rejecting any infeasible step.
We must supply the optimizer with a function that provides a
random neighboring design from the set of possible design configura-
tions. A neighboring design is usually related to the current design, as
opposed to picking a pure random design from the entire set. In defin-
ing the neighborhood structure, one might wish to define transition
probabilities so that all neighbors are not equally likely. This type of
structure is common in Markov chain problems.
Finally, we need to determine the annealing schedule (or cooling
schedule), a process for decreasing the temperature throughout the
optimization. A common approach is exponential decrease:
𝑇 = 𝑇0 𝛼 𝑘 (8.26)
Inputs:
𝑥 (0) : Starting point
𝑇 (0) : Initial temperature
Outputs:
𝑥 ∗ : Optimal point
factor of 0.95. The final design is shown in the bottom of Fig. 8.13 with a path
length of 5.61. The final path might not be the global optimum (remember these
finite time methods are only approximations of the full combinatorial search),
but the methodology is effective and fast for this problem in finding at least a
near-optimal solution. The iteration history is shown in Fig. 8.14.
30
20
Distance
10
continuous ones.
Problems
8.2 Converting to binary variables. You have one integer design variable
𝑥 ∈ [1, 𝑛]. Let’s say, for example, that this variable represents one
of 𝑛 materials that we would like to select from. Convert this
to an equivalent binary problem so that it is more efficient for
branch and bound. To accomplish this you will need to create
additional design variables and additional constraints.
8.3 Branch and bound. Solve the following problem using a manual
branch and bound approach (i.e., show each LP subproblem) as
is done in Ex. 8.5.
A B C D Limit
Chlorine 0.74 -0.05 1.0 -0.15 97
Sodium hydroxide 0.39 0.4 0.91 0.44 99
Sulfuric acid 0.86 0.89 0.09 0.83 52
Labor (person-hours) 5 7 7 6 1000
8 Discrete Optimization 277
𝑤 𝑖 = [2, 5, 3, 4, 6, 1]
(8.29)
𝑐 𝑖 = [5, 3, 1, 5, 7, 2]
a) a greedy algorithm where you take the item with the best
cost/weight ratio (that fits within the remaining capacity) at
each iteration.
b) dynamic programming
8.7 Binary genetic algorithm. Solve the same problem as above (travel-
ing salesman) with a binary genetic algorithm.
278
9 Multiobjective Optimization 279
becomes
𝑓1 (𝑥)
𝑓2 (𝑥)
minimize 𝑓 (𝑥) = . , where 𝑛 𝑓 ≥ 2 (9.2)
..
𝑓𝑛 (𝑥)
𝑓
The constraints are unchanged, unless some of them have been refor-
mulated as objectives. This multiobjective formulation might require
tradeoffs when trying to minimize all functions simultaneously because
at some point, further reduction in one objective can only be achieved
by increasing one of more of the other objectives.
One exception occurs if the objectives are independent because they
depend on different sets of design variables. Then, the objectives are
said to be separable and they can be minimized independently. If there
are constraints, these need to be separable as well. However, separable
objectives and constraints are rare because in real engineering systems
all functions tend to be linked in some way.
Given that multiobjective optimization requires tradeoffs, we need
a new definition of optimality. In the next section, we explain how
there are an infinite number of points that are optimal, forming a
surface in the space of objective functions. After defining optimality
for multiple objectives, we present several possible methods for solving
multiobjective optimization problems.
figure, we can see that a small sacrifice in maximum power production can be
exchanged for greatly reduced noise. However, if even larger noise reductions
are sought then large power reductions will be required. Conversely, if the
left side of the figure had a flatter slope we would know that small reductions 𝑓1
frequently used. The idea is to combine all of the objectives into one
objective using a weighted sum, which can be written as:
Õ
𝑁
𝑓¯(𝑥) = 𝑤 𝑖 𝑓𝑖 (𝑥), (9.3)
𝑖
method can only return points on the convex portion of the Pareto front
(see Fig. 9.5). 𝑤=0
Using the Pareto front shown in Fig. 9.4, Fig. 9.5 highlights the
𝑓1
convex portion of the Pareto front. Those are the only portions of the
Pareto front that can be found using a weighted sum method. Figure 9.5: The convex portion of
this Pareto front are the portions
highlighted.
9 Multiobjective Optimization 283
minimize 𝑓𝑖
by varying 𝑥
subject to 𝑓𝑗 ≤ 𝜖 𝑗 for all 𝑗 ≠ 𝑖, (9.7)
𝑔(𝑥) ≤ 0
ℎ(𝑥) = 0
from those points, solve optimization problems that search along lines
normal to this plane.
This procedure is shown in Fig. 9.7 for a two objective case. In this
case, the plane that passes through the anchor points is a line. We now
space points along this plane by choosing a vector of weights that we
will call 𝑏, and which are illustrated on the left hand figure. The weights
Í
are constrained such that 𝑏 𝑖 ∈ [0, 1], and 𝑖 𝑏 𝑖 = 1. If we make 𝑏 𝑖 = 1
and all other entires zero, then this equation returns one of the anchor
points, 𝑓 (𝑥 ∗𝑖 ). For two objectives, we would define 𝑏 as 𝑏 = [𝑤, 1 − 𝑤]𝑇
and vary 𝑤 in equal steps between 0 and 1.
9 Multiobjective Optimization 284
𝑓 (𝑥1∗ ) 𝑓 (𝑥1∗ )
𝑏 = [0.8, 0.2]
𝛼 𝑛ˆ Figure 9.7: A notional example of
𝑓2 𝑓2 the normal boundary intersection
𝑃𝑏 + 𝑓 ∗ 𝑃𝑏 + 𝑓 ∗ method. A plane is created passing
through the single objective optima,
and solutions are sought normal
𝑓∗ 𝑓∗
𝑓 (𝑥2∗ ) 𝑓 (𝑥2∗ ) to that plane to allow for a more
evenly spaced Pareto front.
𝑓1 𝑓1
𝑓1 (𝑥 ∗ )
1
𝑓2 (𝑥 ∗ )
2
𝑓∗ = . , (9.8)
..
𝑓𝑛 (𝑥 ∗ )
𝑛
𝑛˜ = −𝑃1 (9.11)
maximize 𝛼
by varying 𝑥, 𝛼
subject to 𝑃𝑏 + 𝑓 ∗ + 𝛼 𝑛ˆ = 𝑓 (𝑥) (9.12)
𝑔(𝑥) ≤ 0
ℎ(𝑥) = 0
This means that we are finding the point furthest away from the
anchor point plane, starting from a given value for 𝑏, while satisfying
the original problem constraints. The process is then repeated for
additional values of 𝑏 to sweep out the Pareto front.
In contrast to the previously mentioned methods, this method yields
a more uniformly spaced Pareto front, which is desirable for computa-
tional efficiency, albeit at the cost of a more complex methodology.
3 (2, 3)
𝑓2 2
𝑛ˆ
1 (5, 1)
𝑓∗
Figure 9.8: Search directions are
0 normal to the line connecting an-
1 2 3 4 5 6 chor points.
𝑓1
First, we optimize the objectives one at a time, which in our example results
in the two anchor points shown in Fig. 9.8: 𝑓 (𝑥1∗ ) = (2, 3) and 𝑓 (𝑥2∗ ) = (5, 1).
The utopia point is then:
2
𝑓∗ = (9.13)
1
9 Multiobjective Optimization 286
Our quasi-normal vector is given by −𝑃1 (note that the true normal is
[−2, −3]):
−3
˜𝑛 = (9.15)
−2
We now have all the parameters we need to solve Eq. 9.12.
version. The primary difference is in determining the fitness and the 103. Deb et al., A fast and elitist multiobjec-
tive genetic algorithm: NSGA-II. 2002
selection procedure. Here, we provide an overview of one popular
118. Deb, Introduction to Evolutionary
approach, the NSGA-II algorithm.† Multiobjective Optimization. 2008
9 Multiobjective Optimization 287
Inputs:
𝑝: a population sorted by the first objective
Outputs:
𝑓 : the pareto set for the population
procedure front(𝑝)
if length(𝑝) = 1 then if there is only one point, it is the front
return f
end if
split population into two halves 𝑝 𝑡 and 𝑝 𝐵
⊲ because input was sorted, 𝑝 𝑡 will be superior to 𝑝 𝐵 in the first objective.
𝑡 = front(𝑝 𝑡 ) recursive call to find front for top half
𝑏 = front(𝑝 𝐵 ) recursive call to find front for bottom half
initialize 𝑓 with the members from 𝑡 merged population
for 𝑖 = 1 to length(𝑏) do
dominated = false track whether anything in 𝑡 dominates 𝑏 𝑖
for 𝑗 = 1 to length(𝑡) do
if 𝑡 𝑗 dominates 𝑏 𝑖 then
dominated = true
break no need to continue search through 𝑡
end if
end for
if not dominated then 𝑏 𝑖 was not dominated by anything in T
add 𝑏 𝑖 to 𝑓
end if
end for
return 𝑓
end procedure
9 Multiobjective Optimization 288
In NSGA-II, we are interested in not just the Pareto set, but rather
in ranking all members by their dominance depth, which is also called
nondominated sorting. In this approach, all points in the population
that are nondominated (i.e., the Pareto set) are given a rank of 1. Those
points are then removed from the set and the next set of nondominated
points is given a rank of 2, and so on (see Fig. 9.10). Note that there
are alternative procedures that can perform a nondominated sorting
directly, which can sometimes be more efficient, though we don’t
highlight them here. This algorithm is summarized in Alg. 9.6.
rank = 1
rank = 2
Algorithm 9.6: Perform nondominated sorting rank = 3
rank ≥ 4
Inputs: 𝑓2
𝑝: a population
Outputs:
rank: the rank for each member in the population
The new population is filled by placing all rank 1 points in the new
population, then all rank 2 points, and so on. At some point, an entire
group of constant rank will not fit within the new population. Points
with the same rank are all equivalent as far as Pareto optimality is
concerned, so an additional sorting mechanism is needed to determine
which members of this group to include.
The way that we perform selection within a group that can only 𝑓2
necessarily touches its neighbors as the two closest neighbors can differ
for each objective. The sum of the dimensions of this hypercube is the
crowding distance. When summing the dimensions, each dimension is
normalized by the maximum range of that objective value. For example,
considering only 𝑓1 for the moment, if the objectives were in ascending
order, then the contribution of point 𝑖 to the crowding distance would
be:
𝑓1 − 𝑓1 𝑖−1
𝑑1,𝑖 = 𝑖+1 (9.16)
𝑓1 𝑁 − 𝑓1 1
Sometimes, instead of using the first and last points in the current
objective set, user-supplied values are used for the min and max values
of 𝑓 that appear in that denominator. The anchor points (the single
objective optima) are assigned a crowding distance of infinity as we
want to preference their inclusion. The algorithm for crowding distance
is shown in Alg. 9.7.
Inputs:
𝑝: a population
Outputs:
𝑑: crowding distances
We can now put together the pieces in the overall algorithm. The
crossover and mutation operations remain the same. Tournament
selection (Fig. 7.17) is modified slightly to use the ranking and crowding
metrics of this algorithm. In the tournament, a member with a lower
rank is superior. If two members have the same rank, then the one
with the larger crowding distance is selected. This procedure is called
crowded tournament selection. After reproduction/mutation, instead of
replacing the parent generation with the offspring generation, both the
9 Multiobjective Optimization 290
Inputs:
𝑥: Variable upper bounds
x: Variable lower bounds
𝑓 (𝑥): function
Outputs:
𝑥 ∗ : Optimal point
We see that current nondominated set consists of points D and J and that there 𝑓1
A C I L
1.67 ∞ 1.5 ∞
We would then add these in order: C, L, A, I, but only have room for one, so we
add 𝐶 and complete this iteration with a new population of [D, J, E, H, K, C].
9.4 Summary
Problems
• (20, 4)
• (18, 5)
• (34, 2)
• (19, 6)
Identify the Pareto front using the weighted sum method with
11 evenly spaced weights: 0, 0.1, 0.2, . . . , 1. If some parts of the
front are underresolved, discuss how you might select weights
for additional points.
9.5 Repeat Prob. 9.3 with the normal boundary intersection method
using the following 11 evenly spaced points: 𝑏 = [0, 1], [0.1, 0.9], [0.2, 0.8], . . . , [1, 0]
295
10 Surrogate-Based Optimization 296
procedures infill is omitted, the surrogate is fully constructed upfront duction to this topic with much more
depth than can be provided in this chap-
and not subsequently updated. Many of the concepts discussed in this ter.
chapter are of broader usefulness in optimization beyond just SBO. 120. Forrester et al., Engineering Design
via Surrogate Modelling: A Practical Guide.
2008
10.2 Sampling
Sample
Sampling methods select the evaluation points for constructing the
initial surrogate. These evaluation points must be chosen carefully.
Construct
surrogate
Example 10.1: Full grid sampling is not scalable.
(a) A sampling strategy whose (b) A sampling strategy whose Figure 10.3: Contrasting sampling
projection uniformly spans each projection uniformly spans each strategies that both fulfill the uni-
dimension but does not fill the space dimension and fills the space more form projection requirement.
well. effectively.
0.8
𝑓ˆ = 𝑤 𝑇 𝜓 (10.1)
where Ψ is matrix:
— 𝜓(𝑥 (1) )𝑇 —
— 𝜓(𝑥 (2) )𝑇 —
Ψ= (10.6)
..
.
— 𝜓(𝑥 (𝑛) )𝑇 —
Thus, the same minimization problem can be expressed as:
minimize ||Ψ𝑤 − 𝑓 || 2 (10.7)
where
𝑓 (1)
(2)
𝑓
𝑓 = . (10.8)
..
𝑓 (𝑛)
The matrix Ψ is of size (𝑚 × 𝑛) where 𝑚 > 𝑛. This means that there
should be more equations than unknowns, or that we have sampled
more points than the number of polynomial coefficients we need to
estimate. This should make sense because our polynomial function
is only an assumed form, and generally not an exact fit to the actual
underlying function. Thus, we need more data to create a reasonable
fit.
This is exactly the same problem as 𝑦 = 𝐴𝑥 where 𝐴 ∈ ℛ 𝑚×𝑛 . There
are more equations than unknowns so generally there is not a solution
(the problem is called overdetermined). Instead, we seek the solution
that minimizes the error ||𝐴𝑥 − 𝑦|| 2 .
Tip 10.3: Least squares is not the same as a linear system solution.
10 Surrogate-Based Optimization 301
In Julia or Matlab you can solve this with x = A\b, but keep in mind that
for 𝐴 ∈ ℛ 𝑚×𝑛 this syntax performs a least squares solution, not a linear system
solution because it would for a full rank 𝑛 × 𝑛 system. This overloading is
generally not used in other languages, for example in Python rather than using
numpy.linalg.solve you would use numpy.linalg.lstsq.
We will create the data at 20 points in the interval [−3, 2] (see Fig. 10.5).
𝑓
0
−2
Figure 10.5: Data from a numerical
−3 −2 −1 0 1 2 model or experiments.
𝑥
For a real problem we do not know the underlying function and the
dimensionality is often too high to visualize. Determining the right basis to
functions to use can be difficult. If we are using a polynomial basis we might
try to determine the order by trying each case (e.g., quadratic, cubic, quartic,
etc.) and measuring the error in our fit (Fig. 10.6).
It seems as if the higher the order of the polynomial, then the lower the
error. For example, a 20th order polynomial reduces the error to almost zero.
The problem is that while the error may be low on this set of data, we expect
the predictive capability of such a model for future data points to be poor. For
example, Fig. 10.7 shows a 19th order fit to the data. The model passes right
through the points, but its predictive capability is poor.
10 Surrogate-Based Optimization 302
Error 2
𝑓
0
1. Randomly split your data into a training set and a validation set
(e.g., a 70/30 split).
3. Choose the model with the lowest error on the validation set, and
optionally retrain that model using all of the data.
An alternative option may be useful if you have few data points and
so can’t afford to leave much out for validation. This method is called
k-fold cross validation and while it is more computationally intensive, it
makes better use of all of your data:
10 Surrogate-Based Optimization 303
2. Train each candidate model using the data from all sets except
one (e.g., 9 of the 10 sets), and use the remaining set for valida-
tion. Repeat for all 𝑛 possible validation sets and average your
performance.
3. Choose the model with the lowest average error on all the 𝑛
validation sets.
This example continues from Ex. 10.4. First, we perform k-fold cross-
validation using ten divisions. The average error across the divisions using the
training data is shown in Fig. 10.8.
·104
3 15
2 10
Error
Error
1 5
The error becomes extremely large as the polynomial order becomes large.
Zooming in on the flat region we see a range of options with similar errors.
Amongst similar solutions, one generally prefers the simplest model. In this
case, a fourth-order polynomial seems reasonable. A fourth-order polynomial
is compared against the data in Fig. 10.9. This model has much better predictive
capability.
𝑓
0
−2
Figure 10.9: A 4th order polynomial
−3 −2 −1 0 1 2 fit to the data.
𝑥
10 Surrogate-Based Optimization 304
© Õ ª
𝜓 (𝑖) = exp − 𝜃 𝑗 |𝑥 𝑗 − 𝑥 𝑗 | 𝑝 𝑗 ®
(𝑖)
(10.11)
« 𝑗 ¬
A Kriging basis is a generalization of a Gaussian basis (which would
have 𝜃 = 1/𝜎2 and 𝑝 = 2). These types of models are useful because in
addition to creating a model they also predict the uncertainty in the
model through the surrogate. An example of this is shown in Fig. 10.10.
Notice how the uncertainty goes to zero at the known data points, and
becomes largest when far from known data points.
𝑓 0
Data
−1 Actual
Fit Figure 10.10: A Kriging fit to the
Uncertainty input data (circles) and a shaded
−2
0 2 4 6 confidence interval.
𝑥
The surrogate modeling toolbox‡ is a useful package for surrogate modeling ‡ https://smt.readthedocs.io/
10.4 Infill
Consider the one-dimensional function with data points and fit shown
in Fig. 10.11. § The best point we have found so far is denoted in the figure § Thisdata is based on an example from
as 𝑥 ∗ , 𝑓 ∗ . For a Gaussian process model, the fit also provides an uncertainty Rajnarayan et al.121 .
𝑓
5
(𝑥 ∗ , 𝑓 ∗ )
Figure 10.11: A one-dimensional
function with a Gaussian process
4
0 0.5 1 1.5 2 2.5 model surrogate fit and uncertainty.
𝑥
Now imagine we want to evaluate this function at some new test point
𝑥test = 0.5. In Fig. 10.12 the probability distribution for the objective at 𝑥test is
shown in red (imagine that the probability distribution was coming out of the
page). The shaded blue region is the probability of improvement over the best
point. Expected value is similar to the probability of improvement but rather
than return a probability it returns the magnitude of improvement expected.
That magnitude may be more helpful in defining a stopping criteria as opposed
to a probability.
𝑓
5
Figure 10.12: At a given test point
(𝑥test = 0.5) we highlight the proba-
(𝑥 ∗ , 𝑓 ∗)
bility distribution and the expected
improvement in the shaded blue
4
0 0.5 1 1.5 2 2.5 region.
𝑥
Now, let us evaluate the expected improvement not just at 𝑥 test = 0.5 but
across the domain. The result is shown by the red function in Fig. 10.13. The
spike on the right tells us that we expect improvement by sampling close to our
best known point, but the expected improvement is rather small. The spike on
the left tells us that there is a promising region where the surrogate suggests a
relatively high potential improvement. Notice that the metric does not simply
10 Surrogate-Based Optimization 307
capture regions with high uncertainty, but rather regions with high uncertainty
in areas that are likely to lead to improvement. For our next sample, we would
choose the location with the highest expected improvement, recreate the fit
and repeat.
𝐸𝐼(𝑥)
𝑓
5
possible before. The phrase neural is used because these models are
inspired by neurons in a brain.
The first and last layers are the inputs and outputs of our surrogate
model. Each neuron in the hidden layer represents a function. This
means that the output from a neuron is a number, and thus the output
from a whole layer can be represented as a vector 𝑥. We call 𝑥 (𝑘) the
(𝑘)
vector of values for layer 𝑘, and 𝑥 𝑖 is the value for the 𝑖th neuron
in layer 𝑘. Let us consider just one neuron in layer 𝑘. This neuron is
connected to many neurons from the previous layer 𝑘 −1 (see first part of
Fig. 10.15). We need to choose a functional form for this neuron taking
in the values from the previous layer as inputs. A linear function is too
simple. Chaining together linear functions will only result in a linear
composite function, so the function for this neuron must be nonlinear.
The most common choice for hidden layers is a linear function passed
through a second activation function that creates the nonlinearity. Let
us first focus on the linear portion, which produces an intermediate
variable we call 𝑧 (figure):
Õ
𝑛
(𝑘−1)
𝑧= 𝑤𝑗 𝑥𝑗 +𝑏 (10.15)
𝑗=1
or in vector form:
𝑧 = 𝑤 𝑇 𝑥 (𝑘−1) + 𝑏 (10.16)
Notice that the first term is just a weighted sum of the values from the
neurons in the previous layer. The 𝑤 vector contains the weights. The
𝑏 term is called the bias, which provides an offset allowing us to scale
10 Surrogate-Based Optimization 309
𝑧1
(𝑘−1) 𝑧2
𝑥1
𝑤1
(𝑘−1) 𝑧3
𝑥2
𝑤2
(𝑘−1)
𝑤3 Í (𝑘−1)
(𝑘)
𝑥3 𝑧4 = 𝑗 𝑤𝑗 𝑥𝑗 + 𝑏4 𝑥 (𝑘) = 𝑎(𝑧) 𝑥4
.. 𝑤𝑛 𝑧5
.
(𝑘−1)
𝑥𝑛 ..
.
𝑧𝑚
𝑎(𝑧). Historically, a sigmoid function (top of Fig. 10.16) was almost 0.8
large positive values would produce results close to one. Most modern
neural nets now use a rectified linear unit (ReLU) as the activation 4
function (bottom of Fig. 10.16):
𝑎(𝑧)
𝑎(𝑧) = max(0, 𝑧) (10.18) 2
ReLU
The ReLU has been found to be far more effective in producing accurate
neural nets. Notice that this activation function completely eliminates
5
negative inputs. Thus, we see that the bias term can be thought of as a −5 𝑧
threshold establishing what constitutes a significant value. This last Figure 10.16: Activation functions.
step is summarized in the final two columns of Fig. 10.15.
10 Surrogate-Based Optimization 310
To compute across all the neurons in this layer, the weights 𝑤 for this
one neuron would form one row in a matrix of weights 𝑊.
𝑥 (𝑘) 𝑥 (𝑘−1) 𝑏
1 © ... 1 1 ª
. . . ®
..
..
.. .. ®
(𝑘−1) ®®
.
(𝑘)
𝑥 𝑖 = 𝑎 𝑊𝑖1 . . . 𝑊𝑖𝑗 . . . 𝑊𝑖,𝑛 𝑘−1 𝑥 𝑗 + 𝑏 𝑖 ® (10.20)
.
. ..
.. .. ®®
. . . . ®
(𝑘) (𝑘−1)
𝑥 𝑛 𝑥
𝑘 « ... 𝑛 𝑘−1 𝑛 𝑘 ¬
𝑥
or
𝑥 (𝑘) = 𝑎(𝑊 𝑥 (𝑘−1) + 𝑏) (10.21)
The activation function is applied separately for each row. The below
equation is more explicit (where 𝑤 𝑖 is the 𝑖 th row of 𝑊), though we
generally use the above equation as shorthand.
(𝑘) (𝑘−1)
𝑥𝑖 = 𝑎(𝑤 𝑇𝑖 𝑥 𝑖 + 𝑏𝑖 ) (10.22)
Ö
𝑛
max 𝑝(𝑦 (𝑖) |𝑥 (𝑖) ; 𝜃) (10.24)
𝜃
𝑖=1
We now take the log of the objective, which does not change the solution,
but changes the products to a better numerically behaved summation.
We also add a negative sign up front so that the problem is one of
minimization:
Õ
𝑛
min − log(𝑝(𝑦 (𝑖) |𝑥 (𝑖) ; 𝜃)) (10.25)
𝜃
𝑖=1
Õ
𝑛
min ( 𝑓 (𝑥 (𝑖) ) − 𝑦 (𝑖) )2 (10.26)
𝜃
𝑖=1
where 𝑥 (𝑖) is the 𝑖 th sample from the training set and 𝑓ˆ is any function
that operates on one training sample. As seen in this section, the
objectives commonly used for many machine learning applications
fit this form (e.g., negative log likelihood, or a squared error). The
difficulty with these problems is that we often have large training sets,
sometimes with 𝑛 in the billions. That means that computing the
objective can be time consuming, but computing the gradient is even
more time consuming.
If we divide the objective by 𝑛 (which does not change the solution),
we can see that objective function is an expectation:
1 Õ ˆ (𝑖)
𝑛
𝑓 (𝑥) = 𝑓 (𝑥 ) (10.28)
𝑛
𝑖=1
Thus, we divide our training data into these minibatches and use a new
minibatch to estimate the gradients at each iteration in the optimization.
This approach works well for these specific problems because of Figure 10.17: A simplified example
of how training data is randomly
the unique form for the objective. As an example, if there were one assigned into minibatches.
10 Surrogate-Based Optimization 313
10.6 Summary
Problems
11.1 Introduction
317
11 Convex Optimization 318
because the linearization can be updated in the next time step. However,
this reduction in fidelity is problematic for design applications. In
design scenarios, the optimization is performed once, and the design
cannot continue to be updated after it is created. For this reason, convex
optimization less frequently used for design applications, with the
exception of some limited uses of geometric programming, a topic
discussed in more detail in Section 11.6.
This chapter is introductory in nature, focusing only on understand-
ing what convex optimization is useful for and describing some of the
most widely used forms.† The known categories of convex optimiza- † Boyd et al.125 provides a good starting
Example 11.1: Formulating a linear programming problem. Figure 11.2: Relationship between
various convex optimization prob-
Suppose we are going shopping and want to figure out how to best meet
lems.
our nutritional needs for the least amount of cost. We enumerate all the
food options, and use the variable 𝑥 𝑗 to represent how much of food 𝑗 we
will purchase. The parameter 𝑐 𝑗 is the cost of a unit amount of food 𝑗. The
11 Convex Optimization 320
If we call the amount of each food 𝑥, the cost column 𝑐, and the nutrient
columns 𝑛1 , 𝑛2 , 𝑛3 then we can setup the following linear problem:
minimize 𝑐𝑇 𝑥
subject to 5 ≤ 𝑛1𝑇 𝑥 ≤ 8
7 ≤ 𝑛2𝑇 𝑥 (11.5)
1≤ 𝑛3𝑇 𝑥 ≤ 10
𝑥≤4
The last constraint was added to ensure we do not eat too much of any one
item and get tired of it. LP solvers are widely available. In fact, some solvers
11 Convex Optimization 321
suggesting that our optimal diet consists of items B, F, H, and I in the proportions
shown above. The solution hit the upper limit on nutrient 1 and the lower limit
on nutrient 2.
1 𝑇
minimize 𝑥 𝑄𝑥 + 𝑓 𝑇 𝑥
2
subject to 𝐴𝑥 + 𝑏 = 0 (11.7)
𝐶𝑥 + 𝑑 ≤ 0
where the vector 𝑏ˆ contains the estimated values (from a model, for
example), and 𝑏 contains the data points that we are trying to fit. If we
assume a linear model, then 𝑏ˆ = 𝐴𝑥, where 𝑥 are the model parameters
we want to optimize to fit the data. Here, “linear” means linear in the
coefficients, not in the data fit. For example, we could estimate the
coefficients 𝑐 𝑖 of a quadratic function that best fits some data:
𝑓 (𝜁) = 𝑐 1 𝜁 2 + 𝑐 2 𝜁 + 𝑐3 (11.9)
The left pane of Fig. 11.3 shows some example data that is both noisy and
biased relative to the true (but unknown) underlying curve represented as a
dashed line. Given the data points we would like to estimate the underlying
functional relationship. We assume that the relationship is cubic:
𝑦(𝑥) = 𝑎1 𝑥 3 + 𝑎2 𝑥 2 + 𝑎3 𝑥 + 𝑎4 (11.13)
11 Convex Optimization 323
30 30 30
20 20 20
𝑦 𝑦 𝑦
10 10 10
0 0 0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥 𝑥 𝑥
Suppose from some careful measurements, or additional data, we know Figure 11.3: hello
an upper bound on the function value at a few places. For this example we
assume that we know that 𝑓 (−2) ≤ −2, 𝑓 (0) ≤ 4, 𝑓 (2) ≤ 26. These requirements
can be posed as linear constraints:
𝑎1
(−2)3 (−2)2 1 −2
−2 𝑎2
0 1 ≤4
0 0 𝑎3 (11.14)
23 1 26
22 2 𝑎
4
We add these linear constraints to our quadratic objective (minimizing the sum
of the squared error) and the resulting problem is still a QP. The resulting
solution is shown in the right pane of Fig. 11.3, which results in a much more
accurate fit.
where 𝑥 𝑡 is the deviation from a desired state at time 𝑡 (for example, the
positions and velocities of an aircraft), and 𝑢𝑡 are the control inputs that we
want to optimize (for example, control surface deflections). The above dynamic
equation can be used as a set of linear constraints in an optimization, but we
must decide on an objective.
One would like to have small 𝑥 𝑡 because that would mean reducing the error
in our desired state quickly, but we would also like to have small 𝑢𝑡 because
small control inputs require less energy. These are competing objectives, where
a small control input will take longer to minimize error in a state, and vice-versa.
11 Convex Optimization 324
1Õ 𝑇
𝑁
minimize 𝑥 𝑡 𝑄𝑥 𝑡 + 𝑢𝑡𝑇 𝑅𝑢𝑡 , (11.16)
2
𝑡=0
minimize 𝑓 𝑇𝑥
subject to ||𝐴 𝑖 𝑥 + 𝑏 𝑖 || 2 ≤ 𝑐 𝑇𝑖 𝑥 + 𝑑 𝑖 (11.17)
𝐺𝑥 + ℎ = 0
1 𝑇
minimize 𝑥 𝑄𝑥 + 𝑓 𝑇 𝑥
2
subject to 𝐴𝑥 + 𝑏 = 0 (11.18)
1 𝑇
𝑥 𝑅 𝑖 𝑥 + 𝑐 𝑇𝑖 𝑥 + 𝑑 𝑖 ≤ 0 for 𝑖 = 1, . . . , 𝑚
2
Both 𝑄 and 𝑅 must be positive semidefinite for the QCQP to be convex.
A QCQP reduces to a QP if 𝑅 = 0. We saw QCQPs when solving
trust-region problems Section 4.5, although for trust-region problems
only an approximate solution method is typically used.
11 Convex Optimization 325
minimize 𝑦
subject to ||𝐹𝑥 + 𝑔|| 2 ≤ 𝑦
(11.19)
𝐴𝑥 + 𝑏 = 0
||𝐺 𝑖 𝑥 + ℎ 𝑖 || 2 ≤ 0
If we square both sides of the first and last constraint, we see that
this formulation is exactly equivalent to the QCQP where 𝑄 = 2𝐹𝑇 𝐹,
𝑓 = 2𝐹𝑇 𝑔, 𝑅 𝑖 = 2𝐺𝑇𝑖 𝐺 𝑖 , 𝑐 𝑖 = 2𝐺𝑇𝑖 ℎ 𝑖 and 𝑑 𝑖 = ℎ 𝑇𝑖 ℎ 𝑖 . The matrices 𝐹 and
𝐺 𝑖 are the square roots of the matrices 𝑄 and 𝑅 𝑖 respectively (divided
by two), and would be computed from a factorization.
• log-sum-exp: log(𝑒 𝑥1 + 𝑒 𝑥2 + . . . + 𝑒 𝑥 𝑛 )
CVX and its variants are free popular tools for disciplined convex program-
ming with interfaces for multiple programming languages.¶ ¶ https://stanford.edu/~boyd/software.
html
Õ
𝑁
𝑎1𝑗 𝑎2𝑗 𝑎𝑛 𝑗
𝑓 (𝑥) = 𝑐 𝑗 𝑥1 𝑥2 · · · 𝑥 𝑛 (11.21)
𝑗=1
𝐶 𝐿2
𝐷 = 𝐶 𝐷 𝑝 𝑞𝑆 + 𝑞𝑆 (11.23)
𝜋𝐴𝑅𝑒
11 Convex Optimization 327
minimize 𝑓0 (𝑥)
subject to 𝑓𝑖 (𝑥) ≤ 1 (11.24)
ℎ 𝑖 (𝑥) = 1
where all of the 𝑓𝑖 are posynomials and the ℎ 𝑖 are monomials. This
problem does not fit into any of the convex optimization problems
defined in the previous section, and it is not convex. The reason why
this formulation is useful is that we can convert it into an equivalent
convex optimization problem.
First, we take the logarithm of the objective and of both sides of the
constraints:
minimize log 𝑓0 (𝑥)
subject to log 𝑓𝑖 (𝑥) ≤ 0 (11.25)
log ℎ 𝑖 (𝑥) = 0.
Let us further examine the equality constraints. Recall that ℎ 𝑖 is a
monomial, so writing one of the constraints explicitly results in the
form:
log(𝑐𝑥 1𝑎1 𝑥2𝑎2 . . . 𝑥 𝑛𝑎 𝑛 ) = 0 (11.26)
Using the properties of logarithms, this can be expanded to an equiva-
lent expression:
𝑎1 𝑦1 + 𝑎2 𝑦2 + . . . + 𝑎 𝑛 𝑦𝑛 + log 𝑐 = 0
(11.28)
𝑎 𝑇 𝑦 + log 𝑐 = 0
©Õ 𝑎𝑛 𝑗 ª
𝑁
log 𝑐 𝑗 𝑥1 𝑥2 . . . 𝑥 𝑛 ®
𝑎1𝑗 𝑎 2𝑗
(11.29)
« 𝑗=1 ¬
11 Convex Optimization 328
©Õ ª
𝑁
log 𝑓𝑖 = log 𝑐 𝑗 𝑒 𝑦1 𝑎1𝑗 𝑒 𝑦2 𝑎2𝑗 . . . 𝑒 𝑦𝑛 𝑎 𝑛 𝑗 ®
« 𝑗=1 ¬
© Õ𝑁
ª
= log 𝑐 𝑗 𝑒 𝑦1 𝑎1𝑗 +𝑦2 𝑎2𝑗 +𝑦𝑛 𝑎 𝑛 𝑗 ® (11.30)
« 𝑗=1 ¬
©Õ 𝑎𝑇𝑗 𝑦+𝑏 𝑗 ª
𝑁
= log 𝑒 ® where 𝑏 𝑗 = log 𝑐 𝑗 .
« 𝑗=1 ¬
This is a log-sum-exp of an affine function. As mentioned in the previous
section, log-sum-exp is convex, and a convex function composed with an
affine function is a convex function. Thus, the objective and inequality
constraints are convex in 𝑦. Because the equality constraints are affine,
we have a convex optimization problem obtained through a change of
variables.
Geometric programming has been successfully used for aircraft
design applications using relationships like the simple ones shown in
131. Hoburg et al., Geometric Programming
Ex. 11.5.131 for Aircraft Design Optimization. 2014
Unfortunately, many other functions do not fit this form (e.g., design
variables that can be positive or negative, terms with negative coeffi-
cients, trigonometric functions, logarithms, exponents). GP modelers
use various techniques to extend usability including using a Taylor’s
series across a restricted domain, fitting functions to posynomials,132 132. Hoburg et al., Data fitting with
geometric-programming-compatible soft-
and rearranging expressions to other equivalent forms including im- max functions. 2016
plicit relationships. A good deal of creativity and some sacrifice in
fidelity is usually needed to create a corresponding GP from a general
nonlinear programming problem. Still, if the sacrifice in fidelity is
not too great, there is a big upside as it comes with all the benefits of
convexity (guaranteed convergence, global optimality, efficiency, no
parameter tuning, and limited scaling issues).
One extension to geometric programming is signomial program-
ming. A signomial program has the same form except that the coeffi-
cients 𝑐 𝑖 can be positive or negative (the design variables 𝑥 𝑖 must still
be strictly positive). Unfortunately, this problem cannot be transformed
to a convex one, so it can no longer guarantee a global optimum. Still, a
signomial program can usually be solved using a sequence of geometric
programs, so it is much more efficient than solving the general nonlinear
problem. Signomial programs have been used to extend the range
11 Convex Optimization 329
11.7 Summary
Problems
11.3 The following foods are available to you at your nearest grocer.
Minimize the amount you spend while making sure you get at
least 5 units of Nutrient 1, between 8 and 20 units of nutrient 2,
and between 5 and 30 units of nutrient 3. Also be sure not to buy
more than 4 units of any one food item, just for variety. Determine
the optimal amount of each item to purchase and the total cost.
11 Convex Optimization 331
1
𝐷 = 𝐶 𝐷 𝜌𝑉 2 𝑆 (11.31)
2
where the drag coefficient is a sum of parasitic drag, lift-dependent
drag, and drag of the rest of the aircraft.
𝑆wet 𝐶𝐿2 1
𝐶 𝐷 = 𝑘𝐶 𝑓 + + (𝐶𝐷𝑆)other (11.32)
𝑆 𝜋𝐴𝑅𝑒 𝑆
The skin friction coefficient is a function of Reynolds number:
0.074
𝐶𝑓 = (11.33)
𝑅𝑒 0.2
where the Reynolds number is:
p
𝜌𝑉 𝑆/𝐴𝑅
𝑅𝑒 = (11.34)
𝜇
1
𝑊 = 𝐶 𝐿 𝜌𝑉 2 𝑆 (11.35)
2
where the weight is a sum of the wing weight and the fixed weight
of the rest of the aircraft:
𝑊 = 𝑊𝑤 + 𝑊other (11.36)
12.1 Introduction
333
12 Optimization Under Uncertainty 334
A familiar example of robust design occurs when playing the board game
Monopoly. On a given turn, if you knew for certain where an opponent was
going to land next, it would make sense to put all of your funds into developing
that one property to its fullest extent. However, because their next position is
uncertain, a better strategy might be to develop multiple nearby properties (each
to a lesser extent because of a fixed monetary resource). This is an example of
robust design: the expected return is less sensitive to input variability. However,
because you develop multiple properties, a given property will have a lower
return than if you had only developed one property. This is a fundamental
tradeoff in robust design. An improvement in robustness generally is not
free; instead, it requires a tradeoff in peak performance. This is known as a
risk-reward tradeoff.
A familiar example of reliable design is when planning a trip to the airport.
Experience suggests that it is not a good idea to use average times to plan your
arrival down to the minute. Instead, if you want a high probability of making
your flight, you plan for variability in traffic and security lines and add a buffer
to your departure time. This is an example of a reliable design: it is less prone
to failure under variability. Reliability is also not free, and generally requires a
tradeoff in the objective (in this example, optimal use of time perhaps).
other random variables such as bar length, bar diameter, and material
Young’s modulus.
One measurement does not tell us anything about how variable the
axial strength is, but if we perform the test many times we can learn a
lot about its distribution. From this information we can infer various
statistical quantities like the mean value of the axial strength. The mean
of some variable 𝑥 that is measured 𝑁 times is estimated as:
1 Õ
𝑁
𝜇𝑥 = 𝑥𝑖 (12.1)
𝑁
𝑖=1
Note that this is actually a sample mean, which would differ from
the population mean (the true mean if you could measure every bar).
With enough samples the sample mean will approach the population
mean. In this brief introduction we won’t distinguish between sample
and population statistics.
Another important quantity is the variance or standard deviation.
This is a measure of spread, or how far away our samples are from the
mean. The unbiased† estimate of the variance is:
† Unbiased means that the expected value
of the sample variance is the same as the
1 Õ
𝑁 true population variance. If 𝑁 was used in
the denominator, rather than 𝑁 − 1, then
𝜎𝑥2 = (𝑥 𝑖 − 𝜇𝑥 )2 (12.2) the two quantities differ by a constant.
𝑁 −1
𝑖=1
and the standard deviation is just the square root of the variance. A
small variance implies that measurements are clustered tightly around
the mean, whereas a large variance means that measurements are
spread out far from the mean. The variance can also be written in the
mathematically equivalent, but more computationally friendly format:
!
1 Õ
𝑁
𝜎𝑥2 = 𝑥 2𝑖 − 𝑁𝜇2𝑥 (12.3)
𝑁 −1
𝑖=1
The total integral of the PDF must be one since it contains all possible
outcomes (100%). ∫ ∞
𝑝(𝑥)𝑑𝑥 = 1 (12.5)
−∞
From the PDF we can also measure various statistics like the mean:
∫ ∞
𝜇𝑥 = 𝐸[𝑥] = 𝑥𝑝(𝑥)𝑑𝑥 (12.6)
−∞
1
0.3
0.8
0.2 0.6
𝑝(𝜎) 𝑝(𝜎)
0.4
0.1
0.2
The capital 𝐹 denotes the CDF and the lowercase 𝑓 the PDF. As
an example, the CDF for the axial strength distribution is shown in
Fig. 12.1b. The CDF always approaches 1 as 𝑥 → ∞.
We often fit a named distribution to the PDF of empirical data. One
of the most popular distributions is the Gaussian or Normal distribution.
Its PDF is:
1 −(𝑥 − 𝜇)2
𝑝(𝑥; 𝜇, 𝜎 2 ) = √ exp (12.10)
𝜎 2𝜋 2𝜎2
12 Optimization Under Uncertainty 337
𝜇 = 1, 𝜎 = 0.5
0.6
𝑝(𝑥) 0.4
𝜇 = 3, 𝜎 = 1.0
For a Gaussian distribution the mean and variance are clearly visible
in the function, but keep in mind these quantities are defined for
any distribution. Figure 12.2 shows two normal distributions with
different means and standard deviations to illustrate the effect of those
parameters. A few other popular distributions, including a uniform,
Weibull, lognormal, and exponential distribution are shown in Fig. 12.3.
These only give a flavor of different named distributions, many others
exist.
0.3 0.3
0.2 0.2
𝑝(𝑥) 𝑝(𝑥)
0.1 0.1
0 0
0 2 4 6 0 2 4 6
𝑥 𝑥
0.3 0.3
0.2 0.2
𝑝(𝑥) 𝑝(𝑥)
0.1 0.1
0 0
0 2 4 6 0 2 4 6 Figure 12.3: A few other example
𝑥 𝑥
probability distributions.
(c) Lognormal distribution (d) Exponential distribution
12 Optimization Under Uncertainty 338
Consider a simple airfoil optimization. Figure 12.4 shows the drag coefficient
of an RAE 2822 airfoil, as a function of Mach number, evaluated by an inviscid
compressible flow solver.
·10−2
2.2
1.8
𝑐𝑑
1.6
1.4
Figure 12.4: Inviscid drag coeffi-
1.2 cient, in counts, of the RAE 2822
1 airfoil as a function of Mach num-
0.64 0.66 0.68 0.7 0.72 0.74 ber.
Mach number
This is a typical drag rise curve, where increasing Mach number leads to
stronger shock waves and an associated increase in wave drag. Now let’s try to
change the shape of the airfoil to allow us to fly a little bit faster without large
increases in drag. We could perform an optimization to minimize the drag of
this airfoil at Mach 0.71. The resulting drag curve of this optimized airfoil is
shown in Fig. 12.5 in comparison to the baseline RAE 2822 airfoil.
·10−4
220
200
180
𝑐𝑑
160
Figure 12.5: The red curve shows
140 the drag of the airfoil optimized
to minimize drag at 𝑀 = 0.71,
120 corresponding to the dot. The drag
100 is low at the requested point, but
0.64 0.66 0.68 0.7 0.72 0.74 off-design performance is poor.
Mach number
Note that the drag is low at Mach 0.71 (as requested!), but any deviation
from the target Mach number causes significant drag penalties. In other words,
the design is not robust.
One way to improve the design, is to use what is called a multi-point
optimization. We minimize a weighted sum of the drag coefficient evaluated at
12 Optimization Under Uncertainty 339
three different Mach numbers (𝑀 = 0.68, 0.71, 0.725). The resulting drag curve
is shown in gray in Fig. 12.6.
·10−4
220
200
180
𝑐𝑑
160
Figure 12.6: The gray curve shows
140 the drag of the airfoil optimized
to minimize the average drag at
120 the three denoted points. The ro-
100 bustness of the design is greatly
0.64 0.66 0.68 0.7 0.72 0.74 improved.
Mach number
·10−3 N
8
NW NE
Relative Probability
4
W E
2
Figure 12.7: Left: probability den-
sity function of wind direction.
0
SE
Right: same PDF but visualized as a
0 90 180 270 360 SW
wind rose.
Wind Direction (deg) S
mean values for uncertain parameters, often with the assumption that the
variability is Gaussian or at least symmetric. In this case, the wind direction is
periodic, and very asymmetric, so instead we optimize using the most probable
wind direction (261◦ ). The second way is to treat this as an OUU problem.
Instead of maximizing the power for one direction, we maximize the expected
value of the power across all directions. This is straightforward to compute
from the definition of expected value because this is one-dimensional function.
Section 12.5 explains other ways to perform forward propagation.
Figure 12.8 shows the power as a function of wind direction for both cases.
Note that the deterministic approach does indeed allow for higher power
production when the wind comes from the west (and 180 degrees from that),
but that power drops considerably for other directions. In contrast, the OUU
result is much less sensitive to changes in wind direction. The expected value
of power is 58.6 MW for the deterministic case, and 66.1 MW for the OUU case,
which represents over a 12% improvement‡ . ‡ The wind energy community does not
use expected power directly, but rather an-
nual energy production, which is just the
80
expected power times utilization
OUU
70
60
Power (MW)
50
40
Figure 12.8: Wind farm power, as a
1 dir function of wind direction, for two
30 cases: optimized deterministically
using the most probable direction,
0 90 180 270 360 optimized under uncertainty.
Wind Direction (deg)
We can also see the tradeoff in the optimal layouts. The left side of Fig. 12.9
shows the optimal layout using the deterministic formulation, with the wind
coming from the predominant direction (the direction we optimized for).
12 Optimization Under Uncertainty 341
The wakes are shown in blue and the boundaries with a dashed line. The
wind turbines have spaced themselves out so that there is very little wake
interference. However, when the wind changes direction the performance
degrades significantly. The right side of Fig. 12.9 shows the same layout, but
when the wind is in the second-most probable direction. In this direction many
of the turbines are operating in the wake of another turbine and produce much
less power.
In contrast, the robust design is shown in Fig. 12.10 for the predominant
wind direction on the left and the second-most probable direction on the right.
In both cases the wake effects are relatively minor, though not quite as ideally
placed in the predominant direction. The tradeoff in performance for that one
direction, allows the design to be more robust as the wind changes direction.
This example again highlights the classic risk-reward tradeoff. The maxi-
mum power achieved at the most probable wind speed is reduced, in exchange
for reduced power variability and thus higher energy production in the long
run.
Both Examples 12.2 and 12.3 used the expected value, or mean, as
the objective function. However, there are other useful forms of OUU
12 Optimization Under Uncertainty 342
12.4 Reliability
Consider the Barnes function shown on the left side of Fig. 12.11. The three
red lines are the three nonlinear constraints of the problem and the red regions
highlight regions of infeasibility. With deterministic inputs, the optimal value
sits right on the constraint. An uncertainty ellipse shown around the optimal
point highlights the fact that the solution is not reliable. Any variability in the
inputs will create a significant probability for one or more of the constraints to
be violated (just like the real life problem where you are likely to be late if you
plan your arrival assuming zero variability).
Conversely, the right side of Fig. 12.11 shows a reliable optimum, with the
same uncertainty ellipse. We see that it is highly probable that the design
will satisfy all constraints under the input variation. However, as noted in
the introduction, increased reliability presents a performance trade-off with a
corresponding increase in the objective function.
12 Optimization Under Uncertainty 343
𝜎(𝑥) ≤ 𝜎 𝑦 . (12.11)
where 𝑤 𝑖 are specific weights. The nodes where the function is evalu-
ated, and the corresponding weights, are determined by the quadrature
strategy (e.g., rectangle rule, trapezoidal rule, Newton–Cotes, Clenshaw–
Curtis, Gaussian, Gauss–Konrod).
The difficulty of numerical quadrature is extending to multiple
dimensions (also known as cubature), and, unfortunately, most of the
time there is more than one uncertain variable. The most obvious
extension for multidimensional quadrature is a full-grid tensor product.
This type of grid is created by discretizing the nodes in each dimension,
and then evaluating at every combination of nodes. Mathematically,
the quadrature formula can be written as
∫ ÕÕ Õ
𝑓 (𝑥)𝑑𝑥1 𝑑𝑥2 . . . 𝑑𝑥 𝑛 ≈ ... 𝑓 (𝑥 𝑖 , 𝑥 𝑗 , . . . , 𝑥 𝑛 )𝑤 𝑖 𝑤 𝑗 . . . 𝑤 𝑛
𝑖 𝑗 𝑛
(12.18)
12 Optimization Under Uncertainty 345
1 1
0.5 0.5
𝑥2 0 𝑥2 0
Figure 12.12: Comparison between
−0.5 −0.5 a two-dimensional full tensor grid
(left) and a level 5 sparse grid (right)
using the Clenshaw-Curtis exponen-
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 tial rule.
𝑥1 𝑥1
1.4
Count ·104
From the data it appears that we may need about 105 samples to confidently
have well converged statistics. Using that number of samples gives the following
1.5
results: 𝜇 = 6.133, 𝜎 = 1.242, reliability = 99.187%. Note that because of the
random sampling these results will vary somewhat between simulations. The 1
corresponding histogram of the objective function is seen in Fig. 12.14. The
output distribution does not appear to be normally distributed. Producing an 0.5
𝜇 𝑓 = 𝑓 (𝜇𝑥 )
𝑛
Õ 2
𝜕𝑓 (12.20)
𝜎2𝑓 = 𝜎𝑥 𝑖
𝜕𝑥 𝑖
𝑖=1
4. Repeat as necessary.
where 𝑝(𝑥) is a weight function, and in our case is the probability density
function. The angle bracket notation, known as an inner product, will
be used in the remainder of this section.
The intuition is similar to that of vectors. Adding a non-orthogonal
vector to a set of vectors does not increase the span of the vector space.
In other words, the new vector could have been formed by a linear
combination of existing vectors in the set and so does not add any
new information. Similarly, we want to make sure that any new basis
functions we add are orthogonal to the existing set, so that the range of
functions that can be approximated is increased.
12 Optimization Under Uncertainty 351
You may be familiar with this concept from its use in Fourier series.
In fact, this method is a truncated generalized Fourier series. Recall
that a Fourier series represents an arbitrary periodic function with a
series of sinusoidal functions. The basis functions in the Fourier series
are orthogonal.
By definition we choose the first basis function to be 𝜓0 = 1. This
just means the first term in the series is a constant (polynomial of order
0). Because the basis functions are orthogonal we know that
h𝜓 𝑖 , 𝜓 𝑗 i = 0 if 𝑖 ≠ 𝑗 (12.27)
These three steps are overviewed below, though we begin with the last
step because it provides insight for the first two.
Compute Statistics
Using Eq. 12.6 the mean of the function 𝑓 is given by:
∫
𝜇𝑓 = 𝑓 (𝑥)𝑝(𝑥)𝑑𝑥 (12.28)
The coefficients 𝛼 𝑖 are constants and so can be taken out of the integral:
Õ ∫
𝜇𝑓 = 𝛼𝑖 𝜓 𝑖 (𝑥)𝑝(𝑥)𝑑𝑥
𝑖
∫ ∫ ∫
= 𝛼0 𝜓0 (𝑥)𝑝(𝑥)𝑑𝑥 + 𝛼1 𝜓1 (𝑥)𝑝(𝑥)𝑑𝑥 + 𝛼 2 𝜓2 (𝑥)𝑝(𝑥)𝑑𝑥 + . . .
Because the polynomials are orthogonal, all of the terms except the first
are zero (see Eq. 12.27), and by definition of a PDF, we know that the
integral of 𝑝(𝑥) over the domain must be one (see Eq. 12.5). Thus, we
have the simple result that the mean of the function is simply given by
the zeroth coefficient:
𝜇 𝑓 = 𝛼0 (12.31)
Using a similar approach we can derive a formula for the variance.
By definition, the variance is (Eq. 12.8):
∫
𝜎2𝑓 = 𝑓 (𝑥)2 𝑝(𝑥)𝑑𝑥 − 𝜇2𝑥 (12.32)
Õ
𝑖
∫
= 𝛼2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥)𝑑𝑥 − 𝛼 20
∫ ∫
𝑖
Õ
𝑛
= 𝛼 20 𝜓02 𝑝(𝑥)𝑑𝑥 + 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥)𝑑𝑥 − 𝛼20
∫
𝑖=1
Õ
𝑛
= 𝛼 20 + 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥)𝑑𝑥 − 𝛼20
∫
𝑖=1
Õ
𝑛
= 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥)𝑑𝑥
𝑖=1
Õ
𝑛
= 𝛼 2𝑖 h𝜓 2𝑖 i
𝑖=1
𝜇 𝑓 = 𝛼0 (12.33)
Õ
𝑛
𝜎 2𝑓 = 𝛼 2𝑖 hΨ2𝑖 i (12.34)
𝑖=1
𝜓0 = 1 𝑓 (𝑥)
1
𝜓1 = 𝑥 𝜓0
𝜓1
1 𝜓3
𝜓2 = (3𝑥 2 − 1) 0.5
2 (12.35)
1 0
𝜓3 = (5𝑥 3 − 3𝑥)
2
.. −0.5
𝜓2
.
−1
These polynomials are plotted in Fig. 12.15, and are orthogonal with respect to −1 −0.5 0 0.5 1
𝑥
a uniform probability distribution.
Figure 12.15: The first few Legen-
dre polynomials.
12 Optimization Under Uncertainty 354
Determine Coefficients
With the polynomial basis 𝜓 𝑖 fixed, we need to determine the appro-
priate coefficients 𝛼 𝑖 in Eq. 12.22. We discuss two ways to do this. The
first is with quadrature and is also known as non-intrusive spectral pro-
jection. The second is with regression and is also known as stochastic
collocation.
First, let’s use the quadrature approach. Beginning with the polyno-
mial approximation: Õ
𝑓 (𝑥) = 𝛼 𝑖 𝜓 𝑖 (𝑥) (12.36)
𝑖
or ∫
1
𝛼𝑖 = 𝑓 (𝑥)𝜓 𝑖 (𝑥)𝑝(𝑥)𝑑𝑥 (12.39)
h𝜓2𝑖 i
12 Optimization Under Uncertainty 355
Problems
𝑇3
𝑦3
𝑇2
𝑦2
Wind
𝑇1
𝑦𝑖 > 0 (bound)
𝑦3 > 𝑦2 (linear)
360
13 Multidisciplinary Design Optimization 361
x0 , x1 x0 , x2 x0 , x3 x x
c1 Analysis 1 y1 y1 y1 y1
c2 y2 Analysis 2 y2 y2 y2
c3 y3 y3 Analysis 3 y3 y3
Global
c0
Constraints
Surface
Aerodynamic pressures
solver
Structural
Displacements solver
Surface pressure
Drag, lift
integration
13.3.1 Components
In Section 3.3, we explained how all models can ultimately be written
as a system of residuals, 𝑟(𝑥, 𝑢) = 0 . When the system is large or
includes sub-models, it might be natural to partition the system into
components. We prefer to use the more general term components instead
of disciplines to refer to the sub-models resulting from the partitioning
because the partitioning of the overall model is not necessarily by
discipline (e.g., aerodynamics, structures). A system model might also
be partitioned by physical system component (e.g., wing, fuselage, or
an aircraft in a fleet) or by different conditions applied to the same
model (e.g., aerodynamic analyses at different flight conditions).
The partitioning can also be performed within a given discipline for
the same reasons cited above. In theory, the system model equations in
𝑟(𝑥, 𝑢) = 0 can be partitioned in any way, but only some partitions are
advantageous or make sense. The partitioning can also be hierarchical,
where a given component has one or multiple levels of sub-components.
Again, this might be motivated by efficiency, modularity, or both.
13 Multidisciplinary Design Optimization 364
Let us formulate model for the aerostructural problem described in Ex. 13.1.
A possible model for the aerodynamics is a lifting line model given by the
linear system,
𝐴Γ = 𝛾, (13.3)
where 𝐴 is the matrix of aerodynamic influence coefficients and 𝛾 is a vector,
both of which depend on the wing shape. The state Γ is a vector that represents
13 Multidisciplinary Design Optimization 365
the circulation (vortex strength) at each spanwise position on the wing. The
lift and drag scalars can be computed explicitly for a given Γ, so we will write
these dependencies as 𝐿 = 𝐿(Γ) and 𝐷 = 𝐷(Γ), omitting the detailed explicit
expressions for conciseness.
A possible model for the structures is a cantilevered beam modeled with
Euler–Bernoulli elements,
𝐾𝑑 = 𝑓 , (13.4)
where 𝐾 is the stiffness matrix, which depends on the beam shape and sizing.
The right-hand-side vector represents the applied forces at spanwise position
on the beam. The states 𝑑 are the displacements and rotations of each element.
The weight does not depend on the states and it is an explicit function of the
beam sizing and shape, so it does not involve the structural model (13.4). The
stresses are an explicit function of the displacements, so we can write 𝜎 = 𝜎(𝑑),
where 𝜎 is a vector whose size is the number of elements.
When we couple these two models, 𝐴 and 𝛾 depend on the wing displace-
ments 𝑑, and 𝑓 depends on the Γ. We can write all the implicit and explicit
equations as residuals:
𝑟1 = 𝐴(𝑑)Γ − 𝛾(𝑑) = 0
𝑟2 = 𝐿 − 𝐿(Γ) = 0
𝑟3 = 𝐷 − 𝐷(Γ) = 0 (13.5)
𝑟4 = 𝐾𝑑 − 𝑓 (Γ) = 0
𝑟5 = 𝜎 − 𝜎(𝑑) = 0.
We used Eq. (13.2) to transform the explicit equations into residuals for 𝑟2 , 𝑟3 ,
and 𝑟5 . The states of this system are,
𝑢1 Γ
𝑢2 𝐿
𝑢 = 𝑢3 ≡ 𝐷 . (13.6)
𝑢 𝑑
4
𝑢 𝜎
5
Because 𝑢2 , 𝑢3 , and 𝑢5 can be explicitly determined from 𝑢1 and 𝑢4 , we could
just solve for 𝑟1 and 𝑟4 and then just use the explict expressions. However, it
can be convenient to write all equations as a singular vector 𝑟(𝑢) = 0.
𝑦 𝑗≠𝑖 𝑃𝑖 (𝑦 𝑗≠𝑖 ) 𝑝𝑖
𝑅 𝑖 = 𝑦 𝑖 − 𝑌𝑖 (𝑦 𝑗≠𝑖 ) = 0, (13.8)
where 𝑦 𝑖 are the guesses for the coupling variables and 𝑌𝑖 are the actual
computed values.
The residual representation of the coupled system is an alternative
system level representation, where a general component, including
13 Multidisciplinary Design Optimization 367
u1
R1 R1
u2 y1 y1
R2 Y1 (yR22, y3 )
u3 u3
R3 R3
u4
R4 R4
u5 y2
R5 Y2 (yR15, y3 ) y2
u6
R6 u6 R6
u7
R7 R7
u8 y3 y3
R8 Y3 (yR18, y2 )
u9 u9
R9 R9
The mathematical representation of these dependencies is given by Figure 13.6: Two system-level views
a graph (Fig. 13.7b), where the graph nodes are the components and the of coupled system with three
solvers: all components exposed
edges represent the information dependency. This graph is a directed and written as residuals and states
graph because in general there are three possibilities for a coupling: a (left) and black box representation
where only inputs and outputs for
single coupling one way or the other, or a two way coupling. A directed each solver are visible (right), where
graph is said to be cyclic when there are edges that form a closed loop, 𝑦1 , 𝑢3 , 𝑦2 , 𝑢6 , and 𝑦3 , 𝑢9 .
or cycles. In the example of Fig. 13.7b, there is a single cycle between
components B and C. When there are no closed loops, the graph is
acyclic. In this case, the whole system can be solved by solving each
component in turn, without having to iterate. A graph can also be
represented using an adjacency matrix (Fig. 13.7c), which has the same
structure as the transpose of the DSM.
The adjacency matrix for real-world systems is often a sparse ma-
trix, that is, it has many zeros in its entries. This means that in the
A y1 y1
corresponding DSM, each component depends only on a subset of all
the other components. We can take advantage of the structure of this
y2
sparsity in the solution of coupled systems. B
only the data dependencies, the only difference relative to DSM is that
the coupling variables are labeled explicitly, and the data paths are Figure 13.8: XDSM showing
drawn. In the next section, we add process to the XDSM. data dependencies for the four-
component coupled system of
Fig. 13.7.
13 Multidisciplinary Design Optimization 369
B B
1 0 0 0
C 1 1 1 0
0
1 1 0
D 1 1 Figure 13.7: Different represen-
C D 0 1
tations of the dependencies of a
(a) Design structure (b) Directed graph (c) Adjacency hypothetical system.
matrix matrix
has converged. The process line is shown as the thin black line to
distinguish from the data dependency connections (thick gray lines),
and follows the sequence of numbered steps. The analyses for each
component are all numbered the same (step 1), because they can be
done in parallel. Each component returns the coupling variables it
computes to the MDA iterator, closing the loop between step 2 and
step 1 (denoted as 2 → 1).
yt x0 , x1 x0 , x2 x0 , x3
0, 2 → 1 :
(no data) 1 : y2t , y3t 1 : y1t , y3t 1 : y1t , y2t
MDA
1:
y1 2 : y1
Analysis 1
1:
y2 2 : y2
Analysis 2
Inputs:
(0) (0)
𝑢 (0) = [𝑢1 , . . . , 𝑢𝑛 ]: Guesses for coupling variables
Outputs:
𝑢 = [𝑢1 , . . . , 𝑢𝑛 ]: System-level states
𝑘=1
while ||𝑢 (𝑘) − 𝑢 (𝑘−1) || 2 < 𝜖 do
for all 𝑖 ∈ {1, . . . , 𝑛} do Can be done in parallel
(𝑘) (𝑘) (𝑘−1)
𝑢𝑖 ← solve 𝑅 𝑖 𝑢𝑖 , 𝑢 𝑗 = 0, where 𝑗 ≠ 𝑖
end for
𝑘 = 𝑘+1
end while
yt
0, 4 → 1 :
1 : y2t , y3t 2 : y3t
MDA
1:
y1 4 : y1 2 : y1 2 : y1
Analysis 1
2:
y2 4 : y2 3 : y2
Analysis 2
Inputs:
(0) (0)
𝑢 (0) = [𝑢1 , . . . , 𝑢𝑛 ]: Guesses for coupling variables
Outputs:
𝑢 = [𝑢1 , . . . , 𝑢𝑛 ]: System-level states
𝑘=1
while ||𝑢 (𝑘) − 𝑢 (𝑘−1) || 2 < 𝜖 do
for 𝑖 = 1, 𝑛 do
(𝑘) (𝑘) (𝑘) (𝑘) (𝑘−1) (𝑘−1)
𝑢𝑖 ← solve 𝑅 𝑖 𝑢1 , . . . , 𝑢𝑖−1 , 𝑢𝑖 , 𝑢𝑖+1 , . . . , 𝑢𝑛 =0
end for
𝑘 = 𝑘+1
end while
𝜕𝑅
Δ𝑢 = −𝑅 𝑢 (𝑘) , (13.9)
𝜕𝑢 𝑢=𝑢 (𝑘)
where we need the partial derivatives of all the residuals with respect
to the coupling variables to form the Jacobian matrix 𝜕𝑅/𝜕𝑢.
Expanding the concatenated residual and coupling variable vectors,
we get,
𝜕𝑅1 𝜕𝑅 1
𝜕𝑢1 𝜕𝑢𝑛 Δ𝑢1 𝑅1
···
.
. .. .. ..
.
..
. . = − . , (13.10)
𝜕𝑅
.
𝑛 · · · 𝜕𝑅 𝑛 Δ𝑢𝑛
𝑅 𝑛
𝜕𝑢1 𝜕𝑢𝑛
where the derivatives in the block Jacobian matrix and the right hand
side are evaluated at the current iteration, 𝑢 (𝑘) . These derivatives can
be computed using any of the methods from Chapter 6. Note that this
Jacobian matrix has exactly the same structure of the DSM and is often
a sparse matrix. The full procedure is listed in Alg. 13.5.
Inputs: h i
(0) (0)
𝑢 (0) = 𝑢1 , . . . , 𝑢𝑛 : Guesses for coupling variables
Outputs:
𝑢 = [𝑢1 , . . . , 𝑢𝑛 ]: System-level states
𝑘=1
while ||𝑅|| 2 < 𝜖 do
for all 𝑖 ∈ {1, . . . , 𝑛} do Can be done in parallel
13 Multidisciplinary Design Optimization 373
Compute 𝑅 𝑖
𝜕𝑅 𝑖
Compute for 𝑗 = 1, . . . , 𝑛
𝜕𝑢 𝑗
end for
Δ𝑢 ← solve block Newton system (13.10)
𝑢 (𝑘+1) = 𝑢 (𝑘) + Δ𝑢
𝑘 = 𝑘+1
end while
𝜕𝑅1 𝜕𝑅 1 𝜕𝑅1
𝜙1
𝜕𝑢1 𝜕𝑢𝑛 𝜕𝑥
···
. .. ..
. .. ..
. = . ,
. . .
(13.11)
𝜕𝑅 𝜕𝑅 𝑛 𝜙 𝑛 𝜕𝑅 𝑛
𝑛
···
𝜕𝑢1 𝜕𝑢𝑛 𝜕𝑥
where 𝜙 𝑖 is the derivatives of the states from component 𝑖 with respect
to the design variable. Once we have solved for 𝜙, we can use the
coupled equivalent of the total derivative equation (6.46) to compute
the derivatives.
𝜙1
d𝑓 𝜕𝐹 ..
. .
𝜕𝐹 𝜕𝐹
= − ,..., (13.12)
d𝑥
𝜙𝑛
𝜕𝑥 𝜕𝑢1 𝜕𝑢𝑛
The coupled adjoint equations can be written as
𝜕𝑅1 𝑇 𝜕𝑅 1 𝑇 𝑇 𝜕𝐹 𝑇
𝜓1 𝜕𝑢
𝜕𝑢1 𝜕𝑢𝑛
···
. 1
. .. .. ..
.
..
. . = . . (13.13)
𝜕𝑅
.
𝑛𝑇 𝜕𝑅 𝑛 𝑇 𝜓 𝑛 𝜕𝐹
···
𝜕𝑢1 𝜕𝑢𝑛 𝜕𝑢𝑛
After solving for the coupled-adjoint vector using the equation above, we
13 Multidisciplinary Design Optimization 375
can use the total derivative equation to compute the desired derivatives,
𝜕𝑅1
𝜕𝑥
d𝑓 𝜕𝐹 𝑇 .
= − 𝜓1 , . . . , 𝜓𝑇𝑛 .. . (13.14)
d𝑥 𝜕𝑥 𝜕𝑅
𝑛
𝜕𝑥
There is an alternative form for the coupled direct and adjoint
methods that was not useful for single models. The coupled direct and
adjoint methods derived above use the residual form of the governing
equations and are a natural extension of the corresponding methods
applied to single models. In this form, the residuals for all the equations
and the corresponding states are exposed at the system level. As
previously mentioned in Section 13.3.2, there is an alternative system-
level representation—the functional representation—that views each
model as a function relating its inputs and outputs 𝑦 = 𝑌(𝑦), where the
coupling variables 𝑦 represent the system-level states. We can derive
direct and adjoint methods from the functional representation.
The functional versions of these methods can be derived by defining
the residuals as 𝑅(𝑦) , 𝑦 − 𝑌(𝑦) = 0, where the states are now the
coupling variables. The linear system for the direct method (13.11) then
yields
𝜕𝑌1 𝜕𝑌1
𝐼 𝜙¯ 1
𝜕𝑌1
− ··· −
𝜕𝑦𝑛 𝜕𝑥
. ..
𝜕𝑦2
. .. ..
.
.. ..
. . = . , (13.15)
¯ 𝜕𝑌
. .
𝜕𝑌
− 𝑛 𝜕𝑌𝑛 𝜙𝑛 𝑛
− ··· 𝐼
𝜕𝑦1 𝜕𝑦2 𝜕𝑥
𝜙¯ 1
d𝑓 𝜕𝐹 𝜕𝐹 𝜕𝐹 ..
= − ,..., . . (13.16)
d𝑥
𝜙¯ 𝑛
𝜕𝑥 𝜕𝑦1 𝜕𝑦𝑛
The functional form of the coupled adjoint equations can be similarly
derived, yielding
𝜕𝑌1 𝑇 𝑇 𝜕𝑌1
𝐼 𝜓¯ 1
𝜕𝑌1 𝑇
− ··· −
𝜕𝑦𝑛 𝜕𝑥
. ..
𝜕𝑦2
. .. ..
.
.. ..
. . = . , (13.17)
¯ 𝜕𝑌
. .
𝜕𝑌
− 𝑛 𝑇 𝜕𝑌𝑛 𝑇 𝜓 𝑛 𝑛
− ··· 𝐼
𝜕𝑦1 𝜕𝑦2 𝜕𝑥
13 Multidisciplinary Design Optimization 376
After solving for the coupled-adjoint vector using the equation above, we
can use the total derivative equation to compute the desired derivatives,
𝜕𝐹
𝜕𝑦1
d𝑓 .
− 𝜓¯ 𝑇1 , . . . , 𝜓¯ 𝑇𝑛 .. .
𝜕𝐹
= (13.18)
d𝑥 𝜕𝑥 𝜕𝐹
𝜕𝑦𝑛
Finally, the unification of the methods for computing derivatives
introduced in Section 6.9 also applies to coupled systems and can be
used to derive the coupled direct and adjoint methods presented above.
Furthermore, the UDE (6.74) can also handle residual or functional
components, as long as they are ultimately expressed as residuals.
Even if exact derivatives can only be supplied for a subset of the models and
the rest are obtained by finite difference, the system derivatives will usually be
more accurate than applying finite difference at the system level.
x (0) y (0)
0, 7 → 1 :
x∗ 2 : x0 , x1 3 : x0 , x2 4 : x0 , x3 6:x
Optimization
0, 4 → 1 :
2 : y2 , y3 3 : y3
MDA
2:
y1∗ 5 : y1∗ 3 : y1 4 : y1 6 : y1
Analysis 1
3:
y2∗ 5 : y2∗ 4 : y2 6 : y2
Analysis 2
4:
y3∗ 5 : y3∗ 6 : y3
Analysis 3
6:
7 : f ,c
Functions
for 𝑦
where unlike MDF, we do not need to make the coupling variables con-
sistent using an MDA. Instead, each component is solved independently
for each optimization iteration. Then 𝑓 and 𝑐0 are computed using
the current design variables 𝑥 and the latest available set of coupling
variables 𝑦. The XDSM for IDF is shown in Fig. 13.12.
x (0) , y t,(0)
0, 3 → 1 :
x∗ 2 : x, y t 1 : x0 , x1 , y2t , y3t 1 : x0 , x2 , y1t , y3t 1 : x0 , x3 , y1t , y2t
Optimization
2:
y1∗ 3 : f , c, c c
Functions
1:
y2∗ 2 : y1
Analysis 1
1:
y3∗ 2 : y2
Analysis 2
1:
2 : y3
Analysis 3
0, 2 → 1 :
x ∗, y ∗ 1 : x, y 1 : x0 , x1 , y , u1 1 : x0 , x2 , y , u2 1 : x0 , x3 , y , u3
Optimization
1:
2 : f ,c
Functions
1:
2 : R1
Residual 1
1:
2 : R2
Residual 2
1:
2 : R3
Residual 3
Because we are solving all variables simultaneously, the SAND Figure 13.13: The SAND architec-
architecture has the potential for being the most efficient way to get ture lets the optimizer solve for all
variables (design, coupling, and
to the optimal solution. In practice, however, it is unlikely that this is state variables) and component
advantageous when efficient component solvers are available. solvers are no longer needed.
The resulting optimization problem is the largest of all MDO archi-
tectures and requires an optimizer that scales well with the number
of variables. Therefore, a gradient-based optimization algorithm is
likely required, in which case, the derivative computation must also
be considered. Fortunately, SAND does not require derivatives of the
coupled system or even total derivatives that account for the component
solution; only partial derivatives of residuals are needed.
SAND is an intrusive approach because it requires access to the
residuals. These might not be available if components are provided as
black boxes. Rather than computing coupling variables 𝑦 𝑖 and state
variables 𝑢𝑖 by converging the residuals to zero each component 𝑖 just
compute the current residuals ℛ 𝑖 for the current values of the coupling
variables 𝑦, and the component states 𝑢𝑖 . ‖
The MAUD architecture was developed
by Hwang et al.39 , who realized that the
UDE provided the mathematical basis
13.5.4 Modular Analysis and Unified Derivatives for a new MDO framework that makes
sophisticated parallel solvers and cou-
pled derivative computations available
The modular analysis and unified derivatives (MAUD) architecture is
through a small set of user-defined func-
essentially MDF with built-in solvers and derivative computation that tions.
use the residual representation introduced in Section 13.3.2. ‖ There 39. Hwang et al., A computational architec-
ture for coupling heterogeneous numerical
are two main ideas in MAUD: 1) represent the coupled system as a models and computing coupled derivatives.
2018
13 Multidisciplinary Design Optimization 382
single nonlinear system and 2) linearize the coupled system using the
UDE (6.74) and solve it for the coupled derivatives.
To represent the coupled system as a single nonlinear system,
we view the MDA as a series of residuals and variables, 𝑅 𝑖 (𝑢) = 0,
corresponding to each component 𝑖 = 1, . . . , 𝑛, as previously written
in Eq. 13.1. Unlike the previous architectures, there is no distinction
between the coupling variables and state variables; they are all just
states, 𝑢. As previously shown in Fig. 13.5, the coupling variables can
be considered to be the states by defining explicit components that
translate the inputs and outputs.
In addition, both the design variables and functions of interest
(objective and constraints) are also concatenated in the state variable
vector. Denoting the original states for the coupled system (13.1) as 𝑢,
¯
the new state is,
𝑥
𝑢 , 𝑢¯ . (13.24)
𝑓
We also need to augment the residuals to have a solvable system. The
residuals corresponding to the design variables and output functions
are formulated using the residual for explicit functions introduced in
Eq. 13.2. The complete set of residuals is then,
𝑥 − 𝑥0
𝑅(𝑢) , 𝑟 − 𝑅 𝑢¯ (𝑥, 𝑢)
¯ , (13.25)
𝑓 − 𝐹(𝑥, 𝑢) ¯
where 𝑥 0 are fixed inputs, and 𝐹(𝑥, 𝑢)
¯ is the actual computed value
of the function. Formulating fixed inputs and explicit functions as
residuals in this way might seem unnecessarily complicated, but it
facilitates the formulation of the MAUD architecture, just like it did for
the formulation of the UDE.
The two main ideas in MAUD mentioned above are directly as-
sociated with two main tasks: 1) the solution of the coupled system
and 2) the computation of the coupled derivatives. The formulation
of the concatenated states (13.24) and residuals (13.25) simplifies the
implementation of the algorithms that perform the above tasks. To
perform these tasks, MAUD assembles and solves four types of systems:
𝑅(𝑢) = 0
13 Multidisciplinary Design Optimization 383
𝜕𝑅
Δ𝑢 = −𝑟
𝜕𝑢
𝜕𝑅 d𝑢
=ℐ
𝜕𝑢 d𝑟
𝜕𝑅 𝑇 d𝑢 𝑇
=ℐ
𝜕𝑢 d𝑟
x x x
u1 u1
u1
u2 u2
u3 u3
u2
u4 u4
f f f
0, 2 → 1 :
x0∗ System 1 : x0 , x̂1...N , y t 1.1 : yj6t=i 1.2 : x0 , x̂i , y t
Optimization
1:
2 : f0 , c0 System
Functions
1.1 :
yi∗ 1.2 : yi
Analysis i
1.2 :
2 : Ji∗ 1.3 : fi , ci , Ji Discipline i
Functions
For each system-level iteration, the disciplinary subproblems do Figure 13.15: Diagram for the CO
architecture.
not include the original objective function. Instead the objective of
each subproblem is to minimize the inconsistency function. For each
discipline 𝑖 the subproblem is
minimize 𝐽𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖
by varying 𝑥ˆ 0𝑖 , 𝑥 𝑖 (13.27)
subject to 𝑐 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖 ≤ 0.
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Corresponding objective value
𝑐 ∗ : Corresponding constraint values
Õ
𝑁
minimize 𝑓0 𝑥, 𝑦 𝑡 + Φ𝑖 𝑥ˆ 0𝑖 − 𝑥 0 , 𝑦 𝑖𝑡 − 𝑦 𝑖 𝑥 0 , 𝑥 𝑖 , 𝑦 𝑡 +
𝑖=1
(13.28)
Φ0 𝑐 0 𝑥, 𝑦 𝑡
by varying 𝑥0 , 𝑦 𝑡 ,
by varying 𝑥ˆ 0𝑖 , 𝑥 𝑖
subject to 𝑐 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖 ≤ 0.
(13.29)
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Corresponding objective value
𝑐 ∗ : Corresponding constraint values
0, 8 → 1 :
(no data) 6:w 3 : wi
w update
5, 7 → 6 :
x0∗ System 6 : x0 , y t 3 : x0 , y t 2 : yj6t=i
Optimization
6:
System and
7 : f0 , Φ0...N
Penalty
Functions
1, 4 → 2 :
xi∗ 6 : x̂0i , xi 3 : x̂0i , xi 2 : x̂0i , xi
Optimization i
3:
Discipline i
4 : fi , ci , Φ0 , Φi
and Penalty
Functions
2:
yi∗ 6 : yi 3 : yi
Analysis i
3: Compute discipline objective and constraint functions and Figure 13.16: Diagram for the ATC
penalty function values architecture
4: Update discipline design variables
until 4 → 2: Discipline optimization has converged
end for
5: Initiate system optimizer
repeat
6: Compute system objective, constraints, and all penalty functions
7: Update system design variables and coupling targets.
until 7 → 6: System optimization has converged
8: Update penalty weights
until 8 → 1: Penalty weights are large enough
13 Multidisciplinary Design Optimization 390
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
13 Multidisciplinary Design Optimization 391
(0) (0)
x (0) y t,(0) x0 xi
0, 11 → 1 :
(no data) Convergence
Check
1, 3 → 2 :
6 : yj6t=i 6, 9 : yj6t=i 6 : yj6t=i 2, 5 : yj6t=i
MDA
8, 10 :
x0∗ 11 : x0 System 6, 9 : x0 6, 9 : x0 9 : x0 6 : x0 2, 5 : x0
Optimization
4, 7 :
xi∗ 11 : x0 6, 9 : xi 6, 9 : xi 9 : xi 6 : xi 2, 5 : xi
Optimization i
6, 9 :
10 : f0 , c0 7 : f0 , c0 System
Functions
6, 9 :
10 : fi , ci 7 : fi , ci Discipline i
Functions
9:
Shared
10 : df /dx0 , dc/dx0
Variable
Derivatives
3:
Discipline i
7 : df0,i /dx0 , dc0,i /dx0
Variable
Derivatives
2:
yi∗ 3 : yi 6, 9 : yi 6, 9 : yi 9 : yi 6 : yi
Analysis i
Figure 13.17 shows the XDSM for BLISS and the corresponding steps
are listed in Alg. 13.9. Since BLISS uses an MDA, it is a distributed MDF
architecture. Due to the linear nature of the optimization problems,
repeated interrogation of the objective and constraint functions is not
necessary once we have the gradients. If the underlying problem is
highly nonlinear, the algorithm may converge slowly. The variable
bounds may help the convergence if these bounds are properly chosen,
such as through a trust region framework.
(0) (0)
x0,1,2 y t,(0) x3
0, 10 → 1 :
∗
x0,1,2 System 9 : x0,1,2 2 : x0 , x1 3 : x0 , x2 6 : x0,1,2 5 : x0
Optimization
9:
Discipline 0, 1,
10 : f0,1,2 , c0,1,2
and 2
Functions
0, 8 → 2 :
2 : y2t , y3t 3 : y3t
MDA
2:
y1∗ 9 : y1 8 : y1 3 : y1 6 : y1 5 : y1
Analysis 1
3:
y2∗ 9 : y2 8 : y2 6 : y2 5 : y2
Analysis 2
4, 7 → 5 :
x3∗ 9 : x3 6 : x3 5 : x3
Optimization 3
6:
Discipline 0
7 : f0 , c0 , f3 , c3
and 3
Functions
5:
y3∗ 9 : y3 8 : y3 6 : y3
Analysis 3
minimizes the objective with respect to the global variables and coupling
variables while enforcing consistency constraints.
When using QSD, the objective and constraint functions are assumed
to be dependent only on the shared design variables and coupling
variables. Each discipline is assigned a “budget” for a local objective and
the discipline problems maximize the margin in their local constraints
and the budgeted objective. The system-level subproblem minimizes
the objective and budgets of each discipline while enforcing the global
constraints and a positive margin for each discipline.
IPD and EPD are applicable to MDO problems with no global
objectives or constraints. They are similar to ATC in that copies of
the share variables are used for every discipline subproblem and the
consistency constraints are relaxed with a penalty function. Unlike
ATC, however, the simpler structure of the discipline subproblems is
exploited to compute post-optimality derivatives to guide the system-
level optimization subproblem.
Like CO, ECO uses copies of the global variables. The discipline
subproblems minimize quadratic approximations of the objective while
enforcing local constraints and linear models of the nonlocal constraints.
The system-level subproblem minimizes the total violation of all con-
sistency constraints with respect to the global variables.
13.7 Summary
MDF/MAUD
Monolithic IDF
SAND
BLISS
MDO
architecture CSSO
classification Distributed MDF
MDOIS
ASO
Distributed CO
Multilevel
QSD
Penalty
IPD/EPD
ECO
multidisciplinary feasibility in some other way. Some do this by formu- Figure 13.19: Classification of MDO
lating an appropriate multilevel optimization (such as CO) and others architectures.
use penalties to ensure this (such as ATC). ‡‡ ‡‡ Martins et al.36 describes all these MDO
There are a number of commercial MDO frameworks that are architectures in detail.
36. Martins et al., Multidisciplinary Design
available, including Isight/SEE 147 by Dassault Systèmes, ModelCen- Optimization: A Survey of Architectures.
ter/CenterLink by Phoenix Integration, modeFRONTIER by Esteco, 2013
147. Golovidov et al., Flexible implementa-
AML Suite by TechnoSoft, Optimus by Noesis Solutions, TechnoSoft’s tion of approximation concepts in an MDO
AML suite, Noesis Solutions’ Optimus, and VisualDOC by Vander- framework. 1998
plaats Research and Development 148 . These frameworks focus on 148. Balabanov et al., VisualDOC: A Soft-
ware System for General Purpose Integration
making it easy for users to couple multiple disciplines and to use the and Design Optimization. 2002
optimization algorithms through graphical user interfaces. They also
provide convenient wrappers to popular commercial engineering tools.
While this focus has made it convenient for users to implement and
solve MDO problems, the numerical methods used to converge the
multidisciplinary analysis (MDA) and the optimization problem are
usually not as sophisticated as the methods presented in this book. For
example, these frameworks often use fixed-point iteration to converge
the MDA. When derivatives are needed for a gradient-based optimizer,
finite-difference approximations are used rather than more accurate
analytic derivatives.
Problems
d d 𝑓 d𝑔
𝑓 (𝑔(𝑥)) = (A.1)
d𝑥 d𝑔 d𝑥
Let 𝑓 (𝑔(𝑥)) = sin(𝑥 2 ). In this case, 𝑓 (𝑔) = sin(𝑔), and 𝑔(𝑥) = 𝑥 2 . The
derivative with respect to 𝑥 is:
𝑑 𝑑 𝑑
𝑓 (𝑔(𝑥)) = (sin(𝑔)) (𝑥 2 ) = cos(𝑥 2 )(2𝑥) (A.2)
𝑑𝑥 𝑑𝑔 𝑑𝑥
398
A Mathematics Review 399
𝑑 𝜕 𝑓 𝑑𝑔 𝜕 𝑓 𝑑ℎ
𝑓 (𝑔(𝑥), ℎ(𝑥)) = + (A.3)
𝑑𝑥 𝜕𝑔 𝑑𝑥 𝜕ℎ 𝑑𝑥
𝑑𝑓 𝜕𝑓 𝜕 𝑓 𝑑𝑦
= +
𝑑𝑥 𝜕𝑥 𝜕𝑦 𝑑𝑥
(A.6)
= 2𝑥 + 2𝑦 cos(𝑥)
= 2𝑥 + 2 sin(𝑥) cos(𝑥)
Notice that the partial derivative and total derivative are quite different. For this
simple case we could also find the total derivative by direct substitution and then
using an ordinary one-dimensional derivative. Substituting in 𝑦(𝑥) = sin(𝑥)
directly into the original expression for 𝑓 :
Expanding on our single variable example, let 𝑔(𝑥) = cos(𝑥) and ℎ(𝑥) =
sin(𝑥) and 𝑓 (𝑔, ℎ) = 𝑔 2 ℎ 3 . Then 𝑓 (𝑔(𝑥), ℎ(𝑥)) = cos2 (𝑥) sin3 (𝑥) Applying
Eq. A.3 we have:
𝑑 𝜕 𝑓 𝑑𝑔 𝜕 𝑓 𝑑ℎ
( 𝑓 (𝑔(𝑥), ℎ(𝑥))) = +
𝑑𝑥 𝜕𝑔 𝑑𝑥 𝜕ℎ 𝑑𝑥
𝑑𝑔 𝑑ℎ
= 2𝑔 ℎ 3 + 𝑔 2 3ℎ 2 (A.9)
𝑑𝑥 𝑑𝑥
= −2𝑔 ℎ 3 sin(𝑥) + 𝑔 2 3ℎ 2 cos(𝑥)
= −2 cos(𝑥) sin4 (𝑥) + 3 cos3 (𝑥) sin2 (𝑥)
A Mathematics Review 400
The most familiar norm for vectors is the 2-norm, which corresponds
to the Euclidean length of the vector:
k𝑥 k ∞ = max |𝑥 𝑖 | (A.13)
𝑖
Consider a matrix 𝐴 ∈ R𝑚x𝑛 ∗ and a matrix 𝐵 ∈ R𝑛x𝑝 . The two matrices Figure A.1: Norms for two-
can be multiplied together (𝐶 = 𝐴𝐵) as follows: dimensional case.
∗ This
Õ
means that the matrix is comprised
𝑛
of real numbers and that it has 𝑚 rows
𝐶 𝑖𝑗 = 𝐴𝑖 𝑘 𝐵 𝑘 𝑗 (A.15) and 𝑛 columns.
𝑘=1
A Mathematics Review 401
where 𝐶 ∈ R𝑚𝑥𝑝 . Notice that two matrices can be multiplied only if there
inner dimensions are equal (𝑛 in this case). The remaining products
discussed in section are just special cases of of matrix multiplication,
but are common enough that we discuss them separately.
𝑣1
𝑣2 Õ
𝑛
𝑢 𝑇 𝑣 = 𝑢1 𝑢2 . . . 𝑢𝑛 . = 𝑢𝑖 𝑣 𝑖 (A.16)
..
𝑖=1
𝑣 𝑛
Notice that the order is irrelevant:
𝑢𝑇 𝑣 = 𝑣𝑇 𝑢 (A.17)
𝑢1 𝑢1 𝑣 1 𝑢1 𝑣 𝑛
𝑢1 𝑣2 ···
𝑢2 𝑢2 𝑣1 𝑢2 𝑣 𝑛
𝑢2 𝑣2 ···
𝑢𝑣 𝑇 = . 𝑣1 𝑣2 . . . 𝑣𝑛 = . .. (A.19)
.. .. .. ..
.
. .
𝑢 𝑚 𝑢 𝑚 𝑣 1 𝑢𝑚 𝑣 𝑛
𝑢𝑚 𝑣 2 ···
or in index form:
(𝑢𝑣 𝑇 )𝑖𝑗 = 𝑢𝑖 𝑣 𝑗 (A.20)
A Mathematics Review 402
We can see that the entries in 𝑣 are dot products between the rows of 𝐴
and 𝑢:
—— 𝑎 𝑇 ——
—— 𝑎 𝑇 ——
1
𝑣= 𝑢
2
(A.22)
..
.
—— 𝑎 𝑇 ——
𝑚
where 𝑎 𝑇𝑗 is the 𝑗 th row of the matrix 𝐴.
Alternatively, it could be thought of as a linear combination of the
columns of 𝐴 where the 𝑢 𝑗 are the weights:
| | |
𝑣 = 𝑎1 𝑢1 + 𝑎2 𝑢2 + . . . +
𝑎 𝑛 𝑢𝑛
(A.23)
| | |
where 𝑎 𝑖 are the columns of 𝐴.
We can also multiply by a vector on the left, instead of on the right:
𝑣𝑇 = 𝑢𝑇 𝐴 (A.24)
form the two vectors 𝑢 are identical and thus 𝐴 is square. Also in a
quadratic form we assume that 𝐴 is symmetric (even if it isn’t, only
the symmetric part of 𝐴 contributes anyway so effectively it acts like a
symmetric matrix).
Note that:
(𝐴𝑇 )𝑇 = 𝐴 (A.29)
𝑇
(𝐴 + 𝐵) = 𝐴 + 𝐵 𝑇 𝑇
(A.30)
𝑇
(𝐴𝐵) = 𝐵 𝐴 𝑇 𝑇
(A.31)
𝐴 𝑖𝑗 = 𝐴 𝑗𝑖 (A.32)
Not all matrices are invertible. Some common properties for inverses
are:
𝑥 𝑇 𝑀𝑥 ≥ 0 (A.38)
𝑥 𝑇 𝑀𝑥 < 0 (A.39)
Õ
𝑛
𝑓 (𝑥) = 𝑎 𝑇 𝑥 + 𝑏 = 𝑎𝑖 𝑥𝑖 + 𝑏𝑖 (A.40)
𝑖=1
where 𝑎, 𝑥, and 𝑏 are vectors of length 𝑛, and 𝑎 𝑖 , 𝑥 𝑖 , and 𝑏 𝑖 are the ith
elements of 𝑎, 𝑥, and 𝑏, respectively. If we take the partial derivative
of each element with respect to an arbitrary element of 𝑥, namely 𝑥 𝑘 ,
we get " #
𝜕 Õ
𝑛
𝑎𝑖 𝑥𝑖 + 𝑏𝑖 = 𝑎𝑘 (A.41)
𝜕𝑥 𝑘
𝑖=1
Thus:
∇𝑥 (𝑎 𝑇 𝑥 + 𝑏) = 𝑎 (A.42)
Recall the quadratic form presented in Appendix A.3.3, we can
combine that with a linear term to form a general quadratic function:
𝑓 (𝑥) = 𝑥 𝑇 𝐴𝑥 + 𝑏 𝑇 𝑥 + 𝑐 (A.43)
Õ
𝑛 Õ
𝑛
𝑓 (𝑥) = 𝑥 𝑖 𝑎 𝑖𝑗 𝑥 𝑗 + 𝑏 𝑖 𝑥 𝑖 + 𝑐 𝑖 (A.44)
𝑖=1 𝑗=1
For convenience, we’ll separate the diagonal terms from the off
diagonal terms leaving us with
Õ
𝑛
Õ
𝑓 (𝑥) = 𝑎 𝑖𝑖 𝑥 2𝑖 + 𝑏 𝑖 𝑥 𝑖 + 𝑐 𝑖 + 𝑥 𝑖 𝑎 𝑖𝑗 𝑥 𝑗 (A.45)
𝑖=1 𝑗≠𝑖
A Mathematics Review 405
𝜕𝑓 Õ Õ
= 2𝑎 𝑘 𝑘 𝑥 𝑘 + 𝑏 𝑘 + 𝑥𝑗 𝑎𝑗𝑘 + 𝑎𝑘 𝑗 𝑥𝑗 (A.46)
𝜕𝑥 𝑘
𝑗≠𝑖 𝑗≠𝑖
We now move the diagonal terms back into the sums to get
𝜕𝑓 Õ 𝑛
= 𝑏𝑘 + (𝑥 𝑗 𝑎 𝑗 𝑘 + 𝑎 𝑘 𝑗 𝑥 𝑗 ), (A.47)
𝜕𝑥 𝑘
𝑗=1
∇𝑥 𝑓 (𝑥) = 𝐴𝑇 𝑥 + 𝐴𝑥 + 𝑏 (A.48)
∇𝑥 (𝑥 𝑇 𝐴𝑥 + 𝑏 𝑇 𝑥 + 𝑐) = 2𝐴𝑥 + 𝑏 (A.49)
𝑓 (𝑥 + Δ𝑥) = 𝑎0 + 𝑎1 Δ𝑥 + 𝑎 2 Δ𝑥 2 + . . . + 𝑎 𝑘 Δ𝑥 𝑘 + . . . (A.50)
both sides and the appropriate value of the first nonzero term (which
is always constant). Identifying the pattern yields the general formula
for the 𝑛 th -order coefficient
𝑓 (𝑘) (𝑥)
𝑎𝑘 = . (A.52)
𝑘!
Substituting this into the polynomial (A.50) yields the Taylor series
Õ
∞
Δ𝑥 𝑘
𝑓 (𝑥 + Δ𝑥) = 𝑓 (𝑘) (𝑥). (A.53)
𝑘!
𝑘=0
𝑛=2
6
=
The Taylor series in multiple dimensions is similar to the single
variable case, but more complicated. The first derivative of the function 𝑛
becomes a gradient vector and the second derivatives becomes a Hessian
matrix. Also, we need to define a direction along which we want to 𝑓
approximate the function, since that information is not inherent as it is 1
in a 1-D function. The Taylor series expansion in 𝑛-dimensions along a 𝑛=
Õ 1 ÕÕ
0
𝑛 𝑛 𝑛
𝜕𝑓 𝜕2 𝑓 𝑥
𝑓 (𝑥 + 𝛼𝑝) = 𝑓 (𝑥) + 𝛼 𝑝𝑘 + 𝛼2 𝑝𝑘 𝑝𝑙 + 𝒪(𝛼 3 ),
𝜕𝑥 𝑘 2 𝜕𝑥 𝑘 𝜕𝑥 𝑙 Figure A.2: Taylor series expan-
𝑘=1 𝑘=1 𝑙=1
(A.56) sions for 1-D example. The more
where 𝛼 is a scalar that determines how far to go in the direction 𝑝. In terms we consider from the Taylor
series, the better the approximation.
matrix form, we can write
1
𝑓 (𝑥 + 𝛼𝑝) = 𝑓 (𝑥) + 𝛼∇ 𝑓 (𝑥)𝑇 𝑝 + 𝛼 2 𝑝 𝑇 𝐻(𝑥)𝑝 + 𝒪 𝛼3 , (A.57)
2
A Mathematics Review 407
1 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 . (A.58)
2
Performing a Taylor series expansion about 𝑥 = [0, −2]𝑇 , we get,
1 10 0
𝑓 (𝛼𝑝) = 18 + 𝛼 −2 − 14 𝑝 + 𝛼 2 𝑝 𝑇 𝑝 (A.59)
2 0 6
The original function, the linear approximation, and the quadratic approxima-
tion are compared in Fig. A.3.
𝑟(𝑢) = 𝑏 − 𝐴𝑢 = 0, (B.1)
time, starting with the first one and progressing from left to right. This 1
is done by subtracting multiples of each row from subsequent rows. Figure B.1: 𝐿𝑈 decomposition.
These operations can be expressed as sequence of multiplications with
408
B Linear Solvers 409
Inputs:
𝐴: Nonsingular square matrix
𝑏: A vector
Outputs:
𝑢: Solution to 𝐴𝑢 = 𝑏
𝑏1 1 © Õ𝑖−1
ª
𝑦1 = , 𝑦𝑖 = 𝑏 𝑖 − 𝐿 𝑖𝑗 𝑦 𝑗 ® for 𝑖 = 2, . . . , 𝑛
𝐿11 𝐿 𝑖𝑖
« 𝑗=1 ¬
Perform backward substitution to solve the following 𝑈𝑢 = 𝑦 for 𝑢:
𝑦𝑛 1 © Õ
𝑛
ª
𝑢𝑛 = , 𝑢𝑖 = 𝑦𝑖 − 𝑈 𝑖𝑗 𝑢 𝑗 ® for 𝑖 = 𝑛 − 1, . . . , 1
𝑈𝑛𝑛 𝑈 𝑖𝑖
« 𝑗=𝑖+1 ¬
While direct methods are usually more efficient and robust, iterative
methods have several advantages:
starting from an initial guess 𝑢0 . The function 𝐺(𝑢) is devised such that
the iterates converge to the solution 𝑢 ∗ , which satisfies 𝑟(𝑢 ∗ ) = 0. Many
fixed-point methods can be derived by splitting the matrix such that
𝐴 = 𝑀 − 𝑁. Then, 𝐴𝑢 = 𝑏 leads to 𝑀𝑢 = 𝑁𝑢 + 𝑏, and substituting this
into the linear system yields
This corresponds to the two lines shown in Fig. B.3, where the solution is at 𝑢∗
0.5
their intersection.
Applying the Jacobian iteration (B.9), 0
0 1 2
(𝑘+1) 1 (𝑘) 𝑢1
𝑢1 = 𝑢2
2
1 (B.15) (a) Jacobi
(𝑘+1) (𝑘)
𝑢2 = 1 + 2𝑢1 .
3 2
𝑢2
Starting with the guess 𝑢 (0) = (2, 1), we get the iterations shown in Fig. B.3.
1.5
The Gauss–Seidel iteration (B.11) is similar, where the only change is that the
second equation use the latest state from the first one: 𝑢 (0)
1
(𝑘+1) 1 (𝑘) 𝑢∗
𝑢1 =𝑢
2 2 0.5
1 (B.16)
(𝑘+1) (𝑘+1)
𝑢2 = 1 + 2𝑢1 . 0
3 0 1 2
𝑢1
As expected, Gauss–Seidel converges faster than the Jacobi iteration, taking a
more direct path. The SOR iteration is (b) Gauss-Seidel
𝜔 (𝑘) 𝑢2
(𝑘+1) (𝑘)
𝑢1 = (1 − 𝜔)𝑢1 +𝑢 2
2 2
(𝑘+1) (𝑘) 𝜔 (𝑘)
(B.17)
𝑢2 = (1 − 𝜔)𝑢2 + 1 + 2𝑢1 . 1.5
3
𝑢 (0)
SOR converges even faster for the right values of 𝜔. The result shown here is 1
for 𝜔 = 1.2. 𝑢∗
0.5
0
0 1 2
(c) SOR
Krylov subspace methods are another class of iterative methods that
include the conjugate gradient method and the generalized mini- Figure B.3: Jacobi, Gauss–Seidel,
mum residual (GMRES) method. Compared to stationary methods, and SOR iterations.
B Linear Solvers 413
Krylov methods have the advantage that they use information gathered
throughout the iterations. Instead of using a fixed splitting matrix,
Krylov methods effectively vary the splitting so that 𝑀 is changed
at each iteration according to some criteria that uses the information
gathered so far. For this reason, Krylov methods are usually more
efficient than fixed-point iterations.
Like fixed-point iteration methods, Krylov methods do not require
forming or storing 𝐴. Instead, the iterations require only matrix-vector
products of the form 𝐴𝑣, where 𝑣 is some vector given by the Krylov
algorithm. To be efficient, Krylov subspace methods require a good
preconditioner.
Test Problems
C
C.1 Unconstrained Problems
gradient-based optimizer:
4
𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 + 𝑥 22 − 𝛽𝑥 1 𝑥2 , (C.1) 𝑥∗
0
An intermediate value of 𝛽 = 3/2 is suitable for first tests and yields the
contours shown in Fig. C.1. Figure C.1: Slanted quadratic func-
tion for 𝛽 = 3/2
Global minimum: 𝑓 (𝑥 ∗ ) = 0.0 at 𝑥 ∗ = (0, 0)
2 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥 1 )2 + 100 𝑥2 − 𝑥 12 . (C.2)
𝑥∗
1
This is a classic benchmarking function because of its narrow turning
valley. The large difference between the maximum and minimum 0
curvatures, and the fact that the principal curvature directions change
along the valley makes it a good test for quasi-Newton methods. −1
−1 0 1
The Rosenbrock function can be extended to 𝑛-dimensions by 𝑥1
defining the sum,
Figure C.2: Rosenbrock function
𝑛−1
Õ
2
𝑓 (𝑥) = 100 𝑥 𝑖+1 − 𝑥 𝑖 2 + (1 − 𝑥 𝑖 )2 . (C.3)
𝑖=1
414
C Test Problems 415
1 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 . (C.4)
2
Global minimum: 𝑓 (𝑥 ∗ ) = 0.09194 at 𝑥 ∗ = (1.21314, 0.82414) 𝑥2
3
with one global minimum. This function, shown in Fig. C.4 along with
−1
the local and global minima, is −2 0 2
𝑥1
Õ © Õ
𝑥1
ª
4 3
𝑓 (𝑥) = − 𝛼 𝑖 exp − 𝐴 𝑖𝑗 (𝑥 𝑗 − 𝑃𝑖𝑗 )2 ® (C.6) Figure C.4: Jones multimodal func-
𝑖=1 « 𝑗=1 ¬ tion
𝑥3
where 1
𝛼 = [1.0, 1.2, 3.0, 3.2] , 𝑇
3 10 30
0.8 𝑥∗
0.1 10 35
0.6
𝐴 =
10 30
,
3 0.4
0.1 10 35
(C.7) 0.2
1091
381 5743 8828
Figure C.5: An 𝑥2 − 𝑥3 slice of
Hartmann function at 𝑥 1 = 0.1148
C Test Problems 416
𝐿 = 𝑞𝐶 𝐿 𝑆, (C.10)
𝑆 = 𝑏𝑐. (C.12)
𝐷 𝑓 = 𝑘𝐶 𝑓 𝑞𝑆wet . (C.13)
C Test Problems 417
𝐿2
𝐷𝑖 = , (C.16)
𝑞𝜋𝑏 2 𝑒
where 𝑒 is the Oswald efficiency factor. Total drag is the sum of induced
and viscous drag, 𝐷 = 𝐷𝑖 + 𝐷 𝑓 ).
Our objective function, the power required by the motor for level
flight, is
𝐷𝑣
𝑃(𝑏, 𝑐) = , (C.17)
𝜂
where 𝜂 is the propulsive efficiency. We assume that our electric
propellers have a Gaussian efficiency curve (real efficiency curves aren’t
very Gaussian, but this is simple and will be sufficient for our purposes):
−(𝑣 − 𝑣)2
𝜂 = 𝜂max exp . (C.18)
2𝜎 2
This is the same problem that was presented in Ex. 1.2 of Chapter 1.
The optimal wing span and chord are 𝑏 = 25.48 m and 𝑐 = 0.50 m
respectively given the parameters. The contour and the optimal wing
shape are shown in Fig. C.6. 𝑐
1.5
Note that there are no structural considerations in this problem so
the resulting aircraft has a higher aspect ratio wing than is realistic. 1.2
s q
2 Δ𝑥 2𝑖 + Δ𝑦 𝑖2
Δ𝑡 𝑖 = p p , (C.23)
𝑔 ℎ − 𝑦 𝑖+1 − 𝜇 𝑘 𝑥 𝑖+1 + ℎ − 𝑦 𝑖 − 𝜇 𝑘 𝑥 𝑖
Õ
𝑛−1
𝑇= Δ𝑡 𝑖 . (C.24)
𝑖=1
The design variables are the 𝑛−2 positions of the path parameterized
by 𝑦 𝑖 . The end points must be fixed, otherwise the problem is ill-defined,
which is why there are 𝑛 − 2 design variables instead of 𝑛. Note that
𝑥 is a parameter, meaning that it is fixed. You could space the 𝑥 𝑖 any
reasonable way and still find the same underlying optimal curve, but
it is easiest to just use uniform spacing. As the dimensionality of the
problem increases, the solution becomes more challenging. We will
use the following specifications:
The analytic solution for the case with friction is more difficult to
derive, but the analytic solution for the frictionless case (𝜇 𝑘 = 0) with
our starting and ending points is:
𝑥 = 𝑎(𝜃 − sin(𝜃)),
(C.26)
𝑦 = −𝑎(1 − cos(𝜃)) + 1,
𝑎 1 = 75.196 𝑎 2 = -3.8112
𝑎 3 = 0.12694 𝑎 4 = -2.0567×10−3
𝑎 5 = 1.0345×10−5 𝑎6 = -6.8306
𝑎 7 = 0.030234 𝑎8 = -1.28134×10−3
𝑎 9 = 3.5256×10−5 𝑎10 = -2.266×10−7
𝑎 11 = 0.25645 𝑎12 = -3.4604×10−3
𝑎 13 = 1.3514×10−5 𝑎14 = -28.106
𝑎 15 = -5.2375×10−6 𝑎16 = -6.3×10−8
𝑎 17 = 7.0×10−10 𝑎18 = 3.4054×10−4
𝑎19 = -1.6638×10−6 𝑎20 = -2.8673
𝑎21 = 0.0005
C Test Problems 421
𝑓 (𝑥1 , 𝑥2 ) = 𝑎1 + 𝑎2 𝑥 1 + 𝑎 3 𝑦4 + 𝑎4 𝑦4 𝑥1 + 𝑎5 𝑦42 + 𝑎6 𝑥2 + 𝑎7 𝑦1 +
𝑎8 𝑥1 𝑦1 + 𝑎 9 𝑦1 𝑦4 + 𝑎 10 𝑦2 𝑦4 + 𝑎11 𝑦3 + 𝑎12 𝑥2 𝑦3 + 𝑎 13 𝑦32 +
𝑎14 (C.28)
+ 𝑎 15 𝑦3 𝑦4 + 𝑎16 𝑦1 𝑦4 𝑥2 + 𝑎17 𝑦1 𝑦3 𝑦4 + 𝑎18 𝑥1 𝑦3 +
𝑥2 + 1
𝑎19 𝑦1 𝑦3 + 𝑎 20 exp(𝑎21 𝑦1 )
ℓ ℓ
1 2
7 8 9 10
ℓ
5 6
1 He, X., Li, J., Mader, C. A., Yildirim, A., and Martins, J. R. R. A., cited on p. 19
“Robust aerodynamic shape optimization—from a circle to an
airfoil,” Aerospace Science and Technology, Vol. 87, April 2019, pp. 48–
61.
doi: 10.1016/j.ast.2019.01.051
2 Betts, J. T., “Survey of numerical methods for trajectory optimiza- cited on p. 25
tion,” Journal of guidance, control, and dynamics, Vol. 21, No. 2, 1998,
pp. 193–207.
3 Bryson, A. E. and Ho, Y. C., Applied Optimal Control; Optimization, cited on p. 25
Estimation, and Control. Blaisdell Publishing, 1969.
4 Bertsekas, D. P., Dynamic programming and optimal control. Belmont, cited on p. 25
MA: Athena Scientific, 1995.
5 Kepler, J., Nova stereometria doliorum vinariorum (New solid geometry cited on p. 32
of wine barrels). Linz: Johannes Planck, 1615.
6 Ferguson, T. S., “Who Solved the Secretary Problem?” Statistical cited on p. 32
Science, Vol. 4, No. 3, August 1989, pp. 282–289.
doi: 10.1214/ss/1177012493
7 Fermat, P. de, Methodus ad disquirendam maximam et minimam cited on p. 33
(Method for the study of maxima and minima). 1636, Translated by
Jason Ross.
8 Lagrange, J.-.-L., Mécanique analytique. Paris, France, 1788, Vol. 1. cited on p. 34
424
Bibliography 425
36 Martins, J. R. R. A. and Lambe, A. B., “Multidisciplinary Design cited on pp. 38, 376, 396
Optimization: A Survey of Architectures,” AIAA Journal, Vol. 51,
No. 9, September 2013, pp. 2049–2075.
doi: 10.2514/1.J051895
37 Sobieszczanski–Sobieski, J., “Sensitivity of Complex, Internally cited on p. 38
Coupled Systems,” AIAA Journal, Vol. 28, No. 1, 1990, pp. 153–160.
doi: 10.2514/3.10366
38 Martins, J. R. R. A., Alonso, J. J., and Reuther, J. J., “A Coupled- cited on p. 38
Adjoint Sensitivity Analysis Method for High-Fidelity Aero-Structural
Design,” Optimization and Engineering, Vol. 6, No. 1, March 2005,
pp. 33–62.
doi: 10.1023/B:OPTE.0000048536.47956.62
39 Hwang, J. T. and Martins, J. R. R. A., “A computational architecture cited on pp. 38, 208, 381
for coupling heterogeneous numerical models and computing
coupled derivatives,” ACM Transactions on Mathematical Software,
Vol. 44, No. 4, June 2018, Article 37.
doi: 10.1145/3182393
40 Wright, M. H., “The interior-point revolution in optimization: cited on p. 39
History, recent developments, and lasting consequences,” Bulletin
of the American Mathematical Society, Vol. 42, 2005, pp. 39–56.
41 Grant, M., Boyd, S., and Ye, Y., “Global optimization—from theory cited on p. 39
to implementation,” Liberti, L. and Maculan, N., Eds. Springer,
2006, ch. Disciplined Convex Programming, pp. 155–210.
42 Wengert, R. E., “A Simple Automatic Derivative Evaluation Pro- cited on p. 39
gram,” Commun. ACM, Vol. 7, No. 8, August 1964, pp. 463–464,
issn: 0001-0782.
doi: 10.1145/355586.364791
43 Speelpenning, B., “Compiling fast partial derivatives of functions cited on p. 39
given by algorithms,” Ph.D. Dissertation, University of Illinois at
Urbana–Champaign, January 1980.
doi: 10.2172/5254402
44 Squire, W. and Trapp, G., “Using Complex Variables to Estimate cited on p. 39
Derivatives of Real Functions,” SIAM Review, Vol. 40, No. 1, 1998,
pp. 110–112, issn: 0036-1445 (print), 1095-7200 (electronic).
45 Martins, J. R. R. A., Sturdza, P., and Alonso, J. J., “The Complex- cited on pp. 39, 186
Step Derivative Approximation,” ACM Transactions on Mathematical
Software, Vol. 29, No. 3, 2003, pp. 245–262, September.
doi: 10.1145/838250.838251
46 Torczon, V., “On the Convergence of Pattern Search Algorithms,” cited on p. 40
SIAM Journal on Optimization, Vol. 7, No. 1, February 1997, pp. 1–25.
Bibliography 428
47 Jones, D., Perttunen, C., and Stuckman, B., “Lipschitzian optimiza- cited on pp. 40, 222, 223
tion without the Lipschitz constant,” Journal of Optimization Theory
and Application, Vol. 79, No. 1, October 1993, pp. 157–181.
48 Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., “Optimization by cited on p. 40
Simulated Annealing,” Science, Vol. 220, No. 4598, 1983, pp. 671–
680.
doi: 10.1126/science.220.4598.671
49 Kennedy, J. and Eberhart, R. C., “Particle Swarm Optimization,” cited on p. 40
IEEE International Conference on Neural Networks, Vol. IV, Piscataway,
NJ, 1995, pp. 1942–1948.
50 Forrester, A. I. and Keane, A. J., “Recent advances in surrogate- cited on p. 40
based optimization,” Progress in Aerospace Sciences, Vol. 45, No. 1,
2009, pp. 50–79, issn: 0376-0421.
doi: 10.1016/j.paerosci.2008.11.001
51 Bottou, L., Curtis, F. E., and Nocedal, J., “Optimization Methods cited on p. 41
for Large-Scale Machine Learning,” SIAM Review, Vol. 60, No. 2,
2018, pp. 223–311.
doi: 10.1137/16M1080173
52 Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M., cited on p. 41
“Automatic Differentiation in Machine Learning: A Survey,” Journal
of Machine Learning Research, Vol. 18, No. 1, January 2018, pp. 5595–
5637.
53 Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., cited on p. 53
Guy, R. T., Haddock, S. H. D., Huff, K. D., Mitchell, I. M., Plumbley,
M. D., Waugh, B., White, E. P., and Wilson, P., “Best Practices for
Scientific Computing,” PLoS Biology, Vol. 12, No. 1, 2014, e1001745.
doi: 10.1371/journal.pbio.1001745
54 Grotker, T., Holtmann, U., Keding, H., and Wloka, M., The Devel- cited on p. 54
oper’s Guide to Debugging, 2nd. 2012.
55 Ascher, U. M. and Greif, C., A first course in numerical methods. SIAM, cited on p. 58
2011.
56 Saad, Y., Iterative Methods for Sparse Linear Systems, 2𝑛𝑑 . SIAM, 2003. cited on p. 59
57 Hager, W. W. and Zhang, H., “A New Conjugate Gradient Method cited on p. 88
with Guaranteed Descent and an Efficient Line Search,” SIAM
Journal on Optimization, Vol. 16, No. 1, January 2005, pp. 170–192,
issn: 1095-7189.
doi: 10.1137/030601880
58 Nocedal, J. and Wright, S. J., Numerical Optimization, 2nd. Springer- cited on pp. 92, 113, 114, 152, 159
Verlag, 2006.
Bibliography 429
64 Conn, A. R., Gould, N. I. M., and Toint, P. L., Trust Region Methods. cited on pp. 112, 113, 114
SIAM, January 2000.
isbn: 0898714605
65 Steihaug, T., “The Conjugate Gradient Method and Trust Regions cited on p. 113
in Large Scale Optimization,” SIAM Journal on Numerical Analysis,
Vol. 20, No. 3, June 1983, pp. 626–637, issn: 1095-7170.
doi: 10.1137/0720042
66 Murray, W., “Analytical expressions for the eigenvalues and eigen- cited on p. 147
vectors of the Hessian matrices of barrier and penalty functions,”
Journal of Optimization Theory and Applications, Vol. 7, No. 3, March
1971, pp. 189–196.
doi: 10.1007/bf00932477
67 Forsgren, A., Gill, P. E., and Wright, M. H., “Interior Methods for cited on p. 147
Nonlinear Optimization,” SIAM Review, Vol. 44, No. 4, January
2002, pp. 525–597.
doi: 10.1137/s0036144502414942
68 Gill, P. E., Murray, W., and Saunders, M. A., “SNOPT: An SQP Al- cited on p. 152
gorithm for Large-Scale Constrained Optimization,” SIAM Review,
Vol. 47, No. 1, 2005, pp. 99–131.
doi: 10.1137/S0036144504446096
Bibliography 430
69 Liu, D. C. and Nocedal, J., “On the limited memory BFGS method cited on p. 154
for large scale optimization,” Mathematical Programming, Vol. 45,
No. 1-3, August 1989, pp. 503–528.
doi: 10.1007/bf01589116
70 Wächter, A. and Biegler, L. T., “On the implementation of an cited on p. 158
interior-point filter line-search algorithm for large-scale nonlinear
programming,” Mathematical Programming, Vol. 106, No. 1, April
2005, pp. 25–57.
doi: 10.1007/s10107-004-0559-y
71 Byrd, R. H., Hribar, M. E., and Nocedal, J., “An Interior Point cited on p. 158
Algorithm for Large-Scale Nonlinear Programming,” SIAM Journal
on Optimization, Vol. 9, No. 4, January 1999, pp. 877–900.
doi: 10.1137/s1052623497325107
72 Fletcher, R. and Leyffer, S., “Nonlinear programming without a cited on p. 161
penalty function,” Mathematical Programming, Vol. 91, No. 2, January
2002, pp. 239–269.
doi: 10.1007/s101070100244
73 Fletcher, R., Leyffer, S., and Toint, P., “A Brief History of Filter cited on p. 161
Methods,” ANL/MCS-P1372-0906, Argonne National Laboratory,
September 2006.
74 Benson, H. Y., Vanderbei, R. J., and Shanno, D. F., “Interior-Point cited on p. 161
Methods for Nonconvex Nonlinear Programming: Filter Methods
and Merit Functions,” Computational Optimization and Applications,
Vol. 23, No. 2, 2002, pp. 257–272.
doi: 10.1023/a:1020533003783
75 Kreisselmeier, G. and Steinhauser, R., “Systematic Control Design cited on p. 164
by Optimizing a Vector Performance Index,” IFAC Proceedings
Volumes, Vol. 12, No. 7, September 1979, pp. 113–117, issn: 1474-
6670.
doi: 10.1016/s1474-6670(17)65584-8
76 Hoerner, S. F., Fluid-Dynamic Drag. 1965. cited on p. 169
77 Lyness, J. N., “Numerical Algorithms Based on the Theory of Com- cited on p. 182
plex Variable,” Proceedings — ACM National Meeting, Washington
DC: Thompson Book Co., 1967, pp. 125–133.
78 Lyness, J. N. and Moler, C. B., “Numerical Differentiation of Ana- cited on p. 182
lytic Functions,” SIAM Journal on Numerical Analysis, Vol. 4, No. 2,
1967, pp. 202–210, issn: 0036-1429 (print), 1095-7170 (electronic).
Bibliography 431
79 Lantoine, G., Russell, R. P., and Dargent, T., “Using Multicomplex cited on p. 183
Variables for Automatic Computation of High-Order Derivatives,”
ACM Transactions on Mathematical Software, Vol. 38, No. 3, April
2012, pp. 1–21, issn: 0098-3500.
doi: 10.1145/2168773.2168774
80 Fike, J. and Alonso, J., “The Development of Hyper-Dual Numbers cited on p. 183
for Exact Second-Derivative Calculations,” 49th AIAA Aerospace
Sciences Meeting including the New Horizons Forum and Aerospace
Exposition, January 2011
doi: 10.2514/6.2011-886
81 Griewank, A., Evaluating Derivatives. Philadelphia: SIAM, 2000. cited on p. 187
91 Gray, J. S., Hwang, J. T., Martins, J. R. R. A., Moore, K. T., and cited on pp. 204, 208, 383
Naylor, B. A., “OpenMDAO: An open-source framework for mul-
tidisciplinary design, analysis, and optimization,” Structural and
Multidisciplinary Optimization, Vol. 59, No. 4, April 2019, pp. 1075–
1104.
doi: 10.1007/s00158-019-02211-z
92 Ning, A., “Using Blade Element Momentum Methods with Gradient- cited on p. 205
Based Design Optimization,,” 2020, (in review).
93 Martins, J. R. R. A. and Hwang, J. T., “Review and Unification of cited on p. 205
Methods for Computing Derivatives of Multidisciplinary Compu-
tational Models,” AIAA Journal, Vol. 51, No. 11, November 2013,
pp. 2582–2599.
doi: 10.2514/1.J052184
94 Yu, Y., Lyu, Z., Xu, Z., and Martins, J. R. R. A., “On the Influence cited on p. 214
of Optimization Algorithm and Starting Design on Wing Aerody-
namic Shape Optimization,” Aerospace Science and Technology, Vol.
75, April 2018, pp. 183–199.
doi: 10.1016/j.ast.2018.01.016
95 Rios, L. M. and Sahinidis, N. V., “Derivative-free optimization: A cited on pp. 214, 215
review of algorithms and comparison of software implementations,”
Journal of Global Optimization, Vol. 56, 2013, pp. 1247–1293.
doi: 10.1007/s10898-012-9951-y
96 Conn, A. R., Scheinberg, K., and Vicente, L. N., Introduction to cited on p. 216
Derivative-Free Optimization. SIAM, 2009.
97 Audet, C. and Hare, W., Derivative-Free and Blackbox Optimization. cited on p. 216
Springer, 2017.
doi: 10.1007/978-3-319-68913-5
98 Le Digabel, S., “Algorithm 909: NOMAD: Nonlinear Optimization cited on p. 216
with the MADS algorithm,” ACM Transactions on Mathematical
Software, Vol. 37, No. 4, 2011, pp. 1–15.
99 Jones, D. R., “Direct global optimization algorithm,” Encyclopedia cited on pp. 216, 228
of Optimization, Floudas, C. A. and Pardalos, P. M., Eds. Boston,
MA: Springer US, 2009, pp. 725–735, isbn: 978-0-387-74759-0
. doi: 10.1007/978-0-387-74759-0_128
100 Simon, D., Evolutionary Optimization Algorithms. John Wiley & Sons, cited on pp. 217, 237
June 2013.
isbn: 1118659503
101 Barricelli, N., “Esempi numerici di processi di evoluzione,” Metho- cited on p. 231
dos, 1954, pp. 45–68.
Bibliography 433
102 Jong, K. A. D., “An analysis of the behavior of a class of genetic cited on p. 231
adaptive systems,” Ph.D. Dissertation, University of Michigan,
Ann Arbor, MI, 1975.
103 Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T., “A fast and cited on pp. 233, 286
elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions
on Evolutionary Computation, Vol. 6, No. 2, April 2002, pp. 182–197.
doi: 10.1109/4235.996017
104 Deb, K., Multi-Objective Optimization Using Evolutionary Algorithms. cited on p. 239
John Wiley & Sons, July 2001.
isbn: 047187339X
105 Eberhart, R. and Kennedy, J. A., “New Optimizer Using Particle cited on p. 241
Swarm Theory,” Sixth International Symposium on Micro Machine
and Human Science, Nagoya, Japan, 1995, pp. 39–43.
106 Gutin, G., Yeo, A., and Zverovich, A., “Traveling salesman should cited on p. 261
not be greedy: Domination analysis of greedy-type heuristics for
the TSP,” Discrete Applied Mathematics, Vol. 117, No. 1-3, March
2002, pp. 81–86, issn: 0166-218X.
doi: 10.1016/s0166-218x(01)00195-0
107 Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., “Optimization cited on p. 270
by Simulated Annealing,” Science, Vol. 220, No. 4598, May 1983,
pp. 671–680, issn: 1095-9203.
doi: 10.1126/science.220.4598.671
108 Černý, V., “Thermodynamical approach to the traveling salesman cited on p. 270
problem: An efficient simulation algorithm,” Journal of Optimization
Theory and Applications, Vol. 45, No. 1, January 1985, pp. 41–51, issn:
1573-2878.
doi: 10.1007/bf00940812
109 Andresen, B. and Gordon, J. M., “Constant thermodynamic speed cited on p. 271
for minimizing entropy production in thermodynamic processes
and simulated annealing,” Physical Review E, Vol. 50, No. 6, Decem-
ber 1994, pp. 4346–4351, issn: 1095-3787.
doi: 10.1103/physreve.50.4346
110 Lin, S., “Computer Solutions of the Traveling Salesman Prob- cited on p. 272
lem,” Bell System Technical Journal, Vol. 44, No. 10, December 1965,
pp. 2245–2269, issn: 0005-8580.
doi: 10.1002/j.1538-7305.1965.tb04146.x
Bibliography 434
111 Press, W. H., Wevers, J., Flannery, B. P., Teukolsky, S. A., Vetterling, cited on p. 273
W. T., Flannery, B. P., and Vetterling, W. T., Numerical Recipes in
C: The Art of Scientific Computing. Cambridge University Press,
October 1992.
isbn: 0521431085
112 Haimes, Y. Y., Lasdon, L. S., and Wismer, D. A., “On a Bicriterion cited on p. 283
Formulation of the Problems of Integrated System Identification
and System Optimization,” IEEE Transactions on Systems, Man, and
Cybernetics, Vol. SMC-1, No. 3, July 1971, pp. 296–297.
doi: 10.1109/tsmc.1971.4308298
113 Das, I. and Dennis, J. E., “Normal-Boundary Intersection: A New cited on p. 283
Method for Generating the Pareto Surface in Nonlinear Multicrite-
ria Optimization Problems,” SIAM Journal on Optimization, Vol. 8,
No. 3, August 1998, pp. 631–657.
doi: 10.1137/s1052623496307510
114 Ismail-Yahaya, A. and Messac, A., “Effective generation of the Pareto cited on p. 286
frontier using the Normal Constraint method,” 40th AIAA Aerospace
Sciences Meeting & Exhibit, American Institute of Aeronautics and
Astronautics, January 2002.
doi: 10.2514/6.2002-178
115 Messac, A. and Mattson, C. A., “Normal Constraint Method with cited on p. 286
Guarantee of Even Representation of Complete Pareto Frontier,”
AIAA Journal, Vol. 42, No. 10, October 2004, pp. 2101–2111.
doi: 10.2514/1.8977
116 Hancock, B. J. and Mattson, C. A., “The smart normal constraint cited on p. 286
method for directly generating a smart Pareto set,” Structural and
Multidisciplinary Optimization, Vol. 48, No. 4, June 2013, pp. 763–775.
doi: 10.1007/s00158-013-0925-6
117 Schaffer, J. D., “Some Experiments in Machine Learning Using Vec- cited on p. 286
tor Evaluated Genetic Algorithms.” Ph.D. Dissertation, Vanderbilt
University, Nashville, TN, 1984.
118 Deb, K., Introduction to evolutionary multiobjective optimization, Mul- cited on p. 286
tiobjective Optimization, Springer Berlin Heidelberg, 2008, pp. 59–96.
doi: 10.1007/978-3-540-88908-3_3
119 Kung, H. T., Luccio, F., and Preparata, F. P., “On Finding the Maxima cited on p. 287
of a Set of Vectors,” Journal of the ACM, Vol. 22, No. 4, October 1975,
pp. 469–476.
doi: 10.1145/321906.321910
Bibliography 435
120 Forrester, A., Sobester, A., and Keane, A., Engineering Design via cited on p. 296
Surrogate Modelling: A Practical Guide. John Wiley & Sons, September
2008.
isbn: 0470770791
121 Rajnarayan, D., Haas, A., and Kroo, I., “A Multifidelity Gradient- cited on p. 306
Free Optimization Method and Application to Aerodynamic De-
sign,” 12th AIAA/ISSMO Multidisciplinary Analysis and Optimization
Conference, September 2008
doi: 10.2514/6.2008-6020
122 Ruder, S., “An overview of gradient descent optimization algo- cited on p. 313
rithms,” arXiv:1609.04747, 2016.
123 Goh, G., “Why Momentum Really Works,” Distill, 2017. cited on p. 313
doi: 10.23915/distill.00006
124 Diamond, S. and Boyd, S., “Convex Optimization with Abstract cited on p. 317
Linear Operators,” 2015 IEEE International Conference on Computer
Vision (ICCV), IEEE, December 2015.
doi: 10.1109/iccv.2015.84
125 Boyd, S. P. and Vandenberghe, L., Convex Optimization. Cambridge cited on p. 319
University Press, March 2004.
isbn: 0521833787
126 Lobo, M. S., Vandenberghe, L., Boyd, S., and Lebret, H., “Applica- cited on p. 319
tions of second-order cone programming,” Linear Algebra and its
Applications, Vol. 284, No. 1-3, November 1998, pp. 193–228.
doi: 10.1016/s0024-3795(98)10032-0
127 Vandenberghe, L. and Boyd, S., “Applications of semidefinite cited on p. 319
programming,” Applied Numerical Mathematics, Vol. 29, No. 3, March
1999, pp. 283–299.
doi: 10.1016/s0168-9274(98)00098-1
128 Vandenberghe, L. and Boyd, S., “Semidefinite Programming,” cited on p. 319
SIAM Review, Vol. 38, No. 1, March 1996, pp. 49–95.
doi: 10.1137/1038003
129 Parikh, N. and Boyd, S., “Block splitting for distributed optimiza- cited on p. 319
tion,” Mathematical Programming Computation, Vol. 6, No. 1, October
2013, pp. 77–102.
doi: 10.1007/s12532-013-0061-8
130 Grant, M., Boyd, S., and Ye, Y., Disciplined convex programming, cited on p. 325
Global Optimization, Kluwer Academic Publishers, 2006, pp. 155–
210.
doi: 10.1007/0-387-30528-9_7
Bibliography 436
131 Hoburg, W. and Abbeel, P., “Geometric Programming for Aircraft cited on p. 328
Design Optimization,” AIAA Journal, Vol. 52, No. 11, November
2014, pp. 2414–2426.
doi: 10.2514/1.j052732
132 Hoburg, W., Kirschen, P., and Abbeel, P., “Data fitting with geometric- cited on p. 328
programming-compatible softmax functions,” Optimization and
Engineering, Vol. 17, No. 4, August 2016, pp. 897–918.
doi: 10.1007/s11081-016-9332-3
133 Kirschen, P. G., York, M. A., Ozturk, B., and Hoburg, W. W., “Ap- cited on p. 329
plication of Signomial Programming to Aircraft Design,” Journal of
Aircraft, Vol. 55, No. 3, May 2018, pp. 965–987.
doi: 10.2514/1.c034378
134 York, M. A., Hoburg, W. W., and Drela, M., “Turbofan Engine cited on p. 329
Sizing and Tradeoff Analysis via Signomial Programming,” Journal
of Aircraft, Vol. 55, No. 3, May 2018, pp. 988–1003.
doi: 10.2514/1.c034463
135 Smolyak, S. A., “Quadrature and interpolation formulas for tensor cited on p. 345
products of certain classes of functions,” Dokl. Akad. Nauk SSSR,
Vol. 148, No. 5, 1963, pp. 1042–1045.
136 Smith, R. C., Uncertainty Quantification: Theory, Implementation, and cited on p. 348
Applications. SIAM, December 2013.
isbn: 1611973228
137 Cacuci, D., Sensitivity & Uncertainty Analysis, Volume 1. Chapman cited on p. 348
and Hall/CRC, May 2003.
doi: 10.1201/9780203498798
138 Parkinson, A., Sorensen, C., and Pourhassan, N., “A General Ap- cited on p. 349
proach for Robust Optimal Design,” Journal of Mechanical Design,
Vol. 115, No. 1, 1993, p. 74.
doi: 10.1115/1.2919328
139 Wiener, N., “The Homogeneous Chaos,” American Journal of Mathe- cited on p. 350
matics, Vol. 60, No. 4, October 1938, p. 897.
doi: 10.2307/2371268
140 Eldred, M., Webster, C., and Constantine, P., “Evaluation of Non- cited on p. 353
Intrusive Approaches for Wiener-Askey Generalized Polynomial
Chaos,” 49th AIAA Structures, Structural Dynamics, and Materials
Conference, American Institute of Aeronautics and Astronautics,
April 2008.
doi: 10.2514/6.2008-1892
Bibliography 437
141 Adams, B., Bauman, L., Bohnhoff, W., Dalbey, K., Ebeida, M.,
Eddy, J., Eldred, M., Hough, P., Hu, K., Jakeman, J., Stephens, J.,
Swiler, L., Vigil, D., and Wildey, T., “Dakota, A Multilevel Parallel cited on p. 355
Object-Oriented Framework for Design Optimization, Parameter
Estimation, Uncertainty Quantification, and Sensitivity Analysis:
Version 6.0 User’s Manual,” Sandia Technical Report SAND2014-
4633, Sandia National Laboratories, November 2015.
142 Feinberg, J. and Langtangen, H. P., “Chaospy: An open source tool cited on p. 355
for designing methods of uncertainty quantification,” Journal of
Computational Science, Vol. 11, November 2015, pp. 46–57.
doi: 10.1016/j.jocs.2015.08.008
143 Kroo, I. M., MDO for large-scale design, Multidisciplinary Design cited on p. 360
Optimization: State-of-the-Art, Alexandrov, N. and Hussaini, M. Y.,
Eds., SIAM, 1997, pp. 22–44.
144 Biegler, L. T., Ghattas, O., Heinkenschloss, M., and Bloemen cited on p. 380
Waanders, B. van, Eds., Large-Scale PDE-Constrained Optimization.
Springer–Verlag, 2003.
145 Braun, R. D., “Collaborative Optimization: An Architecture for cited on p. 386
Large-Scale Distributed Design,” Ph.D. Dissertation, Stanford Uni-
versity, Stanford, CA 94305, 1996.
146 Tedford, N. P. and Martins, J. R. R. A., “Benchmarking Multi- cited on p. 395
disciplinary Design Optimization Algorithms,” Optimization and
Engineering, Vol. 11, No. 1, February 2010, pp. 159–183.
doi: 10.1007/s11081-009-9082-6
147 Golovidov, O., Kodiyalam, S., Marineau, P., Wang, L., and Rohl, cited on p. 396
P., “Flexible implementation of approximation concepts in an
MDO framework,” 7th AIAA/USAF/NASA/ISSMO Symposium on
Multidisciplinary Analysis and Optimization, American Institute of
Aeronautics and Astronautics, 1998.
doi: 10.2514/6.1998-4959
148 Balabanov, V., Charpentier, C., Ghosh, D. K., Quinn, G., Van- cited on p. 396
derplaats, G., and Venter, G., “VisualDOC: A Software System
for General Purpose Integration and Design Optimization,” 9th
AIAA/ISSMO Symposium on Multidisciplinary Analysis and Optimiza-
tion, Atlanta, GA, 2002.
149 Barnes, G. K., “A Comparative Study of Nonlinear Optimization cited on p. 420
Codes,” Master’s thesis, The University of Texas at Austin, 1967.
150 Venkayya, V., “Design of optimum structures,” Computers & Struc- cited on p. 421
tures, Vol. 1, No. 1-2, August 1971, pp. 265–309, issn: 0045-7949.
doi: 10.1016/0045-7949(71)90013-7
Index
438
Index 439