Mdobook-J R R A MARTINS

Engineering
Design
Optimization
Joaquim R. R. A. Martins
Andrew Ning
Engineering
Design
Optimization
joaquim r. r. a. martins
University of Michigan
andrew ning
Brigham Young University
This is a working draft that we are updating frequently.

Once the book is finalized and published, we will continue
to provide an electronic copy free of charge. If you have any
suggestions or corrections, please email [email protected]
or [email protected].
Draft compiled on Sunday 7th February, 2021 at 16:31

Copyright
© 2020 Joaquim R. R. A. Martins and Andrew Ning. All rights reserved.
Publication
First electronic edition January 2020.
Contents
Contents i
Preface v
Acknowledgements vii
1 Introduction 1
1.1 Design Optimization Process 2
1.2 Optimization Problem Formulation 5
1.3 Optimization Problem Classification 16
1.4 Optimization Algorithms 20
1.5 Selecting an Optimization Approach 25
1.6 Notation 27
1.7 Summary 27
Problems 28
2 A Short History of Optimization 31

2.1 The First Problems: Optimizing Length and Area 31
2.2 Optimization Revolution: Derivatives and Calculus 32
2.3 The Birth of Optimization Algorithms 34
2.4 The Last Decades 37
2.5 Summary 41
3 Numerical Models and Solvers 42

3.1 Model Development for Analysis Versus Optimization 42
3.2 Modeling Process and Types of Errors 43
3.3 Numerical Models as Residual Equations 45
3.4 Discretization of Differential Equations 47
3.5 Numerical Errors 48
3.6 Rate of Convergence 54
3.7 Overview of Solvers 58
3.8 Newton-based Solvers 60
3.9 Models and the Optimization Problem 63
3.10 Summary 66
i
Contents ii
Problems 67
4 Unconstrained Gradient-Based Optimization 70

4.1 Fundamentals 71
4.2 Two Overall Approaches to Finding an Optimum 83
4.3 Line Search 85
4.4 Search Direction 96
4.5 Trust-region Methods 111
4.6 Summary 117
Problems 118
5 Constrained Gradient-Based Optimization 123

5.1 Constrained Problem Formulation 124
5.2 Optimality Conditions 125
5.3 Penalty Methods 137
5.4 Sequential Quadratic Programming 148
5.5 Interior Point Methods 155
5.6 Merit Functions and Filters 160
5.7 Constraint Aggregation 163
5.8 Summary 164
Problems 165
6 Computing Derivatives 173

6.1 Derivatives, Gradients, and Jacobians 173
6.2 Overview of Methods for Computing Derivatives 175
6.3 Symbolic Differentiation 176
6.4 Finite Differences 177
6.5 Complex Step 182
6.6 Algorithmic Differentiation 187
6.7 Implicit Analytic Methods—Direct and Adjoint 195
6.8 Sparse Jacobians and Graph Coloring 202
6.9 Unification of the Methods for Computing Derivatives 205
6.10 Summary 208
Problems 209
7 Gradient-Free Optimization 212

7.1 Relevant Problem Characteristics 212
7.2 Classification of Gradient-Free Algorithms 215
7.3 Nelder–Mead Algorithm 218
7.4 DIRECT Algorithm 222
7.5 Genetic Algorithms 231
7.6 Particle Swarm Optimization 241
7.7 Summary 246
Contents iii
Problems 247
8 Discrete Optimization 250

8.1 Binary, Integer, and Discrete Variables 250
8.2 Techniques to Avoid Discrete Variables 252
8.3 Branch and Bound 253
8.4 Greedy Algorithms 260
8.5 Dynamic Programming 262
8.6 Simulated Annealing 270
8.7 Quantum Annealing 274
8.8 Binary Genetic Algorithms 274
8.9 Summary 274
Problems 275
9 Multiobjective Optimization 278

9.1 Multiple Objectives 278
9.2 Pareto Optimality 280
9.3 Solution Methods 281
9.4 Summary 292
Problems 292
10 Surrogate-Based Optimization 295

10.1 When to Use a Surrogate 295
10.2 Sampling 296
10.3 Constructing a Surrogate 299
10.4 Infill 305
10.5 Deep Neural Networks 307
10.6 Summary 313
Problems 314
11 Convex Optimization 317

11.1 Introduction 317
11.2 Linear Programming 319
11.3 Quadratic Programming: 321
11.4 Second-Order Cone Programming: 324
11.5 Disciplined Convex Optimization 325
11.6 Geometric Programming 326
11.7 Summary 329
Problems 329
12 Optimization Under Uncertainty 333

12.1 Introduction 333
12.2 Statistics Review 334
Contents iv
12.3 Robust Design 338

12.4 Reliability 342
12.5 Forward Propagation 344
12.6 Summary 355
Problems 356
13 Multidisciplinary Design Optimization 360

13.1 Motivation 360
13.2 MDO Problem Representation 361
13.3 Multidisciplinary Models 361
13.4 Coupled Derivative Computation 373
13.5 Monolithic Architectures 376
13.6 Distributed Architectures 383
13.7 Summary 395
Problems 396
A Mathematics Review 398

A.1 Chain Rule, Partial Derivatives, and Total Derivatives 398
A.2 Vector and Matrix Norms 400
A.3 Matrix Multiplication 400
A.4 Matrix Types 403
A.5 Matrix Derivatives 404
A.6 Taylor Series Expansion 405
B Linear Solvers 408

B.1 Direct Methods 408
B.2 Iterative Methods 410
C Test Problems 414

C.1 Unconstrained Problems 414
C.2 Constrained Problems 420
Bibliography 424
Index 438
Preface
Despite its usefulness, design optimization remains underused in in-

dustry. One of the reasons for this is the shortage of design optimization
courses in undergraduate and graduate curricula. This is changing, as
most top aerospace and mechanical engineering departments nowadays
include at least one graduate-level course on numerical optimization.
We have also seen design optimization increasingly used in industry,
including the major aircraft manufacturers.
The usage of engineering in the title reflects the types of problems
and algorithms we focus on, even though the methods are applica-
ble beyond engineering. In contrast to explicit analytic mathematical
functions, most engineering problems are implemented in complex
multidisciplinary codes that involve implicit functions. Such problems
might require hierarchical solvers and coupled derivative computa-
tion. Furthermore, many engineering problems involve many design
variables with varying scales and many constraints, requiring scalable
methods.
The target audience for this book is advanced undergraduate and
beginning graduate students in science and engineering. No previous
exposure to optimization is assumed. Knowledge of linear algebra,
multivariable calculus, and numerical methods is helpful, although
these subjects’ core concepts are reviewed in the appendix. The content
of the book spans approximately two semester-length university courses.
Our approach is to start from the most general case problem and then
explain some of the special cases. The first half of the book covers
the fundamentals (along with an optional history chapter), whereas
the second half, from Chapter 8 onwards, covers more specialized or
advanced topics.
Our philosophy in the exposition is to provide a detailed enough
explanation and analysis of optimization methods so that readers can
implement a basic working version. The problems at the end of each
chapter are designed to provide a gradual progression in difficulty and
eventually require implementing the methods. Some of the problems
are open-ended to encourage students to explore a given topic on
their own. While we do not generally encourage readers to use their
v
Preface vi
implementations instead of existing tools for solving optimization

problems, implementing a method is a useful exercise to understand
the method and its behavior. A deeper understanding of these methods
is useful for developers and researchers and for users who want to
use numerical optimization more effectively. When discussing the
various optimization techniques, we also explain how to avoid the
potential pitfalls of using a particular method and how to use it more
effectively. Practical tips are included throughout the book to alert the
reader to common issues encountered in practical engineering design
optimization and how to address them.
We have created a repository with code, data, templates, and
examples as a supplementary resource for this book: https://github.
com/mdobook/resources. Some of the end-of-chapter exercises refer
to code or data from this repository. Go forth and optimize!
Acknowledgments
We are indebted to many students at our respective institutions that

provided feedback on concepts, examples, and drafts of the manuscript.
We wish to particularly thank Edmund Lee and Aaron Lu for translating
many of our figures to a readable and precise format. We are also
grateful to Judd Mehr for creating the initial draft for the mathematical
review section of the appendix and to Max Opgenoord for sharing his
thesis style file, on which the layout of this book is based.
Joaquim Martins and Andrew Ning
vii
Introduction
1
Optimization is a human instinct. People constantly seek to improve
their lives and the systems that surround them. Optimization is intrinsic
in biology, as exemplified by the evolution of species. Birds optimize
their wings’ shape in real time, and dogs have been shown to find
optimal trajectories. Even more broadly, many laws of physics relate to
optimization, such as the principle of minimum energy. As Leonhard
Euler once wrote, “nothing at all takes place in the universe in which
some rule of maximum or minimum does not appear.”
Optimization is often used to mean improvement, but mathemati-
cally it is a much more precise concept: finding the best possible solution
by changing variables that can be controlled, often subject to constraints.
Optimization has a broad appeal because it is applicable in all domains
and because we can all identify with a desire to make things better.
Any problem where a decision needs to be made can be cast as an
optimization problem.
While some simple optimization problems can be solved analytically,
most practical problems of interest are too complex to be solved this way.
The advent of numerical computing, together with the development of
optimization algorithms, has enabled us to solve problems of increasing
complexity.
Optimization problems occur in various areas, such as economics,
political science, management, manufacturing, biology, physics, and
engineering. A large segment of optimization applications focuses on
operations research, which deals with problems such as deciding on the
price of a product, setting up a distribution network, scheduling, or
suggesting routes.
Another large segment of applications focuses on the design of
engineering systems—the subject of this book. Design optimization
problems abound in the various engineering disciplines, such as wing
design in aerospace engineering, process control in chemical engineer-
ing, structural design in civil engineering, circuit design in electrical
engineering, and mechanism design in mechanical engineering. Most
engineering systems rarely work in isolation and are linked to other
systems. This gave rise to the field of multidisciplinary design optimization
1
1 Introduction 2
(MDO), which applies numerical optimization techniques to the design

of engineering systems that involve multiple disciplines.
In the remainder of this chapter, we start by explaining the design
optimization process and contrasting it with the conventional design
process (Section 1.1). Then we explain how to formulate optimization
problems and the different types of problems that can arise (Section 1.2).
Since design optimization problems involve functions of different types,
these are also briefly discussed (Section 1.3). (A more detailed discussion
of the numerical models used to compute these functions is deferred to
Chapter 3.) We then provide an overview of the different optimization
algorithms, highlighting the algorithms covered in this book and linking
to the relevant section (Section 1.4). We connect algorithm types and
problem types by providing guidelines for selecting the right algorithm
for a given problem (Section 1.5). Finally, we introduce the notation
used throughout the book (Section 1.6).
By the end of this chapter you should be able to:
1. Understand the design optimization process.
2. Formulate an optimization problem.
3. Identify key characteristics to classify optimization prob-

lems and optimization algorithms.
4. Recognize some salient characteristics in selecting an ap-

propriate algorithm.
Requirements
and
specifications
1.1 Design Optimization Process

Conceptual
Engineering design is an iterative process that engineers follow to design
develop a product that accomplishes a given task. For any product
beyond a certain complexity, this process involves teams of engineers
Preliminary
and multiple stages with many iterative loops that could be nested. The design
engineering teams are formed to tackle different aspects of the product
at different stages.
Detailed
The design process can be divided into the sequence of phases shown design
in Fig. 1.1. Before the design process begins, we must determine the
requirements and specifications. This might involve market research,
an analysis of current similar designs, and interviews with potential Final
design
customers. In the conceptual design phase, various concepts for the
system are generated and considered. Because this phase should be Figure 1.1: Design phases.
1 Introduction 3
short, it usually relies on simplified models and human intuition. For

more complicated systems, the various subsystems are identified. In
the preliminary design phase, a chosen concept and subsystems are
refined by using better models to guide changes in the design, and
the performance expectations are set. The detailed design phase seeks
to complete the design down to every detail so that it can finally be
manufactured. All of these phases require iteration within themselves.
When severe issues are identified, it may be necessary to “go back to the
drawing board” and regress to an earlier phase. This is just a high-level
view; in practical design, each phase may require multiple iterative
processes.
Design optimization is a tool that can be used to replace an iterative
design process to accelerate the design cycle and obtain better results.
To understand the role of design optimization, consider a simplified
version of the conventional engineering design process with only one
iterative loop, as shown in Fig. 1.2 (top). In this process, engineers make
decisions at every stage based on intuition and background knowledge.
Each of the conventional design process steps includes human
decisions that are either challenging or impossible to program into
computer code. Determining specifications for the product require
engineers to define the problem and do background research. The
design cycle must start with an initial design, which can be based on past
designs or a new idea. In the conventional design process, this initial
design is analyzed in some way to evaluate its performance. This could
involve numerical modeling or actual building and testing. Engineers
then evaluate the design and decide whether it is good enough or not
based on the results.∗ If the answer is no—which is likely to be the case ∗ The evaluation of a given design is often
just called the analysis.
for at least the first few iterations—the engineer will change the design
based on intuition, experience, or trade studies. When the design is
satisfactory, the engineer will arrive at the final design.
The design optimization process can be represented using a similar
flow diagram, as shown in Fig. 1.2 (bottom). The determination
of the specification and the initial design are no different from the
conventional design process. However, design optimization requires a
formal formulation of the optimization problem that includes the design
variables that are to be changed, the objective to be minimized, and
the constraints that need to be satisfied. The evaluation of the design
is strictly based on numerical values for the objective and constraints.
When a rigorous optimization algorithm is used, the decision to finalize
the design is only made when the current design satisfies the optimality
conditions that ensure that no other design “close by” is better. The
design changes are made automatically by the optimization algorithm
1 Introduction 4
Manual iteration
Change
design
manually No
Initial Evaluate Is the design Yes

Specifications Final design
design performance good?
Optimization
Update
design
No
variables
Initial
design
Evaluate
Specifications Optimality
objective and
achieved?
Formulate constraints
optimization
problem
Yes
Change initial design

No Is the design Yes
and/or Final design
good?
reformulate problem
and do not require intervention from the designer. Figure 1.2: Conventional (top) ver-
This automated process does not usually provide a “push-button” sus design optimization process
(bottom).
solution; it requires human intervention and expertise (often more
expertise than in the traditional process). Human decisions are still
needed in the design optimization process. Before running an op-
timization, in addition to determining the specifications and initial
design, engineers need to formulate the design problem. This requires
expertise in both the subject area and in numerical optimization. The
designer must decide what the objective is, which parameters can be
changed, and which constraints must be enforced. These decisions
have profound effects on the outcome, so it is crucial that the designer
formulates the optimization problem well.
After running the optimization, engineers must assess the design
because it is unlikely that a valid and practical design is obtained after
the first time a formulation is developed. After evaluating the optimal
1 Introduction 5
design, engineers might decide to reformulate the optimization problem

by changing the objective function, adding or removing constraints,
or changing the set of design variables. Engineers might also decide
to increase the models’ fidelity if they fail to consider critical physical
phenomena or decrease the fidelity if the models are too expensive to
evaluate in an optimization iteration.
In addition to assessing the optimization results, post-optimality
studies are often performed to interpret the optimal design and the
design trends. This might be done by performing parameter studies,
where design variables or other parameters are varied to quantify their
effect on the objective and constraints. Validation of the result can be
done by evaluating the design with higher-fidelity simulation tools, by
performing experiments, or both. It is also possible to compute post-
optimality sensitivities to evaluate which design variables are the most
influential or which constraints drive the design. These sensitivities
can inform where engineers might best allocate resources to alleviate
the driving constraints in future designs.
Design optimization can be used in any of the design phases shown
in Fig. 1.1, where each phase could involve running one or more
design optimizations. In addition to increasing the system performance
and reducing the design time, design optimization also decreases the
uncertainty at any given time compared to the conventional design
process (Fig. 1.3). Considering multiple disciplines or components
using MDO amplifies these same favorable trends.
In this book we, will tend to frame problems and discussions in
the context of engineering design. However, you should keep in
mind that the optimization methods are general and are used in other
applications that may not be called design problems, such as optimal
control, machine learning, and regression. In other words, we mean
“design” in a general sense, where variables are changed to optimize an
objective.
1.2 Optimization Problem Formulation
The design optimization process requires the designer to translate

their intent to a mathematical statement that can then be solved by
an optimization algorithm. Developing this statement has the added
benefit that it helps the designer better understand the problem. Being
methodical in the optimization problem formulation is vital because the
optimizer tends to exploit any weaknesses you might have in your formulation
or model. An inadequate problem formulation can either cause the
optimization to fail or to converges to a mathematical optimum that
1 Introduction 6
Increased
Design performance
optimization
performance
System
Conventional
design process
Reduced cost
Cummulative
cost
Reduced time
Uncertainty
Figure 1.3: Compared to the con-

ventional design process, MDO
increases the system performance,
decreases the design time, reduces
Reduced
the total cost, and reduces the un-
uncertainty
certainty at a given point in time.
Time in design
is undesirable or unrealistic from an engineering point of view—the

proverbial “right answer to the wrong question”.
To formulate design optimization problems, we follow the procedure 1. Describe the
problem
outlined in Fig. 1.4. The first step requires writing a description of the
design problem, including a description of the system, and a statement
of all the goals and requirements. At this point, the description does 2. Gather
not necessarily involve optimization concepts and is often vague. information
The next step is to gather and much data and information as

possible about the problem. Some information is already specified in 3. Define the
the problem statement, but more research is usually required to find all design variables
the relevant data on the performance requirements and expectations.
Raw data might need to be processed and organized to gather the
4. Define the
information required for the design problem. The more familiar objective
practitioners are with the problem, the better prepared they will be to
develop a sound formulation to identify eventual issues in the solutions.
At this stage, it is also essential to identify the analysis procedure 5. Define the
constraints
and gather information on that as well. The analysis might consist of a
simple model or a set of elaborate tools. All the possible inputs and Figure 1.4: Optimization problem
outputs of the analysis should be identified, and its limitations should formulation steps.
1 Introduction 7
be understood. The computational time for the analysis needs to be

considered because optimization requires repeated analysis.
It is usually impossible to learn everything about the problem before
proceeding to the next steps where we define the design variables, objec-
tive, and constraints. Therefore, information gathering and refinement
is an ongoing process in the problem formulation.
1.2.1 Design Variables

The next step is to identify the variables that describe the system, the
design variables† , which we represent by the column vector: † Some text call these “decision variables”
or simply “variables”.
𝑥 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 𝑥 ] . (1.1)
This vector defines a given design, so different vectors 𝑥 correspond
to different designs. The number of variables, 𝑛 𝑥 , determines the
problem’s dimensionality and is also referred to as the design degrees
of freedom.
The design variables must not depend on each other, or any other
parameter, and the optimizer must be free to choose the components of
𝑥 independently. This means that in the analysis of a given design, they
must be input parameters that remain fixed throughout the analysis
process. Otherwise, the optimizer will not have absolute control of
the design variables, and the underlying mathematical assumptions
break down. Another possible pitfall is to define a design variable that
happens to be a linear combination of other variables, which results
in an ill-defined optimization problem with an infinite number of
combinations of design variable values that correspond to the same
design.
The choice of variables is usually not unique. For example, a square
shape can be parametrized by the length of its side or by its area, and
different unit systems can be used. The choice of units affects the
problem’s scaling, but not the functional form of the problem.
The choice of design variables can affect the functional form of the
objective and constraints. For example, some nonlinear relationships
can be converted to linear ones through a change of variables. It is also
possible to introduce or eliminate discontinuities through the choice of
design variables.
A given set of design variable values defines the system’s design, but
whether this system satisfies all the requirements is a separate question
that will be addressed with the constraints in a later step. However, it
is possible and advisable to define the space of allowable values for
the design variables based on the design problem specifications and
physical limitations.
1 Introduction 8
The first consideration in the definition of the allowable design

variable values is whether the design variables are continuous or discrete.
The continuous design variables are real numbers that are allowed to
vary continuously within a specified range with no gaps, which we
write as
x𝑖 ≤ 𝑥 𝑖 ≤ 𝑥 𝑖 ; 𝑖 = 1, . . . , 𝑛 𝑥 , (1.2)
where x and 𝑥 are lower and upper bounds on the design variables,
respectively. These are also known as side constraints. Some design
variables may be unbounded or only bounded on one side.
We distinguish the design variable bounds from constraints because
the optimizer has direct control over their values, and they benefit from
a different numerical treatment when solving an optimization problem.
When defining these bounds, we must take care not to unnecessarily
constrain the design space, which would prevent the optimizer from
achieving a better design that is realizable. A smaller allowable range
in the design variable values should make the optimization easier.
However, design variable bounds should be based on actual physical
constraints instead of being artificially limited. An example of a phys-
ical constraint is a lower bound on structural thickness in a weight
minimization problem, where otherwise, the optimizer will discover
that negative sizes yield negative weight. Whenever a design variable
converges to the bound at the optimum, you should reconsider the
reasoning for that bound and make sure it is valid. This is because de-
signers sometimes set bounds that unnecessarily limit the optimization
from obtaining a better objective.
In a continuous optimization problem, all design variables must
be continuous.‡ Most of this book focuses on algorithms that assume ‡ Thisis not to be confused with the conti-
nuity of the objective and functions, which
continuous design variables. we discuss in Section 1.3.
When one or more variables are discrete, we have a discrete opti-
mization problem. The discrete case occurs when one or more variables
are allowed to have discrete values, which can be either real or integer.
An example of discrete design variables is structural sizing, where
only components of specific thicknesses or cross-sectional areas are
available. Integer design variables are a special case of discrete variables
where the values are integers, like the number of wheels in a vehicle.
Optimization algorithms that handle discrete variables are discussed
in Chapter 8.
At the formulation stage, we should strive to list as many indepen-
dent design variables as possible. However, it is advisable to start with
a small set of variables when solving the problem for the first time and
then gradually expand the set of design variables.
1 Introduction 9
Some optimization algorithms require the user to provide initial

design variable values. This initial point is usually based on the best
guess the user can produce. This might be an already good design that
the optimization refines further by making small changes. Another
possibility is that the initial guess is a bad design or a “blank slate” that
the optimization changes significantly.
Example 1.1: Design variables for wing design.
Consider a wing design problem where the wing planform shape is rect-
angular. The planform could be parametrized by the span (𝑏) and the chord
(𝑐) as seen in Fig. 1.5, so that 𝑥 = [𝑏, 𝑐]. However, this choice is not unique.
𝑐
5 10 𝑏
4 𝑆
8 Figure 1.5: Wing span (𝑏) and
=
𝑆=
chord (𝑐).
𝑆
𝑆=5
35
1.0
=
𝑏=
25
15
6 12
𝑐=
𝑐 3 𝐴𝑅
4 1.5
𝑐=
2 .0 𝑏=8
2 𝑐=
𝑐=2
.5
2
=1
3 Figure 1.6: Wing design space

𝐴𝑅
=
𝐴𝑅 7
1
𝐴𝑅 =
0
𝑏=4
for two different sets of design
2 4 6 8 10 5 10 15 20 25 variables, 𝑥 = [𝑏, 𝑐] and 𝑥 = [𝑆, 𝐴𝑅].
𝑏 𝑆
Two other variables are often used in aircraft design: wing area (𝑆) and wing
aspect ratio (𝐴𝑅), also shown in the figure. Because these variables are not
independent (𝑆 = 𝑏𝑐 and 𝐴𝑅 = 𝑏 2 /𝑆), we cannot just add them to the set
of design variables. Instead, we must pick any two variables out of the four
to parametrize the design because we have four possible variables and two
dependency relationships.
For this wing, the variables must be positive to be physically meaningful,
so we must remember to explicitly bound these variables to be greater than
zero in an optimization. The variables should be bound from below by small
positive values because numerical models are probably not prepared to take
zero values. No upper bound is needed unless the optimization algorithm
requires it.
1.2.2 Objective Function

To find the best design, we need a quantifiable criterion to determine if
one design is better than another—the objective function. The objective
function must be a scalar that is computable for a given design variable
vector 𝑥. The objective function can be minimized or maximized,
1 Introduction 10
depending on the problem. For example, a designer might want to

minimize the weight or cost of a given structure. An example of a
function to be maximized could be the range of a vehicle.
The convention adopted in this book is that the objective function, 𝑓 ,
is to be minimized. This convention does not prevent us from maximizing
a function, since we can reformulate it as a minimization problem by
finding the minimum of the negative of 𝑓 and then changing the sign,
that is:
max[ 𝑓 (𝑥)] = − min[− 𝑓 (𝑥)]. (1.3)
This transformation is illustrated in Fig. 1.7.§ § Inverting the function (1/ 𝑓 ) is also a pos-
The computation of the objective function is done through a numer- sible way to turn a maximization problem
into a minimization one, but is generally
ical model whose complexity can range from a simple explicit equation less desirable as it alters the scale of the
to a system of coupled implicit models (more on this in Chapter 3). problem and could introduce a divide-by-
zero problem.
The choice of objective function is crucial for successful design
optimization. If the function does not represent the true intent of the max 𝑓 (𝑥)
designer, it does not matter how precisely the function and its optimum
point is computed—the mathematical optimum will be non-optimal
from the engineering point of view. A bad choice of objective function
is a common mistake in design optimization. 0
𝑥∗
The choice of objective function is not always obvious. For example,
minimizing the weight of a vehicle might sound like a good idea, but
this might result in a vehicle that is too expensive to manufacture. In this

case, manufacturing cost would probably be a better objective. However, min − 𝑓 (𝑥)
there is a tradeoff between manufacturing cost and the efficiency of the
vehicle. It might not be obvious which of these objectives is the most Figure 1.7: A maximization prob-
lem can be transformed into an
appropriate one because this trade depends on customer preferences.
equivalent minimization one.
This issue motivates multiobjective optimization, which is the subject of
Chapter 9. Multiobjective optimization does not yield a single design
but rather a range of designs that settle for different tradeoffs between
the objectives.
Experimenting with different objectives should be part of the design
exploration process (this is represented by the outer loop in the design
optimization process in Fig. 1.2). Results from optimizing the “wrong”
objective can still yield insights into the design tradeoffs and trends for
the system at hand.
Example 1.2: Objective function for wing design.
Let us consider the appropriate objective function for Ex. 1.1. A common
objective for a wing is to minimize drag. However, this does not take into
account the propulsive efficiency, which is strongly affected by speed. A better
objective might be to minimize the required power, which balances drag and
propulsive efficiency.
1 Introduction 11
The contours for the required power are shown in Fig. 1.8 for the two
choices of design variable sets discussed in Ex. 1.1. We can locate the minimum
graphically (denoted by the dot). While the two optimum solutions are the
same, the shapes of the objective function contours are different. In this case,
using aspect ratio and wing area simplifies the relationship between the design
variables and the objective by aligning the two main curvature trends with
each design variable.
1.5 70
1.2
50
𝑐 0.9 𝐴𝑅
Figure 1.8: Required power con-
30
tours for two different choices of
0.6 design variable sets. The optimal
wing is the same for both cases, but
10
0.3
the functional form of the objective
5 15 25 35 5 10 15 20 25 is simplified in the bottom one.
𝑏 𝑆
In this case, the optimal wing has an aspect ratio that is much higher
than typically seen in aircraft or birds. While the high aspect ratio increases
aerodynamic efficiency, it adversely affects the structural strength, which we
did not consider here. Thus, as in most engineering problems, we need to add
constraints.
While we can sometimes visualize the variation of the objective

function as in Ex. 1.2, this is not possible for problems with more design
variables or more computationally demanding function evaluations.
This motivates numerical optimization algorithms, which aim to find
the minimum in a multidimensional design space using as few function
evaluations as possible.
1.2.3 Constraints
The vast majority of practical design optimization problems require the
enforcement of constraints. These are functions of the design variables
that we want to restrict in some way. Like the objective function,
constraints are computed through a model whose complexity can vary
widely. The feasible region is the set of points that satisfy all constraints.
We seek to minimize the objective function within this feasible design
space.
When we restrict a function to being equal to a fixed value, we call
this an equality constraint, denoted by ℎ(𝑥) = 0. When the function
is required to be less than or equal to a certain value, we have an
1 Introduction 12
inequality constraint, denoted by 𝑔(𝑥) ≤ 0.¶ While we use “less or equal” ¶A strict inequality, 𝑔(𝑥) < 0, is never
used because then 𝑥 could be arbitrarily
by convention, you should be aware that some other texts and software close to the equality. Since the optimum
programs use “greater or equal” instead. There is no loss of generality is at 𝑔 = 0 for an active constraint, the
exact solution would then be ill-defined
with either convention, as we can always multiply the constraint by −1 from a mathematical perspective. Also,
to convert between the two. the difference is not meaningful when us-
ing finite-precision arithmetic (which is al-
ways the case when using a computer).
Tip 1.3: Check the inequality convention.
When using optimization software, do not forget to check the convention

for the inequality constraints (is it “less than” or “greater than”?) and convert
your constraints as needed.
Some texts omit the equality constraints without loss of generality

because an equality constraint can be replaced by two inequality con-
straints. More specifically, an equality constraint, ℎ(𝑥) = 0, is equivalent
to two inequality constraints, 𝑔(𝑥) ≥ 0 and 𝑔(𝑥) ≤ 0.
Inequality constraints can be active or inactive at the optimum point.
An active means that 𝑔(𝑥 ∗ ) = 0, whereas for inactive one 𝑔(𝑥 ∗ ) < 0. If
a constraint is inactive at the optimum, this constraint could have been
removed from the problem with no change in its solution, as illustrated
in Fig. 1.9. In this case, constraints 𝑔2 and 𝑔3 can be removed without
affecting the solution of the problem. Furthermore, active constraints
(𝑔1 in this case) can equivalently be replaced by equality constraints.
However, it is difficult to know in advance which constraints are active
or not at the optimum for a general problem. Constrained optimization
is the subject of Chapter 5.
𝑔1 (𝑥) ≤ 0 ℎ1 (𝑥) = 0
(active) (active)
𝑓 (𝑥) 𝑓 (𝑥)
Figure 1.9: Example of two-

dimensional problem with one
𝑥∗ 𝑥∗
active and two inactive inequality
𝑥2 𝑥2 constraints (left). The red high-
lighted area indicates regions that
are infeasible (i.e., the constraints
𝑔2 (𝑥) ≤ 0 are violated). If we only had the
(inactive) active single equality constraint in
the formulation, we would obtain
𝑥1 𝑔3 (𝑥) ≤ 0 𝑥1
the same result (right).
(inactive)
It is possible to over-constrain the problem such that there is no

solution. This can happen due to a programming error but can also
occur at the problem formulation stage. For more complicated design
problems, it might not be possible to satisfy all the specified constraints,
1 Introduction 13
even if they seem to make sense. When this happens, constraints have
to be relaxed or removed.
The problem must not be over-constrained, or else there is no feasible
region in the design space over which the function can be minimized.
Thus, the number of independent equality constraints must be less or
equal to the number of design variables (𝑛 ℎ ≤ 𝑛 𝑥 ). There is no limit
on the number of inequality constraints. However, they must be such
that there is a feasible region, and the number of active constraints plus
the equality constraints must still be less or equal than the number of
design variables.
The feasible region grows when constraints are removed and shrinks
when constraints are added (unless these constraints are redundant).
As the feasible region grows, the optimum objective function usually
improves, or at least stays the same. Conversely, the optimum worsens
or stays the same when the feasible region shrinks.
One common issue in optimization problem formulation is distin-
guishing objectives from constraints. For example, we might be tempted
to minimize the stress in a structure, but this would inevitably result in
an overdesigned heavy structure. Instead, we might want minimum
weight (or cost) with sufficient safety factors on stress, which can be
enforced by an inequality constraint.
Most engineering problems require constraints—often a large num-
ber of them. While constraints may at first appear limiting, they are
what enable the optimizer to find useful solutions.
As previously mentioned, some algorithms require the user to
provide an initial guess for the design variable values. While it is easy
to assign values within the bounds, it might not be as easy to ensure
that the initial design satisfies the constraints. This is not an issue for
most optimization algorithms, but some require starting with a feasible
design. 𝑐
1.5
Example 1.4: Constraints for wing design. 1.2
We now add a design constraint for the power minimization problem of

0.9
Ex. 1.2. The unconstrained optimal wing has unrealistically high aspect ratios
because we did not include structural considerations. If we add an inequality 0.6
constraint on the bending stress at the root of the wing for a fixed amount
of material, we get the curve and feasible region shown in Fig. 1.10. The 0.3
5 15 25 35
unconstrained optimum violates this constraint. The constrain optimum results
𝑏
in a lower span and higher chord, and the constraint is active.
Figure 1.10: Minimum power wing
with a constraint on bending stress
As previously mentioned, it is not possible in general to visualize compared to the unconstrained
the design space as shown in Ex. 1.2 and obtain the solution graphically. case.
1 Introduction 14
In addition to the possibility of a large number of design variables and

computationally expensive objective function evaluations, we now add
the possibility of a large number of constraints that are also expensive
to evaluate. Again, this is further motivation for the optimization
techniques covered in this book.
1.2.4 Optimization Problem Statement

Now that we have discussed the definition of design variables, ob-
jective function, and constraints, we can put them all together in an
optimization problem statement. In words, this statement is: Minimize
the objective function by varying the design variables within their bounds
subject to the constraints. Mathematically, we write this as: ‖ ‖
Instead of “by varying”, some textbooks
write “with respect to”.
minimize 𝑓 (𝑥)
by varying x𝑖 ≤ 𝑥 𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥
subject to 𝑔 𝑗 (𝑥) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 (1.4)
ℎ 𝑘 (𝑥) = 0 𝑘 = 1, . . . , 𝑛 ℎ
While this is the standard formulation used in this book, other books
and software manuals might differ from this. For example, they might
use different symbols, use “greater or equal than” for the inequality
constraint, or maximize instead of minimizing. In any case, it is possible
to convert between standard formulations to get equivalent problems.
All single objective, continuous optimization problems can be writ-
ten in this form. Although our target applications are engineering
design problems, many other problems can be stated in this form, and
thus, the methods covered in this book can be used to solve those
problems.
The values of the objective and constraint functions for a given set Optimizer 𝑥∗
𝑥 (0)
of design variables are computed through the analysis, which consists
of one or more numerical models. The analysis must be fully automatic 𝑥 𝑓 , 𝑔, ℎ
so that multiple optimization cycles can be completed without human
Analysis
intervention, as shown in Fig. 1.11. The optimizer usually requires
an initial design 𝑥 (0) and then queries the analysis for a sequence of
Figure 1.11: The analysis computes
designs until it finds the optimum design, 𝑥 ∗ . the objective ( 𝑓 ) and constraint val-
ues (𝑔, ℎ) for a given set of design
Tip 1.5: Using an optimization software package. variables (𝑥).
The setup of an optimization problem varies depending on the particular

software package, so read the documentation carefully. Most optimization
software requires you to define the objective and constraints as callback functions.
1 Introduction 15
These are passed to the optimizer, which will call them back as needed during
the optimization process. The functions take the design variable values as
inputs and output the function values, as shown in Fig. 1.11. Study the software
documentation for the details of how to use it.∗∗ To make sure you understand ∗∗ Possiblesoftware includes: fmincon in
Matlab, scipy.optimize.minimize with
how to use a given optimization package, test it on simple problems for which
the SLSQP method in Python, Optim.jl
you know the solution first (see Prob. 1.5). with the IPNewton method in Julia, and
the Solver add-in in Microsoft Excel.
When the optimizer queries the analysis for a given 𝑥, the constraints
do not have to be feasible. The optimizer is responsible for changing 𝑥
so that the constraints are satisfied.
The objective and constraint functions must depend on the design
variables; if a function does not depend on any variable in the whole
domain, it can be ignored and should not appear in the problem
statement.
Ideally, 𝑓 , 𝑔, and ℎ should be computable for all values of 𝑥 that
make physical sense. Lower and upper design variable bounds should
be set to avoid non-physical designs as much as possible. Even after
taking this precaution, models in the analysis sometimes fail to provide
a solution. A good optimizer can handle such eventualities gracefully.
There are some mathematical transformations that do not change
the solution of the optimization problem (Eq. 1.4). Multiplying either
the objective and constraints by a constant does not change the optimal
design; it only changes the optimum objective value. Adding a constant
to the objective does not change the solution, but adding a constant to
any constraint changes the feasible space and can change the optimal
design.
Determining an appropriate set of design variables, objective, and
constraints is a crucial aspect of the outer loop shown in Fig. 1.2,
which requires human expertise in engineering design and numerical
optimization.
Tip 1.6: Ease into the problem.
It is tempting to set up the full problem and attempt to solve it right

away. This rarely works, especially for a new problem. Before attempting any
optimization, you should run the analysis models and explore the solution
space manually. Particularly if using gradient-based methods, it helps plot
the output functions across multiple input sweeps to assess if the numerical
outputs display the expected behavior and smoothness. Instead of solving
the full problem, ease into it by setting up the simplest subproblem possible.
If the function evaluations are costly, consider using computational models
that are less costly (but still representative). It is advisable to start by solving
a subproblem with a small set of variables and then gradually expanding it.
1 Introduction 16
Removing some constraints has to be done more carefully because it might

result in an ill-defined problem. For multidisciplinary problems, you should
run optimizations with each component separately before attempting to solve
the coupled problem.
Solving simple problems for which you know the answer (or at least
problems for which you know the trends) helps identify any issues with
the models and problem formulation. Solving a sequence of increasingly
complicated problems gradually builds an understanding of how to solve the
optimization problem and interpret its results.
1.3 Optimization Problem Classification
To choose the most appropriate optimization algorithm for solving a

given optimization problem, we must classify the optimization prob-
lem and know how its attributes affect the efficacy and suitability of
the available optimization algorithms. This is important because no
optimization algorithm is efficient or even appropriate for all types of
problems.
We classify optimization problems based on two main aspects:
the problem formulation and the characteristics of the objective and
constraint functions, as shown in Fig. 1.12.
The classification based on problem formulation was already dis-
cussed in Section 1.2. The design variables can be either discrete or
continuous. Most of this book assumes continuous design variables,
but Chapter 8 provides an introduction to discrete optimization. When
the design variables include both discrete and continuous variables,
the problem is said to be mixed. Most of the book assumes a single
objective function, but we explain how to solve multiobjective prob-
lems in Chapter 9. Finally, as previously mentioned, unconstrained
problems are rare in engineering design optimization. However, we
explain unconstrained optimization algorithms (Chapter 4) because
they provide the foundations for constrained optimization algorithms
(Chapter 5).
The characteristics of the objective and constraint functions also
determine the type of optimization problem at hand and ultimately
limit the type of optimization algorithm that is appropriate to solve the
𝑓 (𝑥)
optimization problem. 𝑥 𝑔(𝑥)
In this section, we will view the function as a “black box”, that is, a ℎ(𝑥)
computation for which we only see inputs (including the design vari-
Figure 1.13: A model is considered
ables) and outputs (including objective and constraints), as illustrated a black box when we only see its
in Fig. 1.13. inputs and outputs.
1 Introduction 17
Continuous
Design variables Discrete
Mixed
Single
Problem Objective
formulation Multiobjective
Constrained
Constraints
Unconstrained
Optimization
Continuous
problem Smoothness
classification Discontinuous
Linear
Linearity
Nonlinear
Objective and Unimodal

constraint function Modality
characteristics Multimodal
Figure 1.12: Optimization problems
can be classified by attributes as-
Convex
Convexity sociated with the different aspects
Non-convex of the problem. The two main as-
pects are the problem formulation
Deterministic and the objective and constraint
Stochasticity function characteristics.
Stochastic
When dealing with black-box models, there is limited or no un-

derstanding of the modeling and numerical solution process used to
obtain the function values. We discuss the types of models and how to
solve them in Chapter 3, but here we can still characterize the functions
based purely on their outputs.
The black-box view is common in real-world applications. This
might be because the source code is not provided, the modeling
methods are not described, or simply because the user does not bother
to understand them.
In the remainder of this section, we discuss the attributes of objec-
tives and constraints shown in Fig. 1.12. Strictly speaking, many of these
attributes cannot typically be identified from a black-box model. For
example, while the model may appear smooth, we cannot know that it
is smooth everywhere without a more detailed inspection. However,
for this discussion, we assume that the black box’s outputs can be
exhaustively explored so that these characteristics can be identified.
1 Introduction 18
1.3.1 Smoothness
The degree of function smoothness with respect to variations in the
design variables depends on the continuity of the function values and
their derivatives. When the value of the function varies continuously,
the function is said to be 𝐶 0 continuous. If the first derivatives also vary
continuously, then the function is 𝐶 1 continuous, and so on. A function
is smooth when the derivatives of all orders vary continuously every-
where in its domain. Function smoothness with respect to continuous
design variables affects what type of optimization algorithm can be
used. Figure 1.14 shows one-dimensional examples for a discontinuous,
𝐶 0 function, and 𝐶 1 function.
As we will see later, discontinuities in the function value or deriva-
tives limit the type of optimization algorithm that can be used because 𝑓 (𝑥)
some algorithms assume 𝐶 0 , 𝐶 1 , and even 𝐶 2 continuity. In practice,

these algorithms usually still work with functions that have only a few 𝑥
discontinuities that are located away from the optimum.

𝑓 (𝑥)
1.3.2 Linearity
𝑥
The functions of interest could be linear or nonlinear. When both the

objective and constraint functions are linear, the optimization problem 𝑓 (𝑥)
is known as a linear optimization problem. These problems are easier

to solve than general nonlinear ones, and there are entire books and 𝑥
courses dedicated to the subject. The first numerical optimization

Figure 1.14: Discontinuous function
algorithms were developed to solve linear optimization problems, and (top), 𝐶 0 continuous function (mid-
there are many applications in operations research (see Chapter 2). An dle), and 𝐶 1 continuous function
example of a linear optimization problem is shown in Fig. 1.15. (bottom).
When the objective function is quadratic, and the constraints are

linear, we have a quadratic optimization problem, which is another
type of problem for which specialized solution methods exist.†† Linear
optimization and quadratic optimization are covered in Sections 11.2 𝑥2
and 11.3. 𝑥∗
While many problems can be formulated as linear or quadratic
problems, most engineering design problems are nonlinear. However, 𝑥1
it is common to have at least a subset of constraints that are linear, and
some general nonlinear optimization algorithms take advantage of the Figure 1.15: Example of a linear
techniques developed to solve linear and quadratic problems. optimization problem in two dimen-
sions.
†† Historically, optimization problems
1.3.3 Multimodality and Convexity were referred to as “programming” prob-
lems, so much of the existing literature
Functions can be either unimodal or multimodal. Unimodal functions refers to these as “linear programming”
have a single minimum, while multimodal functions have multiple and “quadratic programming”.
1 Introduction 19
minima. When we find a minimum without knowledge of whether the

function is unimodal or not, we can only say that it is a local minimum,
that is, this point is better than any point within a small neighborhood.
When we know that a local minimum is the best in the whole domain
(because we somehow know that the function is unimodal), then this
is also the global minimum, as illustrated in Fig. 1.16. Sometimes, the
function might be flat around the minimum, in which case we have a
weak minimum.
For functions involving more complicated numerical models, it is
usually impossible to prove that the function is unimodal. Proving
Weak local
that such a function is unimodal would require evaluating the function minimum
Local
at every point in the domain, which is computationally prohibitive. minimum
However, it much easier to prove multimodality—we just need to find
two distinct local minima. Global
minimum
Just because a function is complicated or the design space has many
dimensions, it does not mean that the function is multimodal. By Figure 1.16: Types of minima.
default, we should not assume that a given function is either unimodal

or multimodal. As we explore the problem and solve it starting from
different points or using different optimizers, there are two main
possibilities.
One possibility is that we find more than one minimum, thus
proving that the function is multimodal. To prove this conclusively, we
must make sure that the minima do indeed satisfy the mathematical
optimality conditions with good enough precision. ‡‡ For example, He et al.1 shows consis-
tent convergence to the same optimum in

The other possibility is that the optimization consistently converges an aerodynamic shape optimization prob-
to the same optimum. In this case, we can become increasingly confi- lem.
dent that the function is unimodal with every new optimization that 1. He et al., Robust aerodynamic shape
optimization—from a circle to an airfoil.
converges to the same optimum. ‡‡ 2019
Often, we need not be too concerned about the possibility of multiple Unimodal
local minima. From an engineering design point of view, achieving a Convex
local optimum that is better than the initial design is already a useful
result.
Convexity is a concept related to multimodality. A function is
Multimodal
convex if all line segments connecting any two points in the function
lies above the function and never intersect it. Convex functions are
always unimodal. Also, all multimodal functions are non-convex, but
not all unimodal functions are convex (see Fig. 1.17).
Convex optimization seeks to minimize convex functions over con- Figure 1.17: Multimodal functions
vex sets. Like linear optimization, convex optimization is another have multiple minima, while uni-
subfield of numerical optimization with many applications. When the modal functions have only one
minimum. All multimodal func-
objective and constraints are convex functions, we can use specialized
tions are non-convex, but not all
formulations and algorithms that are much more efficient than gen- unimodal functions are convex.
1 Introduction 20
eral nonlinear algorithms to find the global optimum, as explained in

Chapter 11.
1.3.4 Deterministic Versus Stochastic

Some functions are inherently stochastic. A stochastic model will yield
different function values for repeated evaluations with the same input
(Fig. 1.18). For example, the numerical value from a roll of dice is a
stochastic function.
Deterministic Stochastic
Stochasticity can also arise from deterministic models when the
inputs are subject to uncertainty. The input variables are then described
as probability distributions, and their uncertainties need to be propa-
gated through the model. For example, the bending stress in a beam
𝑓
may follow a deterministic model, but the beam’s geometric properties
may subject to uncertainty because of manufacturing deviations. For
most of this text, we assume that functions are deterministic except in
Chapter 12, where we explain how to perform optimization where the
model inputs are uncertain. 𝑥 𝑥
Figure 1.18: Deterministic function

1.4 Optimization Algorithms yield the same output when evalu-
ated repeatedly for the same input,
while stochastic functions do not.
No single optimization algorithm is effective or even appropriate for
all possible optimization problems. This is why it is important to
understand the problem before deciding which optimization algorithm
to use. By “effective” algorithm, we mean that the algorithm is capable
of solving the problem at all, and secondly, it does so reliably and effi-
ciently. Fig. 1.19 lists the attributes for the classification of optimization
algorithms, which we discuss in more detail below. These attributes
are often amalgamated, but they are independent and any combination
is possible. In this text, we cover a wide variety of optimization algo-
rithms corresponding to several of these combinations. However, this
overview still does not cover a wide variety of specialized algorithms
designed to solve specific problems where a particular structure can be
exploited.
When multiple disciplinary models are involved, we also need to
consider how the models are coupled, solved, and integrated with the
optimizer. These considerations lead to different MDO architectures,
which may involve multiple levels of optimization problems. Coupled
models are introduced in Section 13.3, while MDO architectures are
covered in Chapter 13.
1 Introduction 21
Zeroth
Order First
Second
Local
Search
Global
Optimality Mathematical
criterion Heuristic
Optimization Iteration Mathematical

algorithm procedure
classification Heuristic
Function Direct
evaluation Surrogate model
Figure 1.19: Optimization algo-

Deterministic
Stochasticity rithms can be classified by using the
Stochastic attributes on the rightmost column.
As in the problem classification,
these attributes are independent,
Static and any combination is possible.
Time dependence
Dynamic
1.4.1 Order of Information

At the minimum, an optimization algorithm requires users to provide
the models that compute the objective and constraint values—zeroth
order information—for any given set of allowed design variables. We
call algorithms that use just these function values gradient-free algorithms
(also known as derivative-free or zeroth-order algorithms). We cover a
selection of these algorithms in Chapter 7. The advantage of gradient-
free algorithms is that the optimization is easier to setup because they
do not need additional computations other than what the models for
the objective and constraints already provide.
Gradient-based algorithms use gradients of both the objective and
constraint functions with respect to the design variables—first order in-
formation. We first cover gradient-based algorithms for unconstrained
problems in Chapter 4 and then extend them to constrained problems
in Chapter 5. The gradients provide much richer information about the
function behavior, which the optimizer can use to converge to the opti-
mum more efficiently. In addition, the gradients are used to establish
whether the optimizer converged to a point that satisfies mathematical
optimality conditions, something that is difficult to verify in a rigor-
ous way without gradients. Gradient-based algorithms also include
algorithms that use curvature—second-order information. Curvature
1 Introduction 22
is even richer information that tells us the rate of the change in the
gradient, which provides an idea of where the function will flatten out.
There is a distinction between the order of information provided
by the user and the order of information that is actually used in the
algorithm. For example, a user might only provide function values
to a gradient-based algorithm and rely on the algorithm to internally
estimate gradients by requesting additional function evaluations and
using finite differences (see Section 6.4). Gradient-based algorithms
can also internally estimate curvature based on gradient values (see
Section 4.4.4).
In theory, gradient-based algorithms require the functions to be
sufficiently smooth (at least 𝐶 1 continuous). However, in practice, they
can tolerate the occasional discontinuity, as long as this discontinuity
does not happen to be at the optimum point.
We devote a considerable portion of this book to gradient-based
algorithms because they generally scale better to problems with many
design variables, and they have rigorous mathematical criteria for
optimality. We also cover the various approaches for computing
gradients in detail because the accurate and efficient computation of
these gradients is crucial for the efficacy and efficiency of these methods
(see Chapter 6).
Current state-of-the-art optimization algorithms also use second-
order information to implement Newton-type methods for second-
order convergence. However, these algorithms tend to build second-
order information based on the provided gradients, as opposed to
requiring users to provide the second-order information directly (see
Section 4.4.4).
Because gradient-based methods require accurate gradients and
smooth enough functions, they require more knowledge about the mod-
els and optimization algorithm than gradient-free methods. Chapters 3
through 6 are devoted to making the power of gradient-based methods
more accessible by providing the necessary theoretical and practical
knowledge.
1.4.2 Local Versus Global Search

The many ways to search the design space can be classified as being local
or global. Local search takes a series of steps starting from a single point
to form a trail of points that hopefully converges to a local optimum.
In spite of the name, local methods can traverse large portions of the
design space and can even step between convex regions (although this
happens by chance). Global search tries to span the whole design space
1 Introduction 23
in the hopes of finding the global optimum. As previously mentioned

when discussing multimodality, even when using a global method,
we cannot prove that any optimum found is a global one except for
particular cases.
The local versus global search classification often gets conflated with
the gradient-based versus gradient-free attributes because gradient-
based methods usually perform a local search. However, these should
be viewed as independent attributes because it is possible to use a
global search strategy to provide starting points for a gradient-based
algorithm. Similarly, some gradient-free algorithms are based on local
search strategies.
The choice of search type is intrinsically linked to the modality of
the design space. If the design space is unimodal, then a local search
will be sufficient, and it will converge to the global optimum. If the
design space is multimodal, a local search will converge to an optimum
that might be local (or global if we are lucky enough). A global search
will increase the likelihood that we converge to a global optimum, but
this is by no means guaranteed.
1.4.3 Mathematical Versus Heuristic

There is a big divide in how much of an algorithm’s iterative process
and optimality criteria are based on provable mathematical principles
versus heuristics. The iterative process determines the sequence of
points that get evaluated in seeking the optimum, while the optimality
criteria determine when this iterative process ends. Heuristics are rules
of thumb or common sense arguments that are not based on a strict
mathematical rationale.
Gradient-based algorithms are usually based on mathematical prin-
ciples both for the iterative process and for the optimality criteria.
Gradient-free algorithms are more evenly split between the mathemati-
cal and heuristic for both optimality criteria and iterative procedure.
The mathematical ones are often classified derivative-free optimization
algorithms. Heuristic gradient-free algorithms include a wide variety
of nature-inspired algorithms (see Section 7.2).
Heuristic optimality criteria are an issue because, strictly speaking,
they do not prove a given point is a local (let alone global) optimum; they
are only expected to find a point that is “close enough”. This contrasts
with mathematical optimality criteria, which are unambiguous about
(local) optimality and converge to the optimum within the limits of
the working precision. The mathematical criteria usually require the
gradients of the objective and constraints. This is not to suggest that
1 Introduction 24
heuristic methods are not useful. Finding a better solution is often

desirable regardless of whether or not it is strictly optimal.
Iterative processes based on mathematical principles tend to be
more efficient than those based on heuristics. However, some heuristic
methods are more robust because they tend to make fewer assumptions
about the modality and smoothness of the functions and handle noisy
functions more effectively.
Algorithms often mix mathematical arguments and heuristics to
some degree. Most mathematical algorithms include constants whose
values end up being tuned based on experience. Conversely, algo-
rithms primarily based on heuristics sometimes include steps with
mathematical justification.
1.4.4 Function Evaluation

The optimization problem setup that we described above assumes that
the function evaluations are obtained by solving numerical models of
the system. We call these direct function evaluations. However, it is
possible to create a surrogate model (also known as a metamodel) of these
models and use them in the optimization process. These surrogate
models can be interpolation-based or projection-based. Surrogate-
based optimization is discussed in Chapter 10.
1.4.5 Stochasticity
This attribute is independent of the stochasticity of the model that
we mentioned previously, and it is strictly related to whether the
optimization algorithm itself contains steps that are determined at
random or not.
A deterministic optimization algorithm always evaluates the same
points and converge to the same result given the same initial conditions.
In contrast, a stochastic optimization algorithm evaluates a different set
of points if run multiple times from the same initial conditions, even
if the models for the objective and constraints are deterministic. For
example, most evolutionary algorithms include steps determined by
generating random numbers. Gradient-based algorithms are usually
deterministic, but some exceptions exist, such as stochastic gradient
descent from the machine learning community (see Section 10.5).
1.4.6 Time Dependence

In this book, we assume that the optimization problem is static, where
this means that we can solve the complete numerical model at each op-
timization iteration. For some problems that involve time dependence,
1 Introduction 25
we can perform the time integration to solve for the full-time history
of the states and then compute the objective and constraint function
values for an optimization iteration. This means that every optimization
iteration requires solving for the complete time history. An example
of this type of problem is a trajectory optimization problem where the
design variables are the coordinates representing the path, and the
objective is to minimize the total energy expended to get to a given
destination. 2 Although such a problem involves a time dependence, we 2. Betts, Survey of numerical methods for
trajectory optimization. 1998
solve a single optimization problem, so we still classify such a problem
as static.
For another class of time-dependent optimization problems, how-
ever, we solve for a sequence or a time history of decisions at different
time instances because we must make decisions as time progresses.
These are called dynamic optimization problems (also known as dynamic
programming). 3,4 In such problems, the design variables are the 3. Bryson et al., Applied Optimal Control;
Optimization, Estimation, and Control. 1969
sequence of decisions, and the decision at a given time instance is
4. Bertsekas, Dynamic programming and
influenced by the decisions made in the previous time instances. An optimal control. 1995
example of a dynamic optimization problem would be to optimize the
throttle, braking, and steering of a car at each time instance such that
the overall time in a racecourse is minimized. This is an example of
an optimal control problem, a type of dynamic optimization problem
where a control law is optimized for a dynamical system over a period
of time. Dynamic optimization is not broadly covered in this book,
except in the context of discrete optimization (see Section 8.5). Different
approaches are used in general, but many of the concepts covered here
are instrumental in the numerical solution of optimal control problems.
1.5 Selecting an Optimization Approach
This section provides guidance on how to select an appropriate approach

for solving a given optimization problem. This process cannot always
be distilled to a simple decision tree; however, it is still helpful to have a
framework as a first guide. Many of these decisions will become more
apparent as you progress through the book and gain experience, so
you may want to revisit this section periodically. Eventually, selecting
an appropriate methodology will become second nature.
Figure 1.20 outlines one approach to algorithm selection and simul-
taneously serves as an overview of the chapters in this book. These
first two characteristics (convex problem and discrete variables) are
not the most common within the broad spectrum of engineering opti-
mization problems, but they are the more restrictive in terms of usable
optimization algorithms, so we list them first.
1 Introduction 26
Convex? Yes
Linear optimization, quadratic optimization, etc.
Ch. 11
Yes
No Branch and bound
Yes
Dynamic programming
Discrete? Yes
Linear? Markov chain?
Ch. 8 No SA or Binary GA
No
No
Yes BFGS
Yes Ch. 4 Yes
Differentiable?
Unconstrained? Multimodal? Multistart
Ch. 6 SQP or IP
No Ch. 5
No
Yes
DIRECT, GA, PS, etc.
Gradient free
Multimodal?
Ch. 7 Nelder–Mead
No
Multiple objectives? Noisy or expensive? Uncertainty? Multiple disciplines?

Ch. 9 Ch. 10 Ch. 12 Ch. 13
The first node asks about convexity. While it is often not immediately Figure 1.20: Decision tree for select-
apparent if the problem is convex, with some experience, one can usually ing an optimization algorithm.
discern whether attempting to reformulate in a convex manner is likely
to be possible. In most instances, convexity occurs for problems with
simple objectives and constraints (e.g., linear or quadratic), such as in
control applications where the optimization is performed repeatedly.
A convex problem can be solved with the more general gradient-based
or gradient-free algorithms, but it would be inefficient not to take
advantage of the convex formulation structure if we can do so.
The next node asks about discrete variables. Problems with discrete
variables are much harder to solve, so we often consider techniques
to avoid using discrete variables whenever possible. For example, a
wind turbine position in a field could be posed as a discrete variable
within a discrete set of options. Alternatively, we could represent the
wind turbine position as a continuous variable with two continuous
coordinate variables. That level of flexibility may or may not be desirable
but will almost always lead to better solutions.
Next, we consider if the model is differentiable or if it can be made
differentiable through model improvements. If the problem is high
dimensional (more than a few tens of variables as a rule of thumb),
gradient-free methods are generally intractable. We would either need
to either make the model differentiable or reduce the dimensionality
of the problem. Another alternative if the problem is not readily
differentiable is to consider surrogate-based optimization (the box
labeled “noisy or expensive”). If we go the surrogate-based optimization
1 Introduction 27
route, we could still use a gradient-based approach to optimize the

surrogate model because most of them are differentiable. Finally, for
problems with a relatively small number of design variables, gradient-
free methods can be a good fit. The largest variety of algorithms are
gradient-free methods, and a combination of experience and testing is
needed to determine an appropriate algorithm for the problem at hand.
The bottom rows list additional considerations that fit within most
of the algorithms: multiple objectives, surrogate-based optimization
for noisy (non-differentiable) or computationally expensive functions,
optimizing under uncertainty in the design variables and other model
parameters, and MDO architectures.
1.6 Notation
By default, a vector 𝑥 is a column vector, and thus 𝑥 𝑇 is a row vector.

For more compact notation, we may write a column vector horizontally
with its components separated by commas, e.g., 𝑥 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 𝑥 ].
We do not use bold vectors or matrices. Instead, we follow the
convention of many optimization books, which use Greek symbols
(such as 𝛼 and 𝛽) for scalars, lowercase roman letters (such as 𝑥 and 𝑢)
for vectors, and capitalized roman letters (such as 𝐴 and 𝐻) for matrices.
One of the exceptions is 𝑓 , which is used for scalar functions because
that is common usage and most objectives are scalar. Because of the
wide variety of topics covered in this book and a desire not to deviate
from standard conventions used in some fields, there are exceptions,
which are explicitly noted as needed.
A subscript index, 𝑥 𝑖 , refers to the 𝑖 th element in vector 𝑥. Similarly,
𝐴 𝑖𝑗 is the element at row 𝑖 column 𝑗 in matrix 𝐴. A superscript index in
parenthesis refers to an iteration number. Thus, 𝑥 (𝑘) , is the complete
vector 𝑥 at iteration 𝑘. A superscript star (𝑥 ∗ ) refers to a quantity at the
optimum.
1.7 Summary
Optimization is compelling, and there are opportunities to apply it ev-

erywhere. Numerical optimization fully automates the design process
but requires expertise in the formulation of the problem, selecting the
appropriate optimization algorithm, and using that algorithm. Finally,
design expertise is also required to interpret and critically evaluate the
optimum results.
There is no single optimization algorithm that is effective in the
solution of all types of problems. It is crucial to classify the optimization
1 Introduction 28
problem and understand the optimization algorithms’ characteristics

to select the appropriate algorithm to solve the problem.
In seeking a more automated design process, we must not dismiss the
value engineering intuition, which is often difficult (if not impossible)
to convert into a rigid problem formulation and algorithm.
Problems
1.1 Answer true or false and justify your answer.
a) MDO arose from the need to consider multiple design objec-

tives.
b) The preliminary design phase takes place after the concep-
tual design phase.
c) Design optimization is a completely automated process from
which designers can expect to get their final design.
d) The design variables for a problem consist of all the inputs
needed to compute the objective and constraint functions.
e) The design variables must always be independent of each
other.
f) An optimization algorithm that is designed for minimization
can be used to maximize an objective function without
modifying the algorithm.
g) Compared to the global optimum of a given problem, adding
more design variables to that problem results in a global
optimum that is no worse than that of the original problem.
h) Compared to the global optimum objective value of a given
problem, adding more constraints sometimes results in a
better global optimum.
i) A function is 𝐶 1 continuous if its derivative varies continu-
ously.
j) All unimodal functions are convex.
k) Global search algorithms always converge to the global
optimum.
l) Gradient-based methods are largely based on mathematical
principles as opposed to heuristics.
m) Solving a problem that involves a stochastic model requires
a stochastic optimization algorithm.
1 Introduction 29
n) If a problem is multimodal, you should use a gradient-free

optimization algorithm.
1.2 Plotting a one-dimensional function. Consider the one-dimensional

function
1 4
𝑓 (𝑥) = 𝑥 + 𝑥 3 − 16𝑥 2 + 4𝑥 + 12.
12
Plot the function and find the approximate location and classify
the minimum point(s). Exploration: Plot other functions to get
an intuition about their trends and minima. You can start with
simple low-order polynomials and then add higher-order terms,
trying different coefficients. Then you can also try non-algebraic
functions.
1.3 Plotting a two-dimensional function. Consider the two-dimensional

function
𝑓 (𝑥1 , 𝑥2 ) = 𝑥13 + 2𝑥1 𝑥 22 − 𝑥23 − 20𝑥1 .
Plot the function contours and find the approximate location
and classify the minimum point(s). Exploration: Similarly to
the suggested exploration in the previous exercise, try plotting
different two-dimensional functions to get an intuition about the
function trends and minima.
1.4 Convert the following problem to the standard formulation (1.4):
maximize 2𝑥12 − 𝑥14 𝑥22 − 𝑒 𝑥3 + 𝑒 −𝑥3 + 12

by varying 𝑥1 , 𝑥2 , 𝑥3
subject to 𝑥1 ≥ 1 (1.5)
𝑥 2 + 𝑥3 ≥ 10
𝑥12 + 3𝑥 22 ≤ 4
1.5 Using an unconstrained optimizer. Consider the two-dimensional

function
1 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥 12 ,
2
Plot the contours of this function and find the minimum graphi-
cally. Then, use optimization software to find the minimum (see
Tip 1.5). Verify that the optimizer converges to the minimum you
found graphically. Exploration: 1. Try minimizing the function
from Prob. 1.3 starting from different points. 2. Minimize other
functions of your choosing. 3. Study the options provided by the
optimization software and explore different settings.
1 Introduction 30
1.6 Using a constrained optimizer. Now we add constraints to Prob. 1.5.

The objective is the same, but we now have two inequality con-
straints:
𝑥12 + 𝑥 22 ≤ 1,
1
𝑥1 − 3𝑥2 + ≥ 0,
2
and bound constraints:
𝑥1 ≥ 0, 𝑥2 ≥ 0.
Plot the constraints and identify the feasible region. Find the
constrained minimum graphically. Use optimization software
to solve the constrained minimization problem. Which of the
inequality constraints and bounds are active at the solution?
1.7 Paper review. Select a paper on design optimization that seems

interesting to you, preferably from a peer-reviewed journal. Write
the full optimization problem statement in the standard form (1.4)
for the problem solved in the paper. Classify the problem ac-
cording to Fig. 1.12 and the optimization algorithm according
to Fig. 1.19. Use the decision tree in Fig. 1.20 to determine if
the optimization algorithm was chosen appropriately. Write a
critique of the paper highlighting its strengths and weaknesses.
1.8 Problem formulation. Choose an engineering system that you are

familiar with, and use the process outlined in Fig. 1.4 to formulate
a problem for the design optimization of that system. Write the
statement in the standard form (1.4). Critique your statement by
asking the following: Does the objective function truly capture the
design intent? Are there other objectives that could be considered?
Do the design variables provide enough freedom? Are the design
variables bounded such that non-physical designs are prevented?
Are you sure you have included all the constraints needed to
get a practical design? Can you think of any loophole that the
optimizer can exploit in your statement?
A Short History of Optimization
2
This chapter provides helpful historical context for algorithms dis-
cussed in this book. Nothing else in the book depends on familiarity
with this chapter, so it can be skipped. However, this history makes
connections between the various topics that will enrich the big picture
of optimization as you become familiar with the material in the rest of
the book. Optimization has a long history that started with geometry
problems solved by ancient Greek mathematicians. The invention of
algebra and calculus opened the door to many more problems, and the
advent of numerical computing increased the range of problems that
could be solved in terms of type and scale.
1. Appreciate a range of historical advances in optimization.
2. Describe some current frontiers in optimization.
2.1 The First Problems: Optimizing Length and Area
Ancient Greek and Egyptian mathematicians made numerous contri-

butions to geometry, including solving optimization problems that
involved lengths and areas. They adopted a geometric approach to
solving problems that are now generally more easily solved using
calculus.
Archimedes of Syracuse (287–212 BCE) showed that of all possible
spherical caps of a given surface area, hemispherical caps have the
largest volume. Euclid of Alexandria (325–265 BCE) showed that the
shortest distance between a point and a line is the segment perpendicular
to that line. He also proved that among rectangles of a given perimeter,
the square has the largest area.
Geometric problems involving perimeter and area were of actual
practical value. The classic example of such practicality is Dido’s
problem. According to the legend, Queen Dido, who had fled to Tunis,
31
purchased from a local leader as much land as could be enclosed by

an ox’s hide. The leader agreed because this seemed like a modest
amount of land. To maximize her land area, queen Dido had the hide
cut into narrow strips to make the longest possible string. Then, she
Carthage
intuitively found the curve that maximizes the area enclosed by string:
a semicircle with the diameter segment set along the sea coast. As Gulf of
Tunis
a result of the maximization, she acquired enough land to found the
ancient city of Carthage. Later, Zenodorus (200–140 BCE) proved this
optimal solution using geometrical arguments. A rigorous solution to
this problem requires using calculus of variations, which was invented Figure 2.1: Queen Dido intuitively
much later (see Section 2.2). maximized the area for a given
perimeter to found the city of
Geometric optimization problems are also applicable to laws of
Carthage.
physics. Hero of Alexandria (10–70 CE) derived the law of reflection
by finding the shortest path for light reflecting from a mirror, which 𝐴
results in an angle of reflection equal to the angle of incidence (Fig. 2.2).
𝐵
𝜃 𝜃
2.2 Optimization Revolution: Derivatives and Calculus
Mirror
0
The scientific revolution generated significant optimization develop- 𝐵
ments in the 17th and 18th centuries that intertwined with other Figure 2.2: The law of reflection
mathematics and physics developments. can be derived by minimizing the
In the early 17th century, Johannes Kepler published a book in length of the beam of light.
which he derived the optimal dimensions of a wine barrel.5 He became 5. Kepler, Nova stereometria doliorum
vinariorum (New solid geometry of wine
interested in this problem when he bought a barrel of wine, and the barrels). 1615
merchant charged him based on a diagonal length (see Fig. 2.3). This
outraged Kepler because he realized that the amount of wine could
vary for the same diagonal length, depending on the barrel proportions.
Incidentally, Kepler also formulated an optimization problem when

looking for his second wife, seeking to maximize the likelihood of satis-
faction. This “marriage problem” later became known at the “secretary
problem”, which is an optimal stopping problem that has since been
solved using dynamic optimization (mentioned in Section 1.4.6 and
discussed in Section 8.5).6
Willebrord Snell discovered the law of refraction in 1621, a formula
that describes the relationship between the angles of incidence and Figure 2.3: Wine barrels were mea-
sured by inserting a ruler in the
refraction when light passes through a boundary between two different taphole until it hit the corner.
media such as air, glass, or water. While Hero solved a length mini- 6. Ferguson, Who Solved the Secretary
mization problem to derive the law of reflection, Snell minimized time. Problem? 1989
These laws were generalized by Fermat in the principle of least time (or
Fermat’s principle), which states that a ray of light going from one point
to another follows the path that takes the least time.
Pierre de Fermat derived Snell’s law by applying the principle of

least time, and in the process, he devised a mathematical technique for
finding maxima and minima using what amounted to derivatives (he
missed out on generalizing the notion of derivative, which came later
in the development of calculus).7 7. Fermat, Methodus ad disquirendam
maximam et minimam (Method for the study
Isaac Newton wrote about a numerical technique in 1669 to find of maxima and minima). 1636
the roots of polynomials by successively linearizing them, achieving
quadratic convergence. In 1687, he used this technique to find the
roots of a non-polynomial equation (Kepler’s equation),∗ but only after ∗ Kepler’s equation describes orbits by 𝐸 −
𝑒 sin(𝐸) = 𝑀 , where 𝑀 is the mean

using polynomial expansions. In 1690, Joseph Raphson improved on anomaly, 𝑒 is the eccentricity, and 𝐸 is the
Newton’s method by keeping all the decimals in each linearization and eccentric anomaly. This equation does not
have a closed-form solution for 𝐸 .
made it a fully iterative scheme. The multivariable “Newton’s method”
that is widely used today was actually introduced in 1740 by Thomas
Simpson. He generalized the method by using the derivatives (which
allowed for solving non-polynomial equations without expansions)
and by extending it to a system of two equations and two unknowns.
In 1685, Newton studied a shape optimization problem where he
sought the shape of a body of revolution that minimizes fluid drag
and even mentioned a possible application to ship design. Although
he used the wrong model for computing the drag, he correctly solved
what amounted to a calculus of variations problem.
In 1696, Johann Bernoulli challenged other mathematicians to find
the path of a body subject to gravity that minimizes the travel time
between two points of different heights. This is now a classic calculus
of variations problem called the brachistochrone problem (Fig. 2.4).
Bernoulli already had a solution that he kept secret. Five mathemati-
cians respond with solutions: Newton, Jakob Bernoulli (Johann’s
brother), Gottfried Wilhelm von Leibniz, Ehrenfried Walther von
Tschirnhaus, and Guillaume de l’Hôpital. Newton reportedly started 𝐴
working on the problem as soon as he received it and stayed up all
night before sending the solution anonymously to Bernoulli the next
day.
Starting in 1736, Leonhard Euler derived the general optimality
conditions for solving calculus of variations problems, but the derivation
included geometric arguments. In 1755, Joseph–Louis Lagrange used a 𝐵
purely analytic approach to derive the same optimality conditions (he
Figure 2.4: Suppose that you have a
was 19 years old at the time!). Euler recognized Lagrange’s derivation,
bead on a wire that goes from 𝐴 to
which uses variations of a function, as a superior approach and adopted 𝐵. The brachistochrone curve is the
it, calling it “calculus of variations”. This is a second-order partial shape of the wire that minimizes
differential equation that has become known as the Euler–Lagrange the time for the bead to slide be-
tween the two points under gravity
equation. Lagrange uses the Euler–Lagrange equation to develop a
alone. It is faster than a straight line
reformulation of classical mechanics in 1788, which became known trajectory or a circular arc.
as Lagrangian mechanics. When deriving the general equations of

equilibrium for problems with constraints, Lagrange introduces the
“method of the multipliers”.8 Lagrange multipliers eventually become 8. Lagrange, Mécanique analytique. 1788
a fundamental concept in constrained optimization (Chapter 5).

In 1784, Gaspard Monge developed a geometric method to solve a
transportation problem. Although the method is not entirely correct,
it marks the establishment of combinatorial optimization, a branch of
discrete optimization (Chapter 8).
2.3 The Birth of Optimization Algorithms
There were several more theoretical contributions related to optimiza-

tion in the 19th century and the early 20th century. However, it was not
until the 1940s that optimization started to gain traction with the devel-
opment of algorithms and their use in practical applications, thanks to
the advent of computer hardware.
In 1805, Adrien–Marie Legendre described the method of least
squares, which was used to predict asteroid orbits and curve fitting.
Frederich Gauss published a rigorous mathematical foundation for the
method of least squares and claims he used it to predict the orbit of the
asteroid Ceres in 1801. Legendre and Gauss engaged in a bitter dispute
on who first developed the method.
In one of his 789 papers, Augustin–Louis Cauchy proposed the
steepest descent method for solving systems of nonlinear equations.9 9. Cauchy, Méthode générale pour la réso-
lution des systèmes d’équations simultanées.
He does not seem to put much thought into it and promises a “paper 1847
to follow” on the subject that never happens. He proposed this method
for solving systems of nonlinear equations, but it is directly applicable
to unconstrained optimization (Chapter 4).
In 1902, Gyula Farkas proved a theorem on the solvability of a
system of linear inequalities. This became known as Farkas’ lemma,
which is crucial in the derivation of the Karush–Kuhn–Tucker optimality
conditions for constrained problems (see below and Chapter 5).
In 1917, Harris Hancock publishes the first textbook on optimiza-
tion, which includes the optimality conditions for multivariable uncon-
strained and constrained problems.10 10. Hancock, Theory of Minima and Max-
ima. 1917
In 1932, Karl Menger presented “the messenger problem”11 , an 11. Menger, Das botenproblem. 1932
optimization problem that seeks to minimize the shortest travel path
that connects a set of destinations, observing that going to the closest
point each time does not in general result in the shortest overall path.
This is a combinatorial optimization problem that later becomes known
as the traveling salesman problem, one of the most intensively studied
problems in optimization (Chapter 8).
In 1939, William Karush derived the necessary conditions for in-

equality constrained problems in his master’s thesis. This generalizes
the method of Lagrange multipliers, which only allowed for equality
constraints. Harold Kuhn and Albert Tucker independently rediscover
these conditions and publish their seminal paper in 1951.12 These 12. Karush, Minima of Functions of Several
Variables with Inequalities as Side Con-
became known as the Karush–Kuhn–Tucker (KKT) conditions, which straints. 1939
constitute the foundation of gradient-based constrained optimization
algorithms (Section 5.2).
Leonid Kantorovich developed a technique to solve linear program-
ming problems in 1939 after having been given the task of optimizing
production in the Soviet government plywood industry. However,
his contribution was neglected for ideological reasons. In the United
States, Tjalling Koopmans rediscovers linear programming in the early
1940s when working on ship transportation problems. In 1947, George
Dantzig published the first complete algorithm to solve linear pro-
gramming problems—the simplex algorithm. In the same year, von
Neumann develops the theory of duality for linear programming
problems. Kantorovich and Koopmans later shared the 1975 Nobel
Memorial Prize in Economic Sciences “for their contributions to the
theory of optimum allocation of resources”. Dantzig was not included,
presumably because his work was not as applied. The development
of the simplex algorithm and the widespread practical applications
of linear programming spark a revolution in optimization. The first
international conference on optimization, the International Symposium
on Mathematical Programming, was held in Chicago in 1949.
In 1951, George Box and Kenneth Wilson developed the response
surface methodology (surrogate modeling), which enables optimization
of systems based on experimental data (as opposed to a physics-based
model). They developed a method to build a quadratic model where
the number of data points scales linearly with the number of inputs,
instead of exponentially, striking a balance between accuracy and ease of
application. In the same year, Danie Krige develops a surrogate model
based on a stochastic process, which is now known as “kriging”. He
developed this model in his master’s thesis to estimate the most likely
distribution of gold based on a limited number of borehole samples.13 13. Krige, A statistical approach to some
mine valuation and allied problems on the
These approaches are foundational in surrogate-based optimization Witwatersrand. 1951
(Chapter 10).
In 1952, Harry Markowitz published a paper on portfolio theory
that formalizes the idea of investment diversification, marking the birth
of modern financial economics.14 The theory is based on a quadratic 14. Markowitz, Portfolio selection. 1952
optimization problem. He received the 1990 Nobel Memorial Prize in

Economic Sciences.
In 1955, Lester Ford and Delbert Fulkerson created the first known
algorithm to solve the maximum flow problem, which has applications
in transportation, electrical circuits, and data transmission. While the
problem could already be solved with the simplex algorithm, they
proposed a more efficient algorithm for this specialized problem.
In 1957, Richard Bellman derived the necessary optimality condi-
tions for dynamic programming problems. These are expressed in
what became known as the Bellman equation (Section 8.5), which was
first applied to engineering control theory, and subsequently became a
core principle in the development of economic theory.
In 1959, William Davidon developed the first quasi-Newton method
to solve nonlinear optimization problems that rely on approximations
of the curvature based on gradient information. He was motivated by
his work at Argonne National Lab, where he used a coordinate descent
method to perform an optimization that kept crashing the computer
before converging. Although Davidon’s approach was a breakthrough
in nonlinear optimization, his original paper was rejected. It was
eventually published more than 30 years later in the first issue of the
SIAM Journal of Optimization.15 Fortunately, his valuable insight had 15. Davidon, Variable Metric Method for
Minimization. 1991
been recognized well before that by Roger Fletcher and Michael Powell,
who developed the method further.16 The method became known as 16. Fletcher et al., A Rapidly Convergent
Descent Method for Minimization. 1963
DFP (Section 4.4.4).
Another quasi-Newton approximation method was independently
proposed in 1970 by Charles Broyden, Roger Fletcher, Donald Goldfarb,
and David Shanno, now called the BFGS approximation. Larry Armĳo,
A. Goldstein, and Philip Wolfe develop the conditions for the line search
in gradient-based methods that ensure convergence (see Section 4.3.2).17 17. Wolfe, Convergence Conditions for
Ascent Methods. 1969
With these developments in unconstrained optimization, researchers
seek methods to solve constrained problems as well. Penalty and barrier
methods are developed but fall out of favor because of numerical issues
(see Section 5.3).
In another effort to solve nonlinear constrained problems, Robert
Wilson proposed the sequential quadratic programming (SQP) method
in his Ph.D. thesis.18 SQP essentially consists in applying the Newton 18. Wilson, A simplicial algorithm for
concave programming. 1963
method to solve the KKT conditions (see 5.4). Shih–Ping Han re-
invented SQP in 197619 and Michael Powell popularized this method in 19. Han, Superlinearly convergent variable
metric algorithms for general nonlinear
a series of papers starting from 1977.20 programming problems. 1976
There were attempts to model the natural process of evolution 20. Powell, Algorithms for nonlinear con-
starting in the 1950s. In 1975, John Holland proposed genetic algorithms straints that use Lagrangian functions. 1978
(GA) to solve optimization problems (Section 7.5).21 Research in GAs 21. Holland, Aptation in Natural and
Artificial Systems. 1975
increased dramatically after that, thanks in part to the exponential
increase in computing power (Section 7.5).
Hooke et al.22 proposed a gradient-free method, which they call 22. Hooke et al., “Direct Search” Solution
of Numerical and Statistical Problems. 1961
“pattern search.” In 1965, Nelder et al.23 developed the nonlinear
23. Nelder et al., A Simplex Method for
simplex method another gradient-free nonlinear optimization based Function Minimization. 1965
on heuristics (Section 7.3). (This has no connection to the simplex
algorithm for linear programming problems mentioned above.)
The Mathematical Programming Society was founded in 1973, an
international association for researchers active in optimization. It was
eventually renamed Mathematical Optimization Society in 2010.
Narendra Karmarkar presented a revolutionary new method in
1984 to solve large-scale LPs as much as a hundred times faster than the
simplex method.24 The New York Times publishes a news item on the 24. Karmarkar, A New Polynomial-Time
Algorithm for Linear Programming. 1984
first page with the headline “Breakthrough in Problem Solving.” This
starts the age of interior point methods, which are related to the barrier
methods dismissed in the 1960s. Interior point methods are eventually
adapted to solve nonlinear problems (see Section 5.5) and contribute to
the unification of linear and nonlinear optimization.
2.4 The Last Decades
The relentless exponential increase in computer power through the

1980s and beyond has made it possible to perform engineering de-
sign optimization with increasingly sophisticated models, including
multidisciplinary models. The increased computer power has also
been contributing to the gain in popularity of heuristic optimization
algorithms. Computer power has also increased the use of neural
networks and the explosive rise of artificial intelligence.
The field of optimal control flourished after Bellman’s contribution
to dynamic programming. Another important optimality principle for
control, the maximum principle, was derived by Pontryagin et al.25 . This 25. Pontryagin et al., The Mathematical
Theory of Optimal Processes. 1961
principle makes it possible to transform a calculus of variations problem
into a nonlinear optimization problem. Gradient-based nonlinear
optimization algorithms were then used to numerically solve for optimal
trajectories of rockets and aircraft, with an adjoint method to compute
the gradients of the objective with respect to the control histories.26 26. Bryson Jr, Optimal Control—1950 to
1985. 1996
The adjoint method efficiently computes gradients with respect to
large numbers of variables and proved to be useful in other disciplines.
Optimal control expands to include the optimization of feedback control
laws that guarantee closed-loop stability. Optimal control approaches
include model-predictive control, which is widely used today.
In 1960, Schmit27 proposed coupling numerical optimization with 27. Schmit, Structural Design by System-
atic Synthesis. 1960
structural computational models to perform structural design, establish-
ing the field of structural optimization. Five years later, he presented
applications, including aerodynamics and structures, which represents

the first known MDO application.28 The direct method for computing 28. Schmit et al., Synthesis of an Airfoil at
Supersonic Mach Number. 1965
gradients for structural computational models is developed shortly after
that,29 eventually followed by the adjoint method (Section 6.7).30 In this 29. Fox, Constraint Surface Normals for
Structural Synthesis Techniques. 1965
early work, the design variables were the cross-sectional areas of the
30. Arora et al., Methods of Design Sensi-
members of a truss structure. Researchers then added joint positions tivity Analysis in Structural Optimization.
in the set of design variables. Structural optimization was generalized 1979
further with shape optimization, which optimized the shape of arbitrary

three-dimensional structural parts.31 Another significant development 31. Haftka et al., Structural shape
optimization—A survey. 1986
was topology optimization, where a structural layout emerges from a
solid block of material.32 It took many years of further development in 32. Eschenauer et al., Topology optimiza-
tion of continuum structures: A review.
algorithms and computer hardware for structural optimization to be 2001
widely adopted by industry, but now this capability has made its way
to commercial software.
Aerodynamic shape optimization began when Pironneau33 used 33. Pironneau, On optimum design in fluid
mechanics. 1974
optimal control techniques to minimize the drag of a body by varying
its shape (the “control” variables). Jameson34 extended the adjoint 34. Jameson, Aerodynamic Design via
Control Theory. 1988
method with more sophisticated computational fluid dynamics models
and applied it to aircraft wing design. CFD-based optimization appli-
cations spread beyond aircraft wing design to the shape optimization
of wind turbines, hydrofoils, ship hulls, and automobiles. The adjoint
method was then generalized for any discretized system of equations
(Section 6.7).
MDO developed rapidly in the 1980s, following the application
of numerical optimization techniques to structural design. The first
conference in MDO, the Multidisciplinary Analysis and Optimization
Conference, took place in 1985. The first MDO applications focused on
coupling the aerodynamics and structures in wing design, and other
early applications integrated structures and controls.35 The develop- 35. Sobieszczanski–Sobieski et al., Multi-
disciplinary Aerospace Design Optimization:
ment of MDO methods progressed efforts towards decomposing the Survey of Recent Developments. 1997
problem into optimization subproblems, leading to distributed MDO
architectures.36 Sobieszczanski–Sobieski37 proposed a formulation for 36. Martins et al., Multidisciplinary Design
Optimization: A Survey of Architectures.
computing the derivatives for coupled systems, which is needed when 2013
performing MDO with gradient-based optimizers. This concept was 37. Sobieszczanski–Sobieski, Sensitivity of
Complex, Internally Coupled Systems. 1990
later combined with the adjoint method to compute coupled derivatives
efficiently.38 More recently, efficient coupled derivative computation 38. Martins et al., A Coupled-Adjoint Sen-
sitivity Analysis Method for High-Fidelity
and hierarchical solvers have made it possible to solve large-scale MDO Aero-Structural Design. 2005
problems39 (see Chapter 13). Engineering design has been focusing on 39. Hwang et al., A computational architec-
achieving improvements made possible by considering the interaction ture for coupling heterogeneous numerical
models and computing coupled derivatives.
of all relevant disciplines. MDO applications have extended beyond 2018
aircraft to the design of bridges, buildings, automobiles, ships, wind
turbines, and spacecraft.
In continuous nonlinear optimization, SQP has remained state-of-

the-art since its popularization in the late 1970s. However, the interior
point approach, which, as mentioned previously, revolutionized linear
optimization, was successfully adapted to solve nonlinear problems
and has made great strides since the 1990s.40 Today, both SQP and 40. Wright, The interior-point revolution in
optimization: history, recent developments,
interior point methods are considered to be state-of-the-art. and lasting consequences. 2005
Interior point methods have contributed to the connection between
linear and nonlinear optimization, which were treated as entirely
separate fields before 1984. Today, state-of-the-art linear optimization
software packages have options for both the simplex and interior-point
approaches because the best approach depends on the problem.
Convex optimization emerged as a generalization of linear optimiza-
tion (see Chapter 11). Like linear optimization, it was initially mostly
used in operations research applications,† (such as transportation, man- † The field of operations research was es-
tablished in World War II to help make
ufacturing, supply chain management, and revenue management) and better strategical decisions.
there were only a few applications in engineering. Since the 1990s,
convex optimization has increasingly been used in engineering applica-
tions, including optimal control, signal processing, communications,
and circuit design. A disciplined convex programming methodology fa-
cilitated this expansion to construct convex problems and convert them
to a solvable form.41 New classes of convex optimization problems have 41. Grant et al., Global Optimization—From
Theory to Implementation. 2006
also been developed, such as geometric programming (see Section 11.6),
semidefinite programming, and second-order cone programming.
As mathematical models became increasingly complex computer
programs and given the need to differentiate those models when per-
forming gradient-based optimization, new methods have been devel-
oped to compute derivatives. Wengert42 was among the first to propose 42. Wengert, A Simple Automatic Deriva-
tive Evaluation Program. 1964
the automatic differentiation of computer programs (or algorithmic
differentiation). The reverse mode of algorithmic differentiation, which
is equivalent to the adjoint method applied to computer programs, was
proposed later (see Section 6.6).43 This field has evolved immensely 43. Speelpenning, Compiling fast partial
derivatives of functions given by algorithms.
since then, with techniques to handle more functions and increase 1980
efficiency. Algorithmic differentiation tools have been developed for
an increasing number of programming languages. One of the more
recently developed programming languages, Julia, features prominent
support for algorithmic differentiation. At the same time, algorithmic
differentiation has now been spread to a wide range of applications.
Another technique to compute derivatives numerically, the complex-
step derivative approximation, was proposed by Squire et al.44 Soon 44. Squire et al., Using Complex Variables
to Estimate Derivatives of Real Functions.
after, this technique was generalized to computer programs, applied to 1998
computational fluid dynamics, and found to be related to automatic
45. Martins et al., The Complex-Step Deriva-
differentiation (see Section 6.5).45 tive Approximation. 2003
The pattern search algorithms that Hooke and Jeeves, and Nelder
and Meade developed were disparaged by applied mathematicians,
who preferred the rigor and efficiency of the gradient-based methods
developed soon after that. Nevertheless, they were further developed
and remain popular with engineering practitioners because of their sim-
plicity. Pattern search methods experienced a renaissance in the 1990s
with the development of convergence proofs that added mathematical
rigor and the availability of more powerful parallel computers.46 Today, 46. Torczon, On the Convergence of Pattern
Search Algorithms. 1997
pattern search methods remain a useful option (and sometimes the
only one) for some types of optimization problems.
Global optimization algorithms also experienced further develop-
ments. Jones et al.47 developed the DIRECT algorithm, which uses a 47. Jones et al., Lipschitzian optimization
without the Lipschitz constant. 1993
rigorous approach to find the global optimum (Section 7.4).
The first genetic algorithms started the development of the broader
class of evolutionary optimization algorithms inspired more broadly by
natural and societal processes. Optimization by simulated annealing
represents one of the early examples of this broader perspective.48 An- 48. Kirkpatrick et al., Optimization by
Simulated Annealing. 1983
other example is particle swarm optimization (PSO) (see Section 7.6).49
49. Kennedy et al., Particle Swarm Opti-
Since then, there has been an explosion in the number of evolutionary mization. 1995
algorithms, inspired by any process imaginable (see the side note at the
end of Section 7.2 for a partial list). Evolutionary algorithms have re-
mained heuristic and have not experienced the mathematical treatment
applied to pattern search methods.
There has been a sustained interest in surrogate-models (also know
as metamodels) since the seminal contributions in the 1950s. Kriging
surrogate models are still being used and have been the focus of many
improvements, but new techniques such as radial-basis functions have
also emerged.50 Surrogate-based optimization is now an area of active 50. Forrester et al., Recent advances in
surrogate-based optimization. 2009
research (see Chapter 10).
Artificial intelligence (AI) has experienced a revolution in the last
decade and is connected to optimization in several ways. The early AI
efforts focused on solving problems that could be described formally,
such as design optimization problem statements. Today, AI solves
problems that are difficult to describe formally, such as face recognition.
This new capability is made possible by the development of deep
learning neural networks, the availability of large datasets for training
the neural networks, and increased computer power. Deep learning
neural networks learn to map a set of inputs to a set of outputs based
on training data and can be viewer as a type of surrogate model (see
Section 10.5). These networks are trained using optimization algorithms
that minimize a loss function (analogous to model error), but they
require specialized optimization algorithms such as stochastic gradient
descent. 51 The gradients for this problem are efficiently computed with 51. Bottou et al., Optimization Methods for
Large-Scale Machine Learning. 2018
backpropagation, which is a specialization of the reverse mode of AD.52
52. Baydin et al., Automatic Differentiation
in Machine Learning: a Survey. 2018
2.5 Summary
This history of optimization is as old as human civilization and has had

many twists and turns. Ancient geometric optimization problems that
were correctly solved by intuition required mathematical developments
that were only realized much later. The discovery of calculus laid out the
foundations for optimization. Computer hardware and algorithms then
enabled the development and deployment of numerical optimization.
Numerical optimization was first motivated by operations research
problems but eventually made its way into engineering design. Soon
after numerical models were developed to simulate engineering systems,
the idea arose to couple those models to optimization algorithms in
an automated cycle to optimize the design of such systems. The
first application was in structural design, but many other engineering
design applications followed, including applications coupling multiple
disciplines, establishing MDO. Whenever a new numerical model
becomes fast enough and sufficiently robust, there is an opportunity to
integrate it with numerical optimization to go beyond simulation and
perform design optimization.
Many insightful connections have been made in the history of
optimization, and the trend has been to unify the theory and methods.
There are no doubt more connections and contributions to be made—
hopefully from a more diverse research community.
Numerical Models and Solvers
3
In the introductory chapter, we discussed function characteristics from
the point of view of the function’s output—the black box view shown in
Fig. 1.13. Here, we discuss how the function is modeled and computed.
The more understanding and access you have to the model, the more
you can do to solve the optimization problem more effectively. We
explain the errors involved in the modeling process so that we can
interpret optimization results correctly.
1. Identify different types of numerical error and understand

some of the limitations of finite precision arithmetic.
2. Estimate an algorithm’s rate of convergence.
3. Use Newton’s method to solve systems of equations.
3.1 Model Development for Analysis Versus Optimization
A good understanding of numerical models and solvers is essential as

numerical optimization demands more of the models and solvers than
does pure analysis. In an analysis, or a parametric study, we may cycle
through a range of plausible configurations. However, optimization al-
gorithms seek to explore the solutions space and therefore intermediate
evaluations may use atypical input combinations. The mathematical
model, numerical model, and solver, need to be robust enough to handle
these wide-ranging inputs without artificially constraining the solution
space. It is important to explore the behavior of the model across a
wide range of inputs before attempting optimization.
A related issue is that an optimizer exploits errors in ways an
engineering designer would not do in analysis. For example, consider
an aerodynamic model of a car. In a parametric study we might try
a dozen designs, compare the drag, and choose the best one. If we
42
passed this procedure to an optimizer it would flatten the car to zero

height—the minimum drag solution. Thus, for optimization we often
need to develop additional models, in this case modeling cargo and
structural requirements. The parametric designer considered these
things implicitly and approximately, but in optimization we need to
explicitly model these requirements and pose them as constraints.
Another consideration that affects both the mathematical and the
numerical model is the overall computational cost of optimization. An
analysis might only be run dozens of times, whereas an optimization
often runs the analysis thousands of times. This computational cost
can affect the level of fidelity, or discretization we can afford to use.
The level of precision that is desirable for analysis is often insufficient
for optimization. In an analysis, a few digits of precision may be
sufficient. However, using fewer significant digits limits the types of
optimization algorithms we can use effectively, convergence failures
can cause premature termination of algorithms, and noisy outputs can
severely mislead or terminate an optimization prematurely. A common
source of these errors involves programs that work through input
and output files. Even though the underlying code may use double-
precision arithmetic, output files rarely include all the significant digits
(another separate issue is that reading and writing files at every iteration
Experiment
is that it considerably slows down the analysis).
Another common source of errors involves converging systems of Observe
equations, as discussed later in this chapter. Optimization generally
Physical
requires tighter tolerances than are used for analysis. Sometimes this system
is as easy as changing a default tolerance, and other times we need to
Model
rethink the solvers we use.
Mathematical
model
3.2 Modeling Process and Types of Errors
Discretize
Design optimization problems usually involve modeling a physical
Numerical
system so that we can compute the objective and constraint function model
values for a given design. The steps in the modeling process are Program
shown in Fig. 3.1. The physical system represents the reality that
we want to model. The mathematical model can range from simple Solver
mathematical expressions to sets of continuous differential or integral
equations for which closed-form solutions over an arbitrary domain are Compute
not possible. When that is the case, we must discretize the continuous Finite-precision
equations to obtain the numerical model. This numerical model must states
then be programmed using a computer language to develop a numerical

Figure 3.1: Physical problems are
solver. Finally, the solver computes the system state variables using
modeled and then solved numeri-
finite-precision arithmetic. cally to produce function values.
Each of these steps in the modeling process introduces an error.

Modeling errors are introduced in the idealization and approximations
performed in the derivation of the mathematical model. The errors
involved in the rest of the process are all numerical errors, which we
detail in Section 3.5. The total error is the sum of the modeling errors
and numerical errors.
Validation and verification processes enable us to quantify these
errors. Verification is concerned with making sure the errors introduced
by the discretization, and the numerical computations are acceptable.
In addition, verification aims to make sure there are no bugs in the code
that introduce errors. Validation involves comparing the numerical
results with experimental observations of the physical system, which
are themselves subject to experimental errors. By making these compar-
isons, we can validate the modeling step of the process and make sure
that the idealizations and assumptions in developing the mathematical
model are acceptable.
These errors relate directly to the concepts of precision and accuracy.
An accurate solution is one that compares well with the real physical sys-
tem (validation), while a precise solution just means that the numerical
coding and solution are solved correctly (verification).
Example 3.1: Modeling a structure.
As an example of physical system, consider the timber roof truss structure

shown in Fig. 3.2. A typical idealization of such a structure assumes that the
wood is an homogeneous material and that the joints are pinned. The loads are
applied only at the joints, and the weight of the structure does not contribute to
the loading. It is also usual to assume that the displacements are small relative
to the dimensions of the truss members. The structure is discretized by pinned
Figure 3.2: Timber roof truss and
bar elements. The discrete governing equations for any truss structure can be idealized model.
derived by enforcing equilibrium at each joint, or more generally, by using the
finite-element method. This leads to the linear system,
𝐾𝑢 = 𝑓 ,
where 𝐾 is the stiffness matrix, 𝑓 is the vector of applied loads, and 𝑢 are the
displacements that we want to compute. At each joint, there are two degrees of
freedom (horizontal and vertical) that describe the displacement and applied
force. Since there are 9 joints, each with 2 degrees of freedom, the size of this
linear system is 18.
3.3 Numerical Models as Residual Equations
Mathematical models vary greatly in complexity and scale. In the

simplest case, a model can be represented by one or more explicit
functions, which are easily coded and computed. Many examples in
this book use explicit functions for simplicity. In practice, however,
most numerical models are defined by implicit equations. Implicit
equations can be written in the residual form,
𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑛 ) = 0 𝑖 = 1, . . . , 𝑛. (3.1)
where 𝑟 is a vector of residuals that has the same size as the vector of
state variables 𝑢. The equation defining the residuals could be any
expression that can be coded in a computer program. No matter how
complex the mathematical model, it can always be written as a set of
equations in this form, which we write more compactly as 𝑟(𝑢) = 0.
This residual notation can still be used to represent explicit functions,
so we can use it for all the functions in a model without loss of generality.
Suppose we have the explicit function 𝑢 𝑓 , 𝑓 (𝑢), where 𝑢 is a vector
and 𝑢 𝑓 is the scalar function value (and not one of the components
of 𝑢). We can rewrite this function as a residual equation by moving
all the terms to one side to get a 𝑟(𝑢 𝑓 ) = 𝑓 (𝑢) − 𝑢 𝑓 = 0. Even though
it might seem more natural to use explicit functions, we might be
motivated to use the residual form to write the whole model in the
compact notation, 𝑟(𝑢) = 0. This will be helpful in later chapters when
computing derivatives (Chapter 6) and solving systems with multiple
components (Chapter 13).
Example 3.2: Expressing an explicit function as an implicit equation.
Suppose we have the following mathematical model,
𝑢12 + 2𝑢2 − 1 = 0,
𝑢1 + cos(𝑢1 ) − 𝑢2 = 0, (3.2)
𝑓 (𝑢1 , 𝑢2 ) = 𝑢1 + 𝑢2 .
The first two equations are written as implicit functions and the third equation
is given as an explicit function. The first equation could be manipulated to
obtain an explicit function of either 𝑢1 or 𝑢2 . The second equation does not have
a closed-form solution and cannot be written as an explicit function for 𝑢1 . The
third equation is an explicit function of 𝑢1 and 𝑢2 . Given these equations, we
might decided to solve the first two equations for 𝑢1 and 𝑢2 using a nonlinear
solver and then evaluate 𝑓 (𝑢1 , 𝑢2 ). However, we can write the whole system as
implicit residual equations by defining the value of 𝑓 (𝑢1 , 𝑢2 ) as 𝑢3 ,
𝑟1 (𝑢1 , 𝑢2 ) =𝑢12 + 2𝑢2 − 1 = 0,

𝑟2 (𝑢1 , 𝑢2 ) =𝑢1 + cos(𝑢1 ) − 𝑢2 = 0, (3.3)
𝑟3 (𝑢1 , 𝑢2 , 𝑢3 ) =𝑢1 + 𝑢2 − 𝑢3 = 0.
You can use the same nonlinear solver to solve for all three equations simulta-
neously.
The governing equations of a model determine the state of a given

physical system at specific conditions. Many governing equations
consist of differential equations, which require discretization to obtain
implicit equations that can be solved numerically (see Section 3.4).
After discretization, the governing equations can always be written as
𝑟(𝑢) = 0.
Example 3.3: Implicit and explicit equations in structural analysis.
The linear system from Ex. 3.1 can be obtained by a finite-element dis-
cretization of the governing equations. This is an example of a set of implicit
equations, which we can write as a set of residuals,
𝑟(𝑢) = 𝐾𝑢 − 𝑓 = 0, (3.4)
where 𝑢 are the state variables. While the solution for 𝑢 could be written as an
explicit function, 𝑢 = 𝐾 −1 𝑓 , this is usually not done. Instead, we use a linear
solver that does not explicitly form the inverse of the stiffness matrix.
In addition to computing the displacements, we might also want to compute
the axial stress in each of the 15 truss members. This is an explicit function of
the displacements, which is given by the linear relationship
𝜎 = 𝑆𝑢, (3.5)
where 𝑆 is a 15 × 18 matrix. Another example of an explicit function is the

computation of the structural mass, which is
Õ
15
𝑚= 𝜌𝑎 𝑖 𝑙 𝑖 , (3.6)
𝑖=1
where 𝜌 is the material density, 𝑎 𝑖 is the cross-sectional area of each member,

and 𝑙 𝑖 is the member length.
3.4 Discretization of Differential Equations
Many physical systems are modeled by differential equations defined

over a domain. The domain can be spatial (one or more dimensions),
temporal, or both. When time is considered, then we have a dynamic
model. When a differential equation is defined in a domain with one
degree of freedom (1D in space or time), then we have an ordinary
differential equation (ODE), while any domain defined by more than
one variable results in a partial differential equation (PDE).
Finite difference Finite volume Finite element

𝑢5 𝑢4 𝑢8 𝑢9
𝑢4 𝑢7
𝑢 𝑢 𝑢3 𝑢 𝑢6
𝑢1 𝑢3 𝑢1 𝑢 𝑢5
𝑢2 𝑢1 𝑢2 2 𝑢3 𝑢4
Mesh point 𝑧
Cell 𝑧
Element 𝑧
Differential equations need to be discretized over the domain to be Figure 3.3: Discretization methods
solved numerically. There are three main methods for the discretization in one spatial dimension.
of differential equations: the finite-difference method, the finite-volume
method, and the finite-element method. The finite-difference method
approximates the derivatives in the differential equations by the value
of the relevant quantities at a discrete number of points in a mesh (see
Fig. 3.3). The finite-volume method is based on the integral form of the
PDEs. It divides the domain into control volumes called cells (which
also form a mesh), and the integral is evaluated for each cell. The values
of the relevant quantities can be defined either at the centroids of the
cells or at the cell vertices. The finite-element model divides the domain
into elements (which are similar to cells) over which the quantities
are interpolated using pre-defined shape functions. The values are
computed at specific points in the element that are not necessarily at the
element boundaries. Governing equations can also include integrals,
which can be discretized with quadrature rules.
With any of these discretization methods, the final result is a set
of algebraic equations that we can write as 𝑟(𝑢) = 0 and solve for the
state variables 𝑢. This is a potentially large set of equations depending
on the domain and discretization (it is common to have millions of
equations in three-dimensional computational fluid dynamic problems).
The number of state variables of the discretized model is equal to the
number of equations for a complete and well-defined model. In the
most general case, the set of equations could be implicit and nonlinear.
When a problem involves both space and time, the prevailing ap-
proach is to decouple the discretization in space from the discretization
in time—called the method of lines (see Fig. 3.4). The discretiza-
tion in space is performed first, yielding an ODE in time. The time
derivative can then be approximated as a finite difference, leading to a
time-integration scheme.
The discretization process usually yields implicit algebraic equations PDE
that require a solver to obtain the solution. However, discretization

in some cases yield explicit equations, in which case a solver is not
required. 𝑡
𝑢(𝑧, 𝑡)
3.5 Numerical Errors 𝑧
Numerical errors (or computation errors) can be categorized into three

main types: round-off errors, truncation errors, and errors due to
ODE
coding. Numerical errors are involved with each of the modeling steps 𝑢4 (𝑡)
between the mathematical model and the states (see Fig. 3.1). The error
involved in the discretization step is a type of truncation error. The
𝑡
errors introduced in the coding step are not usually discussed as a
numerical error, but we include them here because they are a likely
source of error in practice. The errors in the computation step involve
𝑧1 𝑧2 𝑧3 𝑧4 𝑧5
both round-off and truncation errors. Each of these error sources is 𝑧
discussed in the following subsections.

An absolute error is the magnitude of the difference between the exact
value (𝑥 ∗ ) and the computed value (𝑥), which we can write as |𝑥 − 𝑥 ∗ |. Fully discretized
The relative error is a more intrinsic error measure and is defined as
|𝑥 − 𝑥 ∗ |
𝜖= . (3.7) 𝑢4 (𝑡3 )
|𝑥 ∗ | 𝑡
This is more useful error measure in most cases. When the exact value
𝑥 ∗ is close to zero, however, this definition breaks down. To address 𝑧1 𝑧2 𝑧3 𝑧4 𝑧5
this, we avoid the division by zero by using 𝑧
|𝑥 − 𝑥 ∗ | Figure 3.4: PDEs in space and time

𝜖= . (3.8) are often discretized in space first to
1 + |𝑥 ∗ | yield an ODE in time.
This error metric combines the properties of absolute and relative errors.
When |𝑥 ∗ | 1, this metric is similar to the relative error, but when
|𝑥 ∗ | 1, it becomes similar to the absolute error.
3.5.1 Roundoff Errors

Roundoff errors stem from the fact that a computer cannot represent
all real numbers with exact precision. Errors in the representation of
each number lead to errors in each arithmetic operation, which in turn

might accumulate over the course of a program.
There are an infinite number or real numbers, but not all numbers can
be represented in a computer. When a number cannot be represented
exactly, it is rounded. In addition, a number might be too small or too
large to be represented.
Computers use bits to represent numbers, where each bit is either
0 or 1. Most computers use the IEEE standard for representing num-
bers and performing finite-precision arithmetic. The most common
representation uses 32 bits for integers and 64 bits for real numbers.
Basic operations that only involve integers and whose result is an
integer do not incur numerical errors. However, there is a limit on the
range of integers that can be represented. When using 32 bit integers,
1 bit is used for the sign, and the remaining 31 bits can be used for
the digits, which results in a range from −231 = −2, 147, 483, 648 to
231 − 1 = 2, 147, 483, 647. Any operation outside this range would result
in integer overflow.∗ ∗ Some programming languages, such as
Python, have arbitrary precision integers
Real numbers are represented using scientific notation in base 2 and are not subject to this issue, albeit with
some performance tradeoffs.
𝑥 = significand × 2exponent (3.9)
The 64-bit representation is known as the double-precision floating-

point format, where some digits store the significand and others store
the exponent. The greatest positive and negative real numbers that
can be represented using the IEEE double-precision representation are
approximately 10308 and −10308 , respectively. Operations that results in
numbers outside this range result in overflow, which is a fatal error in
most computers and interrupts the program execution.
There is also a limit on how close a number can come to zero,
which is approximately 10−324 when using double precision. Numbers
smaller than this result in underflow. The computer sets such numbers
to zero by default and the program usually proceeds with no harmful
consequences.
One important number to consider in roundoff error analysis is the
machine precision, 𝜖 mach , which represents the precision of the computer
calculations. This is the smallest positive number 𝜖 such that
1+𝜖 > 1 (3.10)
when calculated using a computer. Typically double precision divides

up the 64 bits representation with 1 bit for the sign, 11 bits for the
exponent, and 52 bits for the significand. Thus, when using double
precision, 𝜖 mach = 2−52 ≈ 2.2 × 10−16 . Thus, a double-precision number
has about 16 digits of precision and a relative representation error of

up to 𝜖 mach may occur.
Example 3.4: Machine precision
Suppose that three decimal digits are available to represent a number

(and that we use base 10 for simplicity). Then, 𝜖mach = 0.005, because any
number smaller than this results in 1 + 𝜖 = 1 when rounded to three digits. For
example, 1.00 + 0.00499 = 1.00499, which rounds to 1.00. On the other hand,
1.00 + 0.005 = 1.005, which rounds to 1.01 and satisfies Eq. 3.10.
Example 3.5: Relative representation error
If we try to store 24.11 using three digits, we get 24.1. The relative error is
24.11 − 24.1
≈ 0.0004, (3.11)
24.11
which is lower than the maximum possible representation error of 𝜖 mach = 0.005
established in Ex. 3.4
When performing an operation with numbers that contain errors,

the result is subject to a propagated error. For multiplication and division,
the relative propagated error is approximately the sum of the relative
errors of the respective two operands.
For addition and subtraction, an error can occur even when the
two operands are represented exactly. Before addition and subtraction,
the computer must convert the two numbers to the same exponent.
When adding numbers with different exponents, several digits from
the small number will vanish (see Fig. 3.5). If the difference in the two
exponents is greater than the magnitude of the exponent of 𝜖mach , the
small number will vanish completely—a consequence of Eq. 3.10. The
relative error incurred in addition is still around 𝜖 mach .
Difference in exponent Lost digits from 𝑏
𝑎 0.
± 𝑏 0. 0 0 0 0 0 0 0 0 0 0
𝑐 0.
Digits from 𝑎 Affected digits
Figure 3.5: Adding or subtracting

number of differing exponents re-
sults in a loss in the number of dig-
its corresponding to the difference
in the exponents.
Subtraction, on the other hand, can incur much greater relative

errors when subtracting two numbers that have the same exponent and
are close to each other. In this case, the digits that match between the
two numbers cancel each other and reduce the number of significant
digits. When the relative difference between two numbers is less than
machine precision, all digits match and the subtraction result is zero (see
Fig. 3.6). This is called subtractive cancellation and is a prevailing issue
when approximating derivatives via finite differences (see Section 6.4).
𝑎 0.
− 𝑏 0.
Figure 3.6: Subtracting two num-
bers that are close to each other
𝑐 0. 0 0 0 0 0 0 0 0 0 0 0 results in a loss of the digits that
match.
Common digits are lost Remaining digits
Sometimes, small roundoff errors can propagate and result in much

larger errors. This can happen when a problem is ill-conditioned or
when the algorithm used to solve the problem is unstable. In both
cases, small changes in the inputs cause large changes in the output. Ill-
conditioning is not a consequence of the finite-precision computations,
but is a characteristic of the model itself. Stability is a property of the
algorithm used to solve the problem. When a problem is ill conditioned,
it is challenging to solve it in a stable way. When a problem is well
conditioned, there is a stable way to solve it, but there may still be
algorithms that are unstable. 𝑓
Example 3.6: Effect of roundoff error on function representation 2·

10−15
Let us examine the function 𝑓 (𝑥) = 𝑥 2 − 4𝑥 + 4 near the minimum which is

1·
at 𝑥 = 2. If we use double precision and plot many points in a small interval, 10−15
we can see that the function exhibits the step pattern shown in Fig. 3.7. The
numerical minimum of this function is anywhere in the interval around 𝑥 = 2 0
where the numerical value is zero. This interval is much larger than the machine 2 − 5 · 10−8 2.0 2 + 5 · 10−8
precision (𝜖mach = 2.2 × 10−16 ) An additional error is incurred in the function 𝑥
computation around 𝑥 = 2 due to subtractive cancellation. Figure 3.7: With double precision,
the minimum of this quadratic
function is in an interval much
larger than machine zero.
3.5.2 Truncation Errors † Roundoff error, discussed in the previ-
ous section, is sometimes also referred to
In the most general sense, truncation errors arise from performing a as truncation error because digits are trun-
cated, but we avoid this confusing naming
finite number of operations where an infinite number of operations and only use truncation error to refer to a
would be required to get an exact result.† Truncation errors would arise truncation in the number of operations.
even if we could do the arithmetic with infinite precision.

When discretizing a mathematical model with partial derivatives,
these are approximated by truncated Taylor series expansions that
ignored higher-order terms. When the model includes integrals, they
are approximated as finite sums. In either case, a mesh of points
where the relevant states and functions are evaluated is required.
Discretization errors generally decrease as the spacing between the
points decreases.
Tip 3.7: Perform a mesh refinement study
When using a model that depend on a mesh, it is a good idea to perform a

mesh refinement study. This involves solving the model for increasingly finer
meshes to check if the metrics of interest converge in a stable way and to verify
that the converge rate is as expected for the numerical discretization scheme
that is used. Such as study is also useful to find out which mesh provides the
best compromise between computational time and accuracy, which is especially
important in numerical optimization because of the number of times that the
model is solved.
3.5.3 Iterative Solver Tolerance Error

Many methods for solving numerical models involve an iterative pro- ||𝑟||
cedure that starts with a guess for the states 𝑢 and then improve that 105
guess at each iteration until reaching a specified convergence tolerance. 102

The convergence is usually measured by a norm of the residuals, k𝑟(𝑢)k, 10−1
which we want to drive to zero. Iterative linear solvers and Newton-type
10−4
solvers are examples of iterative methods (see Section 3.7).
A typical convergence history for an iterative solver is shown in 10−7
Fig. 3.8. The norm of the residuals decreases gradually until a limit is 10−10
0 200 400 600
reached (near 10−10 in this case). This limit represents the lowest error 𝑘
that can be achieved with the iterative solver and is determined by other
sources of error such as roundoff and truncation errors. If we terminate Figure 3.8: Norm of residuals ver-
sus the number of iterations for an
before reaching the limit (either by setting a convergence tolerance to a iterative solver.
value higher than 10−10 , or by setting an iteration limit to lower than 400
iterations), we incur an additional error. However, it might be desirable
to trade-off a less precise solution for a lower computational effort.
Tip 3.8: Find the level of the numerical noise in your model.
It is important to know the level of error in your model because this limits
the type of optimizer you can use and how well you can optimize. In Ex. 3.6,
we saw that if we plot a function at a small enough scale, we can see discrete
steps in the function due to roundoff errors. When accumulating all sources of
error in a more elaborate model (roundoff, truncation, and iterative), we no
longer have a neat step pattern. Instead, we get numerical noise, as shown in
Fig. 3.9. The level of noise can be estimated by the amplitude of the oscillations
and gives us the order of magnitude of the total numerical error.
0.5369
0.58 +2 · 10−8
𝑓 𝑓 ∼ 10−8
0.56
Figure 3.9: To find the level of nu-
0.5369
0.54 merical noise of a function of inter-
est with respect to an input param-
0.52 eter (left), we magnify both axes by
0.5369 several orders of magnitude and
−2 · 10−8 evaluate the function at points that
0.5
0 1 2 3 4 2 − 1 · 10−6 2.0 2 + 1 · 10−6 are closely spaced (right).
𝑥 𝑥
3.5.4 Programming Errors

Most of the literature on numerical methods is too optimistic and does
not explicitly discuss programming errors, which are commonly known
as bugs. Most programmers, especially beginners, underestimate the
likelihood that their code has bugs.
It is helpful to adopt sound programming practices, such as writing
clear code that is modular. Clear code has consistent formatting, mean-
ingful naming of variable functions, and helpful comments. Modular
code re-uses and generalizes functions as much as possible and avoids
copying and pasting sections of code. 53 53. Wilson et al., Best Practices for Scien-
tific Computing. 2014
There are different types of bugs that are relevant to numerical
models: generic programming errors, incorrect memory handling, and
algorithmic or logical errors.
Programming errors are the most frequent and include typos, type
errors, copy-and-paste errors, faulty initialization, missing switch cases,
and default values. In theory, these errors can be avoided through
careful programming and code inspection, but in practice, you must
always test your code. The testing involves comparing your result with
for a case where you know the solution—the reference result. You
should start with the simplest representative problem and then build
up from that. An interactive debugger is a helpful tool that enables you
to observe what the code is doing at runtime and check intermediate
variable values.
Tip 3.9: Assume there are bugs in your code.
The overall attitude towards programming should be that all code has bugs
until it is verified through testing.
Memory handling issues are much less frequent than programming

errors, but they are usually more difficult to track. These issues include
memory leaks (a failure to free unused memory), incorrect use of
memory management, buffer overruns (such as array bound violations),
and reading uninitialized memory. Memory issues are difficult to track
because they can result in strange behavior in parts of the code that
are far from the source of the error. In addition, they might manifest
themselves in specific conditions that are hard to reproduce consistently.
Memory debuggers are essential tools for addressing memory issues.
They perform a detailed bookkeeping of all allocation, deallocation,
and memory access to detect and locate any irregularities.‡ ‡ See
Grotker et al.54 for more details on
how to debug and profile code.
Whereas programming errors are due to a mismatch between the
54. Grotker et al., The Developer’s Guide to
programmer’s intent and what is coded, the root cause of algorithmic Debugging. 2012
or logical errors is in the programmer’s intent itself. Again, testing is
the key to finding these errors, but you must be sure that the reference
result is correct.
Running the analysis within an optimization loop has the potential
to reveal bugs that do not manifest themselves in a single analysis.
Therefore, after (and only after!) you have tested the analysis code in
isolation, should you run an optimization test case.
As previously mentioned, there is a higher incentive to reduce the
computational cost of an analysis when it runs in an optimization loop
because it will run many times. When you first write your code, you
should prioritize clarity and correctness as opposed to speed. Once the
code is verified through testing, you should identify any bottlenecks
using a performance profiling tool. Memory performance issues can
also arise from running the analysis in an optimization loop as opposed
to running a single case. In addition to running a memory debugger,
you can also run a memory profiling tool to identify opportunities to
reduce memory usage.
3.6 Rate of Convergence
Iterative solvers compute a sequence of approximate solutions that hope-

fully converge to the exact solution. When characterizing convergence,
we need to first establish if the algorithm converges and if so, how

fast does it converge. The first characteristic relates to the stability of
the algorithm. Here, we focus on the second characteristic, which is
quantified through the rate of convergence.
The cost of iterative algorithms is often measured by counting
the number of iterations required to achieve the solution. Iterative
algorithms often require an infinite number of iterations to converge to
the exact solution. In practice, we want to converge to an approximate
solution that is close enough to the exact one. Determining the rate of
convergence arises from the need to quantify how fast the approximate
solution is approaching the exact one.
In the following, we assume that we have a sequence of points,
𝑥 , 𝑥 (1) , . . . , 𝑥 (𝑘) , . . . that represent approximate solutions in the form
(0)
of vectors in any dimension, and converge to a solution 𝑥 ∗ . Then,
lim k𝑥 (𝑘) − 𝑥 ∗ k = 0, (3.12)

𝑘→∞
which means that the norm of the error tends to zero as the number of
iterations tends to infinity. This sequence converges with order 𝑟 when
𝑟 is the largest number that satisfies
||𝑥 (𝑘+1) − 𝑥 ∗ ||
0 ≥ lim = 𝛾 < ∞, (3.13)
𝑘→∞ ||𝑥 (𝑘) − 𝑥 ∗ || 𝑟
where 𝑟 is the asymptotic convergence rate, and 𝛾 is the asymptotic conver-

gence error constant. “Asymptotic” here refers to the fact that we care
mostly about the behavior in the limit. There is no guarantee that the
initial and intermediate iterations satisfy this condition.
To avoid dealing with limits, let us assume that condition (3.13)
applies to all iterations. We can relate the error from one iteration to
the next by
||𝑥 (𝑘+1) − 𝑥 ∗ || = 𝛾||𝑥 (𝑘) − 𝑥 ∗ || 𝑟 . (3.14)
When 𝑟 = 1, we have linear convergence, and when 𝑟 = 2, we have
quadratic convergence. Quadratic convergence is a highly valued
property for an algorithm and in practice higher rates of convergence
are usually not considered.
When we have linear convergence, then
||𝑥 (𝑘+1) − 𝑥 ∗ || = 𝛾||𝑥 (𝑘) − 𝑥 ∗ ||. (3.15)
In this case the convergence is highly dependent on the value of the

asymptotic constant 𝛾. If 𝛾 > 1, then the sequence diverges—a situation
to be avoided. If 0 < 𝛾 < 1, then the norm of the error decreases by a
constant factor for every iteration. If 𝛾 = 0.1, for example, and we start
with an initial error norm of 0.1, we get the sequence,
10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 , 10−7 . . . . (3.16)
Thus, after six iterations, we get six-digit accuracy. Now suppose that
𝛾 = 0.9. Then we would have
10−1 , 9.0 × 10−2 , 8.1 × 10−2 , 7.3 × 10−2 , 6.6 × 10−2 ,

5.9 × 10−2 , 5.3 × 10−2 , . . . , (3.17)
which corresponds to only one-digit accuracy after six iterations. It

would take 131 iterations to achieve the same six-digit accuracy.
When we have quadratic convergence, then
||𝑥 (𝑘+1) − 𝑥 ∗ || = 𝛾||𝑥 (𝑘) − 𝑥 ∗ || 2 . (3.18)
If 𝛾 = 1, then the error norm sequence with a starting error norm of 0.1
would be
10−1 , 10−2 , 10−4 , 10−8 , . . . . (3.19)
Thus we achieve more than six digits of accuracy in just three iterations!
In this case, the number of correct digits doubles at every iteration.
For 𝛾 > 1, the convergence might not be as good, but the series is sill
convergent.
If 𝑟 ≥ 1 and 𝛾 → 0, we have superlinear convergence, which includes
quadratic and higher rates of convergence. There is a special case of
superlinear convergence that is relevant for optimization algorithms,
which is when 𝑟 = 1. This case is desirable because in practice it
behaves similarly to quadratic convergence and can be achieved by
gradient-based algorithms that use first derivatives (as opposed to
second derivatives). In this case, we can write
||𝑥 (𝑘+1) − 𝑥 ∗ || = 𝛾(𝑘) ||𝑥 (𝑘) − 𝑥 ∗ ||, (3.20)
where lim 𝑘→∞ 𝛾(𝑘) = 0. Now we need to consider a sequence of values

for 𝛾 that tends to zero. For example, if 𝛾 = 1/(𝑘 + 1), starting with an
error norm of 0.1, we get,
10−1 , 5 × 10−1 , 1.7 × 10−1 , 4.2 × 10−2 , 8.3 × 10−4 ,

1.4 × 10−4 , 2.0 × 10−5 , . . . . (3.21)
Thus, we have achieved four-digit accuracy after six iterations. This

special case of superlinear convergence is not quite as good as quadratic
convergence, but it is better than either of the linear convergence

examples above.
We plot the sequences above in Fig. 3.10. Since the points are just
scalars and the exact solution is zero, the norm or the error is just
𝑥 (𝑘) . The first plot uses a linear scale so we cannot see any differences
beyond two digits. To examine the differences more carefully, we need
to use a logarithmic axis for the sequence values, as shown on the right
plot. In this scale, each decrease in order of magnitude represents
one more digit of accuracy. We can see how the linear convergence
shows up as a straight line in this plot, but the slope of the line varies
widely, depending on the value of the asymptotic constant. Quadratic
convergence exhibits an increasing slope, reflecting the doubling of
digits for each iteration. the superlinear sequence exhibits poorer
convergence than the best linear one, but we can see that the slope of
the superlinear curve is increasing, which means that for a high enough
𝑘, it will converge at a higher rate than the linear one.
0.1 10−1
Linear
10−2 𝑟 = 1, 𝛾 = 0.9
0.08 Superlinear
10−3 𝑟=1
0.06 𝛾→0
10−4
𝑥 𝑥
0.04 10−5
10−6
0.02 Linear Figure 3.10: Sample sequences for
10−7 Quadratic 𝑟=1 linear, superlinear, and quadratic
𝑟=2 𝛾 = 0.1
0
cases plotted in a linear scale (left)
10−8
0 2 4 6 0 2 4 6 and logarithmic scale (right).
𝑘 𝑘
Tip 3.10: Use a logarithmic scale when plotting convergence
When using a linear scale plot, you can only see differences in two significant
digits. To reveal changes beyond three digits, you should use a logarithmic
scale. This need occurs frequently in plotting the convergence behavior of
optimization algorithms.
When solving numerical models we can monitor the norm of the

residual. Because we know that the residuals should be zero for an
exact solution, the norm of the error is simply the norm of the residual,
||𝑟(𝑢 𝑘 )||.
If we monitor another quantity, we do not usually know the exact
solution. In these cases, we can use the ratio of the step lengths of each
iteration:
||𝑥 (𝑘+1) − 𝑥 ∗ || ||𝑥 (𝑘+1) − 𝑥 (𝑘) ||
≈ , (3.22)
||𝑥 (𝑘) − 𝑥 ∗ || ||𝑥 (𝑘) − 𝑥 (𝑘−1) ||
The rate of convergence can then be estimated numerically with the
values of the last available four iterates using
||𝑥 (𝑘+1) −𝑥 (𝑘) ||

log ||𝑥 (𝑘) −𝑥 (𝑘−1) ||
𝑟≈ . (3.23)
||𝑥 (𝑘) −𝑥 (𝑘−1) ||
log ||𝑥 (𝑘−1) −𝑥 (𝑘−2) ||
Finally, we can also monitor any quantity by taking the step length
and normalizing it in the same way as Eq. 3.8,
||𝑥 (𝑘+1) − 𝑥 (𝑘) ||

. (3.24)
1 + ||𝑥 (𝑘) ||
3.7 Overview of Solvers
There are a number of methods available for solving the discretized

governing equations (3.1). We want to solve the governing equations
for a fixed set of design of design variables, so 𝑥 will not appear in the
solution algorithms. Our objective is to find the state variables 𝑢 such
that 𝑟(𝑢) = 0.
This is not a book about solvers, but it is important to understand the
characteristics of these solvers because they affect the cost and precision
of the function evaluations in the overall optimization process. Thus,
we provide an overview and some of the most relevant details in this
section. In addition, the solution of coupled systems builds on these
solvers, as we will see in Section 13.3. Finally, some of the optimization
algorithms detailed in later chapters use these solvers. § § Ascheret al.55 provides a more detailed
introduction to the numerical methods
There are two main types of solvers, depending on whether the mentioned in this chapter.
equations to be solved are linear or nonlinear (Fig. 3.11). Linear solution 55. Ascher et al., A first course in numerical
methods solve systems of the form 𝑟(𝑢) = 𝐴𝑢 − 𝑏 = 0, where the matrix methods. 2011
𝐴 and vector 𝑏 are not dependent on 𝑢. Nonlinear methods can handle

any algebraic system of equations that can be written as 𝑟(𝑢) = 0.
Linear systems can be solved directly or iteratively. Direct meth-
ods are based on the concept of Gaussian elimination, which can be
expressed in matrix form as a decomposition into lower and upper tri-
angular matrices that are easier to solve (𝐿𝑈 decomposition). Cholesky
decomposition is a variant of 𝐿𝑈 decomposition that applies only to
symmetric positive-definite matrices.
While direct solvers obtain the solution 𝑢 at the end of a process,
iterative solvers start with a guess for 𝑢 and successively improve it with
LU decomposition
Direct
QR decomposition
Linear Jacobi
Iterative Fixed point Gauss–Seidel
Solver SOR
Newton
+ linear solver CG
Krylov subspace
Nonlinear
Nonlinear GMRES
variants of
fixed point
each iteration through explicit expressions that are easy to compute, as Figure 3.11: Overview of solution
illustrated in Fig. 3.12. Iterative methods can be fixed-point iterations, methods for linear and nonlinear
systems.
such as Jacobi, Gauss–Seidel, and successive over-relaxation (SOR), or
Krylov subspace methods, such as the conjugate gradient (CG) and
generalized minimum residual (GMRES) methods.¶ Direct solvers ¶ SeeSaad56 for more details on iterative
methods in the context of large-scale nu-
are well established and are included in the standard libraries for merical models.
most programming languages. Iterative solvers are less widespread in 56. Saad, Iterative Methods for Sparse
standard libraries, but they are becoming more commonplace. Linear Systems. 2003
Tip 3.11: Do not compute the inverse of 𝐴.
Residual
Because some numerical libraries have functions to compute 𝐴−1 , you Iterative Direct
might be tempted to do this and then multiply by a vector to compute 𝑢 = 𝐴−1 𝑏.
This is a bad idea because finding the inverse is computationally expensive. 𝜖machine
Instead, use 𝐿𝑈 decomposition or another method from Fig. 3.11.
𝒪 𝑛3
Effort
Direct methods are the right choice for many problems because they Figure 3.12: While direct methods
are generally robust. However, for large systems where 𝐴 is sparse, the only yield the solution at the end
cost of direct methods can become prohibitive, while iterative methods of the process, iterative methods
produce approximate intermediate
remain viable. Iterative methods have other advantages, such as being results.
able to trade between computational cost and precision, and to restart
from a good guess (see Appendix B for details).
When it comes to nonlinear solvers, the most efficient methods are
based on Newton’s method, which we explain later in this chapter
(Section 3.8). Newton’s method solves sequence of problems that are
linearizations of the nonlinear problem about the current iterate. The
linear problem at each Newton iteration can be solved using any linear
solver, as indicated by the incoming arrow in Fig. 3.11. Although
efficient, Newton’s method is not robust in that it does not always
converge. Therefore, it requires modifications so that it can converge
reliably.
Finally, it is possible to adapt linear fixed point iterations methods

to solve nonlinear equations as well. However, unlike the linear case, it
might not be possible to derive explicit expressions for the iterations in
the nonlinear case. For this reason, fixed point iteration methods are
not usually the best choice for solving a system of nonlinear equations.
However, as we will see in Section 13.3.4, these methods are useful for
solving of systems of coupled nonlinear equations.
For time dependent problems, we require a way to solve for the
time history of the states, 𝑢(𝑡). As mentioned in Section 3.3, the most
popular approach is to decouple the temporal discretization from the
spatial one. By discretizing a PDE in space first, this method formulates
an ODE in time of the form,
d𝑢
= −𝑟(𝑢), (3.25)
d𝑡
which is called the semi-discrete form. A time-integration scheme is
then used to solve for the time history. A time-integration scheme
can be either explicit or implicit, depending on whether it involves
evaluating explicit expressions, or requires solving implicit equations.
If a system under a certain condition has a steady state, these techniques
can be used to solve for the steady state (d𝑢/d𝑡 = 0).
3.8 Newton-based Solvers
As mentioned in Section 3.7, Newton’s method is the basis for many

nonlinear equation solvers. Newton’s method is also at the core of the
most efficient gradient-based optimization algorithms, so we explain it
here in more detail. We start with the single-variable case for simplicity,
and then generalize it to the 𝑛-dimensional case.
We want to find 𝑢 ∗ such that 𝑟(𝑢 ∗ ) = 0, where for now, 𝑟 and 𝑢 are
scalars. Newton’s method estimates a solution at each iteration 𝑢 (𝑘) by

approximating 𝑟 𝑢 (𝑘) to be a linear function. The linearization is done
by taking a Taylor’s series of 𝑟 about 𝑢 (𝑘) and truncating it to obtain

𝑟 𝑢 (𝑘) + Δ𝑢 ≈ 𝑟 𝑢 (𝑘) + Δ𝑢𝑟 0 𝑢 (𝑘) , (3.26)

where 𝑟 0 , d𝑟/d𝑢 and 𝑟 (𝑘) , 𝑟 𝑢 (𝑘) . Now we can use this to find the
step Δ𝑢 that makes approximate residual zero,
𝑟 (𝑘)
𝑟 (𝑘) + Δ𝑢𝑟 0(𝑘) = 0 ⇒ Δ𝑢 = − . (3.27)
𝑟 0(𝑘)
where we need to assume that 𝑟 0(𝑘) ≠ 0.

Thus, the update for each step in Newton’s algorithm is
𝑟 (𝑘)
𝑢 (𝑘+1) = 𝑢 (𝑘) − . (3.28)
𝑟 0(𝑘)
If 𝑟 0(𝑘) = 0, the algorithm will not converge because it yields a step to
infinity. Small enough values of 𝑟 0(𝑘) also cause an issue with large
steps, but the algorithm might still converge.
Example 3.12: Newton’s method for a single variable
Suppose we want to solve the equation 𝑟(𝑢) = 2𝑢 3 + 4𝑢 2 + 𝑢 − 2 = 0. Since

= 6𝑢 2 + 8𝑢 + 1, the Newton iteration is,
𝑟 0 (𝑢)
3 2
2𝑢 (𝑘) + 4𝑢 (𝑘) + 𝑢 (𝑘) − 2
𝑢 (𝑘+1) = 𝑢 (𝑘) − 2
. (3.29)
6𝑢 (𝑘) + 8𝑢 (𝑘) + 1
When we start with the guess 𝑢 (0) = 1.5 (left plot), the iterations are well
behaved and the method converges quadratically. We can see the geometric
interpretation of Newton’s method: For each iteration, it takes the tangent to
the curve and finds the intersection with 𝑟 = 0.
When we start with 𝑢 (0) = −0.5 (right plot), the first step goes in the wrong
direction, but recovers in the second iteration. The third iteration is close to the
point with zero derivative and takes a large step. In this case, the iterations
recover and then converge normally. However, we can easily envision a case
where an iteration is much closer to the point with zero derivative, causing an
arbitrarily long step.
𝑟 𝑟
Figure 3.13: Newton iterations

−0.5 0
0 0 starting from different starting
0.54 1.5 0.54 points.
𝑢∗ 𝑢
𝑢 (0) 𝑢 (0) 𝑢∗
Newton’s method converges quadratically when close enough to

the solution with a convergence constant of
𝑟 00(𝑢 ∗ )
𝛾= . (3.30)
2𝑟 0(𝑢 ∗ )
This means that if the derivative is close to zero or the curvature tends
to a large number at the solution, Newton’s method will not converge
as well or not at all.
Now we consider the general case where we have 𝑛 nonlinear

equations of 𝑛 unknowns, expressed as 𝑟(𝑢) = 0. Similarly to the
single-variable case, we derive the Newton step from a truncated Taylor
series. However, the Taylor series needs to be multidimensional in
both the independent variable and the function. Consider first the
multidimensionality of the independent variable, 𝑢, for a component
of the residuals, 𝑟 𝑖 (𝑢). The first two terms of the Taylor series about
𝑢 (𝑘) for a step Δ𝑢 (which is now a vector with arbitrary direction and
magnitude) are
Õ
𝑛
𝜕𝑟 𝑖
𝑟 𝑖 𝑢 (𝑘) + Δ𝑢 ≈ 𝑟 𝑖 𝑢 (𝑘) + Δ𝑢 𝑗 . (3.31)
𝜕𝑢 𝑗 𝑢=𝑢 (𝑘)
𝑗=1
Since we have 𝑛 residuals, 𝑖 = 1, . . . , 𝑛, and we can write the second

term in matrix form as 𝐽Δ𝑢, where 𝐽 is a square matrix whose elements
are
𝜕𝑟 𝑖
𝐽𝑖𝑗 = . (3.32)
𝜕𝑢 𝑗
This is called the Jacobian matrix.
Similarly to the single-variable case, we want to find the step that
makes the two terms zero, which yields the linear system,
𝐽 (𝑘) Δ𝑢 (𝑘) = −𝑟 (𝑘) , (3.33)
After solving this linear system, we can update the solution to
𝑢 (𝑘+1) = 𝑢 (𝑘) + Δ𝑢 (𝑘) . (3.34)
Thus, Newton’s method involves solving a sequence of linear systems

given by Eq. 3.33. The linear system and can be solved using any of
the methods for solving linear systems mentioned in Section 3.7. One
popular option for solving for the Newton step is the Krylov method,
which results in the Newton–Krylov method for solving nonlinear
systems. Because the Krylov method only requires matrix-vector
products of the form [𝜕𝑟/𝜕𝑢]𝑣, we can avoid computing and storing the
Jacobian by computing this product directly (using finite differences or
other methods from Chapter 6).
The multivariable version of Newton’s method is subject to the same
issues we uncovered for the single-variable case: it only converges
if the starting point is within a certain region and it can be subject
to ill conditioning. Newton’s method can be modified to increase
the likelihood of convergence from any starting point, as we will see
in Chapter 4. The ill-conditioning issue has to do with the linear
system (3.33) and can be quantified by the condition number of the
Jacobian matrix. Ill-conditioning can be addressed by scaling and

preconditioning.
Example 3.13: Newton’s method applied to two nonlinear equations.
Suppose we have the nonlinear system of two equations
1 √
𝑢2 = , 𝑢2 = 𝑢1 . (3.35)
𝑢1
This corresponds to the two lines shown in Fig. 3.14, where the solution is at
their intersection, 𝑢 = (1, 1). (In this example, the two equations are explicit
and we could solve them by substitution, but they could have been implicit.)
To solve this using Newton’s method, we need to write these as residuals, 5
𝑢2
1 4
𝑟1 = 𝑢2 − =0
𝑢1 (3.36)
√ 3 𝑢 (0)
𝑟2 = 𝑢2 − 𝑢1 = 0.
2
The Jacobian can be derived analytically and the Newton step is given by the
1
linear system
" 1 # 𝑢∗
1 Δ𝑢
𝑢2 − 𝑢11
0
𝑢12 1
= − √ . (3.37) 0 1 2 3
− √1
2 𝑢1
1 Δ𝑢2 𝑢2 − 𝑢1 𝑢1
Starting from 𝑢 = (2, 3) yields the iterations shown below with the quadratic Figure 3.14: Newton iterations.
convergence shown in Fig. 3.15.
||𝑟||
100
𝑢1 𝑢2 ||𝑢 − 𝑢 ∗ || ||𝑟||
10−3
2.000000 3.000000 2.23 2.50
0.485281 0.878679 5.28 × 10−1 2.50 10−6
0.760064 0.893846 2.62 × 10−1 1.18 10−9

0.952668 0.982278 5.05 × 10−2 4.21 × 10−1
10−12
0.998289 0.999417 1.80 × 10−3 6.74 × 10−2
0.999998 0.999999 2.31 × 10−6 2.29 × 10−3 10−15
0 2 4 6
1.000000 1.000000 3.81 × 10−12 2.93 × 10−6 𝑘
1.000000 1.000000 0.0 4.83 × 10−12
1.000000 1.000000 0.0 0.0 Figure 3.15: The norm of the resid-
ual exhibits quadratic convergence.
3.9 Models and the Optimization Problem 𝑥 𝑟(𝑥, 𝑢) = 0 𝑢
When performing design optimization, we ultimately need to compute Figure 3.16: For a general model,
the values of the objective and constraint functions in the optimization the state variables 𝑢 are implicit
functions of the design variables 𝑥
problem (1.4). There is typically an intermediate step that requires
through the solution of the govern-
solving the governing equations for the given design 𝑥 at one or more ing equations.
specific conditions. The governing equations define the state variables

𝑢 as an implicit function of 𝑥, as illustrated in Fig. 3.16.
The objective and constraints are typically explicit functions of the
state variables. In addition to depending implicitly on the design
variables through the state variables, these functions can also include
𝑥 explicitly, as indicated by the 𝑥 in 𝑓 (𝑥, 𝑢). The dependency of these
functions on the state and design variables is illustrated in Fig. 3.17,
which is a more detailed version of Fig. 1.11. In design optimization
applications, the solution of the governing equations is usually the
most computational intensive part of the optimization problem.
When we first introduced the general optimization problem (1.4),
the governing equations were not included because they were assumed
to be part of the computation of the objective and constraints for a
given 𝑥. However, we can include them in the problem statement for
completeness as follows
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥
subject to 𝑔 𝑗 (𝑥, 𝑢) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔
ℎ 𝑘 (𝑥, 𝑢) = 0 𝑘 = 1, . . . , 𝑛 ℎ (3.38)
x𝑖 ≤ 𝑥 𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥
while solving 𝑟 𝑙 (𝑥, 𝑢) = 0 𝑙 = 1, . . . , 𝑛𝑢
by varying 𝑢𝑙 𝑙 = 1, . . . , 𝑛𝑢
Here, “while solving” means that the governing equations are solved
at each optimization iteration to find a valid 𝑢 for each value of 𝑥.
𝑥
Optimizer
Example 3.14: Structural sizing optimization.
𝑢
Recalling the truss problem of Ex. 3.3, suppose we want to minimize the 𝑟(𝑥, 𝑢) = 0
mass of the structure (𝑚) by varying the cross sectional areas of the trusses (𝑥),
subject to stress constraints. We can write the problem statement as 𝑓 (𝑥, 𝑢)
𝑔(𝑥, 𝑢)
𝑓 , 𝑔, ℎ ℎ(𝑥, 𝑢)
minimize 𝑚(𝑥)
by varying 𝑥 ≥ 𝑥min 𝑗 = 1, . . . , 15
Figure 3.17: The computation of the
subject to |𝜎 𝑗 (𝑥, 𝑢)| − 𝜎max ≤ 0 𝑗 = 1, . . . , 15 (3.39) objective ( 𝑓 ) and constraint func-
while solving 𝐾𝑢 − 𝑓 = 0 tions (ℎ,𝑔) for a given set of design
variables (𝑥) usually involves the so-
by varying 𝑢𝑙 𝑙 = 1, . . . , 18 lution of a numerical model (𝑟 = 0)
by varying the state variables (𝑢).
The governing equations are a linear set of equations whose solution determines
the displacements of the a given design (𝑥) for a given load condition ( 𝑓 ). We
mentioned previously that the objective and constraint functions are usually
explicit functions of the state variables, design variables, or both. As we saw in
Ex. 3.3, the mass is indeed an explicit function of the cross sectional areas. In
this case, it does not even depend on the state variables. The constraint function
is also an explicit function, but in this case it is just a function of the state
variables. This example illustrates a common situation where the solution of
the state variables requires the solution of implicit equations (structural solver),
while the constraints (stresses) and objective (weight) are explicit functions of
the states and design variables.
From a mathematical point of view, the model governing equations

𝑟(𝑥, 𝑢) = 0 can be considered equality constraints in an optimization
problem. Some specialized optimization approaches add these equa-
tions to the optimization problem and let the optimization algorithm
solve both the governing equations and optimization simultaneously.
This is called a full-space approach, and is also known as simultaneous
analysis and design (SAND) or one-shot optimization. The approach is
stated as:
minimize 𝑓 (𝑥, 𝑢)
𝑢𝑙 𝑙 = 1, . . . , 𝑛𝑢
subject to 𝑔 𝑗 (𝑥, 𝑢) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 (3.40)
ℎ 𝑘 (𝑥, 𝑢) = 0 𝑘 = 1, . . . , 𝑛 ℎ
x𝑖 ≤ 𝑥 𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥
𝑟 𝑙 (𝑥, 𝑢) = 0 𝑙 = 1, . . . , 𝑛𝑢
Unless otherwise stated, we assume that the optimization model gov- Optimizer
𝑥, 𝑢
erning equations are solved by a dedicated solver for each optimization
iteration as stated in Eq. 3.38.
𝑟(𝑥, 𝑢) = 0
More generally, the optimization constraints and equations in a 𝑟
model are interchangeable. If a set of equations in a model can 𝑓 (𝑥, 𝑢)

be satisfied by varying a corresponding set of state variables, these 𝑔(𝑥, 𝑢)
𝑓 , 𝑔, ℎ
equations and variables can be moved to the optimization problem ℎ(𝑥, 𝑢)
statement as equality constraints and design variables, respectively. Figure 3.18: In the full-space ap-
proach, the governing equations are
Example 3.15: Structural sizing optimization using a full-space approach. solved by the optimizer by varying
the state variables.
If we wanted to solve this problem using a full-space approach, we would
forgo the linear solver by adding 𝑢 to the set of design variables and letting the
optimizer enforce the governing equations. This would result in the following
problem,
minimize 𝑚(𝑥)
by varying 𝑥 𝑗 ≥ 𝑥 min 𝑗 = 1, . . . , 15
𝑢𝑙 𝑗 = 1, . . . , 18 (3.41)
subject to |𝜎 𝑗 (𝑥, 𝑢)| − 𝜎max ≤ 0 𝑗 = 1, . . . , 15
𝐾𝑢 − 𝑓 = 0.
3.10 Summary
It is essential to understand the models that compute the objective and

constraint functions because they directly impact the performance and
effectiveness of the optimization process.
The modeling process introduces several types of numerical errors
associated with each step of the process (discretization, programming,
computation). Knowing the level of numerical error is necessary to
establish what precision can be achieved in the optimization. Under-
standing the types of errors involved helps us find ways to reduce
those errors. Programming errors—“bugs”—are often underestimated;
thorough testing is required to verify that the numerical model is coded
correctly. A lack of understanding of a given model’s numerical errors
is often the cause of the failure in optimization, especially when using
gradient-based algorithms.
Modeling errors arise from discrepancies between the mathematical
model and the real physical system. While they do not affect the
optimization process’s performance and precision, modeling errors
affect the accuracy and thus determine how valid the result is in the real
world. Therefore, model validation and an understanding of modeling
error are also critical.
In engineering design optimization problems, the models usually
involve solving large sets of nonlinear implicit equations. The compu-
tational time required to solve these equations dominates the overall
optimization time, and therefore, solver efficiency is crucial. Solver
robustness is also crucial because optimization often asks for designs
that are very different from what a human designer would ask for,
which tests the limits of the model and the solver.
We presented an overview of the various types of solvers available
for linear and nonlinear equations. Newton-type methods are highly
desirable for solving nonlinear equations because they exhibit second-
order convergence. Because Newton-type methods involve solving a
linear system at each iteration, a linear solver is always required. These
solvers are also at the core of several of the optimization algorithms in

later chapters.
Problems
a) A model developed to perform well for analysis will generally

do well in a numerical optimization process.
b) Modeling errors have nothing to do with computations.
c) Explicit and implicit equations can always be written in
residual form.
d) Subtractive cancellation is a type of roundoff error.
e) Programming errors can be eliminated by carefully reading
the code.
f) Quadratic convergence is only better than linear convergence
if the asymptotic convergence error constant is less or equal
than one.
g) Logarithmic scales are desirable when plotting convergence
because they show errors of all magnitudes.
h) Newton solvers always require a linear solver.
i) Some linear iterative solvers can be used to solve nonlinear
problems.
j) Direct methods allow us to trade between computational
cost and precision, while iterative methods do not.
k) Newton’s method requires the derivatives of all the state
variables.
l) In the full-space optimization approach, the state variables
become design variables, and the governing equations be-
come constraints.
3.2 Choose an engineering system that you are familiar with and
describe each of the components illustrated in Fig. 3.1 for that
system. List all the options for the mathematical and numerical
models that you can think of and describe the assumptions for
each model. What type of solver is usually used for each model
(see Section 3.7) and what are their state variables? What are the
state variables for each model?
3.3 Consider the following mathematical model:
𝑢12 + 2𝑢2 = 1,
𝑢1 + cos(𝑢1 ) − 𝑢2 = 0,
𝑓 (𝑢1 , 𝑢2 ) = 𝑢1 + 𝑢2 .
Which equations are explicit and which ones are implicit? Write
these equations in residual form.
3.4 Reproduce a plot similar to the one shown in Fig. 3.7 for
𝑓 (𝑥) = cos(𝑥) + 1
in the neighborhood of 𝑥 = 𝜋.
3.5 Consider the residual equation,
𝑟(𝑢) = 𝑢 3 − 6𝑢 2 + 12𝑢 − 8 = 0.
a) Find the solution using your own implementation of New-

ton’s method.
b) Tabulate the residual for each iteration number.
c) What is the lowest error you can achieve?
d) Plot the residual versus the iteration number using a linear
axis; how many digits can you discern in this plot?
e) Make the same plot using a log axis for the residual and
estimate the rate of convergence.
f) Exploration: Try different starting points. Can you find a
predictable trend?
3.6 Kepler’s equation, which we mentioned in Section 2.2, defines the

relationship between a planet’s polar coordinates and the time
elapsed from a given initial point and is
𝐸 − 𝑒 sin(𝐸) = 𝑀,
where 𝑀 is the mean anomaly (a parameterization of time), 𝐸 is

the eccentric anomaly (a parameterization of polar angle), and 𝑒
is the eccentricity of the elliptical orbit.
a) Use Newton’s method to find 𝐸 when 𝑒 = 0.7 and 𝑀 = 𝜋/2.

b) Devise a fixed-point iteration to solve the same problem.
c) Compare the number of iterations and rate of convergence.
d) Exploration: Plot 𝐸 versus 𝑀 in the interval [0, 2𝜋] for 𝑒 =

[0, 0.1, 0.5, 0.9] and interpret your results physically.
3.7 Consider the equation from the previous exercise where we

replace one of the coefficients with a parameter 𝑎 as follows
𝑟(𝑢) = 𝑎𝑢 3 − 6𝑢 2 + 12𝑢 − 8 = 0.
a) Find the level of the numerical noise in the solution 𝑢 when

solving this equation using Newton’s method.
b) Produce a plot similar to Fig. 3.9 by perturbing 𝑎 in the
neighborhood of 𝑎 = 1 using a solver convergence tolerance
of |𝑟 | ≤ 0.1.
c) Exploration: Try smaller tolerances and see how much you
can decrease the numerical noise.
3.8 Reproduce the solution of Ex. 3.13 and then try different initial
guesses. Can you define a distinct region from where Newton’s
method converges?
3.9 Choose a problem that you are familiar with and find the magni-
tude of numerical noise in one or more outputs of interest with
respect to one or more inputs of interest. What means do you
have to decrease the numerical noise? What is the lowest possible
level of noise you can achieve?
Unconstrained Gradient-Based Optimization
4
In this chapter we focus our attention on unconstrained minimiza-
tion problems with continuous design variables (see Fig. 1.12). Such
optimization problems can be written as
(4.1)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥 ,
where 𝑥 contains the design variables that the optimization algorithm

can change. We will solve these problems using gradient information
to determine a path from a starting guess (or baseline design) to the
optimum, which consists of a series of discrete steps (see Fig. 4.1).
We assume the objective function to be nonlinear, 𝐶 2 continuous,
and deterministic. We make no assumption about multimodality and 𝑥∗
there is no guarantee that the algorithm finds the global optimum. 𝑥2
Referring to the attributes that classify an optimization problem

(Fig. 1.19), the optimization algorithms discussed in this chapter range 𝑥 (0)
from first- to second-order, perform a local search, and evaluate the

𝑥1
function directly. Both the iteration strategy (which is deterministic)
and optimality criteria are based on mathematical principles as opposed Figure 4.1: Gradient-based opti-
to heuristics. mization starts with a guess, 𝑥 (0) ,
and takes a sequence of steps in
While most engineering design problems are constrained, the con- 𝑛-dimensional space that converge
strained optimization algorithms in Chapter 5 build on the techniques to an optimum, 𝑥 ∗ .
explained in the current chapter.
70
1. Understand the significance of gradients, Hessians, and

directional derivatives.
2. Understand the mathematical definition of optimality for

an unconstrained problem.
3. Describe, implement, and use line-search-based methods.
4. Understand the pros and cons of various search direction

methods.
5. Understand trust region approaches and how they contrast

with line search methods.
4.1 Fundamentals
To determine the directions of the steps shown in Fig. 4.1, gradient-

based methods need the gradient (first-order information). Some
methods also use the curvature (second-order information). Gradients
and curvature are required build a second-order Taylor series, which is
a fundamental construct that is useful in establishing optimality and in
developing gradient-based optimization algorithms.
4.1.1 Derivatives and Gradients

Recall that we are considering a scalar objective function 𝑓 (𝑥), where 𝑥
is the vector of design variables, 𝑥 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ]𝑇 . The gradient of
this function, ∇ 𝑓 (𝑥), is a column vector of first-order partial derivatives
of the function with respect to each design variable:
h i𝑇
𝜕𝑓 𝜕𝑓 𝜕𝑓
∇ 𝑓 (𝑥) ≡ ,
𝜕𝑥 1 𝜕𝑥2
, . . . , 𝜕𝑥 𝑛 , (4.2)
where each partial derivative is defined as the limit,

𝜕𝑓 𝑓 (𝑥1 , . . . , 𝑥 𝑖 + 𝜖, . . . , 𝑥 𝑛 ) − 𝑓 (𝑥1 , . . . , 𝑥 𝑛 )
≡ lim . (4.3)
𝜕𝑥 𝑖 𝜖→0 𝜖
Each component in the gradient vector quantifies the local rate of change
of the function with respect to the corresponding design variable, as
shown in Fig. 4.2 for the two-dimensional case. In other words, these
components represent the slope of the function projected along each
coordinate direction. The gradient is a vector pointing in the direction
of greatest function increase from the current point.
𝑓 𝜕𝑓
𝜕𝑓 𝜕𝑥1
𝜕𝑥2
𝑥2
∇𝑓
𝑥1 𝜕𝑓 Figure 4.2: Components of the
𝜕𝑓 𝜕𝑥 1 gradient vector in the 2D case.
𝜕𝑥 2
The gradient vectors are normal to contour surfaces of constant 𝑓 in

𝑛-dimensional space. In the two-dimensional case, gradient vectors are
perpendicular to the function contour lines, as shown in Fig. 4.2.
Example 4.1: Gradient of a polynomial function
Consider the function of two variables,
𝑓 (𝑥1 , 𝑥2 ) = 𝑥13 + 2𝑥 1 𝑥22 − 𝑥23 − 20𝑥 1 (4.4)
The gradient can be obtained using analytic differentiation to obtain

3𝑥12 + 2𝑥22 − 20
∇ 𝑓 (𝑥1 , 𝑥2 ) = . (4.5)
4𝑥1 𝑥2 − 3𝑥22
This defines the vector field shown in Fig. 4.3, where each vector points in the
direction of steepest local increase.
2 Saddle point
Maximum
𝑥2 0
Minimum
−2 Saddle point
Figure 4.3: Gradient vector field

shows how gradients point towards
−4
−4 −2 0 2 4 maxima and away from minima.
𝑥1
If a function is explicit function, we can use symbolic differentiation

as we did in Ex. 4.1. However, recall from Chapter 3 that functions
can be the result of the solution of implicit equations. Closed form

expressions for the derivatives of such equations is not possible, but
there are established methods for computing them, which are the
subject of Chapter 6.
Physically, each gradient component has units that correspond to the
units of the function divided by the units of the corresponding variable.
Since the variables might represent different physical quantities, each
gradient component might have different units.
From the engineering design point of view, it might be useful to think
about relative changes, where the derivative is given as the percentage
change in the function for a 1% increase in the variable. This relative
derivative can be computed by non-dimensionalizing both the function
and the variable, yielding
𝜕𝑓 𝑥
, (4.6)
𝜕𝑥 𝑓
where 𝑓 and 𝑥 are the values of the function and variable at the point
where the derivative is computed.
Example 4.2: Interpretation of derivatives for wing design problem.
Consider the wing design problem from Ex. 1.1, where the objective
function 𝑓 is the required power. For the derivative of power with respect to
span (𝜕 𝑓 /𝜕𝑏), the units are Watts per meter (W/m). For example, for a wing
with 𝑐 = 1 m and 𝑏 = 12 m, we have 𝑓 = 1087.85 W and 𝜕 𝑓 /𝜕𝑏 = −41.65 W/m.
This means that an increase in span of 1 m a linear approximation predicts a
decrease power of 41.65 W. However, because the function is nonlinear, the
actual power at 𝑏 = 13 𝑚 is 1059.77 W (see Fig. 4.4). The relative derivative for
1.5 1,150
1.2 1,125
(12, 1.0)
1,100
𝑐 0.9 𝑓
𝜕𝑓
1,075 𝜕𝑏
0.6
1059.77
0.3
1046.20 Figure 4.4: Power versus span and
5 15 25 35 11 12 13 14 15 16 the corresponding derivative.
𝑏 𝑏
this same design can be computed as (𝜕 𝑓 /𝜕𝑏)(𝑏/ 𝑓 ) = −0.459, which means that
for a 1% increase in span, the linear approximation predicts a 0.459% decrease
in power; the actual decrease is 0.310%.
The gradient components quantify the rate of change of the function

in each coordinate direction (𝑥 𝑖 ), but sometimes we are interested in

the rate of change in a direction that is not a coordinate direction. This
corresponds to a directional derivative, which is defined as
𝑓 (𝑥 + 𝜖𝑝) − 𝑓 (𝑥)
∇𝑝 𝑓 (𝑥) ≡ lim (4.7)
𝜖→0 𝜖
We can find this derivative by projecting the gradient onto the desired
direction 𝑝 using the dot product
∇𝑝 𝑓 (𝑥) = ∇ 𝑓 𝑇 𝑝. (4.8)
When 𝑝 is a unit vector aligned with one of the Cartesian coordinates 𝑖,

this dot product yields the corresponding partial derivative 𝜕 𝑓 /𝜕𝑥 𝑖 . A
two-dimensional example of this projection is shown in Fig. 4.5.
∇𝑓 𝑇𝑝
𝑥2 Figure 4.5: Projection of the gradi-

∇𝑓
𝑥1 ent in an arbitrary unit direction
𝑝
∇𝑓 𝑇𝑝 𝑝.
From the gradient projection, we can see why the gradient is the
direction of steepest increase. If we use this definition of the dot
product,
∇𝑝 𝑓 (𝑥) = ∇ 𝑓 𝑇 𝑝 = k∇ 𝑓 k k𝑝 k cos 𝜃, (4.9)
we can see that this is maximized when 𝜃 = 0◦ . That is, the directional
derivative is largest when 𝑝 points in the same direction as ∇ 𝑓 . If
−90◦ < 𝜃 < 90◦ , the directional derivative is positive and is thus in
a direction of increase (Fig. 4.6). If 90◦ < 𝜃 < 270◦ , the directional
derivative is negative and 𝑝 points in a descent direction. Finally, if
𝜃 = ±90◦ , the directional derivative is 0 and thus the function value
does not change and is locally flat in that direction. That condition
occurs if ∇ 𝑓 and 𝑝 are orthogonal, and thus the gradient is always
orthogonal to contour surfaces.
To get the correct slope in the original units of 𝑥, the direction
should be normalized as 𝑝ˆ = 𝑝/ 𝑝 . In the gradient-based optimization
Positive directional
derivative (∇ 𝑓 𝑇 𝑝 > 0)
∇𝑓
𝜃
Negative directional 𝑝 Figure 4.6: The gradient ∇ 𝑓 is al-
derivative (∇ 𝑓 𝑇 𝑝 < 0)
ways orthogonal to contour lines
(surfaces), and the directional
derivative in the direction of 𝑝 is
Contour line tangent
given by ∇ 𝑓 𝑇 𝑝.
(∇ 𝑓 𝑇 𝑝 = 0)
algorithms of this chapter, it might not be necessary to normalize 𝑝. If 𝑝

is not normalized, the slopes and variable axis are scaled by a constant.
Example 4.3: Directional derivative of a quadratic function
Consider the function of two variables,
𝑓 (𝑥 1 , 𝑥2 ) = 𝑥12 + 2𝑥22 − 𝑥 1 𝑥2 , (4.10)
The gradient can be obtained using analytic differentiation to obtain

2𝑥1 − 𝑥 2
∇ 𝑓 (𝑥1 , 𝑥2 ) = . (4.11)
4𝑥2 − 𝑥 1
At point 𝑥 = [−1, 1], the gradient is,

−3
∇ 𝑓 (−1, 1) = . (4.12)
5
√ √
Taking the derivative in the direction 𝑝 = [2/ 5, −1/ 5]𝑇 , we obtain,
√
2/ 5 11
∇ 𝑓 𝑇 𝑝 = [−3, 5] √ = −√ , (4.13)
−1/ 5 5
which we show in Fig. 4.7. We use a 𝑝 with unit length to get the slope of the
function in the original units. A projection of the function in the 𝑝 direction can
be obtained by plotting 𝑓 along the line defined by 𝑥 + 𝛼𝑝, where alpha is the
independent variable, as shown in Ex. 4.3(middle). The projected slope of the
function in that direction corresponds to the slope of this univariate function.
4.1.2 Curvature and Hessians

The rate of change of the gradient—the curvature—is also useful in-
formation because it tells us if a function slope is increasing (positive
curvature), decreasing (negative curvature), or stationary (zero curva-
ture).
(0,1)
1.5 10
(-1,1) (1,1)
∇𝑓 k∇ 𝑓 k
𝑥 8
1
𝑝
𝑥 + 𝛼𝑝 6
−6 −3 0 3 6
𝑥2 0.5 𝑓 (-1,0) (1,0)
4 𝑥 ∇𝑓 𝑇𝑝
0
2
∇𝑓 𝑇𝑝
(-1,-1) (1,-1)
−0.5 0
−1.5 −1 −0.5 0 0.5 𝛼 (0,-1)
𝑥1
In one dimension, the gradient reduces to a scalar (the slope) and Figure 4.7: Derivative along the
the curvature is also a scalar that can be calculated by taking the second direction 𝑝.
derivative of the function. To quantify curvature in 𝑛 dimensions, we

need to take the partial derivative of each gradient component 𝑗 with
respect to each coordinate direction 𝑖, which can be written as
𝜕2 𝑓
. (4.14)
𝜕𝑥 𝑖 𝜕𝑥 𝑗
If the function 𝑓 has continuous second partial derivatives, the order of

differentiation does not matter and the mixed partial derivatives are
equal, and thus
𝜕2 𝑓 𝜕2 𝑓
= . (4.15)
𝜕𝑥 𝑖 𝜕𝑥 𝑗 𝜕𝑥 𝑗 𝜕𝑥 𝑖
This property is known as the symmetry of second derivatives or
equality of mixed partials.∗ ∗ Thesymmetry of second derivatives has
a long history and there is a lot more to it.
Considering all gradient components and their derivatives with
respect to all coordinate directions results in a second-order tensor.
This tensor can be represented as a square 𝑛 × 𝑛 matrix of second-order
partial derivatives called the Hessian:
 𝜕2 𝑓 𝜕2 𝑓 𝜕2 𝑓 
 𝜕𝑥 2 ··· 𝜕𝑥 1 𝜕𝑥 𝑛 
 21 𝜕𝑥 1 𝜕𝑥2 
 𝜕 𝑓 𝜕2 𝑓 
 𝜕𝑥 𝜕𝑥 𝜕𝑥 2 𝜕𝑥 𝑛 
𝜕2 𝑓
···
∇ ∇ 𝑓 (𝑥) ≡ 𝐻(𝑥) ≡  2 1 𝜕𝑥22 
. (4.16)
 ... .. .. .. 
 2 . . . 
 𝜕 𝑓 
 𝜕𝑥 𝑛 𝜕𝑥1 𝜕2 𝑓 𝜕2 𝑓

 
𝜕𝑥 𝑛 𝜕𝑥2
··· 𝜕𝑥 𝑛2
Because of the symmetry of second derivatives property, the Hessian is

a symmetric matrix with 𝑛(𝑛 + 1)/2 independent elements.
Each row 𝑖 of the Hessian is a vector that quantifies rate of change
of all components 𝑗 of the gradient vector with respect to the 𝑖 direction.
On the other hand, each column 𝑗 of the matrix quantifies the rate
of change of component 𝑗 of the gradient vector with respect to all

coordinate directions 𝑖. Because the Hessian is symmetric, the rows
and columns are transposes of each other and these two interpretations
are equivalent.
We can find the rate of change of the gradient in an arbitrary direction
𝑝 by taking the product 𝐻𝑝. This yields an 𝑛-vector that quantifies the
rate of change in the gradient in the direction 𝑝, where each component
of the vector is the rate of the change in the corresponding partial
derivative with respect to a movement along 𝑝, so we can write as,
∇ 𝑓 (𝑥 + 𝜖𝑝) − ∇ 𝑓 (𝑥)
𝐻𝑝 = ∇𝑝 ∇ 𝑓 (𝑥) = lim . (4.17)
𝜖→0 𝜖
Because of the symmetry of second derivatives, we can also interpret
this as the rate of change in the directional derivative of the function
along 𝑝 with respect to each of the components of 𝑝.
To find the curvature of the one-dimensional function along a
direction 𝑝, we need to project 𝐻𝑝 onto direction 𝑝 as follows,
𝐷𝑝2 𝑓 (𝑥) = 𝑝 𝑇 𝐻𝑝, (4.18)
which yields a scalar quantity. Again, if we want to get the curvature in

the original units of 𝑥, 𝑝 should be normalized.
For an 𝑛-dimensional Hessian, it is possible to find 𝑛 directions 𝑣
along which projected curvature aligns with that direction, that is,
𝐻𝑣 = 𝜅𝑣. (4.19)
This is an eigenvalue problem whose eigenvectors represent the principal

curvature directions and the eigenvalues 𝜅 quantify the corresponding
curvatures. If each eigenvector is normalized as 𝑣ˆ = 𝑣/||𝑣|| 2 , then the
corresponding 𝜅 is the unscaled curvature.
Example 4.4: Hessian and principal curvature directions of a quadratic
Consider the quadratic function of two variables,
𝑓 (𝑥 1 , 𝑥2 ) = 𝑥12 + 2𝑥22 − 𝑥 1 𝑥2 , (4.20)
whose contours are shown in Fig. 4.8. These contours are ellipses that have the
same focus points. The Hessian of this quadratic is

2 −1
𝐻= , (4.21)
−1 4
√
which is constant. To find the curvature in the direction 𝑝 = [−1/2, − 3/2]𝑇 ,
we compute
h
√ i 2
" −1 # √
−1 2 7− 3
𝑝 𝐻𝑝 = −1 − 3
𝑇 √
− 3 = . (4.22)
2 2 −1 4 2
2
The principal curvature directions can be computed by solving the eigenvalue

problem (4.19). This yields two eigenvalues and two corresponding eigenvalues,
√ √
√ 1− 2 √ 1+ 2
𝜅 (1) = 3 + 2, 𝑣 (1) = , and 𝜅 (2) = 3 − 2, 𝑣 (2) = .
1 1
(4.23)
By plotting the principal curvature directions superimposed on the function
contours (Fig. 4.8 on left), we can see that they are aligned with the major and
minor axes of the ellipses.
To see how the curvature varies as a function of the direction, we make a
polar plot of the curvature 𝑝 𝑇 𝐻𝑝, where 𝑝 is normalized (Fig. 4.8 on right). We
can see that the maximum curvature aligns with the first principal curvature
direction, as expected, and the minimum one corresponds to the second
principal curvature direction.
(0,1)
𝜅 (1) 𝑣ˆ (1) (-1,1) (1,1)

𝜅 (1)
1
Figure 4.8: Contours of 𝑓 for Ex. 4.4

𝜅 (2) 𝑣ˆ (2) 𝜅 (2)
and the two principal curvature
𝑥2 0 (-1,0) (1,0)
0 2 4 6 directions in red. The polar plot
shows the curvature, with the eigen-
𝑝 𝑇 𝐻𝑝 vectors pointing at the directions of
−1 𝑝 principal curvature, and all other
(-1,-1) (1,-1)
directions have curvature values in
−1 0 1 (0,-1) between.
𝑥1
Example 4.5: Hessian of two-variable polynomial
Consider the same polynomial from Ex. 4.1. Differentiating the gradient
we obtained previously, we obtain the Hessian,

6𝑥1 4𝑥 2
𝐻(𝑥1 , 𝑥2 ) = . (4.24)
4𝑥2 4𝑥1 − 6𝑥2
We can visualize the variation of the Hessian by plotting the principal curvatures
at different points (Fig. 4.9).
4.1.3 Taylor Series

The Taylor series provides a local approximation to a function and is
the foundation for gradient-based optimization algorithms.
For an 𝑛-dimensional function, the Taylor series can predict the
function along any direction 𝑝. This is done by projecting the gradient
2 Saddle point
𝑥2 0
Maximum Minimum
−2 Saddle point
Figure 4.9: Principal curvature

−4
−4 −2 0 2 4 direction and magnitude variation.
𝑥1
and Hessian onto the desired direction 𝑝, to get an approximation of

the function at any nearby point 𝑥 + 𝛼𝑝: † † For a more extensive introduction to the

Taylor series, see Appendix A.6.
1 3
𝑓 (𝑥 + 𝑝) = 𝑓 (𝑥) + ∇ 𝑓 (𝑥)𝑇 𝑝 + 𝑝 𝑇 𝐻(𝑥)𝑝 + 𝒪 𝑝 . (4.25)
2
We use a second-order Taylor series (ignoring the cubic term)
because it results in a quadratic, which is the lowest order Taylor series
that can have a minimum. For a function that is 𝐶 2 continuous, this
approximation can be made arbitrarily accurate by making 𝛼 small
enough.
Example 4.6: Taylor series expansion of two-variable function
Using the gradient and Hessian of the two-variable polynomial from Ex. 4.1
and Ex. 4.5, we can use Eq. 4.25 to construct a second-order Taylor expansion
about 𝑥 (0) ,
𝑇
3𝑥12 + 2𝑥22 − 20 6𝑥 1 4𝑥2
𝑓˜(𝑝) = 𝑓 𝑥 (0) + 𝑝 + 𝑝𝑇 𝑝. (4.26)
4𝑥 1 𝑥2 − 3𝑥 22 4𝑥 2 4𝑥1 − 6𝑥2
Figure 4.10 shows the resulting Taylor series expansions about different points.
4 4 4
2 2 2
Saddle point
𝑥2 0 𝑥2 0 𝑥2 0
Minimum Maximum
−2 −2 −2
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1
Figure 4.10: The second-order Tay-

lor series expansion uses the func-
tion value, gradient, and Hessian
at a point to construct a quadratic
model about that point. Depending
4.1.4 What is an Optimum?

To find the minimum of a function, we must determine the mathematical
conditions that identify a given point 𝑥 as a minimum. There is only a
limited set of problems for which we can prove global optimality, so in
general, we are only interested in local optimality.
The extreme value theorem states that a continuous function on a
closed interval has both a maximum and a minimum in that interval. A
point 𝑥 ∗ is a local minimum if 𝑓 (𝑥 ∗ ) ≤ 𝑓 (𝑥) for all 𝑥 in the neighborhood
of 𝑥 ∗ . A second-order Taylor-series expansion about 𝑥 ∗ for small steps
of size 𝑝 yields
1
𝑓 (𝑥 ∗ + 𝑝) = 𝑓 (𝑥 ∗ ) + ∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 + 𝑝 𝑇 𝐻(𝑥 ∗ )𝑝 + . . . . (4.27)
2
For 𝑥 ∗ to be an optimal point, we must have 𝑓 (𝑥 ∗ + 𝑝) ≥ 𝑓 (𝑥 ∗ ) for all 𝑝.
This implies that the first- and second-order terms in the Taylor series
have to be non-negative, that is,
1
∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 + 𝑝 𝑇 𝐻(𝑥 ∗ )𝑝 ≥ 0. (4.28)
2
Because the magnitude of 𝑝 is small, we can always find a 𝑝 such
that the first term dominates. Therefore, we require that
∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 ≥ 0. (4.29)
Because 𝑝 can be in any arbitrary direction, the only way this inequality
can be satisfied is if all the elements of the gradient are zero,
∇ 𝑓 (𝑥 ∗ ) = 0. (4.30)
This is the first-order optimality condition. This is necessary because if

any element of 𝑝 is nonzero, there are directions (such as 𝑝 = −∇ 𝑓 ) for
which the inequality would not be satisfied.
Since the gradient term has to be zero, we must now satisfy the
remaining term in the inequality (4.28), that is,
𝑝 𝑇 𝐻(𝑥 ∗ )𝑝 ≥ 0. (4.31)
From Eq. 4.18, we know that this term represents the curvature in
direction 𝑝, so this means that the function curvature must be positive
of zero when projected in any direction. You may recognize this
inequality as the definition of a positive semidefinite matrix. In other

words, the Hessian 𝐻(𝑥 ∗ ) must be positive semidefinite.
For a matrix to be positive semidefinite, its eigenvalues must all
be greater than or equal to zero. Recall that the eigenvalues of the
Hessian quantify the principal curvatures, so as long as all the principal
curvatures are greater or equal than zero, the curvature along an
arbitrary directions is also greater or equal than zero.
These conditions on the gradient and curvature are necessary con-
ditions for a local minimum, but not sufficient. These conditions are
not sufficient because if the curvature is zero in some direction 𝑝 (i.e.,
𝑝 𝑇 𝐻(𝑥 ∗ )𝑝 = 0), we have no way of knowing if it is a minimum unless
we look at the third-order term. In that case, even if it is a minimum, it
is a weak minimum.
The sufficient conditions for optimality require that the curvature
is positive in any direction, in which case we have a strong minimum.
Mathematically, this means that 𝑝 𝑇 𝐻(𝑥 ∗ )𝑝 > 0 for all nonzero 𝑝, which
is the definition of a positive definite matrix. If 𝐻 is positive definite
matrix, every eigenvalue of 𝐻 is positive and the determinant of every
leading principal sub-matrix of 𝐻 is positive.
Figure 4.11 shows some examples of quadratic functions that are
positive definite (all positive eigenvalues), positive semidefinite (non-
negative eigenvalues), indefinite (mixed eigenvalues), and negative
definite (all negative eigenvalues).
(a) Positive definite (b) Positive semidefinite (c) Indefinite (d) Negative definite
In summary, the necessary optimality conditions for an unconstrained Figure 4.11: Quadratic functions
with different types of Hessians
optimization problem are
from positive definite to negative
definite.
∇ 𝑓 (𝑥 ∗ ) = 0,
(4.32)
𝐻(𝑥 ∗ ) is positive semidefinite.
The sufficient optimality conditions are
∇ 𝑓 (𝑥 ∗ ) = 0,
(4.33)
𝐻(𝑥 ∗ ) is positive definite.
Example 4.7: Finding minima analytically.
Consider the function to two variables,
𝑓 = 0.5𝑥14 + 2𝑥13 + 1.5𝑥12 + 𝑥 22 − 2𝑥1 𝑥2
We can find the minima of this function analytically by solving for the optimality
conditions analytically.
To find the critical points of this function, we solve for the points at which
the gradient is equal to zero,
 𝜕𝑓  3
 𝜕𝑥  2𝑥1 + 6𝑥12 + 3𝑥1 − 2𝑥 2
 
∇𝑓 =  𝜕𝑓  =
1 =
0
 𝜕𝑥2  2𝑥 2 − 2𝑥 1 0
 
From the second equation we have that 𝑥 2 = 𝑥1 . Substituting this into the first
equation yields,
𝑥 1 2𝑥12 + 6𝑥 + 1 = 0.
The solutions of this equation yields three points
" √ # "√ #
0 − 23 − 27 7
− 32
𝑥 (1)
= , 𝑥 (2)
= √ , 𝑥 (3)
= √2
0 − 23 − 27 7
− 32
2
To classify these points, we need to compute the Hessian matrix. Differentiating

the gradient, we get
 𝜕2 𝑓 𝜕2 𝑓 
 
 𝜕𝑥 2 𝜕𝑥1 𝜕𝑥2  6𝑥12 + 12𝑥1 + 3 −2
𝐻(𝑥 1 , 𝑥2 ) =  𝜕2 1𝑓 𝜕 𝑓 
= .

2
−2 2
 𝜕𝑥2 𝜕𝑥1 𝜕𝑥22 

One easy way to determine if a 2 × 2 matrix is positive is by checking that both
the first diagonal element and the matrix determinant are positive. Evaluating
the determinant the first point, we get

(1) 3 −2
det 𝐻 𝑥 = det = 2 > 0.
−2 2
Since the first diagonal element is positive as well, 𝐻 at this point is positive
definite so 𝑥 (1) is a local minimum. For the second point,
√
3(3 + 7) −2 √
(2)
det 𝐻 𝑥 = det = 14 + 6 7 > 0.
−2 2
√
Since 3(3 + 7) > 0 as well, 𝑥 (2) is also a local minimum. For the third point,
√
9 − 3 7 −2 √
(3)
det 𝐻 𝑥 = det = 14 − 6 7 < 0,
−2 2
√
and since 9 − 3 7 > 0, this is a saddle point.
1
Minimum
(local)
0
Saddle point
𝑥2 −1
−2 Minimum
(global)
−3
Figure 4.12: Minima and critical
points for a polynomial of two
−4
−4 −3 −2 −1 0 1 2 variables.
𝑥1
These three critical points are shown in Fig. 4.12. To find out which of the
two local minima
if the global
one, we evaluate the function at each of these
points. Since 𝑓 𝑥 (2) < 𝑓 𝑥 (1) , 𝑥 (2) is the global minimum.
While it is possible to solve for the optimality conditions analytically,

as we did in Ex. 4.7, this is not possible in general because the resulting
equations might not be solvable in closed form. Therefore, we need
numerical methods for solving for these conditions.
When using a numerical approach, we seek points where ∇ 𝑓 (𝑥 ∗ ) = 0,
but the entries in ∇ 𝑓 do not converge to exactly zero because of finite-
precision arithmetic. Instead, we define convergence for the first
criterion based on the maximum component of the gradient, such that,
∇𝑓 ∞
< 𝜏, (4.34)
where 𝜏 is some tolerance. A typical absolute tolerance is 𝜏 = 10−6

or a six-order magnitude reduction in gradient when using a relative
tolerance. The second condition (that 𝐻 must be positive semidefinite)
is not usually checked explicitly. If we satisfy the first condition then
all we know is that we have reached a stationary point, which could
be a maximum, a minimum, or a saddle point. However, as shown in
Section 4.4, our search directions are always descent directions and so
in practice, we only converge to local minimum.
4.2 Two Overall Approaches to Finding an Optimum
While the optimality conditions derived in the previous section can be

solved analytically to find the function minima, this analytic approach
is not possible for functions that are the results of numerical models.
Instead, we need iterative numerical methods that can find minima

based only on the function values and its gradients.
In Chapter 3, we reviewed methods for solving simultaneous sys-
tems of nonlinear equations, which we wrote as 𝑟(𝑢) = 0. Because
the first order optimality condition (∇ 𝑓 = 0) can be written in this
residual form (where 𝑟 ≡ ∇ 𝑓 and 𝑢 ≡ 𝑥), we could try to use the
solvers from Chapter 3 directly to solve unconstrained optimization
problems. While several components of general solvers for 𝑟(𝑢) = 0 are
used in optimization algorithms, these solvers are not the most effective
approaches in their original form. Furthermore, solving ∇ 𝑓 = 0 is not
necessarily sufficient—it finds a stationary point but not necessarily a
minimum. Optimization algorithms require additional considerations
to ensure convergence to a minimum.
Similarly to the iterative solvers from Chapter 3, gradient-based
algorithms start with a guess, 𝑥 (0) , and generate a series of points,
𝑥 (1) , 𝑥 (2) , . . . , 𝑥 (𝑘) , . . . that converge to a local optimum, 𝑥 ∗ , as previously
illustrated in Fig. 4.1. At each iteration, some form of the Taylor series
about the current point is used to find the next point.
A truncated Taylor series is in general only a good model within a
small neighborhood, as shown in Fig. 4.13, which shows three quadratic
models of the same function based on three different points. All
𝑥 (𝑘) 𝑥∗ 𝑥∗ 𝑥∗
𝑥2 𝑥2 𝑠 (𝑘) 𝑥2 𝑥 (𝑘)
𝑥 (𝑘)
𝑥1 𝑥1 𝑥1
quadratic approximations match the local gradient and curvature at Figure 4.13: Taylor series quadratic
models are only guaranteed to be
the respective points. However, the Taylor series quadratic about the
accurate near the point about which
first point (left plot) yields a quadratic without a minimum (the only the series is expanded (𝑥). When
critical point is a saddle point). The second point (middle plot) yields the point is far from the optimum,
the quadratic model might result
a quadratic whose minimum is closer to the true minimum. Finally, in a function without a minimum
the Taylor series about the actual minimum point (right plot) yields a (left).
quadratic with the same minimum, as would be expected, but we can
see how the quadratic model worsens the further we are from the point.
Because the Taylor series is only guaranteed to be a good model
locally, we need a globalization strategy to ensure convergence to an
optimum. Globalization here means to make the algorithm robust
enough that it is able to converge to a local minimum starting from any

point in the domain. This should not to be confused with trying to find
the global minimum, which is a separate issue (see Tip 4.24). There are
two main globalization strategies: line search and trust region.
The line search approach consists of three main steps for every
iteration (Fig. 4.14): 𝑥 (0)
1. Choose a suitable search direction from the current point. The

Search
choice of search direction is based on a Taylor series approxima- direction
tion.
2. Determine how far to move in that direction by performing a line
search. Update 𝑥 Line search
3. Move to the new point and update all values.

No Is 𝑥 a
The two first steps can be seen a two separate subproblems. We address
minimum?
the line search subproblem in Section 4.3 and the search direction
Yes
subproblem in Section 4.4.
Trust-region methods also consist of three steps (Fig. 4.15): 𝑥∗
1. Create a model about the current point. This model can be based Figure 4.14: Line search approach.
on a Taylor series approximation or another type of surrogate 𝑥 (0)

model.
2. Minimize the model within a trust region around the current point
Create model
to find the step.
Update trust-
3. Move to the new point, update values, and adapt the size of the region size, Δ
trust region. Minimize
model
We introduce the trust-region approach in Section 4.5, but we devote Update 𝑥
more attention to algorithms that use the line search approach because Is 𝑥 a
they are more common. No minimum?
Both of the line search and trust region approaches use iterative Yes
processes that must be repeated until some convergence criterion is
𝑥∗
satisfied. The first step in both approaches is usually refereed to a
major iteration, while the second step might require more function Figure 4.15: Trust region approach.
evaluations corresponding to minor iterations.
4.3 Line Search
All gradient-based unconstrained optimization algorithms that use a

line search follow the procedure outlined in Alg. 4.8. We start with
a guess 𝑥 (0) and provide a convergence tolerance 𝜏 for the optimality
condition. The final output is an optimal point 𝑥 ∗ and the corresponding
function value 𝑓 (𝑥 ∗ ). As mentioned in the previous section, there are
two main subproblems in line-search gradient-based optimization
algorithms: choosing the search direction and determining how far to

step in that direction. In the next section, we introduce several methods
for choosing the search direction. The line search method determines
how far to step in the chosen direction and is usually independent of
the method for choosing the search direction. Therefore, different line
search methods can be combined with different methods for finding
the search direction. However, the search direction method determines
the name of the overall optimization algorithm, as we will see in the
next section.
Algorithm 4.8: Gradient-based unconstrained optimization using a line

search
Inputs:
𝑥 (0) : Starting point
𝜏: Convergence tolerance
func: Function that takes 𝑥 as input and returns 𝑓 (𝑥) and optionally ∇ 𝑓 (𝑥)
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
while ||∇ 𝑓 || ∞ > 𝜏 do Optimality condition

Determine search direction, 𝑝 (𝑘) Use any of the methods from Section 4.4
Determine step length, 𝛼(𝑘) Use a line search algorithm
𝑥 (𝑘+1) = 𝑥 (𝑘) + 𝛼(𝑘) 𝑝 (𝑘) Update design variables
𝑘 = 𝑘+1 Increment iteration index
𝑥 (𝑘) + 𝛼𝑝 (𝑘)
end while
For the line search subproblem, we assume that we are given a 𝑝 (𝑘)
starting at 𝑥 (𝑘) and a suitable search direction 𝑝 (𝑘) along which we are
going to search. The line search then operates solely on points along 𝑥 (𝑘)
direction 𝑝 (𝑘) starting from 𝑥 (𝑘) , which can be written as
Figure 4.16: The line search starts
from a given point 𝑥 (𝑘) and searches
𝑥 (𝑘+1) = 𝑥 (𝑘) + 𝛼𝑝 (𝑘) , (4.35)
solely along direction 𝑝 (𝑘) .
where the scalar 𝛼 is always positive and represents how far we go in

∇𝑓
the direction 𝑝 (𝑘) . This equation produces a one-dimensional slice of
𝑛-dimensional space, as illustrated in Fig. 4.17.
The line search determines the magnitude of the scalar 𝛼(𝑘) , which in
turn determines the next point in the iteration sequence. Even though
𝑥 (𝑘) and 𝑝 (𝑘) are 𝑛-dimensional, the line search is a one-dimensional 𝑝 (𝑘)
problem with the goal of selecting 𝛼(𝑘) .
Line search methods require that the search direction 𝑝 (𝑘) be a Figure 4.18: The line search direc-
𝑇 tion must be a descent direction.
descent direction, so that ∇ 𝑓 (𝑘) 𝑝 (𝑘) < 0 (see Fig. 4.6). This guarantees
𝑥 (𝑘) 𝑝 (𝑘)
𝑥2 𝑓
𝑥 (𝑘+1)
𝑥 (𝑘) + 𝛼𝑝 (𝑘) Figure 4.17: The line search projects
the 𝑛-dimensional problem onto
one-dimension, where the indepen-
𝑥1 𝛼 dent variable is 𝛼.
𝑥 (𝑘) 𝑥 𝑘+1 = 𝑥 (𝑘) + 𝛼(𝑘+1) 𝑝 (𝑘)
that 𝑓 can be reduced by stepping some distance along this direction

with a positive 𝛼.
The goal of the line search is not to find the value of 𝛼(𝑘) that

minimizes 𝑓 𝑥 (𝑘) + 𝛼(𝑘) 𝑝 (𝑘) , but to find a point that is “good enough”
using as few function evaluations as possible. This is because finding
the exact minimum along the line would require too many evaluations
of the objective function and possibly its gradient. Because the overall
optimization needs to find a point in 𝑛-dimensional space, the search
direction might change drastically between line searches, so spending
too many iterations on each line search is generally not worthwhile.
Consider the function shown in Fig. 4.19. At point 𝑥 (𝑘) , the direction
𝑝 is a descent direction. However, it would be wasteful to spend a lot
(𝑘)
of effort determining the exact minimum in the 𝑝 (𝑘) direction because it 𝑥 (𝑘)
would not take us any closer to the minimum of the overall function 𝑥2
(the dot on the right side of the plot). Instead, we should find a point 𝑝 (𝑘)
that is good enough and then update the search direction.

To simplify the notation for the line search, we define the univariate
𝑥1
function
𝜙(𝛼) = 𝑓 𝑥 (𝑘) + 𝛼𝑝 (𝑘) , (4.36) Figure 4.19: The descent direction
does not necessarily point towards
where 𝛼 = 0 corresponds to the start of the line search and thus the minimum, in which case it
𝜙(0) = 𝑓 (𝑥 (𝑘) ). Then, using 𝑥 = 𝑥 (𝑘) + 𝛼𝑝 (𝑘) , the slope of the univariate would be wasteful to do an exact
line search.
function is
𝜕[ 𝑓 (𝑥)] 𝜕[ 𝑓 (𝑥)] 𝜕𝑥
𝜙0(𝛼) = = = ∇ 𝑓 (𝑥)𝑇 𝑝 (𝑘)
𝑇
𝜕𝛼 𝜕𝑥 𝜕𝛼 (4.37)
(𝑘) (𝑘) (𝑘)
= ∇𝑓 𝑥 + 𝛼𝑝 𝑝 ,
which is the directional derivative along the search direction. The slope
at the start of a given line search is
𝑇
𝜙0(0) = ∇ 𝑓 (𝑘) 𝑝 (𝑘) . (4.38)
Because 𝑝 (𝑘) must be a descent direction, 𝜙0(0) is always negative.
Fig. 4.20 is a version of the one-dimensional slice from Fig. 4.17 in the
this notation. The 𝛼 axis and the slopes scale with the magnitude of
𝑝 (𝑘) .
4.3.1 Sufficient Decrease and Backtracking 𝜙0(0) < 0
The simplest line search algorithm to find a “good enough” point relies 𝜙
on the sufficient decrease condition in combination with a backtracking

algorithm. The sufficient decrease condition, also known as the Armĳo 𝜙0(𝛼)
condition, is given by the inequality
𝛼=0
0
(4.39)
𝛼
𝜙(𝛼) ≤ 𝜙(0) + 𝜇1 𝛼𝜙 (0)
Figure 4.20: For the line search, we
for a constant 0 < 𝜇1 ≤ 1.‡ The quantity 𝛼𝜙0(0) represents the expected denote the function as 𝜙(𝛼) with
decrease of the function, assuming the function continued at the same the same value as 𝑓 . The slope
slope. The multiplier 𝜇1 states that we will be satisfied as long we achieve 𝜙0 (𝛼) is the gradient of 𝑓 projected
onto the search direction.
even a small fraction of the expected decrease. In practice, this constant
‡ This condition can be problematic near
is several orders of magnitude smaller than one, typically 𝜇1 = 10−4 .
a local minimum because 𝜙(0) and 𝜙(𝛼)
𝑇
Because 𝑝 (𝑘)is a descent direction, and thus 𝜙0(0)
= < 0,
∇ 𝑓 (𝑘) 𝑝 (𝑘) are very similar and so their subtraction
is inaccurate. Hager et al.57 introduced
there is always a positive 𝛼 that satisfies this condition for a smooth an approximate Wolfe condition with im-
function. proved accuracy along with an efficient
line search based on a secant method.
The concept is illustrated in Fig. 4.21, which shows a function with
57. Hager et al., A New Conjugate Gradient
negative slope at 𝛼 = 0 and a sufficient decrease line whose slope is a Method with Guaranteed Descent and an
Efficient Line Search. 2005
fraction of that initial slope. When starting a line search, all we know is
the function value and its slope at 𝛼 = 0, so we do not really now how
the function varies until we evaluate it. Because we do not want to do
to many function evaluations, the first point whose value is below the
sufficient decrease line is deemed acceptable. The sufficient decrease
line slope in Fig. 4.21 is exaggerated for illustration purposes; for typical
values of 𝜇1 , the line is indistinguishable from a horizontal line when
plotted.
𝜙0(0) Sufficient
decrease line
𝜙(𝛼)
Figure 4.21: Sufficient decrease

𝛼
conditions.
Acceptable range Acceptable range
Line search algorithms require a first guess for 𝛼. As we will see

later, some methods for finding the search direction also provide good
guesses for the step length. However, in many cases we have no idea of
the scale of function, so our initial guess may not be suitable. Even if
we do have an educated guess for 𝛼, it is only a guess and the first step
might not satisfy the sufficient decrease condition.
One simple algorithm that is guaranteed to find a step that satisfies
the sufficient decrease condition is backtracking (Alg. 4.9). This algo-
rithm starts with a maximum step and successively reduces the step
by a constant ratio 𝜌 until it satisfies the sufficient decrease condition
(a typical value is 𝜌 = 0.5). Because our search direction is a descent
direction, we know that if we backtrack enough we will achieve an
acceptable decrease in function value.
Algorithm 4.9: Backtracking line search algorithm
Inputs:
𝛼init > 0: Initial step length
0 < 𝜇1 < 1: Sufficient decrease factor
0 < 𝜌 < 1: Backtracking factor
Outputs:
𝛼∗ : Step size satisfying sufficient decrease condition
𝛼 = 𝛼init
while 𝜙(𝛼) > 𝜙(0) + 𝜇1 𝛼𝜙0 (0) do Is function value is above sufficient decrease line?
𝛼 = 𝜌𝛼 Backtrack
end while
Example 4.10: Backtracking line search
Consider the function, 𝑥2

2.5
𝑝
𝑓 (𝑥1 , 𝑥2 ) = 0.1𝑥16 − 1.5𝑥14 + 5𝑥12 + 0.1𝑥 24 + 3𝑥22 − 9𝑥2 + 0.5𝑥1 𝑥2 . 2
Suppose we do a line search starting from 𝑥 = (−1.25, 1.25) in the direction 1.5
𝑝 = [4, 0.75], as shown in Fig. 4.22. Applying the backtracking algorithm with
1 𝑥 (𝑘)
𝜇1 = 10−4 and 𝜌 = 0.7 produces the iterations shown in Fig. 4.23. The sufficient
decrease line appears to be horizontal, but it just has a small slope because 0.5
𝜇1 is small. Using a large initial step of 𝛼init = 1.2 (left), several iterations are 0
0 1 2 3
required. For a small initial step of 𝛼init = 0.05 (right), the algorithm satisfies −3 −2 −1
𝑥1
sufficient decrease at the first iteration but misses out on further decreases.
Figure 4.22: Line search direction.
Although backtracking is guaranteed to find a point that satisfies

sufficient decrease, there are two undesirable scenarios where this
algorithm performs poorly. The first scenario is that the guess for the
30 30
20 20
𝑓 10 𝑓 10
0 0
−10 −10
Figure 4.23: Backtracking using
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 different initial steps.
𝛼 𝛼 (0) 𝛼 (0) 𝛼
initial step is far too large, and the step sizes that satisfy sufficient de-
crease are smaller than the starting step by several orders of magnitude.
Depending on the value of 𝜌, this scenario requires a large number of
backtracking evaluations.
The other undesirable scenario is where our initial guess immedi-
ately satisfies sufficient decrease, but the slope of the function at this
point is still highly negative and we could have decreased the function
value by much more if we had taken a larger step. In this case, our
guess for the initial step is far too small.
Even if our original step size is not too far from an acceptable step
size, the basic backtracking algorithm ignores any information we have
about the function values and its gradients, and blindly takes a reduced
step based on a preselected ratio 𝜌. We can make more intelligent
estimates of where an acceptable step is based on the evaluated function
values (and gradients, if available). In the next section, we introduce a
more sophisticated line search algorithm that is able to deal with these
scenarios much more efficiently.
4.3.2 A Better Line Search

One major weakness of the sufficient decrease condition is that it accepts
small steps that marginally decrease the objective function, because 𝜇1
in Eq. 4.39 is rather small. We could just increase 𝜇1 (that is, tilt the red
line downward in Fig. 4.21) to prevent these small steps; however, that
would prevent us from taking large steps that result in a reasonable
decrease. A large step that provides a reasonable decrease is desirable,
because that progress generally leads to faster convergence. Instead, we
want to prevent overly small steps while not making it more difficult
to accept decent large steps. This is accomplished by adding a second
condition and using it to construct a more efficient line search algorithm.
Just like guessing the step size, it is difficult to know in advance how
much of a function value decrease to expect. However, if we compare
the slope of the function at the candidate point with the slope at the
start of the line search, we can get an idea if the function is “bottoming
out,” or flattening, using the curvature condition:
|𝜙0(𝛼)| ≤ 𝜇2 |𝜙0(0)|. (4.40)
This condition requires that the magnitude of the slope at the new point
be lower than the magnitude of the slope at the start of the line search
by a factor of 𝜇2 . This requirement is called the curvature condition
because by comparing the two slopes, we are effectively quantifying
the curvature of the function. Typical values of 𝜇2 range from 0.1 to 0.9,
and the best value depends on the method for determining the search
direction and is also problem dependent. To guarantee that there are
steps that satisfy both sufficient decrease and sufficient curvature, the
sufficient decrease slope must be shallower than the sufficient curvature
slope, that is, 0 < 𝜇1 ≤ 𝜇2 ≤ 1. As 𝜇2 tends to zero, enforcing the
sufficient curvature condition tends toward an exact line search.
The sign of the slope at a point satisfying this condition is not
important; all that matters is that the function be shallow enough. The
idea is that if the slope 𝜙0(𝛼) is still negative with a magnitude similar
to the slope at the start of the line search, then the step is too small, and
we expect the function to decrease even further by taking a larger step.
If the slope 𝜙0(𝛼) is positive with a magnitude similar to that at the start
of the line search, then the step is too large, and we expect to decrease
the function further by taking a smaller step. On the other hand, when
the slope is shallow enough (either positive or negative), we assume
that the candidate point is near a local minimum and additional effort
will yield only incremental benefits that are wasteful in the context of
our larger problem. The sufficient decrease and curvature conditions
are collectively known as the strong Wolfe conditions. Figure 4.24 shows
acceptable intervals that satisfy the strong Wolfe conditions.
𝜙0(0) Sufficient
decrease line
𝜙(𝛼)
Figure 4.24: Steps that satisfy the

𝛼
strong Wolfe conditions.
Acceptable range Acceptable range
We now develop a more efficient line search algorithm that finds

a step satisfying the strong Wolfe conditions. Note that using the
curvature condition means we require derivative information (𝜙0).
There are various line search algorithms in the literature, including
some that are derivative-free. Here, we detail a line search algorithm
similar to that presented by Nocedal and Wright 58 . The algorithm has 58. Nocedal et al., Numerical Optimization.
2006
two stages:
1. The bracketing stage finds an interval within which we are certain

to find an acceptable step.
2. The pinpointing stage finds a point that satisfies the strong Wolfe
conditions within the interval provided by the bracketing stage.
The bracketing stage is detailed in Alg. 4.11 and illustrated in

Fig. 4.25; it consists of increasing the step size until finding a step
that satisfies the strong Wolfe conditions, or until finding an interval
that must contain a point satisfying those conditions. For a smooth
continuous function, the conditions are guaranteed to be met by a point
in a given interval if:
1. The function value at the candidate step is higher than at the start
of the line search.
2. The step satisfies sufficient decrease, but the slope is positive.
If the step satisfies sufficient decrease and the slope is negative, the step
size is increased to look for a larger function value reduction along the
line.
Algorithm 4.11: Bracketing stage for the line search algorithm
Inputs:
𝛼1 > 0: Initial step size guess
0 < 𝜇2 < 1: Sufficient curvature factor
𝜌 > 1: Step size increase factor
Outputs:
𝛼∗ : Step size satisfying strong Wolfe conditions
𝛼0 = 0
𝑖=1
while true do
Evaluate 𝜙(𝛼 𝑖 )
if [𝜙(𝛼 𝑖 ) > 𝜙(0) + 𝜇1 𝛼 𝑖 𝜙0 (0)] or [𝜙(𝛼 𝑖 ) > 𝜙(𝛼 𝑖−1 ) and 𝑖 > 1] then
𝛼∗ = pinpoint(𝛼 𝑖−1 , 𝛼 𝑖 ) return 𝛼∗
end if
Evaluate 𝜙0 (𝛼 𝑖 )
if |𝜙0 (𝛼 𝑖 )| ≤ −𝜇2 𝜙0 (0) then return 𝛼∗ = 𝛼 𝑖
else if 𝜙0 (𝛼 𝑖 ) ≥ 0 then
𝛼∗ = pinpoint(𝛼 𝑖 , 𝛼 𝑖−1 ) return 𝛼 ∗
else
𝛼 𝑖+1 = 𝜌𝛼 𝑖 𝜌 > 1, e.g. 2
end if
𝑖 = 𝑖+1
end while
Minimum bracketed; call pinpoint
𝛼 𝑖−1 𝛼𝑖
𝛼 𝑖−1 𝛼𝑖
Conditions are met;

line search is done
Figure 4.25: Visual representation

𝛼𝑖 𝛼 𝑖+1 of the bracketing algorithm.
𝑖 = 𝑖+1
The algorithm for the second stage, the pinpoint(𝛼 low , 𝛼 high ) func-
tion, is given in Alg. 4.12. In the first step, we need to estimate a good
candidate point within the interval that is expected to satisfy the strong
Wolfe conditions. A number of algorithms can be used to find such a
point. Since we have the function value and derivative at one endpoint
of the interval, and at least the function value at the other endpoint, one
option is to perform quadratic interpolation to estimate the minimum

within the interval. If the two end points are 𝛼 1 and 𝛼 2 , respectively,
the minimum can be found analytically from the function values and
derivative as

2𝛼 1 𝜙(𝛼 2 ) − 𝜙(𝛼1 ) + 𝜙0(𝛼 1 ) 𝛼 21 − 𝛼22
𝛼 min = . (4.41)
2 𝜙(𝛼2 ) − 𝜙(𝛼 1 ) + 𝜙0(𝛼 1 )(𝛼 1 − 𝛼 2 )
If we provide analytic gradients, or we already evaluated 𝜙0(𝛼 𝑖 ) (either

as part of Alg. 4.11 or as part of checking the strong Wolfe conditions
in Alg. 4.12), then we would have the function values and derivatives
at both points and we could use cubic interpolation instead.
A graphical representation of the process is shown in Fig. 4.26. In the
leftmost figure we construct a quadratic fit based on the function value
and slope at 𝛼 low and the function value at 𝛼 high . This fit provides us
with a good estimate of where the minimum might be. Four scenarios
are possible for this new trial point. In the first three the function value
is too high, the slope is too positive, or the slope is too negative. In those
scenarios we update our bracket and restart. In the fourth scenario the
function value decreases sufficiently, and the slope is sufficiently small
in magnitude to satisfy the strong Wolfe conditions.
Algorithm 4.12: Pinpoint function for the line search algorithm
Inputs:
𝛼low : Lower limit for pinpoint function
𝛼high : Upper limit for pinpoint function
0 < 𝜇2 < 1: Sufficient curvature factor
Outputs:
𝛼∗ : Step size satisfying strong Wolfe conditions
𝑗=0
while true do
Find 𝛼low ≤ 𝛼 𝑗 ≤ 𝛼high Using quadratic (4.41) or cubic interpolation
Evaluate 𝜙(𝛼 𝑗 )
if 𝜙(𝛼 𝑗 ) > 𝜙(0) + 𝜇1 𝛼 𝑗 𝜙0 (0) or 𝜙(𝛼 𝑗 ) > 𝜙(𝛼 low ) then
𝛼high = 𝛼 𝑗
else
Evaluate 𝜙0 (𝛼 𝑗 )
if |𝜙0 (𝛼 𝑗 )| ≤ −𝜇2 𝜙0 (0) then
𝛼∗ = 𝛼 𝑗 return 𝛼∗
else if 𝜙0 (𝛼 𝑗 )(𝛼high − 𝛼 low ) ≥ 0 then
𝛼high = 𝛼low
end if
𝛼low = 𝛼 𝑗
end if
𝑗 = 𝑗+1
end while
𝛼 low 𝛼j 𝛼 high 𝛼 low 𝛼 high
𝛼high 𝛼 low
𝛼 low 𝛼 high
Done
Figure 4.26: Visual representation
of the pinpointing algorithm.
𝛼∗
The line search defined by Alg. 4.11 followed by Alg. 4.12 is guar-
anteed to find a step length satisfying the strong Wolfe conditions
for any parameters 𝜇1 and 𝜇2 . A robust algorithm needs to consider
additional issues. One of these criteria is to ensure that the new point
in the pinpoint algorithm is not so close to an endpoint as to cause the
interpolation to be ill conditioned. A fall-back option in case the inter-
polation fails could be a simpler algorithm, such as bisection. Another
of these criteria is to ensure that the loop does not continue indefinitely
because finite-precision arithmetic leads to indistinguishable function
value changes.
Example 4.13: Line search with bracketing and pinpointing.
30 30
Bracketing
20 20 Bracketing
𝑓 10 Pinpointing 𝑓 10
Pinpointing
0 0
−10 −10
Figure 4.27: Example of a line
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 search iteration.
𝛼 𝛼 (0) 𝛼 (0) 𝛼
Let us perform the same line search as in Alg. 4.9 but now using bracketing
and pinpointing instead of backtracking. Using a large initial step of 𝛼init = 1.2
(left), bracketing is achieved in the first iteration. Then pinpointing finds a
point better than the one found using backtracking. The small initial step
of 𝛼init = 0.05 (right) does not satisfy the strong Wolfe conditions and the
bracketing stage moves forward as long as the function keeps decreasing. The
end result is a point that is much better than the one obtained with backtracking.
4.4 Search Direction
As stated in the beginning of this chapter, each iteration of an uncon-

strained gradient-based algorithm consists of two main steps: determin-
ing the search direction, and performing the line search (Alg. 4.8). The
method used to find the search direction, 𝑝 (𝑘) , in this iteration is what
names the particular algorithm, which can use any of the line search
algorithms described in the previous section. We start by introducing
two first-order methods that only require the gradient and then explain
two second-order methods that require the Hessian, or at least an
approximation of the Hessian.
4.4.1 Steepest Descent

The steepest descent method (often called gradient descent) is a simple
and intuitive method for determining the search direction. As discussed
in Section 4.1.1, the gradient points in the direction of steepest increase,
so −∇ 𝑓 points in the direction of steepest descent. Thus our search
direction at iteration 𝑘 is simply
𝑝 = −∇ 𝑓 , (4.42)
or as a normalized direction
∇ 𝑓 (𝑘)
𝑝 (𝑘) = − . (4.43)
k∇ 𝑓 (𝑘) k
While steepest descent sounds like the best possible search direction
to decrease a function, it actually is not. The reason is that when a
function curvature varies greatly with direction, the gradient alone is a
poor representation of function behavior beyond a small neighborhood.
Example 4.14: Steepest descent with varying amount of curvature
Consider the quadratic function,
𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 + 𝛽𝑥22 ,
where 𝛽 can be set to adjust he curvature in the 𝑥2 direction. In Fig. 4.28, we

show this function for 𝛽 = 1, 5, 15. The starting point is 𝑥 (0) = (10, 1). When
1 iterations 32 iterations 111 iterations

5 5 5
𝑥 (0) 𝑥 (0) 𝑥 (0)

𝑥2 0 𝑥∗ 𝑥2 0 𝑥∗ 𝑥2 0 𝑥∗
−5 −5 −5
−5 0 5 10 −5 0 5 10 −5 0 5 10
𝑥1 𝑥1 𝑥1
𝛽 = 1 (left), this quadratic has the same curvature in all directions and the Figure 4.28: Iteration history for
steepest descent direction points directly to the minimum. When 𝛽 > 1 (middle a quadratic function using the
steepest descent method with an
and right), this is no longer the case and steepest descent shows abrupt changes exact line search.
in the subsequent search directions. This zigzagging is an inefficient way to
approach the minimum. The higher the difference in curvature, the more
iterations it takes.
The behavior shown in Ex. 4.14 is expected and we can show it

mathematically. Assuming we perform an exact line search at each
iteration, this means selecting the optimal value for 𝛼 along the line
search:
𝜕 𝑓 (𝑥 (𝑘) + 𝛼𝑝 (𝑘) )
=0
𝜕𝛼
𝜕 𝑓 (𝑥 (𝑘+1) )
=0
𝜕𝛼
𝜕 𝑓 (𝑥 (𝑘+1) ) 𝜕(𝑥 (𝑘) + 𝛼𝑝 (𝑘) ) (4.44)
=0
𝜕𝑥 (𝑘+1) 𝜕𝛼
𝑇
∇ 𝑓 (𝑘+1) 𝑝 (𝑘) = 0
𝑇
−𝑝 (𝑘+1) 𝑝 (𝑘) = 0
Hence each search direction is orthogonal to the previous one. As
discussed in the last section, exact line searches are not desirable, so
the search directions are not precisely orthogonal. However, the overall
zigzagging behavior still exists.
Another issue with steepest descent is that the gradient at the current
point on its own does not provide enough information to inform a good
guess of the initial step size. As we saw in the line search, this initial
choice has a large impact on the efficiency of the line search because
the first guess could be orders of magnitude too small or too large.
Second-order methods later in this section will help with this problem.
In the meantime we can make a guess of the step size for a given line
search based on the result of the previous one. If we assume that at the
current line search we will obtain a decrease in objective function that
is comparable to the previous one, we can write
𝑇 𝑇
𝛼 (𝑘) ∇ 𝑓 (𝑘) 𝑝 (𝑘) ≈ 𝛼(𝑘−1) ∇ 𝑓 (𝑘−1) 𝑝 (𝑘−1) . (4.45)
Solving for the step length, and inserting the steepest descent direction,
we get the guess
k∇ 𝑓 (𝑘−1) k 2
𝛼(𝑘) = 𝛼 (𝑘−1) . (4.46)
k∇ 𝑓 (𝑘) k 2
𝑥2
This is just the first guess in the new line search, which will then proceed 3
34 iterations
as usual. If the slope of the function decreases relative to the previous 𝑥 (0)
2
line search, this guess decreases relative to the previous line search step
length, and vice versa. 1
𝑥∗
0
Example 4.15: Steepest descent applied to the bean function
We now find the minimum of the bean function, −1

−2 −1 0 1 2
1 2 𝑥1
2 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 ) + (1 − 𝑥2 ) + 2𝑥2 − 𝑥12 , (4.47)
2 Figure 4.29: Steepest descent opti-
using the steepest descent algorithm with a two-stage line search. Using an exact mization path.
line search (small enough 𝜇2 ) and a convergence tolerance to k∇ 𝑓 k ∞ ≤ 10−6 .

The optimization path is shown in Fig. 4.29. While it takes only a few iterations
to get close to the minimum, it takes many more to satisfy the specified
convergence tolerance.
Tip 4.16: Problem scaling is important.
Problem scaling is one of the most important practical considerations in

optimization. Steepest descent is particularly sensitive to scaling. Even though
we will see methods that are less sensitive, for general nonlinear functions poor
scaling can decrease the effectiveness of any method.
A common cause of poor scaling is unit choice. For example, consider a
problem with two types of design variables, where one type is the material
thickness in the order of 10−6 m, and the other type is the length of the structure
in the order of 1 m. If both distances are measured in meters, then the derivative
in the thickness direction will be large compared to the derivative in the length
direction. In other words, the design space will have a valley that is extremely
steep and short in one direction, and gradual and long in the other. The
optimizer will have great difficulty in navigating this type of design space.
Similarly, if the objective was power and a typical value was 1,000,000 W
then all of the gradients will likely be relatively small and satisfying convergence
tolerances may be difficult.
A good starting point for many optimization problems is to scale the
objective and every design variable to be around unity. So in the first example
we might measure thicknesses in micrometers, and in the second example
we could report power in MW. This heuristic still does not guarantee that the
derivatives are well scaled but it often provides a reasonable starting point for
further fine tuning of the problem scaling.
4.4.2 Conjugate Gradient

Steepest descent generally performs quite poorly, especially if the
problem is not well scaled, like the quadratic example in Figure 4.28.
The conjugate gradient method corrects the search directions such that
they do not zigzag as much. This method is based on the linear conjugate
gradient method, which was designed to solve linear equations. We
first introduce the linear conjugate gradient method, and then adapt it
to the nonlinear case.
For the moment, let us assume that we have a quadratic objective
function. The Hessian of a quadratic function that is badly scaled has a
high condition number, while a quadratic with a condition number of 1
is well scaled and would have perfectly circular contours. Fortunately, a
simple change in the search directions can yield a dramatic improvement

for the badly scaled case. We can choose search directions that are less
sensitive to the problem scaling. Let us also start with the simplest
case, where the Hessian is the identity matrix. In that case, the function
contours would all be circular. If we pick the coordinate directions as
search directions, and find the minimum in each direction, then we can
reach the minimum of an 𝑛-dimensional quadratic in at most 𝑛 steps
(Fig. 4.30).
Let us now assume that our quadratic is not ideally scaled, but
rather stretched in some direction (and optionally rotated). We want
to use the same concept and choose directions that will get us to the
𝑥2
minimum in at most 𝑛 steps. In the previous case, we chose orthogonal
vectors as our search directions. For the more general case, we chose
conjugate vectors. You can think of conjugate vectors as a generalization
of orthogonal vectors. The vectors 𝑝 are conjugate with respect to the 𝑥1
Hessian (or 𝐻-orthogonal) if

Figure 4.30: For a quadratic func-
𝑝 𝑇𝑖 𝐻𝑝 𝑗 = 0 for all 𝑖 ≠ 𝑗. (4.48) tion with perfectly circular contours
(condition number of 1) we can find
If the Hessian is the identity matrix, this definition would produce an the minimum in 𝑛 steps, where 𝑛 is
orthogonal set of vectors. Conjugate vectors are “orthogonal” in the the number of dimensions, by using
a coordinate search. A coordinate
stretched sense. These conjugate directions retain the property that if
search just means that we find that
we determine the best step in each conjugate direction, we can reach the minimum in each coordinate se-
minimum of an 𝑛-dimensional quadratic in 𝑛 steps (Fig. 4.31). This is a quentially. So in the above example
significant improvement over steepest descent, which, as we have seen, we searched in the 𝑥1 direction to
find the minimum along that line,
can take many iterations to converge on a stretched two-dimensional then searched in the 𝑥2 direction.
quadratic function.
Of course a general function is not quadratic, and our line search
methods do not find the best step in each direction. However, we
address these issues by using a local quadratic approximation of the
function, and performing a periodic restart, where every 𝑛 iterations a 𝑥2
steepest descent step is taken instead.

In practice this method outperforms steepest descent significantly
with only a small modification in procedure. The required change
𝑥1
is to save information on the search direction and gradient from the
previous iteration: Figure 4.31: For any quadratic func-
tion we can find the minimum in
𝑝 (𝑘) = −∇ 𝑓 (𝑘) + 𝛽 (𝑘) 𝑝 (𝑘−1) , (4.49) 𝑛 steps, where 𝑛 is the number of
dimensions, by searching along
where conjugate directions. Conjugate
𝑇
∇ 𝑓 (𝑘) ∇ 𝑓 (𝑘) directions are “orthogonal” with
𝛽 (𝑘)
= 𝑇
. (4.50) respect to the Hessian.
∇ 𝑓 (𝑘−1) ∇ 𝑓 (𝑘−1)
The parameter 𝛽 can be interpreted as a “damping parameter” that
prevents each search direction from varying too much relative to the
previous one. When the function steepens, the damping becomes larger,
and vice versa.
Example 4.17: Conjugate gradient applied to the bean function
Minimizing the same bean function from Ex. 4.15 and the same line search
algorithm and settings, we get the optimization path shown in Fig. 4.32. The 𝑥2
3
changes in direction for the conjugate-gradient method are smaller than for 22 iterations
steepest descent and it takes less iterations to achieve the same convergence 2
𝑥 (0)
tolerance.
1
𝑥∗
0
4.4.3 Newton’s Method

−1
−2 −1 0 1 2
The steepest descent and conjugate gradient methods use only first- 𝑥1
order information (the gradient). Newton’s method uses second-order

Figure 4.32: Conjugate gradient
information to enable better estimates of favorable search directions.
optimization path.
The method is based on a Taylor’s series expansion about the current
design point:
𝑇 1 𝑇
𝑓 (𝑥 (𝑘) + 𝑠 (𝑘) ) = 𝑓 (𝑘) + ∇ 𝑓 (𝑘) 𝑠 (𝑘) + 𝑠 (𝑘) 𝐻 (𝑘) 𝑠 (𝑘) + . . . , (4.51)
2
where 𝑠 (𝑘) is some vector centered at 𝑥 (𝑘) . We can find the step 𝑠 (𝑘) that
minimizes this quadratic model (ignoring the higher-order terms). We
do this by taking the derivative with respect to 𝑠 (𝑘) and setting that
equal to zero:
d 𝑓 (𝑥 (𝑘) + 𝑠 (𝑘) )
= ∇ 𝑓 (𝑘) + 𝐻 (𝑘) 𝑠 (𝑘) = 0
d𝑠 (𝑘)
𝐻 (𝑘) 𝑠 (𝑘) = −∇ 𝑓 (𝑘) (4.52)
−1
𝑠 (𝑘) = −𝐻 (𝑘) ∇ 𝑓 (𝑘) .
Mathematically, we use the notation 𝐻 −1 , but in a computational

implementation one would typically not explicitly invert the matrix
for efficiency reasons. Instead, one would solve the linear system
𝐻 (𝑘) 𝑠 (𝑘) = −∇ 𝑓 (𝑘) .
Using the same quadratic example from the previous sections, we
see that Newton’s method converges in one step (Fig. 4.33). This
is not surprising. Because our function is quadratic, the quadratic
“approximation” from the Taylor’s series is exact, and so we can find
the minimum in one step. For a general nonlinear function, it will
take more iterations, but using curvature information should help us
obtain a better estimate for a search direction compared to first-order
methods. Not only does Newton’s method provide a better search

direction, but it also provides a step length embedded in 𝑠 (𝑘) , because the
quadratic model provides an estimate of the stationary point location.
Furthermore, Newton’s method exhibits quadratic convergence.
While Newton’s method is promising, in practice there are a few
𝑥2
issues. Fortunately, we can address each of these challenges. 1 iteration

5
1. Problem: The Hessian might not be positive definite, in which 𝑥 (0)

case the search direction is not a descent direction. 0 𝑥∗
Solution: Because we are not yet at the minimum, we know a de-

scent direction exists. It is possible to modify the Hessian such that −5
it is positive definite, and still prove convergence. The methods −5 0 5 10

of the next section force positive definiteness by construction. 𝑥1
2. Problem: The predicted new point 𝑥 (𝑘) + 𝑠 (𝑘) is based on a second- Figure 4.33: Iteration history for
order approximation and so may not actually yield a good point. quadratic function using an exact
line search and Newton’s method.
In fact, the new point could be worse: 𝑓 (𝑥 (𝑘) + 𝑠 (𝑘) ) > 𝑓 (𝑥 (𝑘) ). Unsurprisingly, only one iteration is
Because the search direction 𝑠 (𝑘) is a descent direction, if we back- required.
track enough our search direction will yield a function decrease.
3. Problem: The Hessian can be difficult or costly to obtain.

Solution: This is unavoidable for Newton’s method, but an alter-
native exists. (The quasi-Newton methods we discuss next).
Example 4.18: Newton method applied to the bean function
Minimizing the same bean function from Ex. 4.15 and Ex. 4.17, we get the 𝑥2
optimization path shown in Fig. 4.34. Newton’s method takes fewer iterations 3
8 iterations
to achieve the same convergence tolerance. 𝑥 (0)
2
1
𝑥∗
4.4.4 Quasi-Newton Methods 0
As discussed above, Newton’s method is effective because the second- −1

order information allows for better search directions, but it has the −2 −1 0 1 2
𝑥1
major shortcoming of requiring the Hessian in the first place. Quasi-
Newton methods are designed to address this issue. The basic idea is Figure 4.34: Newton optimization
that we can use first-order information (gradients) along each step in path.
the iteration path to build an approximation of the Hessian.
In one dimension, we can use a secant line to estimate the slope of a
curve (its derivative), which can be written as the finite difference
𝑓 (𝑘+1) − 𝑓 (𝑘)
𝑓0 ≈ , (4.53)
𝑥 (𝑘+1) − 𝑥 (𝑘)
where 𝑓 0 might be used to estimate the slope at 𝑘 or 𝑘 + 1 (Fig. 4.35) .

We can use a similar finite difference to estimate the curvature using
the slopes and rearrange the equation to get
𝑓 00(𝑘+1) (𝑥 (𝑘+1) − 𝑥 (𝑘) ) = 𝑓 0(𝑘+1) − 𝑓 0(𝑘) . (4.54)
We use the same concept in 𝑛 dimensions, but now the difference

between the gradients at two different points yields a vector that 𝑓 (𝑘)
represents an approximation of the curvature in that direction. Denoting 𝑓0
the approximate Hessian as 𝐵 and the step determined by the latest line 𝑓
search, which is 𝑠 (𝑘) = 𝑥 (𝑘+1) − 𝑥 (𝑘) = 𝛼 (𝑘) 𝑝 (𝑘) , we can write the secant 𝑓 (𝑘+1)
condition
𝐵(𝑘+1) 𝑠 (𝑘) = ∇ 𝑓 (𝑘+1) − ∇ 𝑓 (𝑘) . (4.55)
𝑥
𝑥 (𝑘) 𝑥 (𝑘+1)
This states that the projection of the approximate Hessian onto 𝑠 (𝑘) must
yield the same curvature predicted by taking the difference between Figure 4.35: A secant line at point
the gradients. The secant condition provides a requirement consisting 𝑥 (𝑘) used to estimate 𝑓 0 (𝑘) .
of 𝑛 equations where the step and the gradients are known. However,
there are 𝑛(𝑛 + 1)/2 unknowns in the approximate Hessian (recall that
it is a symmetric matrix), so this is not sufficient to determine 𝐵. There
is another requirement, which is that 𝐵 must be positive definite. This
yields another 𝑛 but that still leaves us with an infinite number of
possibilities for 𝐵.
Given that 𝐵 must be positive definite, the secant condition (4.55) is
only possible if the predicted curvature is positive along the step, that
is,
𝑇

𝑠 (𝑘) ∇ 𝑓 (𝑘+1) − ∇ 𝑓 (𝑘) > 0. (4.56)
This is called the curvature condition which is automatically satisfied if
the line search finds a step that satisfies the strong Wolfe conditions.
Davidon, Fletcher, and Powell devised an effective strategy to esti-
mate the Hessian.15,16 Because there is an infinite number of solutions,
15. Davidon, Variable Metric Method for
they formulated a way to select 𝐻 by picking the one that was “closest” Minimization. 1991
to the Hessian of the previous iteration, while still satisfying the re- 16. Fletcher et al., A Rapidly Convergent
Descent Method for Minimization. 1963
quirements of the secant rule: symmetry and positive definiteness. This
turns out to be an optimization problem in itself, but with an analytic
solution. This led to the DFP method, which was a very impactful idea
in the field of nonlinear optimization.
59. Broyden, The Convergence of a Class
This method was soon superseded by the BFGS method developed of Double-rank Minimization Algorithms 1.
General Considerations. 1970
by Broyden, Fletcher, Goldfarb, and Shannon59–62 and so we focus on
60. Fletcher, A new approach to variable
that method instead. They started with the observation that what we metric algorithms. 1970
ultimately want is the inverse of the Hessian so that we can predict the 61. Goldfarb, A family of variable-metric
next search direction: methods derived by variational means. 1970
−1 62. Shanno, Conditioning of quasi-Newton

𝑝 (𝑘) = −𝐵(𝑘) ∇ 𝑓 (𝑘) . (4.57) methods for function minimization. 1970
Their key insight was that rather than estimating the Hessian, then
solving a linear system, we should directly estimate the Hessian inverse
−1
instead. We will denote the Hessian inverse as 𝑉 (𝑉 (𝑘) = 𝐻 (𝑘) ). Using
the Hessian inverses changes our search prediction step to:
𝑝 (𝑘) = −𝑉 (𝑘) ∇ 𝑓 (𝑘) . (4.58)
Notice that once we have 𝑉 (𝑘) , we only need to perform a matrix-vector

multiplication, which is a much faster operation than solving a linear
system.
The rest of the approach is similar to the DFP method, in the sense
that we still need 𝑉 to be symmetric, positive definite, satisfy the secant
rule, and of all possible matrices, we will choose the one closest to the
one from the previous iteration. The secant rule (4.55) can be rewritten
in terms of our new estimate 𝑉 (𝑘+1) as:
𝑉 (𝑘+1) (∇ 𝑓 (𝑘+1) − ∇ 𝑓 (𝑘) ) = 𝑥 (𝑘+1) − 𝑥 (𝑘) . (4.59)
Mathematically, the problem that we need to solve to estimate the next

Hessian inverse 𝑉 (𝑘+1) is:
minimize k𝑉 (𝑘+1) − 𝑉 (𝑘) k

by varying 𝑉 (𝑘+1)
𝑇 (4.60)
subject to 𝑉 (𝑘+1) = 𝑉 (𝑘+1)

𝑉 (𝑘+1) ∇ 𝑓 (𝑘+1) − ∇ 𝑓 (𝑘) = 𝑥 (𝑘+1) − 𝑥 (𝑘)
Fortunately, this optimization problem can be solved analytically,

depending on the choice of the matrix norm. The matrix norm used
in the BFGS method is a weighted Frobenius norm, with a particular
weighting (same one used in the DFP method). For further details see
Fletcher63 . The solution for 𝑉 (𝑘+1) is: 63. Fletcher, Practical Methods of Optimiza-
" 𝑇
# " 𝑇
# 𝑇
tion. 1987
𝑠 (𝑘) 𝑦 (𝑘) 𝑦 (𝑘) 𝑠 (𝑘) 𝑠 (𝑘) 𝑠 (𝑘)

𝑉 (𝑘+1) = 𝐼 − 𝑇
𝑉 (𝑘) 𝐼 − 𝑇
+ 𝑇
. (4.61)
𝑠 (𝑘) 𝑦 (𝑘) 𝑠 (𝑘) 𝑦 (𝑘) 𝑠 (𝑘) 𝑦 (𝑘)
where
𝑠 (𝑘) = 𝑥 (𝑘+1) − 𝑥 (𝑘) = 𝛼(𝑘) 𝑝 (𝑘)
is the step that resulted from the last line search. The other important
term is the estimate of the curvature in the direction of that line search,
which is given by the difference between the gradients at the end and
start of the line search (the last two major iterations),
𝑦 (𝑘) = ∇ 𝑓 (𝑘+1) − ∇ 𝑓 (𝑘) .

While the denominator 𝑠 𝑇 𝑦 (𝑘) is a dot product resulting in a scalar, the

𝑇 𝑇
numerators 𝑠 (𝑘) 𝑦 (𝑘) and 𝑦 (𝑘) 𝑠 (𝑘) are an outer products that result in
𝑛 𝑥 × 𝑛 𝑥 matrices of rank 1. The division of this matrix by the scalar is
performed element wise.
Eq. 4.61 provides an analytic expression to update the inverse of the
Hessian at each iteration. The advantages of this approximation are
that we only need first-order information and that we do not need to
evaluate points other than the iterations we are already performing. For
the first iteration, we usually set 𝑉0 to the identity matrix, or a scaled
version of it. Using the identity matrix for 𝑉0 in Eq. 4.58 results in
𝑝 0 = −∇ 𝑓0 (4.62)
and thus the first step is a steepest descent step. Subsequent iterations
use information from the previous Hessian inverse, the direction and
length of the last step, and the difference in the last two gradients to
improve the estimate of the Hessian inverse.
The optimization problem (4.60) does not explicitly include a con-
straint on positive definiteness. It turns out that this update formula
will always produce a 𝑉 (𝑘+1) that is positive definite as long as 𝑉 (𝑘)
is positive definite. Therefore if we start with an identify matrix as
suggested above, all subsequent updated produce positive definite
matrices.
Example 4.19: BFGS applied to the bean function
Minimizing the same bean function from previous examples using BFGS, 𝑥2
3
we get the optimization path shown in Fig. 4.36. We initialize the inverse 7 iterations
Hessian to the identity matrix. Using the BFGS update procedure, after two 𝑥 (0)
2
iterations, with 𝑥 (2) = (0.065647, −0.219401), the inverse Hessian approximation
is 1
0.320199 −0.100560
𝑉 (2) = . 𝑥∗
−0.100560 0.219681 0
Compare this to the exact inverse Hessian at the same point,

−1
−2 −1 0 1 2
−1 (2) 0.345784 0.015133 𝑥1
𝐻 𝑥 =
0.015133 0.167328
Figure 4.36: BFGS optimization
The predicted curvature is reasonable, but not accurate. path.
By the end of the optimization, at 𝑥 ∗ = (1.213336, 0.824181), the BFGS
estimate is
0.276503 0.224956
𝑉∗ = ,
0.224956 0.346879
while the exact one is

0.276965 0.224034
𝐻 −1 (𝑥 ∗ ) = .
0.224034 0.347886
Now the estimate is much more accurate.
4.4.5 Limited Memory Quasi-Newton Methods

As the number of design variables becomes large, then even just storing
the Hessian inverse matrix in memory can require an excessive amount
of resources. For example, if there are millions or billions of design
variables the memory requirements can be prohibitive, but even for
problems with 100 design variables the techniques of this section are
often used to improve computational efficiency with minimal sacrifice
in accuracy. Recall that we are only interested in the matrix vector
product 𝑉∇ 𝑓 . We can compute that product without ever actually
forming the matrix 𝑉. We will discuss how this is done with the
BFGS update as this is the most popular approach (known as L-BFGS),
although other Quasi Newton update formulas can also be used.
Notice that Eq. 4.61 defines a recursive sequence. As shorthand we
define the scalar:
1
𝜎= 𝑇 (4.63)
𝑠 𝑦
then the BFGS update is given by:
(𝑘−1)
𝑉 (𝑘) = (𝐼 − 𝜎𝑠 𝑦 𝑇 )𝑉(𝐼 − 𝜎𝑦𝑠 𝑇 ) + 𝜎𝑠𝑠 𝑇 (4.64)
If we saved the sequence of 𝑠 and 𝑦 vectors, and specified a starting

value for 𝑉 (0) , then we could compute any subsequent 𝑉 (𝑘) . Of course,
what we really want is 𝑉 (𝑘) ∇ 𝑓 (𝑘) . This product can be computed
algorithmically using the recurrence relationship.
However, as described, the new algorithm doesn’t provide much
benefit yet. The goal was to avoid storing a large dense matrix, and
instead we are storing a long sequence of vectors and a starting matrix.
To alleviate this issue, we don’t store the entire history, but rather only
store the last 𝑚 vectors for 𝑠 and 𝑦. In practice, 𝑚 is usually between
5–20. Next, we make the starting Hessian diagonal so it only requires
vector storage, or scalar storage if we make all entries in the diagonal
equal. A common choice is to use a scaled identity matrix, which just
requires storing one number:
𝑠𝑇 𝑦
𝑉 (0) = 𝐼 (4.65)
𝑦𝑇 𝑦
where the 𝑠 and 𝑦 values on the right hand side would use the previous
iteration. The algorithm is summarized in Alg. 4.20.
Algorithm 4.20: Compute the product of inverse Hessian and a vector using
the BFGS update rule
Inputs:
∇ 𝑓 (𝑘) : Gradient at point 𝑥 (𝑘)
𝑠 (𝑘−1,...,𝑘−𝑚) : History of steps 𝑥 (𝑘) − 𝑥 (𝑘−1)
𝑦 (𝑘−1,...,𝑘−𝑚) : History of gradient differences ∇ 𝑓 (𝑘) − ∇ 𝑓 (𝑘−1)
Outputs:
𝑑: Desired product 𝑉 (𝑘) ∇ 𝑓 (𝑘)
𝑑 = ∇ 𝑓 (𝑘)
for 𝑖 = 𝑘 − 1 to 𝑘 − 𝑚 by −1 do
𝑇
𝛼(𝑖) = 𝜎 (𝑖) 𝑠 (𝑖) 𝑑
𝑑 = 𝑑 − 𝛼(𝑖) 𝑦 (𝑖)
end for
𝑠 𝑇𝑘−1 𝑦 𝑘−1
𝑉 (0) = 𝐼
𝑦 𝑇𝑘−1 𝑦 𝑘−1
𝑑 = 𝑉 (0) 𝑑
for 𝑖 = 𝑘 − 𝑚 to 𝑘 − 1 do
𝑇
𝛽(𝑖) = 𝜎(𝑖) 𝑦 (𝑖) 𝑑
𝑑 = 𝑑 + (𝛼 − 𝛽(𝑖) )𝑠 (𝑖)
(𝑖)
end for
We no longer need to bear the memory cost of storing a large matrix,

or the computational cost of a large matrix-vector product. Instead, we
store only a small number of vectors, and require only a small number
of vector-vector products (a cost that scales linearly with 𝑛 rather than
quadratically).
Example 4.21: L-BFGS compared to BFGS for the bean function

𝑥2
3
Minimizing the same bean function from the previous examples, the L-BFGS: 7 iterations
optimization iterations using BFGS and L-BFGS are the same, and are shown 2
BFGS: 7 iterations
in Fig. 4.37 The L-BFGS method is applied to the same sequence using the last 𝑥 (0)
5 iterations. The number of variables is too small to benefit from the limited 1 L-BFGS
memory approach, but we show it in this small problem as an example. At the 𝑥∗
BFGS
same 𝑥 ∗ as in Ex. 4.19, the product 𝑉∇ 𝑓 is estimated using Alg. 4.20 as: 0
−5

−7.38683 × 10
𝑑∗ = −1
0 2
5.75370 × 10−5 −2
𝑥1
whereas the exact value is:

Figure 4.37: Optimization paths
−7.49228 × 10−5 using BFGS and L-BGFS.
𝑉∗∇ 𝑓 ∗ =
5.90441 × 10−5
Example 4.22: Total potential energy contours of spring system
Many structural mechanics models involve solving an unconstrained energy

minimization problem. Consider a mass supported by two springs, as shown
in Fig. 4.38. Formulating the total potential energy for the system as a function
of the mass position yields the problem,
q 2 q 2
1 1
minimize 𝑘 (ℓ 1 + 𝑥1 )2 + 𝑥22 − ℓ1 + 𝑘 (ℓ 2 − 𝑥1 )2 + 𝑥22 − ℓ 2 − 𝑚 𝑔𝑥 2
2 1 2 2
by varying 𝑥1 , 𝑥2
(4.66)
The contours of this function are shown in Fig. 4.39 for the case where
ℓ1 ℓ2
𝑘1 𝑘2
𝑥1
𝑥2
Figure 4.38: Two-spring system
with no applied force (top) and
with applied force (bottom).
𝑚𝑔
𝑙1 = 12, 𝑙2 = 8, 𝑘1 = 1, 𝑘2 = 10, 𝑚 𝑔 = 7. There is both a minimum and a

maximum. The minimum represents the position of the mass at the stable
equilibrium condition. The maximum also represents an equilibrium point, but
it is unstable. Starting near the maximum, steepest descent, conjugate gradient,
and quasi-Newton all converge to the minimum. As expected, steepest descent
is the least efficient and quasi-Newton is the most efficient.
Example 4.23: Comparing methods for the Rosenbrock function
We now test the methods on the more challenging function,

2
𝑓 (𝑥 1 , 𝑥2 ) = (1 − 𝑥1 )2 + 100 𝑥2 − 𝑥12 ,
which is known as the Rosenbrock function. This is a well-known optimization

§ The “bean” function we used in previous
problem because a narrow highly curved valley makes it challenging to mini-
examples is a milder version of the Rosen-
mize. § The convergence history for four methods starting from 𝑥 = (−1.2, 1.0) brock function.
12 12
59 iterations 39 iterations
8 𝑥∗ 8 𝑥∗
4 4
𝑥2 𝑥2
0 𝑥 (0) 0 𝑥 (0)
−4 −4
−8 −8
−5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1
(a) Steepest descent (b) Conjugate gradient
12 12
8 𝑥∗ 8
4 4
𝑥2 𝑥2
0 𝑥 (0) 0 𝑥 (0)
𝑥∗
−4 −4
−8 −8
−5 0 5 10 15 −5 0 5 10 15
Figure 4.39: Minimizing the total
𝑥1 𝑥1
potential for two-spring system.
(c) Quasi-Newton (d) Newton
is shown in Fig. 4.41. All four methods use an inexact line search using the
same parameters and a convergence tolerance of k∇ 𝑓 k ∞ ≤ 10−6 . Compared to
the previous two examples, the difference between the steepest descent and
the other methods is much more dramatic (two orders of magnitude more
iterations!), owing to the more challenging variation in the curvature (recall
Ex. 4.14).
Steepest descent does converge takes a large number of iterations because it
bounces between the steep walls of the valley. One of the line search gets lucky
and takes a shortcut to another part of the valley, but even then it cannot make
up for its inherent inefficiency. The conjugate gradient method is much more
efficient because it damps the steepest descent oscillations with a contribution
from the previous direction. Eventually, conjugate gradient achieves superlinear
convergence near the optimum, which saves many iterations to get the last
several orders of magnitude in the convergence criterion. The methods that
use second-order information are even more efficient, exhibiting quadratic
convergence in the last few iterations.
The number of major iterations is not always an effective way to

compare performance. For example, Newton’s method takes fewer ma-
jor iterations, but each iteration in Newton’s method is more expensive
than each iteration in the quasi-Newton method. This is because New-

2 2
𝑥∗
1 𝑥∗ 1
𝑥2 𝑥2
𝑥 (0) 𝑥 (0)
0 0
−1 −1
−1 0 1 −1 0 1
𝑥1 𝑥1
(a) Steepest descent (b) Conjugate gradient
2 2
𝑥∗ 𝑥∗
1 1
𝑥2 𝑥2
𝑥 (0) 𝑥 (0)
0 0
Figure 4.40: Optimization paths

−1 −1 for the Rosenbrock function using
−1 0 1 −1 0 1
steepest descent, conjugate gradient,
𝑥1 𝑥1
Newton, and BFGS
(c) Quasi-Newton (d) Newton
103
Steepest
102
descent
101
100
10−1
||∇ 𝑓 || ∞ Newton Figure 4.41: Convergence of the
10−2 four methods shows the dramatic
10−3
difference between the linear con-
Quasi- vergence of steepest descent, the
10−4 Newton Conjugate superlinear convergence of the
gradient conjugate gradient method, and
10−5
the quadratic convergence of the
10−6 methods that use second-order
100 101 102 103 information.
Major iterations
ton’s method requires a linear solution, which is an 𝒪(𝑛 3 ) operation, as

opposed to a matrix-vector multiplication, which is an 𝒪(𝑛 2 ) operation.
For a small problem like the Rosenbrock function this is an insignificant
difference, but for large problems this is a significant difference in time.
Additionally, each major iteration includes a line search, and depending
on the quality of the search direction, the number of function calls
contained in each iteration will differ.
Tip 4.24: Gradient-based optimization can find the global optimum.

Gradient-based methods are local search methods. If the design space is

fundamentally multimodal, it may be useful to augment the gradient-based
search with a global search.
The simplest and most common approach is to use a multistart approach,
where we run a gradient-based search multiple times, starting from differ-
ent points. The starting points might be chosen from engineering intuition,
randomly generated points, or sampling methods, such as Latin hypercube
sampling (see Chapter 10).
Convergence testing is needed to determine a suitable number of starting
points. If all points converge to the same optimum, and the starting points
were well spaced, this suggest that the design space might not be multimodal
after all. By using multiple starting points, we increase the likelihood that we
find the global optimum, or at least that we find a better optimum than would
be found with a single starting point. One advantage of this approach is that it
can easily be run in parallel.
Another approach is to start with a global search strategy, like a population-
based gradient-free algorithm (see Chapter 7). After some suitable initial
exploration, the designs in the population become starting points for gradient-
based optimization. This approach allows us to find points that satisfy the
optimality conditions, which is typically not possible with a pure gradient-free
approach. It also improves the convergence rate and finds the optima more
precisely.
4.5 Trust-region Methods
Trust-region methods, also known as restricted step methods, present

an alternative to the line-search based algorithms presented so far. The
motivation for trust-region methods is to address the issues caused by
non-positive definite Hessian matrices in Newton and quasi-Newton
methods, as well as other weaknesses.
Unconstrained optimization algorithms based on a line search
consist of determining a search direction, then solving the line search
subproblem, which determines the distance to move along that direction.
Trust-region methods are fundamentally different. Instead of fixing the
direction and then finding the distance, trust-region methods fix the
maximum distance, and then find the direction and distance that yield
the most improvement. The trust-region method requires a model of
the function to be minimized, and the definition of a region within
which we trust the model to be good enough for our purposes. The most
common model is a local quadratic function, but other models may also
be used. The trust-region is centered about the current iteration point,
and can be defined as an 𝑛-dimensional box, sphere, or ellipsoid of a
given size. Each trust-region iteration consists in the following main

stages:
1. Update the function model (e.g., quadratic)
2. Minimize the model within the trust region
3. Update the trust-region size and location
𝑥∗
𝑥 (𝑘+1)
𝑥2
𝑥 (𝑘) 𝑠 (𝑘)
Figure 4.42: Trust region approach

for globalization. In this case,
𝑥 (0)
where the trust regions are circu-
lar in this case.
𝑥1
The trust-region subproblem is
minimize 𝑓˜(𝑠)
by varying 𝑠 (4.67)
subject to k𝑠 k ≤ Δ,
where 𝑓˜(𝑠) is the local trust-region model, 𝑠 is the step from the current
iteration point, and Δ is the size of the trust region. Note that we use
the notation 𝑠 instead of 𝑝 to indicate that this is a step vector (direction
and magnitude) and not just the direction 𝑝 used in line search based
methods.
The subproblem above defines the trust-region as a norm. The
Euclidean norm, k𝑠 k 2 , defines a spherical trust region and is the most
common type of trust region. Sometimes ∞-norms are used instead as
they are easy to apply, but 1-norms are rarely used as they are just as
complex as 2-norms but introduce sharp corners that are sometimes
problematic 64 . The shape of the trust region, dictated by the norm, 64. Conn et al., Trust Region Methods.
2000
can have a significant impact on the convergence rate. The ideal trust
region shape depends on the local function space and some algorithms
allow for the trust region shape to change throughout the optimization.
Using a quadratic trust-region model and the Euclidean norm we

can define the more specific subproblem
𝑇 1
minimize 𝑓˜(𝑠) = 𝑓 (𝑘) + ∇ 𝑓 (𝑘) 𝑠 + 𝑠 𝑇 𝐵(𝑘) 𝑠
2 (4.68)
subject to k𝑠 k 2 ≤ Δ(𝑘) ,
where 𝐵(𝑘) is the approximate Hessian at our current iterate. This

problem has a quadratic objective and quadratic constraints and is
called a quadratically constrained quadratic program (QCQP). If the
problem is unconstrained and 𝐵 is positive definite, we can get to
−1
the solution using a single step 𝑠 = −𝐵(𝑘) ∇ 𝑓 (𝑘) . However, due to
the constraints, there is no analytic solution for the QCQP. While the
problem is still straightforward to solve numerically (it is a convex
problem, see Chapter 11), it requires an iterative solution approach
with multiple factorizations. Similarly to the line search, where we only
obtain a sufficiently good point instead of finding the exact minimum,
in the trust-region subproblem we seek an approximate solution to the
QCQP. The inclusion of the trust-region constraint allows us to omit
the requirement that 𝐵(𝑘) be positive definite, which is used in most
of the quasi-Newton methods. We do not detail approximate solution
approaches to the QCQP but multiple algorithms exist.58,64,65 58. Nocedal et al., Numerical Optimization.
2006
The left side of Fig. 4.43 shows an example of function contours 64. Conn et al., Trust Region Methods.
for the Rosenbrock function, a local quadratic model (in blue), and 2000
a spherical trust-region (red circle). The trust-region step seeks the 65. Steihaug, The Conjugate Gradient
Method and Trust Regions in Large Scale
minimum of the local quadratic model within the spherical trust region. Optimization. 1983
Notice on the right side that, unlike line search methods, as the size
of the trust region changes the direction of the step also change (the
solution to Eq. 4.68).
Figure 4.43: The blue contour lines

are the Rosenbrock function and the
gray contours are a local quadratic
approximation about the current
iteration (where the arrow origi-
nate). The red circle represents a
𝑠 (𝑘) trust region. It is a safeguard to
𝑠 (𝑘) prevent steps beyond where the
𝑝 (𝑘) local model is likely to be valid. The
𝑥2 𝑥2
trust-region step 𝑠 (𝑘) finds the min-
imum of the blue contours while
remaining within the trust-region
boundary. The steepest descent di-
rection 𝑝 is shown as a comparison.
𝑥1 𝑥1
4.5.1 Trust Region Sizing Strategy

This section presents an algorithm for updating the size of the trust
region at each iteration. The trust region can grow, shrink, or remain the
same, depending on how well the model predicts the actual function
decrease. The metric we use to assess the model is the actual function
decrease divided by the expected decrease
𝑓 (𝑥) − 𝑓 (𝑥 + 𝑠)
𝑟= . (4.69)
𝑓˜(0) − 𝑓˜(𝑠)
The denominator in this definition is the expected decrease, which is

always positive. The numerator is the actual change in the function,
which could be a reduction or an increase. An 𝑟 value close to unity
means that the model agrees well with the actual function. An 𝑟 value
larger than one is fortuitous and means the actual decrease was even
greater than expected. A negative value of 𝑟 means that the function
actually increased at the expected minimum, and therefore the model
is not suitable.
The trust region sizing strategy detailed in Alg. 4.25 determines the
size of the trust region at each major iteration 𝑘 based on the value
of 𝑟 (𝑘) . The parameters in this algorithm are not derived from any
theory but are rather empirical. This example uses the basic procedure
from Nocedal et al.58 , but with recommended parameters from Conn 58. Nocedal et al., Numerical Optimization.
2006
et al.64 . The initial value of Δ is usually 1 assuming the problem is
64. Conn et al., Trust Region Methods.
already well scaled. One way to rationalize the trust-region method is 2000
that the quadratic approximation of a nonlinear function is in general
reasonable only within a limited region around the current point 𝑥 (𝑘) .
We can overcome this limitation by minimizing the quadratic function
within a region around 𝑥 (𝑘) within which we trust the quadratic model.
When our model performs well, we expand the trust region. When it
performs poorly we shrink the trust region. If we shrink the trust region
sufficiently, our local model should eventually be a good approximation
of the real function, as dictated by the Taylor series expansion. We
should also set a maximum trust region size, Δmax to prevent the trust
region from expanding too much. Otherwise, if we have good fits over
part of the design space, it may take too long to reduce the trust region
size to an acceptable size over other portions of the design space where
a smaller trust region is needed. The same stopping criteria used in
other gradient-based methods are applicable.¶ ¶ Conn provides more detail on trust-
region problems including trust region
norms and scaling, approaches to solving
Algorithm 4.25: Trust-region algorithm the trust-region subproblem, extensions
to the model, and other important prac-
Inputs: tical considerations.

Δ(0) : Initial size of the trust region
Outputs:
while not converged do

Compute or estimate the Hessian
Solve (approximately) for 𝑠 (𝑘) Using (4.67)
Compute 𝑟 (𝑘) Using (4.69)
⊲ Resize trust region

if 𝑟 (𝑘) ≤ 0.05 then Poor model
Δ(𝑘+1) = Δ(𝑘) /4 Shrink trust region
𝑠 (𝑘) = 0 Reject step
else if 𝑟 (𝑘) ≥ 0.9 and k𝑠 (𝑘) k = Δ(𝑘) then Good model and step to edge
Δ(𝑘+1) = min(2Δ(𝑘) , Δmax ) Expand trust region
else Reasonable model and step within trust region
Δ(𝑘+1) = Δ(𝑘) Maintain trust region size
end if
𝑥 (𝑘+1) = 𝑥 (𝑘) + 𝑠 (𝑘) Update location of trust region
𝑘 = 𝑘+1 Update iteration count
end while
Example 4.26: Minimizing total potential energy of spring system with trust
region
Minimizing the total potential energy function from Ex. 4.22 using the
a trust-region method starting from the same points as before yields the
optimization path shown in Fig. 4.44. The initial trust region size is Δ = 0.3 and
the maximum allowable is Δ = 1.5. The convergence criteria is based on the
difference in subsequent iterations, such that k𝑥 (𝑘) − 𝑥 (𝑘−1) k ≤ 10−8 . The first
few quadratic approximations do not not have a minimum because the function
has negative curvature around the starting point, but the trust region prevents
steps that are too large. When it gets close enough to the bowl containing the
minimum, the quadratic approximation has a minimum and the trust region
subproblem yields a minimum within the trust region. In the last few iterations,
the quadratic is a good model and therefore the region remains large.
4.5.2 Comparison with Line Search Methods

Generally speaking, trust-region methods are more strongly dependent
on accurate Hessians than line search methods. For this reason, they
are usually only effective when exact gradients (or better yet an exact
8 8 8
4 4 4
𝑥2 𝑥2 𝑥2
0 𝑠 (0) 0 0
𝑥 (0)
𝑠 (5)
−4 −4 −4
𝑠 (3)
−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1 𝑥1
(a) 𝑘 = 0 (b) 𝑘 = 3 (c) 𝑘 = 5
8 8 𝑠 (11) 8
𝑠 (8)
𝑥∗
4 4 4
𝑥2 𝑥2 𝑥2
0 0 0
−4 −4 −4
−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1 𝑥1
(d) 𝑘 = 8 (e) 𝑘 = 11 (f) 𝑘 = 15
Hessian) can be supplied. In fact several optimization packages require Figure 4.44: Minimizing the total
the user to provide the full Hessian in order to use a trust-region potential for two-spring system
using a trust-region method.
approach. Trust-region methods generally require fewer iterations
than quasi-Newton methods but each iteration is more computationally
expensive because of the need for at least one matrix factorization.
Scaling can also be more challenging with trust-region approaches.
Newton’s method is invariant with scaling, but the use of a Euclidean
trust-region constraint implicitly assumes that the function changes in
each direction at a similar rate. Some enhancements try to address this
issue through the use of elliptical trust regions rather than spherical
ones.
Example 4.27: Minimizing the Rosenbrock function using trust region
We now test the trust-region method on the Rosenbrock function. The

overall path is similar to the other second-order methods. The initial trust region
size is Δ = 1 and the maximum allowable is Δ = 5. The convergence criteria is
based on the difference in subsequent iterations, such that k𝑥 (𝑘) −𝑥 (𝑘−1) k ≤ 10−8 .
At any given point, the direction of maximum curvature of the quadratic
approximation matches the maximum curvature across the valley and rotates
as we track the bottom of the valley toward the minimum.
2 2 2
1 𝑥 (0) 1 1
𝑥2 𝑠 (0) 𝑥2 𝑥2
0 0
𝑠 (3) 0
𝑠 (7)
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
(a) 𝑘 = 0 (b) 𝑘 = 3 (c) 𝑘 = 7
2 2 2
𝑥∗
1 1 1
𝑥2 𝑥2 𝑥2
𝑠 (17)
0 𝑠 (12) 0 0
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
(d) 𝑘 = 12 (e) 𝑘 = 17 (f) 𝑘 = 35
Tip 4.28: Accurate derivatives matter. Figure 4.45: Minimization of the

Rosenbrock function using a trust
The effectiveness of gradient-based methods depends strongly on providing region method.
accurate gradients. Convergence difficulties, or apparent multimodal behavior,
are often mistakenly identified as fundamental modeling issues when in reality
the numerical issues are caused by inaccurate gradients. Chapter 6 is devoted
to the subject of obtaining accurate derivatives.
4.6 Summary
Unsurprisingly, the gradient plays a crucial role in gradient-based

optimization: it points in the direction of steepest increase, and when
the gradient is zero we know we have reached a stationary point. If we
use descent directions in our search then the stationary point will be a
local minimum. Although the negative gradient points in the direction
of steepest descent, we have seen that following this direction is generally
not the best approach because it is prone to oscillation. Second-order
methods (e.g., Newton’s method) use curvature information to greatly
improve the rate of convergence. Because supplying second derivatives
is often prohibitive, we typically use quasi-Newton methods where the

Hessian is estimated from changes in the gradients.
All gradient-based methods are characterized by how they choose
the search direction 𝑝 (𝑘) . The four methods we have discussed yield
the following search directions:
Steepest descent: 𝑝 (𝑘) = −∇ 𝑓 (𝑘)

Conjugate gradient: 𝑝 (𝑘) = −∇ 𝑓 (𝑘) + 𝛽 (𝑘) 𝑝 (𝑘−1)
−1 (4.70)
Newton: 𝑝 (𝑘) = −𝐻 (𝑘) ∇ 𝑓 (𝑘)
Quasi-Newton: 𝑝 (𝑘) = −𝑉 (𝑘) ∇ 𝑓 (𝑘)
After determining an appropriate descent direction we need to

determine how far to travel in that direction. This is the purpose of
a line search. Although one might ideally want to find the minimum
along that line, this turns out to be wasteful and instead we rely on
criteria that define when the point is “good enough”. These criteria
look for sufficient decrease, and a flattening of the curvature. Once a
point satisfies these conditions we repeat the process and select a new
search direction.
Trust regions provide an alternative approach where instead of
using line searches, we define a region (typically a sphere), and solve
a surrogate optimization problem within that region. Line search
methods are more commonly used, but trust region approaches can be
particularly effective if we are able to supply second derivatives.
Problems
a) Gradient-based optimization requires the function to be

continuous and infinitely differentiable.
b) Gradient-based methods perform a local search.
c) Gradient-based methods are only effective for problems with
one minimum.
d) When the gradient is projected into a given direction, we get
a vector pointing aligned with that direction.
e) The Hessian of a unimodal function is positive-definite or
positive-semidefinite everywhere.
f) Each column 𝑗 of the Hessian quantifies the rate of change
of component 𝑗 of the gradient vector with respect to all
coordinate directions 𝑖.
g) If the function curvature at a point is zero in some direction,

that point cannot be a local minimum.
h) A globalization strategy in a gradient-based algorithm en-
sures convergence to the global minimum.
i) The goal of the line search is to find the minimum along a
given direction.
j) For minimization, the line search must always start in a
descent direction.
k) The direction in the steepest descent algorithm for a given
iteration is orthogonal to the direction of the previous itera-
tion.
l) Newton’s method is not affected by problem scaling.
m) Quasi-Newton methods approximate the function Hessian
by finite-differencing gradients.
n) Overall, Newton’s with a line search is the best choice among
gradient-based methods because it uses exact second-order
information.
o) The trust region method does not require a line search.
4.2 Consider the function
𝑓 (𝑥1 , 𝑥2 , 𝑥3 ) = 𝑥12 𝑥2 + 4𝑥 24 − 𝑥2 𝑥3 + 𝑥3−1 .
Find the gradient of this function. Where is the gradient not

defined? Calculate the directional derivative of the function at
𝑥 𝐴 = (2, −1, 5) in the direction 𝑝 = [6, −2, 3]. Find the Hessian
of this function. Is the curvature in the direction 𝑝 positive or
negative? Write the second-order Taylor’s series expansion of this
function. Plot the Taylor series values along the 𝑝 direction and
compare it to the actual function. Plot the contours of the Taylor
series expansion about 𝑥 𝐴 and compare them to the contours of
the original function. Zoom in your plot around 𝑥 𝐴 until the
two sets of contours are indistinguishable. What is the order of
magnitude in 𝑥 did you end up with?
4.3 Consider the function from Ex. 4.1,
𝑓 (𝑥1 , 𝑥2 ) = 𝑥13 + 2𝑥1 𝑥 22 − 𝑥23 − 20𝑥1 . (4.71)
Find the critical points of this function analytically and classify

them. What is the global minimum of this function?
4.4 Review Kepler’s wine barrel story from Section 2.2. Approximate
the barrel as a cylinder and find the height and diameter of a
barrel that maximizes its volume for a diagonal measurement of
1 m.
4.5 Consider the function:
𝑓 = 𝑥14 + 3𝑥13 + 3𝑥22 + 6𝑥1 𝑥2 + 6𝑥2 − 8𝑥 2 .
Find the critical points analytically and classify them. Where is

the global minimum? Plot the function contours to verify your
results.
4.6 Consider a slightly modified version of the function from Prob. 4.5,
where we add a 𝑥24 term to get
𝑓 = 𝑥14 + 𝑥24 + 3𝑥13 + 3𝑥 22 + 6𝑥 1 𝑥2 + 6𝑥 2 − 8𝑥2 .
Can you find the critical points analytically? Plot the function
contours. Locate the critical points graphically and classify them.
4.7 Line search algorithm implementation. Implement the two line search
algorithms from Section 4.3, such that they work in 𝑛 dimensions
(𝑥 and 𝑝 can be vectors of any size).
a) As a first test for your code, reproduce the results from the
examples in Section 4.3 and plot the function and iterations
for both algorithms. For the line search that satisfies the
strong Wolfe conditions, reduce the value of 𝜇2 until you get
an exact line search. How much accuracy can you achieve?
b) Test your code on another easy two-dimensional function,
such as the bean function from Ex. 4.15, starting from differ-
ent points and using different directions (but remember that
you must always provide valid descent direction, otherwise
the algorithm might not work!). Does it always find a suit-
able point? Exploration: Try different values of 𝜇2 and 𝜌 to
analyze their effect on the number of iterations.
c) Apply your line search algorithms to the two-dimensional
Rosenbrock function and then the 𝑛-dimensional variant
(see Appendix C.1.2). Again, try different points and search
directions to see how robust the algorithm is, and try to tune
𝜇2 and 𝜌.
4.8 Effect of scaling on line search. Consider the one-dimensional

function,
𝑓 (𝑥) = −𝑥 + 𝛾𝑥 2 ,
where 𝛾 > 0 is a parameter. This problem demonstrates the impact

of poor scaling, which often an issue in practical optimization
problems. Use your line search algorithm to investigate the effect
of function curvature by using three values of 𝛾: 0.5, 10, and 104 .
Start from 𝑥 = 0 using a reduction parameter value of 𝜌 = 0.5,
a sufficient decrease parameter of 𝜇1 = 10−6 , and an initial step
length parameter of 𝛼 = 1. Your search direction in this case is
just a scalar, 𝑝 = 1.
a) How many function evaluations are required for the three

different values of 𝛾 for each of the algorithms? Observe
and explain the trends.
b) For the highest value of 𝛾, scale 𝑥 to make the problem better
conditioned and verify that it works.
c) Exploration: Try different starting points (you might have to
set 𝑝 = −1, depending on the direction of descent).
4.9 Optimization algorithm implementation. Program the steepest

descent, conjugate gradient, and BFGS algorithms from Section 4.4.
You must have a thoroughly tested line search algorithm from
the previous exercise first. For the gradients, differentiate the
functions analytically and compute them exactly. Solve each
problem using your implementations of the various algorithms,
as well as off-the-shelf optimization software for comparison.
a) For your first test problem, reproduce the results from the
examples in Section 4.4.
b) Minimize the two-dimensional Rosenbrock function (see
Appendix C.1.2) using the various algorithms and compare
your results starting from 𝑥 = (−1, 2). Compare the total
number of evaluations. Compare the number of minor
versus major iterations. Discuss the trends. Exploration: Try
different starting points and tuning parameters (e.g., 𝜌 and
𝜇2 in the line search) and compare the number of major and
minor iterations.
c) Benchmark your algorithms on the 𝑛-dimensional variant
of the Rosenbrock function (see Appendix C.1.2). Try 𝑛 = 3
and 𝑛 = 4 first, then 𝑛 = 8, 16, 32, . . .. What is the highest
number of dimensions you can solve? How does the number
of function evaluations scale with the number of variables?
d) Optional: Implement L-BFGS and compare it with BFGS.
4.10 Trust region implementation. Implement a trust-region algorithm

and apply it to one or more of the test problems from the previous
exercise. Compare the trust region results with BFGS and the
off-the-shelf software.
4.11 Aircraft wing design. We will solve the aircraft wing design problem
described in Appendix C.1.6.
4.12 Brachistochrone problem The brachistochrone problem seeks to find

the path that minimizes travel time between two points for a
particle under the force of gravity (think of a bead constrained to
slide on a wire whose shape you control). This was mentioned in
Section 2.2 as one of the problems that inspired the developments
in calculus of variations. Solve the discretized version of this
problem (see Appendix C.1.7 for a detailed description).
a) Plot the optimal path for the frictionless case with 𝑛 = 10

and compare it to the exact solution (see Appendix C.1.7).
b) Solve the optimal path with friction and plot the resulting
path. Report the travel time between the two points and
compare it to the frictionless case.
c) Study the effect of increased problem dimensionality. Start
with 4 points and double the dimension each time up to
128 points. Plot and discuss the increase in computational
expense with problem size. Example metrics include the
number of major iterations, function evaluations, and com-
putational time. Hint: When solving the higher-dimensional
cases, start with the solution interpolated from a lower-
dimensional case—this is called a warm start.
Constrained Gradient-Based Optimization
5
Engineering design optimization problems are rarely unconstrained.
In this chapter, we explain how to solve constrained problems. The
methods in this chapter build on the gradient-based unconstrained
methods from Chapter 4 and also assume smooth functions. We first
introduce the optimality conditions for a constrained optimization
problem and then focus on three main types of methods for handling
constraints: penalty functions, sequential quadratic optimization, and
interior-point methods.
Penalty methods are no longer used in constrained gradient-based
optimization because they have been replaced by more effective meth-
ods, but the concept of a penalty is useful when thinking about con-
straints, and is used as part of more sophisticated methods.
Sequential quadratic optimization and interior-point methods rep-
resent the state-of-the-art in nonlinear constrained optimization. We
introduce the basics for these two methods, but a complete and robust
implementation of these two methods requires rather detailed knowl-
edge of a growing body of literature that is not covered here. There are
many other methods not covered in this chapter, but they are either less
effective, or are only more effective for certain problems.
At the end of the chapter, we discuss merit functions and filters.
These considerations are an important part of the line search in both
constrained optimization approaches.
1. Describe the mathematical definition of optimality for a

constrained problem.
2. Understand the motivation for and limitations of penalty

methods.
3. Understand the concepts behind state-of-the-art con-

strained optimization algorithms and be able to use them
to solve real engineering problems.
123
5.1 Constrained Problem Formulation
We can express a general constrained optimization problem as
subject to 𝑔 𝑗 (𝑥) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 (5.1)
ℎ 𝑘 (𝑥) = 0 𝑘 = 1, . . . , 𝑛 ℎ
x𝑖 ≤ 𝑥 𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥
where the 𝑔 𝑗 (𝑥) are the inequality constraints, ℎ 𝑘 (𝑥) are the equality
constraints, and x and 𝑥 are lower and upper bound constraints on
the design variables. Both objective and constraint functions can
be nonlinear, but they should be 𝐶 2 continuous to be solved using
gradient-based optimization algorithms. The inequality constraints
are expressed as “less than” without loss of generality because they
can always be converted to “greater than” by putting a negative sign
on 𝑔 𝑗 . We could also eliminate the equality constraint without loss
of generality by replacing it with two inequality constraints, 𝑔 𝑗 ≤ 𝜖
and −𝑔 𝑗 ≤ 𝜖, where 𝜖 is some small number. In practice, numerical
precision and the implementations of many methods make it desirable
to distinguish between equality and inequality constraints.
Example 5.1: Graphical solution of constrained problem.
Consider the following two-variable problem with quadratic objective and

constraint functions:
1
minimize 𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 − 𝑥1 − 𝑥2 − 2
2
(5.2)
subject to 𝑔1 (𝑥1 , 𝑥2 ) = 𝑥12 − 4𝑥1 + 𝑥2 + 1 ≤ 0
1 2 𝑥2
𝑔2 (𝑥1 , 𝑥2 ) = 𝑥 + 𝑥22 − 𝑥1 − 4 ≤ 0. 4
2 1
We can plot the contours of the objective function and the constraint lines 𝑔2 (𝑥) 𝑔1 (𝑥)
2
𝑥∗
(𝑔1 = 0 and 𝑔2 = 0), as shown in Fig. 5.1. We can see the feasible region
defined by the two constraints and the approximate location of the minimum 0
is evident by inspection. We are only able to visualize the contours for this
problem because the functions can be evaluated quickly and because it has only −2
two dimensions. If the functions were more expensive, we would not be able
to afford the many evaluations needed to plot contours. If the problem had −2 0 2 4
𝑥1
more dimensions, it would become difficult or impossible to fully visualize the
functions and feasible space. Figure 5.1: Graphical solution for
constrained problem.
Tip 5.2: Do not mistake constraints for objectives.
Often a metric is posed an objective (usually with multiple objectives) when

it is probably more appropriate as a constraint. This topic is discussed in more
detail in Chapter 9.
The constrained problem formulation above does not distinguish

between nonlinear and linear constraints or variable bounds. While
it is advantageous to make this distinction, because some algorithms
can take advantage of these differences, the methods introduced in this
chapter will just assume general nonlinear functions.
Tip 5.3: Do not specify bounds as nonlinear constraints.
Bounds are a special category and the simplest form of a constraint:

x𝑖 ≤ 𝑥 𝑖 ≤ 𝑥 𝑖
Bounds are treated differently algorithmically and so should be specified as a
bound constraint, rather than as a general nonlinear constraint. Some bounds
will come from physical limitations on the engineering system. If not otherwise
limited, the bounds should be sufficiently wide so as to not artificially constrain
the problem. It is good practice to check your solution against your bounds to
make sure you haven’t artificially constrained the problem.
We need the Jacobian of the constraints throughout this chapter.

The indexing order we use is:
 𝜕ℎ1 𝜕ℎ1 
 ∇ℎ 𝑇   𝜕𝑥1 ···
𝜕𝑥 𝑛 𝑥 

 1 
 .  ..  𝜕ℎ 𝑖
[∇ℎ]𝑖𝑗 =  ..  =  ..
. ..
  
. .  = 𝜕𝑥 , (5.3)
∇ℎ 𝑇   𝜕ℎ 𝑛 ℎ 𝜕ℎ 𝑛 ℎ 
𝑗
 𝑛ℎ   ··· 
 𝜕𝑥1 𝜕𝑥 𝑛 𝑥 
so this Jacobian is a matrix of size 𝑛 ℎ × 𝑛 𝑥 . The Jacobian of the inequality
constraints uses the same index ordering.
5.2 Optimality Conditions
The optimality conditions for constrained optimization problems are

not as straightforward as those for unconstrained optimization (Sec-
tion 4.1.4). We begin with equality constraints because the mathematics
and intuition are simpler, and then add inequality constraints. As in
the case of unconstrained optimization, the optimality conditions for
constrained problems are used not only for the termination criteria, but
they are also used as the basis for optimization algorithms.
5.2.1 Equality Constraints

First, we review the optimality conditions for an unconstrained problem.
For an unconstrained problem, we can take a first-order Taylor’s series
expansion of the objective function with some step 𝑝 that is small
enough that the second order term is negligible and write
𝑓 (𝑥 + 𝑝) ≈ 𝑓 (𝑥) + ∇ 𝑓 (𝑥)𝑇 𝑝. (5.4)
If 𝑥 ∗ is a minimum point then every point in a small neighborhood

must have a greater value,
𝑓 (𝑥 ∗ + 𝑝) ≥ 𝑓 (𝑥 ∗ ). (5.5)
Given the Taylor series expansion (5.4), the only way that this inequality
can be satisfied is if
∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 ≥ 0. (5.6)
For a given ∇ 𝑓 (𝑥 ∗ ), there are always an infinite number of directions
along which the function decreases, which correspond to the halfspace
shown in Fig. 5.2. If the problem is unconstrained then 𝑝 can be in any
direction and the only way to satisfy this inequality is if ∇ 𝑓 (𝑥 ∗ ) = 0.
Therefore, ∇ 𝑓 (𝑥 ∗ ) = 0 is a necessary condition for an unconstrained
minimum.
∇𝑓
∇𝑓 𝑇𝑝 = 0
∇𝑓 𝑇𝑝 > 0
Figure 5.2: The gradient 𝑓 (𝑥),
which is the direction of steepest
function increase, splits the design
space into two halves. Here we
∇𝑓 𝑇𝑝 < 0 highlight the halfspace of directions
Half space of that result in function decrease.
function decrease
Now consider the constrained case. The function increase condi-

tion (5.6) still applies, but 𝑝 can only take feasible directions. To find the
feasible directions, we can write a first-order Taylor series expansion
for each equality constraint function as
ℎ 𝑗 (𝑥 + 𝑝) ≈ ℎ 𝑗 (𝑥) + ∇ℎ 𝑗 (𝑥)𝑇 𝑝, 𝑗 = 1, . . . , 𝑛 ℎ . (5.7)
Again, the step size is assumed to be small enough so that the higher-
order terms are negligible. Assuming we are at a feasible point, then
ℎ 𝑗 (𝑥) = 0 for all constraints 𝑗. To remain feasible, the step, 𝑝, must be
such that the new point is also feasible, i.e., ℎ 𝑗 (𝑥 + 𝑝) = 0 for all 𝑗. This
implies that the feasibility of the new point requires
∇ℎ 𝑗 (𝑥)𝑇 𝑝 = 0, 𝑗 = 1, . . . , 𝑛 ℎ , (5.8)
which means that a direction is feasible when it is perpendicular to all equality

constraint gradients. Another way to look at this is that the feasible
directions are in the intersection of the hyperplanes corresponding to
the tangent of each constraint. These hyperplanes have at least one point
in common by construction (the point we are evaluating, 𝑥). Assuming
distinct hyperplanes (linearly independent constraint gradients), their
intersection defines a hyperplane with 𝑛 𝑥 − 𝑛 ℎ degrees of freedom (see
Fig. 5.3 for 2-D and 3-D illustrations).
Feasible space Feasible space

1 DOF 0 DOF
∇ℎ1 (𝑥)
𝑥
𝑥
∇ℎ(𝑥) ∇ℎ2 (𝑥)

ℎ2 (𝑥) = 0
ℎ(𝑥) = 0 ℎ1 (𝑥) = 0
∇ℎ2 (𝑥)
∇ℎ(𝑥) ∇ℎ1 (𝑥)
Figure 5.3: Feasible spaces for 2-

𝑥 𝑥 D examples with one constraint
(upper left), and two constraints
(upper right); 3-D examples with
one constraint (lower left) and two
Feasible space Feasible space constraints (lower right).
2 DOF 1 DOF
For a point to be a constrained minimum,
∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 ≥ 0 (5.9)
for all 𝑝 such that
∇ℎ 𝑗 (𝑥 ∗ )𝑇 𝑝 = 0, 𝑗 = 1, . . . , 𝑛 ℎ . (5.10)
As previously mentioned, the feasible directions form a hyperplane

in 𝑛 𝑥 dimensions. The intersection of this hyperplane with the half
space containing all the descent directions (𝑝 such that ∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 < 0)
must be zero. For this to happen, the only possibility to satisfy the
inequality Eq. 5.9 is the case when it is zero. This is because a hyperplane
in 𝑛 𝑥 dimensions includes directions in the descent halfspace of the
same dimensions unless the hyperplane is perpendicular to ∇ 𝑓 (𝑥 ∗ ) (see
Fig. 5.4 for an illustration in 3-D space).
Figure 5.4: All feasible directions

∇ 𝑓 (𝑥) must be contained in the hyper-
∇ 𝑓 (𝑥)
plane perpendicular to the gradient
∇ℎ(𝑥) of the objective function so that
∇ℎ(𝑥) there are no feasible descent direc-
tions. Here we show 3-D example
with one constraint for a point sat-
isfying optimality (left) and a point
not satisfying optimality (right).
Another way of stating this condition is that the projection of ∇ 𝑓 (𝑥 ∗ )

onto all possible feasible directions must be zero. That is because if the
projection were nonzero, there would be a feasible direction that is also
a descent direction.
For the hyperplane defining the descent halfspace to align with the
hyperplane of feasible directions, the gradient of the objective must be a
linear combination of the gradients of the constraints, i.e.,
Õ
𝑛ℎ
∇ 𝑓 (𝑥 ∗ ) = − 𝜆 𝑗 ∇ℎ 𝑗 (𝑥 ∗ ), (5.11)
𝑗=1
where 𝜆 𝑗 are called the Lagrange multipliers. There is a multiplier

associated with each constraint. The sign is arbitrary for equality
constraints, but will be significant later when dealing with inequality
constraints and we choose the negative sign for consistency with this
latter case.
To derive the full set of optimality conditions for constrained prob-
lems, it is convenient to define the Lagrangian function,
ℒ(𝑥, 𝜆) = 𝑓 (𝑥) + 𝜆𝑇 ℎ(𝑥), (5.12)
where 𝜆 is the vector of Lagrange multipliers defined above, which are

now unknown variables as well. ∗ The Lagrangian is defined such that ∗ Despite our convention of reserving
Greek symbols for scalars, we use 𝜆 to rep-
its stationary points are candidate optima for the constrained problem. resent the vector of Lagrange multipliers
To find the stationary points, we can solve for ∇ℒ = 0. Since ℒ is a as it is common usage.
function of both 𝑥 and 𝜆 we need to set both partial derivatives equal

to zero as follows,
𝜕ℒ 𝜕𝑓 Õ
𝑛ℎ
𝜕ℎ 𝑗
= + 𝜆𝑗 = 0, 𝑖 = 1, . . . , 𝑛 𝑥 ,
𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖
𝑗=1
(5.13)
𝜕ℒ
= ℎ 𝑗 = 0, 𝑗 = 1, . . . , 𝑛 ℎ .
𝜕𝜆 𝑗
The first condition is the constrained optimality condition we explained

previously in Eq. 5.11. The second condition enforces the equality
constraints, which must still be enforced because the first condition
could be satisfied at infeasible points.
With the Lagrangian function, we have transformed a constrained
problem into an unconstrained problem by adding new variables, 𝜆. A
constrained problem of 𝑛 𝑥 design variables and 𝑛 ℎ equality constraints
was transformed into an unconstrained problem with 𝑛 𝑥 + 𝑛 ℎ variables.
Although you might be tempted to simply use the algorithms of
Chapter 4 to solve the optimality conditions (5.13) for an unconstrained
Lagrangian function, some modifications are needed in the algorithms
to solve these problems effectively.
The optimality conditions above are first-order conditions that are
necessary, but not sufficient. To make sure that a point is a constrained
minimum, we also need to satisfy second-order conditions. For the
unconstrained case, the Hessian of the objective function had to be
positive definite. In the constrained case, we need to check the Hessian
of the Lagrangian with respect to the design variables in the space of
feasible directions. Therefore, the second order sufficient conditions
are:
𝑝 𝑇 [∇𝑥𝑥 ℒ] 𝑝 > 0, (5.14)
for all feasible 𝑝, so the projection of the curvature onto all feasible
directions must be positive. The feasible directions are all directions 𝑝
such that
∇ℎ 𝑗 (𝑥)𝑇 𝑝 = 0, 𝑗 = 1, . . . , 𝑛 ℎ . (5.15)
That is, the feasible directions are in the null space of the Jacobian of
the constraints.
Example 5.4: Simple equality-constrained problem
Consider the following constrained problem featuring a linear objective

function and a quadratic equality constraint:
minimize 𝑓 (𝑥1 , 𝑥2 ) = 𝑥 1 + 2𝑥2

1 (5.16)
subject to ℎ(𝑥1 , 𝑥2 ) = 𝑥12 + 𝑥22 − 1 = 0.
4
The Lagrangian for this problem is

1
ℒ(𝑥1 , 𝑥2 , 𝜆) = 𝑥 1 + 2𝑥2 + 𝜆 𝑥12 + 𝑥22 − 1 . (5.17)
4
Differentiating this to get the first-order optimality conditions,
𝜕ℒ 1
= 1 + 𝜆𝑥1 = 0
𝜕𝑥 1 2
𝜕ℒ
= 2 + 2𝜆𝑥 2 = 0 (5.18)
𝜕𝑥 2
𝜕ℒ 1
= 𝑥12 + 𝑥22 − 1 = 0.
𝜕𝜆 4
Solving these three equations for the three unknowns (𝑥1 , 𝑥2 , 𝜆), we obtain two
possible solutions:
" √ #
𝑥1 − 2 √
𝑥𝐴 = = √ , 𝜆𝐴 = 2,
𝑥2 − 22
"√ # (5.19)
𝑥 2 √
𝑥 𝐵 = 1 = √2 , 𝜆𝐵 = − 2.
𝑥2
2
These two points are shown in Fig. 5.5, together with the objective and
2
∇𝑓
∇ℎ
1
Maximum
∇𝑓 𝑥𝐵
𝑥2 0
Minimum
𝑥𝐴
−1
∇ℎ Figure𝑥2 5.5: Two points satisfy the

−2 first-order KKT conditions; one is a
2
constrained minimum and the other
−3 −2 −1 0 1 2 3 is a constrained maximum.
𝑥1
constraint gradients, which align with each other as expected. To determine if 𝑥∗

either of these points is a minimum, we check the second-order conditions by
evaluating the Hessian of the Lagrangian, −2
1 −2 0 2
𝜆 0 𝑥1
∇𝑥𝑥 ℒ = 2 . (5.20)
0 2𝜆
Figure 5.6: The minimum of the
√ Lagrangian function with the op-
The Hessian is only positive definite for the case where 𝜆𝐴 = 2 and therefore
timum√ Lagrange multiplier value
𝑥 𝐴 is an optimum. Recall that the Hessian only needs to be positive definite in
(𝜎 = 2) is the constrained mini-
the feasible directions, but here we can easily show that it is positive or negative mum of the original problem.
definite in all possible directions. The Hessian is negative definite for 𝑥 𝐵 , so

this is not a minimum; instead, it is a maximum in this case.
5.2.2 Inequality Constraints

We can reuse the optimality conditions we derived for equality con-
straints for the inequality constrained problems. The key insight is
that the equality conditions apply to inequality constraints that are
active, while inactive constraints can be ignored. Recall that for a
general inequality constraint 𝑔 𝑗 (𝑥) ≤ 0, constraint 𝑗 is said to be active if
𝑔 𝑗 (𝑥 ∗ ) = 0 and inactive if 𝑔 𝑗 (𝑥 ∗ ) < 0.
As before, if 𝑥 ∗ is an optimum, any small enough step 𝑝 from the
optimum must result in a function increase. Based on the Taylor series
expansion (5.4), we get the condition
∇ 𝑓 (𝑥 ∗ )𝑇 𝑝 ≥ 0, (5.21)
which is the same as for the equality constrained case.

To consider the constraints, we use the same linearization of the
constraint (5.7), but now we enforce an inequality to get
𝑔 𝑗 (𝑥 + 𝑝) ≈ 𝑔 𝑗 (𝑥) + ∇𝑔 𝑗 (𝑥)𝑇 𝑝 ≤ 0, 𝑗 = 1, . . . , 𝑛 𝑔 . (5.22)
For a given candidate point that satisfies the constraints, there are
two possibilities to consider for each constraint: whether the constraint
is inactive (𝑔 𝑗 (𝑥) < 0) or active (𝑔 𝑗 (𝑥) = 0). If a given constraint is
inactive, then we do not need to add any conditions because we can
take a step, 𝑝, in any direction and remain feasible as long as the step is
small enough. If a given constraint is active, then we can treat it as an
equality constraint.
Thus, the optimality conditions derived for the equality constrained
case can be reused here, but with a crucial modification. First, the
requirement that the gradient of the objective is a linear combination
of the gradients of the constraints, only needs to consider the active
constraints. This can be written as
Õ
𝑛 𝑔,active
∇ 𝑓 (𝑥 ∗ ) = − 𝜎 𝑗 ∇𝑔 𝑗 (𝑥 ∗ ), (5.23)
𝑗=1
where 𝜎 is the Lagrange multiplier for the inequality constraints, and

the summation occurs only over the active constraints.
Second, the sign of the Lagrange multipliers is now significant, and
must be positive (by our convention) for a feasible optimum. This is
because the feasible space is no longer a hyperplane, but the intersection

of the halfspaces defined by the constraints. An illustration of a 2-D
case with one constraint is shown in Fig. 5.7.
Feasible ∇ 𝑓 (𝑥) ∇ 𝑓 (𝑥) ∇ 𝑓 (𝑥)

directions
∇𝑔(𝑥)
∇𝑔(𝑥)
Feasible
descent Descent ∇𝑔(𝑥)
directions directions
𝜎>0 𝜎<0
We need to include all inequality constraints in the optimality Figure 5.7: Constrained minimum
conditions for 2-D case with one
conditions because we do not know in advance which constraints are
inequality constraint. The objective
active. To do this, we replace the inequality constraints 𝑔 𝑘 ≤ 0 with the function gradient must be parallel
equality constraints: and have opposite directions (cor-
responding to a positive Lagrange
multiplier) so that there are no
𝑔 𝑘 + 𝑠 2𝑘 = 0, 𝑘 = 1, . . . , 𝑛 𝑔 (5.24) feasible descent directions.
where 𝑠 𝑘 is a new unknown associated with each inequality constraint

called a slack variable. The slack variable is squared to ensure it is positive,
so that 𝑔 𝑘 is nonpositive and thus feasible for any 𝑠 𝑘 . The significance of
the slack variable is that when 𝑠 𝑘 = 0, then the corresponding inequality
constraint is active (𝑔 𝑘 = 0), and when 𝑠 𝑘 ≠ 0, the corresponding
constraint is inactive.
The Lagrangian including both equality and inequality constraints
is then

ℒ(𝑥, 𝜆, 𝜎, 𝑠) = 𝑓 (𝑥) + 𝜆𝑇 ℎ(𝑥) + 𝜎𝑇 𝑔(𝑥) + 𝑠 2 , (5.25)
where 𝜎 are the Lagrange multipliers associated with the inequality
constraints.
Similarly to the equality constrained case, we seek a stationary
point for the Lagrangian, but now we have additional unknowns: the
inequality Lagrange multipliers and the slack variables. Taking partial
derivatives with respect to the design variables and setting them to
zero, we obtain
𝜕ℒ 𝜕𝑓 Õ
𝑛ℎ
𝜕ℎ 𝑗 Õ 𝜕𝑔 𝑘
𝑛𝑔
∇𝑥 ℒ = 0 ⇒ = + 𝜆𝑗 + 𝜎𝑘 = 0, 𝑖 = 1, . . . , 𝑛 𝑥
𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖
𝑗=1 𝑘=1
(5.26)
This criteria is the same as before, but with additional Lagrange mul-
tipliers and constraints. Taking the derivatives with respect to the
equality Lagrange multipliers,
𝜕ℒ
∇𝜆 ℒ = 0 ⇒ = ℎ 𝑗 = 0, 𝑗 = 1, . . . , 𝑛 ℎ , (5.27)
𝜕𝜆 𝑗
which enforced the equality constraints as before. Taking derivatives

with respect to the inequality Lagrange multipliers, we get
𝜕ℒ
∇𝜎 ℒ = 0 ⇒ = 𝑔 𝑘 + 𝑠 2𝑘 = 0 𝑘 = 1, . . . , 𝑛 𝑔 , (5.28)
𝜕𝜎 𝑘
which enforces the inequality constraints. Finally, differentiating the
Lagrangian with respect to the slack variables, we obtain
𝜕ℒ
∇𝑠 ℒ = 0 ⇒ = 2𝜎 𝑘 𝑠 𝑘 = 0, 𝑘 = 1, . . . , 𝑛 𝑔 , (5.29)
𝜕𝑠 𝑘
which is called the complementarity condition. This condition helps us to
distinguish the active constraints from the inactive ones. For each in-
equality constraint, either the Lagrange multiplier is zero (which means
that the constraint is inactive), or the slack variable is zero (which means
that the constraint is active). Unfortunately, this condition introduces a
combinatorial problem whose complexity grows exponentially with
the number of inequality constraints, since the number of combinations
of active versus inactive constraints is 2𝑛 𝑔 .
These requirements are called the Karush–Kuhn–Tucker (KKT)
conditions and are summarized below:
𝜕𝑓 Õ
𝑛ℎ
𝜕ℎ 𝑗 Õ 𝜕𝑔 𝑘
𝑛𝑔
+ 𝜆𝑗 + 𝜎𝑘 = 0, 𝑖 = 1, . . . , 𝑛 𝑥
𝑗=1 𝑘=1
ℎ 𝑗 = 0, 𝑗 = 1, . . . , 𝑛 ℎ
(5.30)
𝑔 𝑘 + 𝑠 2𝑘 = 0, 𝑘 = 1, . . . , 𝑛 𝑔
𝜎 𝑘 𝑠 𝑘 = 0, 𝑘 = 1, . . . , 𝑛 𝑔
𝜎 𝑘 ≥ 0, 𝑘 = 1, . . . , 𝑛 𝑔
The last addition, that the Lagrange multipliers associated with the
inequality constraints must be nonnegative, was implicit in the way we
defined the Lagrangian but is now made explicit. As shown in Fig. 5.7,
the Lagrange multiplier for an inequality constraint must be positive
otherwise there is a direction that is feasible and would decrease the
objective function.
The equality and inequality constraints are often lumped together
for convenience, since the expression for the Lagrangian follows the
same form for both cases. As in the equality constrained case, these
conditions are necessary but not sufficient. We still need to require

that the Hessian of the Lagrangian be positive definite in all feasible
directions, as stated in Eq. 5.14. In the inequality case, the feasible
directions are all in the null space of the Jacobian of the equality
constraints and active inequality constraints, that is,
∇ℎ 𝑗 (𝑥)𝑇 𝑝 = 0, 𝑗 = 1, . . . , 𝑛 ℎ ,
(5.31)
∇𝑔𝑖 (𝑥) 𝑝 = 0,
𝑇
for all 𝑖 in active set.
Example 5.5: Simple problem with one inequality constraint
Consider a variation of the simple problem (Ex. 5.4), where the equality is
replaced by an inequality as follows:
minimize 𝑓 (𝑥1 , 𝑥2 ) = 𝑥1 + 2𝑥2

1 (5.32)
subject to 𝑔(𝑥1 , 𝑥2 ) = 𝑥12 + 𝑥22 − 1 ≤ 0.
4
The Lagrangian for this problem is
2
∇𝑓
∇𝑔
1
Maximum
∇𝑓 𝑥𝐵
𝑥2 0
Minimum
−1 𝑥𝐵
∇𝑔
−2 Figure 5.8: Inequality problem with
linear objective and feasible space
−3 −2 −1 0 1 2 3 within an ellipse.
𝑥1

1
ℒ(𝑥1 , 𝑥2 , 𝜎, 𝑠) = 𝑥 1 + 2𝑥2 + 𝜎 𝑥 12 + 𝑥22 − 1 + 𝑠 2 . (5.33)
4
Differentiating this with respect to all the variables to get the first-order
optimality conditions,
𝜕ℒ 1
=1+ 𝜎𝑥 = 0,
𝜕𝑥1 2 1
𝜕ℒ
= 2 + 2𝜎𝑥 2 = 0,
𝜕𝑥2
(5.34)
𝜕ℒ 1 2
= 𝑥 + 𝑥22 − 1 = 0,
𝜕𝜎 4 1
𝜕ℒ
= 2𝜎𝑠 = 0.
𝜕𝑠
Starting with the last equation, there are two possibilities: 𝑠 = 0 (meaning
the constraint is active) and 𝜎 = 0 (meaning the constraint is not active).
However, we can see that setting 𝜎 = 0 in either of the two first equations does
not yield a solution. Assuming that 𝑠 = 0 and 𝜎 ≠ 0, we can solve the equations
to obtain the same points as in Ex. 5.4:
𝑥1  − 2 𝑥1   2 

√ √
       
𝑥 𝐴 =  𝑥2  = − 2  , 𝑥 𝐵 = 𝑥2  =  2  .
√ √
(5.35)
 𝜎   √2   𝜎   √ 2 
   2   − 2
There are the same critical points as in the equality constrained case of Ex. 5.4.
However, now the sign of the Lagrange multiplier has meaning. According
to the KKT conditions, the Lagrange multiplier has to be non-negative. Only Feasible ∇𝑓
𝑥 𝐴 satisfies this condition and therefore there is no descent direction that is directions
feasible, as shown in Fig. 5.9. The Hessian of the Lagrangian at this point is the
same as in Ex. 5.4, where we showed it is positive definite. Therefore, 𝑥 𝐴 is a 𝑥∗
minimum. Unlike the equality-constrained problem, we did not need to check
the Hessian of point 𝑥 𝐵 because the Lagrange multiplier is negative, and as a
Descent
consequence there are feasible descent directions, as shown in Fig. 5.10. ∇𝑔 directions
Figure 5.9: At the minimum there

the Lagrange multiplier is positive
Example 5.6: Simple problem with two inequality constraints and there is no descent direction
that is feasible.
Consider a variation of Ex. 5.5, where we add one more inequality as follows
∇𝑓
minimize 𝑓 (𝑥1 , 𝑥2 ) = 𝑥1 + 2𝑥2

∇𝑔
1 2
subject to 𝑔1 (𝑥 1 , 𝑥2 ) =𝑥 + 𝑥22 − 1 ≤ 0, (5.36)
4 1 𝑥𝐵
𝑔2 (𝑥 2 ) = −𝑥2 ≤ 0.
The feasible region is the top half of the ellipse, as shown in Fig. 5.11. The
Feasible
descent
2 directions
∇ 𝑓 (𝑥)
∇ 𝑓 (𝑥 ∗ ) ∇𝑔1 (𝑥) ∇ 𝑓 (𝑥) Figure 5.10: At this critical point,
1
the Lagrange multiplier is negative
𝑥 and all descent directions are feasi-
∇𝑔1 (𝑥 ∗ ) Minimum 𝑥
𝑥2 0 ble, so this point is not a minium.
𝑥∗ ∇𝑔1 (𝑥)
−1
∇𝑔2 (𝑥 ∗ ) ∇𝑔2 (𝑥)
−2
Figure 5.11: Only one point satisfies
−3 −2 −1 0 1 2 3 the first-order KKT conditions.
𝑥1
Lagrangian for this problem is

1 2
ℒ(𝑥 1 , 𝑥2 , 𝜎, 𝑠) = 𝑥1 + 2𝑥 2 + 𝜎1 𝑥1 + 𝑥22 − 1 + 𝑠 12 + 𝜎2 −𝑥2 + 𝑠 22 . (5.37)
4
Differentiating this with respect to all the variables to get the first-order
optimality conditions,
𝜕ℒ 1
=1+ 𝜎 𝑥 = 0,
𝜕𝑥1 2 1 1
𝜕ℒ
= 2 + 2𝜎1 𝑥2 − 𝜎2 = 0,
𝜕𝑥2
𝜕ℒ 1 2
= 𝑥 + 𝑥22 − 1 + 𝑠 12 = 0,
𝜕𝜎1 4 1
(5.38)
𝜕ℒ
= −𝑥2 + 𝑠22 = 0,
𝜕𝜎2
𝜕ℒ
= 2𝜎1 𝑠 1 = 0,
𝜕𝑠 1
𝜕ℒ
= 2𝜎2 𝑠 2 = 0.
𝜕𝑠 2
We now have two complementarity conditions, which yield the four potential
Feasible ∇ 𝑓 (𝑥 ∗ )
combinations listed in Ex. 5.6. Assuming that both constraints are active yields 𝑔2 (𝑥 ∗ )
directions
Table 5.1: Two inequality constraints yield four potential combinations.

𝑥∗
Assumption Meaning 𝑥1 𝑥2 𝜎1 𝜎2 𝑠1 𝑠2 ∇𝑔1 (𝑥 ∗ )
𝑠 1 = 0, 𝑠 2 = 0 Both constraints −2 0 1 2 0 0 Feasible
are active ∇𝑔2 (𝑥 ∗ ) 𝑔1 (𝑥 ∗ )
2 0 −1 2 0 0 directions
𝜎1 = 0, 𝜎2 = 0 Neither constraint – – – – – –
is active Figure 5.12: At the minimum, the
√ √ √
2 1 intersection of the feasible direc-
𝑠 1 = 0, 𝜎2 = 0 Only constraint 1 is 2 2 − 2 0 0 2− 4 tions and descent directions is null,
active so there is no feasible descent direc-
𝜎1 = 0, 𝑠 2 = 0 Only constraint 2 is – – – – – – tion.
active
Feasible
𝑔2 (𝑥)
Feasible directions ∇ 𝑓 (𝑥)
two possible solutions corresponding to two different Lagrange multipliers.
descent
According to the KKT conditions, the Lagrange multiplier for an active inequal- directions
ity constraint has to be positive, so only the solution with 𝜎1 = 1 is a candidate
for a minimum (see Fig. 5.12). The Hessian of the Lagrangian is identical to 𝑥
the previous example and is positive definite when 𝜎1 is positive. The other ∇𝑔1 (𝑥)
solution corresponds to 𝑥 = (2, 0), where there is a cone of descent directions Feasible
𝑔1 (𝑥)
that is feasible, as shown in Fig. 5.13. Assuming that neither constraint is active ∇𝑔2 (𝑥)
directions
yields 1 = 0 for the first optimality condition, so this situation is not possible.
Assuming that the first constraint is active yields the solution corresponding to Figure 5.13: At this point, there is
the maximum that we already found in Ex. 5.5 and shown in Fig. 5.10. Finally, a cone of descent directions that is
assuming that only the second constraint is active yields no candidate point. also feasible, so it is not a minimum.
While these examples can be solved analytically, they are the ex-
ception rather than the rule. The KKT conditions quickly become
challenging to solve analytically (try solving Ex. 5.1). Furthermore,
engineering problems usually involve functions that are defined by
models with implicit equations, which are impossible to solve ana-
lytically. The reason we include these analytic examples is to better
understand the KTT conditions. For the rest of the chapter, we focus
on numerical methods, which are necessary for the vast majority of
practical problems.
5.2.3 Meaning of the Lagrange Multipliers

A useful way to think of Lagrange multipliers is that they are the sensi-
tivity of the optimal objective 𝑓 (𝑥 ∗ ) to the value of the corresponding
constraints. Here we will explain why that is the case and how it
provides design intuition.
When a constraint is inactive, the corresponding Lagrange multiplier
is zero. This indicates that changing the value of an inactive constraint
does not affect the optimum, which is intuitive. This is only valid to
first order because the KKT conditions are based on the linearization
of the objective and constraint functions. Therefore, small changes are
assumed; an inactive constraint could be made active by changing its
value by a large enough amount.
Consider taking the derivative of Eq. 5.25 with respect to the 𝑖 th
inequality constraint (𝑔𝑖 ):
𝜕ℒ
= 𝜎𝑖 (5.39)
𝜕𝑔𝑖
With a similar form for the Lagrange multiplier for the equality con-
straints. Thus, to first order, the Lagrange multipliers tell us how much
the Lagrangian would be expected to change by changing the constraint
by a unit amount. This has practical value because it tells us how much
of an improvement can be expected if a given constraint is relaxed. The
Lagrange multipliers quantify how much the corresponding constraints
drive the design.
5.3 Penalty Methods
The concept behind penalty methods is intuitive; to transform a con-

strained problem into an unconstrained one by adding a penalty to
the objective function when constraints are violated. As mentioned
in the introduction to this chapter, penalty methods are no longer
used directly in gradient-based optimization algorithms, because it is

difficult to get them to converge to the true solution. However, these
methods are still useful to discuss because: 1) they are simple and
thus ease the transition into understanding constrained optimization;
2) though not effective for gradient-based optimization, they are still
useful in some constrained gradient-free methods, as will be discussed
in Chapter 7; 3) they can be useful as merit functions in line search
algorithms, as discussed in Section 5.6.
The penalized function can be written as
𝐹(𝑥) = 𝑓 (𝑥) + 𝜇𝑃(𝑥), (5.40)
where 𝑃(𝑥) is a penalty function and the scalar 𝜇 is a penalty parameter.

This is similar in form to the Lagrangian, but one difference is that a
value for 𝜇 is fixed in advance instead of solved for.
We can use the unconstrained optimization techniques to minimize
𝐹(𝑥). However, instead of just solving a single optimization problem,
penalty methods usually solve a sequence of problems with different
values of 𝜇 to get closer to the actual constrained minimum. We will
see shortly why we need to solve a sequence of problems rather than
just one problem.
Different forms for 𝑃(𝑥) can be used, leading to different penalty
methods. There are two main types of penalty functions: exterior
penalties, which impose a penalty only when constraints are violated,
and interior penalty functions, which impose a penalty that increases
as a constraint is approached.
5.3.1 Exterior Penalty Methods

Of the many possible exterior penalty methods, we focus on two
of the most popular ones: quadratic penalties and the augmented
Lagrangian method. Quadratic penalties are continuously differentiable
and simple to implement, but suffer from numerical ill-conditioning.
The augmented Lagrangian method is more sophisticated; it is based
on the quadratic penalty but adds terms that improve the numerical
properties. Many other penalties are possible, such as L1-norms, which
are often used when continuous differentiability is not necessary.
Quadratic Penalty Method

For equality-constrained problems the quadratic penalty method takes
𝜇Õ
the form,
𝐹(𝑥; 𝜇) = 𝑓 (𝑥) + ℎ 𝑖 (𝑥)2 . (5.41)
2
𝑖
The motivation for a quadratic penalty is that it is simple and results in

a function that is continuously differentiable. The factor of one half is
unnecessary, but is included by convention as it eliminates the extra
factor of two when taking derivatives. The penalty is nonzero unless
the constraints are satisfied (ℎ 𝑖 = 0), as desired.
𝐹(𝑥; 𝜇) Figure 5.14: Quadratic penalty for

an equality constrained 1-D prob-
lem. The minimum of the penalized
𝜇↑ function (blue dots) approaches the
true constrained minimum (black
circle) as the penalty parameter 𝜇
𝑥 increases.
∗
𝑥true
The value of the penalty parameter 𝜇 must be chosen carefully.

Mathematically, we recover the exact solution to the constrained prob-
lem only as 𝜇 tends to infinity (see Fig. 5.14). However, starting with a
large value for 𝜇 is not practical. This is because the larger the value
of 𝜇, the larger the Hessian condition number, which corresponds to
the curvature varying greatly with direction (as an example, think of
a quadratic function where the level curves are highly skewed). This
behavior makes the problem difficult to solve numerically.
To solve the problem more effectively, we begin with a small value
of 𝜇 and solve the unconstrained problem. We then increase 𝜇 and
solve the new unconstrained problem, using the previous solution
as the starting point for this problem. This process is repeated until
the optimality conditions are satisfied (or some other approximate
convergence criteria are satisfied), as outlined in Alg. 5.7. By gradually
increasing 𝜇 and reusing the solution from the previous problem, we
avoid some of the ill-conditioning issues. Thus, the original constrained
problem is transformed into a sequence of unconstrained optimization
problems.
Algorithm 5.7: Exterior penalty method
Inputs:
𝑥 0 : Starting point
𝜇0 > 0: Initial penalty parameter
𝜌 > 1: Penalty increase factor
Outputs:
𝑘=0
𝑥 ∗𝑘 ← minimize 𝐹(𝑥 𝑘 ; 𝜇 𝑘 ) with respect to 𝑥 𝑘
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Increase penalty parameter∗
∗
𝑥 𝑘+1 = 𝑥 𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
There are three potential issues with the approach outlined in

Alg. 5.7. If the starting value for 𝜇 is too low, then the penalty might
not be enough to overcome a function that is unbounded from below,
and the penalized function has no minimum.
The second issue is that we cannot practically approach 𝜇 → ∞, so
the solution to the problem is always slightly infeasible. By comparing
the optimality condition of the constrained problem, ∇𝑥 ℒ = 0, and the
optimality of the penalized function, ∇𝑥 𝐹 = 0, we can show that for
each constraint 𝑖,
𝜆∗
ℎ𝑖 ≈ 𝑖 . (5.42)
𝜇
Since ℎ 𝑖 = 0 for the exact optimum, 𝜇 must be made large to satisfy the
constraints.
The third issue has to do with the curvature of the penalized function,
which is directly proportional to 𝜇. The added curvature is added in a
direction perpendicular to the constraints, which makes the Hessian
of the penalized function increasingly ill-conditioned as 𝜇 increases.
When using Newton or quasi-Newton methods, such ill-conditioning
causes numerical difficulties.
Example 5.8: Quadratic penalty for equality constrained problem
Consider the equality constrained problem from Ex. 5.4. The penalized
function for that case is
2
𝜇 1 2 2
𝐹(𝑥; 𝜇) = 𝑥 1 + 2𝑥2 + 𝑥 + 𝑥2 − 1 . (5.43)
2 4 1
This function is shown in Fig. 5.15 for different values of the penalty parameter
𝜇. The penalty is active for all points that are infeasible, but the minimum of
the penalized function does not coincide with the constrained minimum of
∗ 𝜌 may range from a conservative value (1.2) to an aggressive value (10), depending
on the problem.
the original problem. The penalty parameter needs to be increased for the
minimum of the penalized function to approach the correct solution, but this
results in a highly nonlinear function.
2 2 2
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗ 𝑥∗ 𝑥∗
−1 −1 −1 𝑥 𝐹∗
𝑥 𝐹∗
𝑥 𝐹∗
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
(a) 𝜇 = 0.5 (b) 𝜇 = 3.0 (c) 𝜇 = 10.0
Figure 5.15: Quadratic penalty for

one equality constraint. The min-
imum of the penalized function
The approach discussed so far handles only equality constraints, approaches the constrained min-
but we can extend it to handle inequality constraints: Instead of adding imum as the penalty parameter
a penalty to both sides of the constraints, we just add the penalty when increases.
the inequality constraint is violated (i.e., when 𝑔 𝑗 (𝑥) > 0). This behavior
can be achieved by defining a new penalty function as
𝜇Õ
𝐹(𝑥; 𝜇) = 𝑓 (𝑥) + max[0, 𝑔 𝑗 (𝑥)]2 . (5.44)
2
𝑗
The only difference relative to the equality constraint penalty shown

in Fig. 5.14 is that the penalty is removed on the feasible side of the
inequality constraint, as shown in Fig. 5.16.
𝐹(𝑥; 𝜇)
Figure 5.16: Quadratic penalty
for an inequality constrained 1-
𝜇↑ D problem. The minimum of the
penalized function approaches the
constrained minimum from the
𝑥 infeasible side.
∗
𝑥true
Example 5.9: Quadratic penalty for inequality constrained problem

function for that case is
2
𝜇 1
𝐹(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + max 0, 𝑥12 + 𝑥22 − 1 . (5.45)
2 4
This function is shown in Fig. 5.17 for different values of the penalty parameter 𝜇.
The contours of the feasible region inside the ellipse coincide with the original
function contours, but outside the feasible region, the contours change to create
a function whose minimum approaches the exact constrained minimum as the
penalty parameters is increased.
2 2 2
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗ 𝑥∗ 𝑥∗
−1 −1 −1 𝑥 𝐹∗
𝑥 𝐹∗
𝑥 𝐹∗
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
(a) 𝜇 = 0.5 (b) 𝜇 = 3.0 (c) 𝜇 = 10.0
Figure 5.17: Quadratic penalty for

one inequality constraint. The min-
The inequality quadratic penalty can be used together with the approaches the constrained mini-
quadratic penalty for equality constraints if we need to handle both mum from the infeasible side.
types of constraints. The two penalty parameters can be incremented
in lockstep or independently.
Tip 5.10: Scaling is also important for constrained problems.
The same considerations on scaling discussed in Chapter 4 are just as

important for constrained problems. As a rule of thumb, all constraints should
be of order one.
Augmented Lagrangian
As explained above, the quadratic penalty method requires a large
value of 𝜇 for constraint satisfaction, but the large 𝜇 degrades the
numerical conditioning. The augmented Lagrangian method alleviates
this dilemma by adding the quadratic penalty to the Lagrangian instead
of just adding it to the function. The augmented Lagrangian function

for equality constraints is:
Õ
𝑛ℎ
𝜇Õℎ𝑛
𝐹(𝑥, 𝜆; 𝜇) = 𝑓 (𝑥) + 𝜆 𝑗 ℎ 𝑗 (𝑥) + ℎ 𝑗 (𝑥)2 , (5.46)
2
𝑗 𝑗
Inequality constraints can be included in a similar way using the

maximum function of Eq. 5.44 and considering only the active or
violated constraints in the second term. Unfortunately, the Lagrange
multipliers cannot be solved for in a penalty approach and so we need
some way to estimate them.
To obtain an estimate of the Lagrange multipliers, we can compare
the optimality conditions for the augmented Lagrangian,
Õ
𝑛ℎ
∇𝑥 𝐹(𝑥, 𝜆; 𝜇) = ∇ 𝑓 (𝑥) + [𝜆 𝑗 + 𝜇ℎ 𝑗 (𝑥)]∇ℎ 𝑗 = 0 (5.47)
𝑗
to those of the actual Lagrangian,
Õ
𝑛ℎ
∇𝑥 ℒ(𝑥 ∗ , 𝜆∗ ) = ∇ 𝑓 (𝑥 ∗ ) + 𝜆∗𝑗 ∇ℎ 𝑗 (𝑥 ∗ ) = 0, (5.48)
𝑗
which suggests the approximation
𝜆∗𝑗 ≈ 𝜆 𝑗 + 𝜇ℎ 𝑗 . (5.49)
Therefore, we update the vector of Lagrange multipliers based on the

current estimate of the Lagrange multipliers and constraint values using
𝜆 𝑘+1 = 𝜆 𝑘 + 𝜇 𝑘 ℎ(𝑥 𝑘 ) (5.50)
The complete algorithm is shown in Alg. 5.11.
Algorithm 5.11: Augmented Lagrangian penalty method
Inputs:
𝜆0 = 0: Initial Langrange multiplier
𝜌 > 1: Penalty increase factor
Outputs:
𝑘=0

𝑥 ∗𝑘 ← minimize 𝐹(𝑥 𝑘 , 𝜆 𝑘 ; 𝜇 𝑘 ) with respect to 𝑥 𝑘
𝜆 𝑘+1 = 𝜆 𝑘 + 𝜇 𝑘 ℎ(𝑥 𝑘 ) Update Lagrange multipliers
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Increase penalty parameter
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
The reason why this approach is an improvement on the plain

quadratic penalty is because by updating the Lagrange multiplier
estimates at every iteration, we obtain more accurate solutions without
having to increase 𝜇 too much. We can see this by comparing the
augmented Lagrangian approximation to the constraints obtained from
Eq. 5.49,
1
ℎ 𝑖 ≈ (𝜆∗𝑖 − 𝜆 𝑖 ) (5.51)
𝜇
with the corresponding approximation in the quadratic penalty method,
𝜆∗𝑖
ℎ𝑖 ≈ . (5.52)
𝜇
To drive the constraints to zero, the quadratic penalty relies solely
on increasing 𝜇 in the denominator, However, with the augmented
Lagrangian, we also have control on the numerator through the La-
grange multiplier estimate. If the estimate is reasonably close to the
true Lagrange multiplier then the numerator will become small for
modest values of 𝜇. Thus, the augmented Lagrangian can provide a
good solution for 𝑥 ∗ while avoiding the ill-conditioning issues of the
quadratic penalty.
Example 5.12: Augmented Lagrangian for inequality constrained problem
Consider the equality constrained problem from Ex. 5.5. Assuming the
inequality constraint is active, the augmented Lagrangian (Eq. 5.46) for that
problem is
2
1 2 2 𝜇 1 2 2
𝐹(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + 𝜎 𝑥1 + 𝑥2 − 1 + 𝑥 + 𝑥2 − 1 . (5.53)
4 2 4 1
Applying Alg. 5.11, starting with 𝜇 = 0.5 and using 𝜌 = 1.1, we get the iterations
shown in Fig. 5.18. Compared to the quadratic penalty in Ex. 5.9, the penalized
function is much better conditioned, thanks to the term associated with the
Lagrange multiplier. The minimum of the penalized function eventually
becomes the minimum of the constrained problem without the need for a large
penalty parameter.
2 2 2
𝑥 (0)
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
(a) 𝑘 = 0, 𝜇 = 0.50, 𝜎 = 0.000 (b) 𝑘 = 2, 𝜇 = 0.61, 𝜎 = 1.146 (c) 𝑘 = 9, 𝜇 = 1.18, 𝜎 = 1.413
5.3.2 Interior Penalty Methods Figure 5.18: Augmented La-

grangian applied to inequality
Interior penalty methods work the same way as exterior penalty constrained problem.
methods—they transform the constrained problem into a series of
unconstrained problems. The main difference with interior penalty
methods is that they seek to always maintain feasibility: Instead of
adding a penalty only when constraints are violated, they add a penalty
as the constraint is approached from the feasible region. This type of
penalty is particularly desirable if the objective function is ill-defined
outside the feasible region. These methods are called “interior” because
the iteration points remain on the interior of the feasible region. They
are also referred to as barrier methods because the penalty function
acts as a barrier preventing iterates from leaving the feasible region.
One possible interior penalty function to enforce 𝑔(𝑥) ≤ 0 is the
inverse function (top of Fig. 5.19),
Õ
𝑛𝑔
1
𝑃(𝑥) = − , (5.54)
𝑔 𝑗 (𝑥)
𝑗
where 𝑃(𝑥) → ∞ as 𝑐 𝑖 (𝑥) → 0− .

A more popular interior penalty function is the logarithmic barrier 𝑃(𝑥)
(bottom of Fig. 5.19), 8
Õ
𝑛𝑔
6
𝑃(𝑥) = − log −𝑔 𝑗 (𝑥) , (5.55) 4

𝑗
2 Inverse barrier
which also approaches infinity as the constraint tends to zero from the
feasible side. The penalty function is then, 0
−2 −1 0
Logarithmic barrier
Õ
𝑔(𝑥)
𝑛𝑔 −2
𝐹(𝑥; 𝜇) = 𝑓 (𝑥) − 𝜇 log(−𝑔 𝑗 (𝑥)). (5.56) Figure 5.19: Two different interior
𝑗 barrier functions.
Like exterior methods, interior methods must also solve a sequence
of unconstrained problems but with 𝜇 → 0 (see Alg. 5.13). As the
penalty parameter is decreased, the region across which the penalty

acts decreases making it sharper and more like a barrier as shown in
Fig. 5.20.
Figure 5.20: Logarithmic barrier

𝐹(𝑥; 𝜇) penalty for an inequality con-
strained 1-D problem. The min-
(blue circles) approaches the true
constrained minimum (black cir-
𝜇↓ cle) as the penalty parameter 𝜇
𝑥 decreases.
∗
𝑥 true
Algorithm 5.13: Interior penalty method
Inputs:
𝜌 < 1: Penalty decrease factor
Outputs:
𝑘=0
𝑥 ∗𝑘 ← minimize 𝐹(𝑥 𝑘 ; 𝜇 𝑘 ) with respect to 𝑥 𝑘
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Decrease penalty parameter
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
The methodology is essentially the same as is described in Alg. 5.7,

but with a decreasing penalty parameter. One major weakness of the
method is that the penalty function is not defined for infeasible points
so a feasible starting point must be provided. For some problems, pro-
viding a feasible starting point may be difficult or practically impossible.
To prevent the algorithm from going infeasible when starting from a
feasible point, the line search must be safeguarded. For the logarithmic
barrier, this can be done by checking the values of the constraints and
backtracking if any of them is greater than or equal to zero.
Both interior and exterior penalties are shown for a two-dimensional

function in Fig. 5.21. The exterior penalty leads to solutions that are
slightly infeasible, while an interior penalty leads to a feasible solution
but underpredicts the objective.
Objective Objective
𝑥∗ 𝑥∗
𝑥∗ 𝑥∗
𝑝 𝑝
Interior Exterior
Constraint Constraint
penalty penalty
Figure 5.21: Interior penalties tend

to infinity as the constraint is ap-
proached from the feasible side of
𝐹(𝑥) the constraint (left), while exterior
penalty functions activate when the
points are not feasible (right). The
minimum for both approaches is
different from the true constrained
𝛼𝑝 minimum.
∗ ∗ ∗
𝑥 interior 𝑥true 𝑥exterior
Another weakness is that, similarly to the exterior penalty methods,

the Hessian becomes ill-conditioned as the penalty parameter tends
to zero.66 There are augmented and modified barrier approaches that 66. Murray, Analytical expressions for the
eigenvalues and eigenvectors of the Hessian
can avoid the ill-conditioning issue (and other methods that remain matrices of barrier and penalty functions.
ill-conditioned but can still be solved reliably, albeit inefficiently).67 1971
67. Forsgren et al., Interior Methods for
However, these methods have been superseded by modern interior Nonlinear Optimization. 2002
point methods discussed in Section 5.5, so we do not elaborate on
further improvements to classical interior barrier methods.
Example 5.14: Logarithmic penalty for inequality constrained problem
function for that case using the logarithmic penalty (Eq. 5.56) is

1
𝐹(𝑥; 𝜇) = 𝑥1 + 2𝑥2 − 𝜇 log − 𝑥12 − 𝑥22 + 1 . (5.57)
4
This function is shown in Fig. 5.22 for different values of the penalty parameter
2 2 2
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥 𝐹∗ 𝑥 𝐹∗ 𝑥 𝐹∗
−1 𝑥∗ −1 𝑥∗ −1 𝑥∗
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
(a) 𝜇 = 3.0 (b) 𝜇 = 1.0 (c) 𝜇 = 0.2
𝜇. The penalized function is defined only in the feasible space, so we do not Figure 5.22: Logarithmic penalty
plot its contours outside the ellipse. for one inequality constraint. The
minimum of the penalized func-
tion approaches the constrained
minimum from the feasible side.
5.4 Sequential Quadratic Programming
Sequential quadratic programming (SQP) is the first of the modern

constrained optimization methods we discuss. SQP is not a single algo-
rithm, but instead, it is a conceptual method from which various specific
algorithms have been derived. We begin with equality-constrained
SQP, and then add inequality constraints.
5.4.1 Equality-Constrained SQP

To derive the SQP method, we start with the KKT conditions for this
problem and treat them as residuals of equations that need to be solved.
Recall that the Lagrangian is given by:
ℒ(𝑥, 𝜆) = 𝑓 (𝑥) + ℎ(𝑥)𝑇 𝜆 (5.58)
The KKT conditions are that the derivatives of this function are zero:

∇𝑥 ℒ(𝑥, 𝜆) ∇ 𝑓 (𝑥) + [∇ℎ(𝑥)]𝑇 𝜆
ℛ= = =0 (5.59)
∇𝜆 ℒ(𝑥, 𝜆) ℎ(𝑥)
Recall that to solve a system of equations 𝑅(𝑢) = 0 using Newton’s
method, we solve the sequence of linear systems
[∇𝑢 ℛ(𝑢 𝑘 )] 𝑠 𝑢 = −ℛ(𝑢 𝑘 ), (5.60)
where 𝑠 𝑢 = 𝑢 𝑘+1 − 𝑢 𝑘 , and in our case 𝑢 = [𝑥 𝑇 , 𝜆𝑇 ]𝑇 . Differentiating the
residual vector, Eq. 5.59, with respect to the two concatenated vectors
in 𝑢 yields the block linear system,

∇2𝑥𝑥 ℒ [∇ℎ]𝑇 𝑠𝑥 −∇𝑥 ℒ
= (5.61)
[∇ℎ] 0 𝑠𝜆 −ℎ
This is a linear system of 𝑛 𝑥 + 𝑛 ℎ equations where the Jacobian matrix

is square. The shape of the matrix and its subblocks are illustrated
in Fig. 5.23. We solve a sequence of these problems to converge to
the optimal design variables and the corresponding optimal Lagrange
multipliers. At each iteration we update the design variables and 𝑛𝑥 𝑛ℎ
Lagrange multipliers as:
𝑛𝑥 ∇2𝑥𝑥 ℒ ∇ℎ 𝑇
𝑥 𝑘+1 = 𝑥 𝑘 + 𝑠 𝑥 (5.62)
𝜆 𝑘+1 = 𝜆 𝑘 + 𝑠𝜆 (5.63)
𝑛ℎ ∇ℎ 0
SQP can be derived in an alternative way that leads to different in-
sights. This alternate approach requires an understanding of quadratic Figure 5.23: Shape of different
programming (QP) discussed in more detail in Section 11.3, but briefly submatrices forming the matrix for
described here. A QP problem is an optimization problem with a the SQP problem (Eq. 5.61)
quadratic objective and linear constraints. In a general form, we can
express any equality-constrained QP as:
1 𝑇
minimize 𝑥 𝑄𝑥 + 𝑓 𝑇 𝑥
2 (5.64)
subject to 𝐴𝑥 + 𝑏 = 0
The objective is quadratic in 𝑥 and the constraint is linear in 𝑥. A two-

dimensional example is illustrated in Fig. 5.24. The constraint is a matrix
equation that represents multiple linear equality constraints—one for
every row in 𝐴. 𝑥2
𝑥∗
We can solve this optimization problem analytically from the opti-
mality conditions. First, we form the Lagrangian:
1 𝑇
(5.65)
𝑥1
ℒ(𝑥, 𝜆) = 𝑥 𝑄𝑥 + 𝑓 𝑇 𝑥 + 𝜆𝑇 (𝐴𝑥 + 𝑏)
2
Figure 5.24: Quadratic problem in
We now take the partial derivatives and set them equal to zero: two dimensions.
∇𝑥 ℒ = 𝑄𝑥 + 𝑓 + 𝐴𝑇 𝜆 = 0
(5.66)
∇𝜆 ℒ = 𝐴𝑥 + 𝑏 = 0
We can express those same equations in a block matrix form:

𝑄 𝐴𝑇 𝑥 −𝑓
= (5.67)
𝐴 0 𝜆 −𝑏
This is like the procedure we used in solving the KKT conditions,

except that these are just linear equations and so we can solve them
directly without any iteration. Intuitively, it shouldn’t be too surprising
that finding the minimum of a quadratic objective (which means linear
gradients) subject to linear constraints results in a set of linear equations.
As long as 𝑄 is positive definite, then the linear system always

has a solution, and it is the global minimum of the QP† . The ease
with which a QP can be solved provides an alternative motivation
for SQP. For a general constrained problem we can make a local QP
approximation of the nonlinear model, solve the QP, then repeat the
process again. This method involves iteratively solving a sequence of
quadratic programming problems, hence the name sequential quadratic
programming.
To form the QP, we use a quadratic approximation of the Lagrangian
(removing the constant term because it does not change the solution),
and a linear approximation of the constraints, for some step, 𝑠, near our
current point. In other words, we locally approximate the problem as
the following QP:
1 𝑇 2
minimize 𝑠 ∇𝑥𝑥 ℒ𝑠 + ∇𝑥 ℒ 𝑇 𝑠
2
subject to [∇ℎ]𝑠 + ℎ = 0
We substitute the gradient of the Lagrangian into the objective:
1 𝑇 2
𝑠 ∇𝑥𝑥 ℒ𝑠 + ∇ 𝑓 𝑇 𝑠 + 𝜆𝑇 [∇ℎ]𝑠 (5.69)
2
Then, we substitute the constraint [∇ℎ]𝑠 = −ℎ into the objective:
1 𝑇 2
𝑠 ∇𝑥𝑥 ℒ𝑠 + ∇ 𝑓 𝑇 𝑠 − 𝜆𝑇 ℎ (5.70)
2
Now, we can remove the last term in the objective, because it does not
depend on the design variables (𝑠) resulting in the following equivalent
problem:
1 𝑇 2
minimize 𝑠 ∇𝑥𝑥 ℒ𝑠 + ∇ 𝑓 𝑇 𝑠
2
subject to [∇ℎ]𝑠 + ℎ = 0
Using the QP solution method outlined above, results in the follow-
ing system of linear equations:

∇2𝑥𝑥 ℒ [∇ℎ]𝑇 𝑠𝑥 −∇ 𝑓
= (5.72)
∇ℎ 0 𝜆 𝑘+1 −ℎ
We replace 𝜆 𝑘+1 = 𝜆 𝑘 + 𝑠𝜆 and multiply through:

∇2𝑥𝑥 ℒ [∇ℎ]𝑇 𝑠𝑥 [∇ℎ]𝑇 𝜆 𝑘 −∇ 𝑓
+ = (5.73)
∇ℎ 0 𝑠𝜆 0 −ℎ
Subtracting the second term on both sides yields the same set of
equations we found from applying Newton’s method to the KKT
conditions: 2
∇𝑥𝑥 ℒ [∇ℎ]𝑇 𝑠 𝑥 −∇𝑥 ℒ
= (5.74)
∇ℎ 0 𝑠𝜆 −ℎ
The derivation based on solving the KKT conditions is more fun-
damental. This alternative derivation relies the somewhat arbitrary
choices (made in hindsight) of choosing a QP as the subproblem and
using an approximation of the Lagrangian with constraints rather than
an approximation of the objective with constraints, or an approximation
of the Lagrangian with no constraints. Nevertheless, it is a useful
conceptual model to consider the method as sequentially creating and
solving QPs.
5.4.2 Inequality Constraints

If we have inequality constraints, we can still use the SQP algorithm
for equality constrained problems introduced in the previous section,
but we need some modifications as that approach assumes equality
constraints. A common approach to this problem is to use an active-set
method. If we knew which of the inequality constraints were active
(𝑔𝑖 (𝑥 ∗ ) = 0) and which were inactive (𝑔𝑖 (𝑥 ∗ ) < 0) at the optimum, we
could use the same solution approach, treating the active constraints as
equality constraints and ignoring all inactive constraints. Unfortunately,
we cannot assume any knowledge about which constraints are active at
the optimum in general.
Finding which constraints are active in an iterative way poses a
difficulty, as in general we would need to try all possible combinations
of active constraints. This is intractable if there are many constraints.
While the true active set is not known until a solution is found,
we can estimate it at each iteration. This estimated subset of active
constraints is called the working set. Then, the linear system (5.61) can
be solved only considering the equality constraints and the inequality
constraints from the working set. The working set is then updated at
each iteration.
As a first guess for the working set, we can simply evaluate the
values for all inequality constraints, and select the ones for which
𝑔 𝑘 ≥ −𝜖, where 𝜖 is a small positive quantity. This selects constraints
that are near active or infeasible into the working set.
Many methods exist for updating the working set, only a general
outline is discussed here. For the equality constrained case, determining
the step direction, 𝑝 𝑘 , already ensures feasibility and so 𝛼 𝑘 can be chosen
without regarding the constraints. For the inequality constrained case,

the computation of the Newton step, 𝑝 𝑘 , only involves the working set.
Because this working set may be incomplete, the line search strategy
needs to choose a step size that does not violate constraints outside the
working set. The algorithm determines 𝛼 𝑚𝑎𝑥 , which is the largest step
size for which the constraints are still feasible. The line search then
enforces 𝛼 𝑚𝑎𝑥 as an upper bound. If 𝛼 𝑘 < 𝛼 𝑚𝑎𝑥 then the working set
is unchanged. However, if 𝛼 𝑘 = 𝛼 𝑚𝑎𝑥 then the constraints that set the
value for 𝛼 𝑚𝑎𝑥 are said to be blocking (i.e., those constraints prevented
the optimizer from taking a larger step). Those constraints are then
added to the working set to improve the prediction of 𝑝 𝑘 for the next
iteration.
Tip 5.15: How to handle maximum and minimum constraints.
Often a maximum or minimum constraint is desired. For example, the

stress on a structure may be evaluated at many points and we want to make
sure the maximum stress does not exceed a given yield stress:
max(𝜎) ≤ 𝜎 𝑦 (5.75)
However, the maximum function is not continuously differentiable. That is not

always problematic, but it is also mathematically equivalent to constraining
the stress at all 𝑚 points, with the added benefit that all constraints are now
continuously differentiable:
𝜎𝑖 ≤ 𝜎 𝑦 for 𝑖 = 1 . . . 𝑚 (5.76)
While this adds many more constraints, if an active set method is used, there
is little cost to adding more constraints as most of them will be inactive.
Alternatively, a constraint aggregation method (Section 5.7) could be used.
5.4.3 Quasi-Newton SQP

The SQP method as discussed so far requires the Hessian of the
Lagrangian ∇2𝑥𝑥 ℒ, which if an exact Hessian is available, becomes
Newton’s method applied to the optimality condition. For the same
reasons, and in a similar manner as discussed in Chapter 4, an exact ‡ Sometimes linearizing the constraints
Hessian is often not available, making it desirable to approximate the can lead to an infeasible QP subproblem,
and additional techniques are needed to
Hessian. We will denote the approximate Hessian of the Lagrangian, treat these subproblems.58 , 68
at iteration 𝑘, as 𝑊𝑘 . 58. Nocedal et al., Numerical Optimization.
A high-level description of SQP with quasi-Newton approximations 2006
is provided in Alg. 5.16.‡ For the convergence criterion, we can use an 68. Gill et al., SNOPT: An SQP Algorithm
for Large-Scale Constrained Optimization.
infinity norm of the KKT system residual vector. To get more control 2005
over the convergence, we can have two separate tolerances for the norm
of the optimality and feasibility.
Algorithm 5.16: SQP with quasi-Newton approximation
Inputs:
𝜏opt : Optimality tolerance
𝜏feas : Feasibility tolerance
Outputs:
𝜆0 = 0 Initial Lagrange multipliers

𝑊0 = 𝐼 Initialize Hessian of Lagrangian approximation to identity matrix
𝑘=0
while ||∇𝑥 ℒ|| ∞ > 𝜏opt or || ℎ|| ∞ > 𝜏feas do
Evaluate ∇ℎ 𝑘 , ∇𝑥 ℒ
Solve KKT system (5.61) for 𝑠 𝑥 and 𝑠𝜆

𝑊𝑘 [∇ℎ]𝑇 𝑠𝑥 −∇𝑥 ℒ
=
∇ℎ 0 𝑠𝜆 −ℎ
Perform a line search in direction 𝑝 𝑘 = 𝑠 𝑘 using the merit function
𝜙(𝛼, 𝜇) = 𝑓 (𝑥 𝑘 + 𝛼𝑝 𝑥 ) + 𝜇k ℎ(𝑥 𝑘 + 𝛼𝑝 𝑥 )k
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 𝑘 Update step and active set

𝜆 𝑘+1 = 𝜆 𝑘 + 𝑠𝜆 Update the Lagrange multipliers
Update 𝑊𝑘+1 Compute quasi-Newton approximation using Eq. (5.77)
Increment penalty parameter 𝜇
𝑘 = 𝑘+1
end while
Just as we did for unconstrained optimization, we can approximate

𝑊𝑘 using the gradients of the Lagrangian and the BFGS update formula.
However, unlike unconstrained optimization, we do not want the
inverse of the Hessian directly. Instead, we make use of a version of
the BFGS formula that computes the Hessian (rather than the inverse
Hessian):
𝑊𝑘 𝑠 𝑘 𝑠 𝑇 𝑊𝑘 𝑦 𝑘 𝑦 𝑇𝑘
𝑊𝑘+1 = 𝑊𝑘 − 𝑇 𝑘 + 𝑇 (5.77)
𝑠 𝑘 𝑊𝑘 𝑠 𝑘 𝑦𝑘 𝑠 𝑘
where:
𝑠 𝑘 = 𝑥 𝑘+1 − 𝑥 𝑘
(5.78)
𝑦 𝑘 = ∇𝑥 ℒ(𝑥 𝑘+1 , 𝜆 𝑘+1 ) − ∇𝑥 ℒ(𝑥 𝑘 , 𝜆 𝑘+1 ).
The step in the design variable space, 𝑠 𝑘 , is the step that resulted from
the latest line search. The Lagrange multiplier is fixed to the latest value
when approximating the curvature of the Lagrangian because we only
need the curvature in the space of the design variables.
Recall that for the QP problem to have a solution then 𝑊𝑘 must be
positive definite. To ensure that this 𝑊𝑘 is always positive definite, a
damped BFGS update formula was devised.20§ This method replaces 20. Powell, Algorithms for nonlinear con-
straints that use Lagrangian functions. 1978
the 𝑦 with a new vector 𝑟 defined as:
§ The damped BFGS formula is not al-
ways the best approach for all problems,

𝑟 𝑘 = 𝜃𝑘 𝑦 𝑘 + (1 − 𝜃𝑘 )𝑊𝑘 𝑠 𝑘 , (5.79) and other Lagrangian Hessian approxi-
mation methods exist like the symmet-
where the scalar 𝜃𝑘 is defined as ric rank-one approximation.63 Addition-
ally, for very large problems storing a


dense Hessian may be problematic and
1
 if 𝑠 𝑇𝑘 𝑦 𝑘 ≥ 0.2𝑠 𝑇𝑘 𝑊𝑘 𝑠 𝑘 , so limited-memory update methods are
𝜃𝑘 = (5.80) used.69

0.8𝑠 𝑇𝑘 𝑊𝑘 𝑠 𝑘
 if 𝑠 𝑇𝑘 𝑦 𝑘 < 0.2𝑠 𝑇𝑘 𝑊𝑘 𝑠 𝑘 ,
 𝑠 𝑘 𝑊𝑘 𝑠 𝑘 −𝑠 𝑇𝑘 𝑦 𝑘
𝑇 63. Fletcher, Practical Methods of Optimiza-
tion. 1987
69. Liu et al., On the limited memory BFGS
which can range from 0 to 1. We then use the same BFGS update method for large scale optimization. 1989
formula, Eq. 5.77, except that we replace each 𝑦 𝑘 with 𝑟 𝑘 .
To better understand what this method is doing, take a closer look
at the two extremes for 𝜃. If 𝜃𝑘 = 0 then Eq. 5.79 in combination
with Eq. 5.77 yields 𝑊𝑘+1 = 𝑊𝑘 ; that is, the Hessian approximation is
unmodified. At the other extreme, 𝜃𝑘 = 1 yields the full BFGS update
formula (𝑟 𝑘 is set to 𝑦 𝑘 ). Thus, the parameter 𝜃𝑘 provides a linear
weighting between keeping the current Hessian approximation and a
full BFGS update.
The definition of 𝜃𝑘 (5.80) ensures that 𝑊𝑘+1 stays close enough
to 𝑊𝑘 and remains positive definite. The damping is activated when
the predicted curvature in the new latest step is below one fifth of the
curvature predicted by the latest approximate Hessian. This could
happen when the function is flattening or when the curvature becomes
negative.
Like the quasi-Newton methods we discussed in unconstrained
optimization, we solve the QP for a search direction, 𝑝 𝑥 (as opposed
to a full step 𝑠 𝑥 ), and perform a line search. However, we cannot just
use the objective function as the metric of our line search as we did
in unconstrained optimization; we need to use some kind of merit
function (a form of penalty) or filter. These are techniques used to
evaluate whether a step is acceptable in a line search. The details of
these techniques are discussed later in Section 5.6, since they are used
for both SQP and interior-point methods.
Example 5.17: SQP applied to equality-constrained problem

We now solve Ex. 5.4 using the SQP method (Alg. 5.16). The gradient of
the equality constraint is
1
𝑥 1
∇ℎ = 2 1 = ,
2𝑥 2 2
and differentiating the Lagrangian with respect to 𝑥 yields

𝜕ℒ 1 + 21 𝜆𝑥 1 1
= = .
𝜕𝑥 2 + 2𝜆𝑥2 2
We start a at 𝑥 (0) = [2, 1] with an initial Lagrange multipler 𝜆 = 0, We set the

initial estimate of the Lagrangian Hessian to 𝑊0 = 𝐼. The KKT system (5.61) to
be solved is then
1 0 1 𝑠 𝑥1  −1
    
0 1 2 𝑠 𝑥  = −2 .
   2  
1 2 0  𝑠  −1
   𝜆  
The solution of 𝑠 to the above is [−0.2, −0.4, −0.8]. Performing a line search
in the direction 𝑝 = [−0.2, −0.4] yields the solution to the next iteration at
𝑥 (1) = [1.8, 0.6]. The Lagrange multiplier is updated to 𝜆1 = −0.8.
To update the approximate Hessian 𝑊𝑘 using the damped BFGS update
formula (5.79), we need to compare the values of 𝑠 0𝑇 𝑦0 = −0.272 and 𝑠 0𝑇 𝑊0 𝑠 0 =
0.2. Since 𝑠 𝑇𝑘 𝑦 𝑘 < 0.2𝑠 𝑇𝑘 𝑊𝑘 𝑠 𝑘 , we need to compute the scalar 𝜃 = 0.339 using
Eq. 5.80. This provides a partial BFGS update where 𝑟0 is a mix of the current
Hessian approximation and the step 𝑦0 . Using the quasi-Newton update
Eq. 5.77, we get the approximate Hessian for the next iteration as

1.076 −0.275
𝑊1 = .
−0.275 0.256
The penalty parameter 𝜇 also needs to be incremented. We use a factor of 1.1

in this example. This process is then repeated for subsequent iterations.
Fig. 5.25 shows SQP optimization at various iterations. The gray contour is
1 𝑠 𝑇 𝑊 𝑠 + ∇ 𝑓 𝑇 𝑠, the QP subproblem (5.71) that is being solved at each iteration,
2
and 𝑊 is the approximate Hessian updated using the quasi-Newton method.
The linearlized constraint [∇ℎ]𝑠 + ℎ = 0 is also shown as a straight line.
The starting point is infeasible and the iterations remain infeasible until the
last few iterations. This behavior is common for SQP, because while it satisfies
the linear approximation of the constraints at each step, it does not necessarily
satisfy the nonlinear constraint of the actual problem. As the approximation of
the constraint becomes more accurate near the solution, the nonlinear constraint
is then satisfied.
5.5 Interior Point Methods
One way to look at interior point methods is to make a seemingly small

(but important!) modification from the interior penalty method we have
2 2 2
𝑥 (0)
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
(a) 𝑘 = 0, 𝜆 = 0.00 (b) 𝑘 = 6, 𝜆 = 1.29 (c) 𝑘 = 16, 𝜆 = 1.41
already seen. We formulate the constrained optimization problem as: Figure 5.25: SQP algorithm itera-
Õ tions.
minimize 𝑓 (𝑥) − 𝜇 ln 𝑠 𝑘
𝑘
by varying 𝑥, 𝑠 (5.81)
subject to ℎ 𝑗 (𝑥) = 0 for 𝑗 = 1, . . . , 𝑛 ℎ
𝑔 𝑘 (𝑥) + 𝑠 𝑘 = 0 for 𝑘 = 1, . . . , 𝑛 𝑔
One difference relative to the barrier method is that rather than

treating the problem as unconstrained, we apply Newton’s method
to the KKT conditions like we do in SQP methods. Implicit in the
constraints is 𝑠 𝑘 ≥ 0. This condition is enforced by the logarithm
function which prevents 𝑠 from approaching zero. Because 𝑠 𝑘 is always
positive that means that at the solution 𝑔 𝑘 (𝑥 ∗ ) < 0, which satisfies the
inequality constraints. Like the penalty method, this formulation is not
actually equivalent to our original constrained problem, except in the
limit as 𝜇 → 0. We will need to solve a sequence of solutions to this
problem with 𝜇 approaching zero. Unlike SQP, this formulation uses
only equality constraints; and so it avoids the combinatorial problem of
trying to determine an active set.
First, we form the Lagrangian for this problem as
Õ
ℒ(𝑥, 𝜆, 𝜎) = 𝑓 (𝑥) − 𝜇 ln 𝑠 𝑘 + ℎ(𝑥)𝑇 𝜆 + (𝑔(𝑥) + 𝑠)𝑇 𝜎. (5.82)
𝑘
By taking derivatives with respect to 𝑥, 𝜆, 𝜎, and 𝑠 respectively, the

KKT conditions for this problem are:

𝜕𝑓 𝜕ℎ 𝑗 𝜕𝑔 𝑘
+ 𝜆𝑗 + 𝜎𝑘 =0
ℎ𝑗 = 0
(5.83)
𝑔𝑘 + 𝑠 𝑘 = 0
Õ 1
−𝜇 + 𝜎𝑘 = 0
𝑠𝑘
𝑘
We can also express this in vector form by defining a matrix 𝑆, which is

a diagonal matrix with 𝑆 𝑘 𝑘 = 𝑠 𝑘 , and defining a vector of ones as 1.
∇ 𝑓 (𝑥) + [∇ℎ(𝑥)]𝑇 𝜆 + [∇𝑔(𝑥)]𝑇 𝜎 = 0

ℎ=0
(5.84)
𝑔+𝑠 =0
−𝜇𝑆−1 1 + 𝜎 = 0
This form of the equation can pose numerically difficulties as 𝑠 → 0
(perhaps most obvious by examining the last equation in Eq. 5.83), and
so we will multiply the last equation through by 𝑆.
∇ 𝑓 (𝑥) + [∇ℎ(𝑥)]𝑇 𝜆 + [∇𝑔(𝑥)]𝑇 𝜎 = 0

ℎ=0
(5.85)
𝑔+𝑠 =0
−𝜇1 + 𝑆𝜎 = 0
We now have a set of residual equations and can apply Newton’s
method, just like we did for SQP. The result is:
∇2𝑥𝑥 ℒ(𝑥) [∇ℎ(𝑥)]𝑇 0  𝑠𝑥  ∇𝑥 ℒ(𝑥, 𝜆, 𝜎)
 [∇𝑔(𝑥)]𝑇    
 ∇ℎ(𝑥) 0   𝑠𝜆   
 0 0   =− ℎ(𝑥) 
 ∇𝑔(𝑥) 𝐼  𝑠 𝜎   𝑔(𝑥) + 𝑠  (5.86)
 0 0    
 Σ  𝑠𝑠   𝑆𝜎 − 𝜇1 
    
𝑛𝑥 𝑛ℎ 𝑛𝑔 𝑛𝑔
0 0 𝑆
where Σ is a diagonal matrix with 𝜎 across the diagonal, and 𝐼 is the 𝑛𝑥 ∇2𝑥𝑥 ℒ ∇ℎ 𝑇 ∇𝑔 𝑇 0
identity matrix.
For numerical efficiency, we make some small modifications to this
system. The matrix is almost symmetric and with a little work we can 𝑛ℎ ∇ℎ 0 0 0
make it symmetric. If we multiply the last equation by 𝑆−1 we have:
𝑛𝑔 ∇𝑔 0 0 𝐼
∇2𝑥𝑥 ℒ(𝑥) [∇ℎ(𝑥)]𝑇  0 𝑠𝑥  ∇𝑥 ℒ(𝑥, 𝜆, 𝜎)
 [∇𝑔(𝑥)]𝑇     
 ∇ℎ(𝑥)   𝑠𝜆   
 0 0  0   =− ℎ(𝑥)  0 0
𝑛𝑔 𝐼 𝑆−1 Σ
 ∇𝑔(𝑥)  𝑠 𝜎   𝑔(𝑥) + 𝑠 
 0 0  𝐼    
 𝑆 Σ  𝑠𝑠   𝜎 − 𝜇𝑆−1 1 
 0 0    
𝐼 −1 Figure 5.26: Shape of the blocks in
(5.87) the interior point system (5.87)
The advantage of this equivalent system is that we can use a symmetric

linear system solver, which is more efficient than that of a general linear
system solver. The shape of the matrix and its blocks are illustrated in
Fig. 5.26.
The steps in the interior point method are detailed in Alg. 5.18.
Similarly to quasi-Newton SQP, we estimate of the Hessian of the
Lagrangian and performs a line search, but the system of equations
that we solve is different. Another major difference is that there is no
active set to update. ¶ ¶ Many important implementation details
and variations exist on things like: how
often and in what manner to update the
Algorithm 5.18: Interior point method with quasi-Newton approximation penalty parameter 𝜇, ensuring that the
BFGS Hessian is positive definite and is
Inputs: computed in a memory-efficient manner
for large problems, resetting parameters,
𝑥 0 : Starting point etc. Further implementation details can
𝜏opt : Optimality tolerance be found in the literature.70 , 71
𝜏feas : Feasibility tolerance 70. Wächter et al., On the implementation
of an interior-point filter line-search algo-
Outputs: rithm for large-scale nonlinear programming.
𝑥 ∗ : Optimal point 2005
𝑓 (𝑥 ∗ ): Corresponding function value 71. Byrd et al., An Interior Point Algorithm
for Large-Scale Nonlinear Programming.
1999
𝜆0 = 0; 𝜎0 = 0 Initial Lagrange multipliers
𝑠0 = 1 Initial slack variables
𝑊0 = 𝐼 Initialize Hessian of Lagrangian approximation to identity matrix
𝑘=0
while ||∇𝑥 ℒ|| ∞ > 𝜏opt or || ℎ|| ∞ > 𝜏feas do
Evaluate ∇ℎ 𝑘 , ∇𝑥 ℒ
Solve the KKT system (5.87) for 𝑠
 𝑊𝑘 [∇ℎ(𝑥)]𝑇 [∇𝑔(𝑥)]𝑇 0  𝑠𝑥  ∇𝑥 ℒ(𝑥, 𝜆, 𝜎)
     
∇ℎ(𝑥)   𝑠𝜆   
 0 0 0    =− ℎ(𝑥) 
∇𝑔(𝑥)  𝑠 𝜎   𝑔(𝑥) + 𝑠 
 0 0 𝐼     
 0 𝑆 Σ 𝑠   𝜎 − 𝜇𝑆−1 1 
  𝑠  
0 𝐼 −1
Perform a line search in direction 𝑝 𝑘 = 𝑠 𝑘 using the merit function

𝜙(𝛼, 𝜇) = 𝑓 (𝑥 𝑘 + 𝛼𝑝 𝑥 ) + 𝜇k ℎ(𝑥 𝑘 + 𝛼𝑝 𝑥 ) + max(0, 𝑔(𝑥 𝑘 + 𝛼𝑝 𝑥 )k
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 𝑘 Update step
𝜆 𝑘+1 = 𝜆 𝑘 + 𝑠𝜆 Update the Lagrange multipliers
Update 𝑊𝑘+1 Compute quasi-Newton approximation using Eq. (5.77)
Reduce penalty parameter 𝜇
𝑘 = 𝑘+1
end while
Generally speaking, active-set SQP methods are more effective on

medium-scale problems, while interior-point methods are more effec-
tive on large-scale problems. Interior-point methods are also generally
more sensitive to the initial starting point and the scaling of the prob-
lem.58 These are of course only generalities, and are not replacements 58. Nocedal et al., Numerical Optimization.
2006
for testing multiple algorithms on the problem of interest. Many op-
timization frameworks make it easy to switch between optimization
algorithms facilitating this type of testing.
Example 5.19: Interior point method applied to inequality-constrained

problem
Here we solve Ex. 5.5 using the interior point method (Alg. 5.18) starting at
𝑥 (0)= [2, 1]. The initial Lagrange multiplier is 𝜎 = 0 and the slack variable is
𝑠 = 1.Starting with a penalty parameter of 𝜇 = 20 results in the iterations show
in Fig. 5.27. 𝑥2
For the first iteration, differentiating the Lagrangian with respect to 𝑥 yields 2 19 iterations
𝑥 (0)
𝜕ℒ 1 + 12 𝜎𝑥1 1 1
= = ,
𝜕𝑥 2 + 2𝜎𝑥2 2
0
and the gradient of the constraint is −1 𝑥∗

1
𝑥 1
∇𝑔 = 2 1 =
−2
.
2𝑥 2 2 −3 −2 −1 0 1 2 3
𝑥1
The interior point system of equations (5.87) at the starting point is
Figure 5.27: Interior point algo-
1 0 1 0 𝑠 𝑥1  −1
     rithm iterations.
0 0 𝑠 𝑥2  −2
 1 2   =  .
1 1  𝑠 𝜎  −2
 2 0    
0 0  𝑠   20 
 0 1  𝑠  
The solution is 𝑠 = [−21, −42, 20, 103]. Performing a line search in the direction
𝑝 = [−21, −42] yields 𝑥 (1) = [1.34375, −0.3125]. The Lagrange multiplier and
slack variable are updated to 𝜎1 = 20 and 𝑠 1 = 104, respectively.
To update the approximate Hessian 𝑊𝑘 , we can use the damped BFGS
update formula (5.79) to ensure that 𝑊𝑘 is positive definite. By comparing
𝑠 0𝑇 𝑦0 = 73.21 and 𝑠 0𝑇 𝑊0 𝑠 0 = 2.15, we can see that 𝑠 𝑇𝑘 𝑦 𝑘 ≥ 0.2𝑠 𝑇𝑘 𝑊𝑘 𝑠 𝑘 , and
therefore we do a full BFGS update with 𝜃0 = 1 and 𝑟0 = 𝑦0 . Using the
quasi-Newton update Eq. 5.77, we get the approximate Hessian

1.388 4.306
𝑊1 = .
4.306 37.847
We reduce the penalty parameter 𝜇 by a factor of 2 at each iteration. This

process can be repeated for subsequent iterations.
The starting point is infeasible but the algorithm finds a feasible point after
the first iteration. From then on, it approaches the optimum from within the
feasible region, which is the expected behavior for interior point algorithms, as
shown in Fig. 5.27.
Tip 5.20: Some equality constraints can be posed as inequality con-

straints.
Equality constraints are less common in engineering design problems than
inequality constraints. Sometimes we pose a problem as an equality constraint
unnecessarily. For example, the simulation of an aircraft in steady-level flight
may want the lift to equal the weight. Formally, this is an equality constraint,
but it can also be posed as an inequality constraint of the form: 𝐿 ≥ 𝑊.
This is because there is no advantage to additional lift, as it will increase
drag, and so the constraint will always be active at the solution. While an
equality constraint is perhaps more natural algorithmically, the flexibility of
the inequality constraint can often allow the optimizer to explore the design
space more effectively. Consider another example: a propeller may be designed
match a specified thrust target. While an equality constraint would likely work,
it is more constraining than necessary. If the optimal design was somehow able
to produce excess thrust we would not reject that design. Thus, we shouldn’t
formulate the constraint in a way that is unnecessarily restrictive.
5.6 Merit Functions and Filters
For unconstrained optimization, evaluating the effectiveness of the

line search is relatively straightforward. The primary concern is that
the objective function achieves a sufficient decrease. For constrained
optimization, the problem is not as simplistic. A new point may
decrease the objective but increase in infeasibility. It is not obvious
how best to weigh these tradeoffs. We need to be able to combine these
metrics into one metric for the purposes of evaluating the line search.
We cannot just use the Lagrangian because it is not computed at each
point in the line search. Traditionally, this issue has been dealt with by
using merit functions.
Common merit functions have already been introduced in the form
of penalty functions. These include: 𝑙1 and 𝑙2 norms of constraint
violations:
𝐹(𝑥; 𝜇) = 𝑓 (𝑥) + 𝜇||𝑐 𝑣𝑖𝑜 (𝑥)|| 𝑝 for 𝑝 = 1 or 2, (5.88)
and the augmented Lagrangian:
1
𝐹(𝑥; 𝜇) = 𝑓 (𝑥) + 𝜆(𝑥)𝑇 ℎ(𝑥) + 𝜇||𝑐 𝑣𝑖𝑜 (𝑥)|| 22 . (5.89)
2
where 𝑐 𝑣𝑖𝑜 are the constraint violations defined as:
(
| ℎ 𝑖 (𝑥)| for equality constraints
𝑐 𝑣𝑖𝑜,𝑖 (𝑥) = (5.90)
max(0, 𝑔𝑖 (𝑥)) for inequality constraints
One downside to merit functions, similar to that seen with penalty

functions, is that it is in general difficult to choose a suitable value
for the penalty parameter. If needs to be large to ensure that there is
improvement, but if it is too large then often a full Newton step is not
permitted and convergence is slowed.
A more recent approach is a filter method.72 Filter methods, in 72. Fletcher et al., Nonlinear programming
without a penalty function. 2002
general, provide less interference with a full Newton step73 , and are
73. Fletcher et al., A Brief History of Filter
effective for both sequential quadratic programming and interior point Methods. 2006
methods.74 The approach is based on concepts from multiobjective 74. Benson et al., Interior-Point Methods for
Nonconvex Nonlinear Programming: Filter
optimization, which is discussed in more detail in Chapter 9 (Fig. 9.1 in Methods and Merit Functions. 2002
particular). The basic idea is that we accept a point in the line search if
it is not dominated by points in the current filter. One point dominates
another if its objective is lower and the sum of its constraint violations
is lower. If a point is acceptable to the line search it is added to the filter,
and any dominated points are removed from the filter (as they are no
longer necessary).
Example 5.21: Using a filter.

A filter consists of pairs 𝑓 (𝑥), 𝑐(𝑥) where 𝑐(𝑥) is the sum of constraint
violations: 𝑐(𝑥) = ||𝑐 𝑣𝑖𝑜 (𝑥)|| 1 . As an example, assume that the current filter
contains these three points: (2, 5), (3, 2), and (7, 1). Notice that none of the
points in the filter dominates any other. We could plot these points as shown in
Fig. 5.28, where the shaded region correspond to areas that are dominated by
the points in the filter. During a line search a new candidate point is evaluated. 8
Constraint Infeasibility
There are three possible outcomes. Let us consider three example points that
6
illustrate these three outcomes.
1. (1, 4): this point is not dominated by any point in the filter. The step is 4
accepted, and this point is added to the filter. Additionally it dominates

2
one of the points in the filter, (2, 5), and so that point is removed from
the filter. The current filter is now (1, 4), (3, 2), and (7, 1). 0
0 2 4 6 8
2. (1, 6): this point is not dominated by any point in the filter. The step Objective
is accepted, and this new point is added to the filter. No points in the
filter are dominated and so nothing is removed. The current filter is now Figure 5.28: A filter method exam-
(1, 6), (2, 5), (3, 2), and (7, 1). ple showing three points in the
filter, the shaded regions corre-
3. (4, 3): this point is dominated by a point in the filter (3, 2). The step is spond to points that are dominated
rejected and the line search must continue. The filter is unchanged. by the filter.
The outline of this section presents only the basic ideas. Robust
implementation of a filter method requires imposing sufficient decrease
conditions, not unlike those in the unconstrained case, as well as a few
73. Fletcher et al., A Brief History of Filter
other minor modifications.73 Methods. 2006
Tip 5.22: Consider reformulating your constraints.
There are often multiple mathematically equivalent ways to pose the

problem constraints. Sometimes reformulating can yield equivalent problems
that are significantly easier to solve. In some cases it can help to add constraints
that are redundant, but guide the optimizer to more useful areas of the design
space. Similarly, one should consider whether residuals should be solved
internally or posed as constraints at the optimizer level.
Example 5.23: Numerical solution of graphical solution example.
Recall the constrained problem with a quadratic objective and quadratic

constraints introduced in Ex. 5.1. Instead of finding an approximate solution
graphically or trying to solve this analytically, we can now solve this numerically
using SQP or the interior point method. The resulting optimization paths are
shown in Fig. 5.29. The different in the number of iterations between these
two methods is not representative because this is a simple problem and their
relative performance is highly dependent on implementation details.
4 4
2 2
𝑥 (0) 𝑥∗ 𝑥 (0) 𝑥∗
𝑥2 𝑥2
0 0
−2 −2
−2 0 2 4 −2 0 2 4
Figure 5.29: Numerical solution of
𝑥1 𝑥1
Ex. 5.1.
(a) Sequential quadratic programming (b) Interior point method
Example 5.24: Constrained spring system.
Consider the spring system from Ex. 4.22, which is an unconstrained

optimization problem. We can constrain the spring system by attaching two
cables as shown in Fig. 5.30, were ℓ 𝑐1 = 9 m, ℓ 𝑐2 = 6 m, 𝑦 𝑐 = 2 m, 𝑥 𝑐1 = 7 m,
and 𝑥 𝑐2 = 3 m. Because the cables do not resist compression forces, they
𝑥 𝑐1 𝑥 𝑐2
𝑦𝑐
𝑘1 , ℓ1 𝑘2 , ℓ2
ℓ 𝑐2 Figure 5.30: Spring system con-

ℓ 𝑐1 strained by two cables.
correspond to inequality constraints, which yields the following problem

q 2 q 2
1 1
minimize 𝑘 (ℓ 1 + 𝑥1 )2 + 𝑥22 − ℓ1 + 𝑘 (ℓ 2 − 𝑥1 )2 + 𝑥22 − ℓ 2 − 𝑚 𝑔𝑥 2
2 1 2 2
q
2 2
subject to 𝑥1 + 𝑥 𝑐1 + 𝑥2 + 𝑦𝑐 ≤ ℓ 𝑐1 ,
q
2 2
𝑥1 − 𝑥 𝑐2 + 𝑥2 + 𝑦𝑐 ≤ ℓ 𝑐2 .
(5.91)
The optimization paths for SQP and interior point method are shown in Fig. 5.31.
12 12
8 8
4 4
𝑥2 𝑥∗ 𝑥2 𝑥∗
0 0
𝑥 (0) 𝑥 (0)
−4 −4
−8 −8
−5 0 5 10 15 −5 0 5 10 15
ℓrope2
rope1 ℓ rope2
rope1 Figure 5.31: Optimization of con-
𝑥1 𝑥1
strained spring system.
(a) Sequential quadratic programming (b) Interior point method
5.7 Constraint Aggregation
As discussed in Chapter 6, adjoint methods, or reverse mode AD,

are effective for scenarios with many inputs and few outputs. These
methods are desirable because they allow us to work with large numbers
of design variables. However, they are only effective if we have a small
number of constraints. In many practical engineering problems we
have many constraints, and so in these scenarios we can use constraint
aggregation methods. A constraint aggregation function combines some

or all of the constraints into a single constraint such that violation of any
single constraint causes the total function to be violated (approximately).
A simple example would be the max function. If max(𝑔(𝑥)) < 0 then
we know that all of 𝑔 𝑗 (𝑥) < 0. However, the max function is not
differentiable and so alternative functions that play a similar role are
needed.
A common constraint aggregation function is the KS function 75 , 75. Kreisselmeier et al., Systematic Control
Design by Optimizing a Vector Performance
which acts like a soft maximum: Index. 1979
1 ©Õ 𝜌𝑔 𝑗 (𝑥) ª
𝑚
𝐾𝑆(𝑥) = ln 𝑒 ® (5.92)
𝜌
« 𝑗=1 ¬
where 𝜌 is an aggregation parameter, like a penalty parameter used in
penalty methods.
Example 5.25: Constrained spring system with aggregated constraints.
Consider the constrained spring system from Ex. 5.24. Aggregating the two
constraints using the KS function, we can formulate a single constraint as
1 𝜌𝑔2 (𝑥1 ,𝑥2 )

𝐾𝑆(𝑥1 , 𝑥2 ) = ln 𝑒 + 𝑒 𝜌𝑔2 (𝑥1 ,𝑥2 ) , (5.93)
𝜌
where q
2 2
𝑔1 (𝑥1 , 𝑥2 ) = 𝑥1 + 𝑥 𝑐1 + 𝑥2 + 𝑦𝑐 − ℓ 𝑐1 ,
q (5.94)
2 2
𝑔2 (𝑥1 , 𝑥2 ) = 𝑥1 − 𝑥 𝑐2 + 𝑥2 + 𝑦𝑐 − ℓ 𝑐2 .
We plot the contour of 𝐾𝑆 = 0 in Fig. 5.32 for increasing values of the aggregation
paramter 𝜌. We can see the difference in the feasible region for the lowest value
of 𝜌, which results in a conservative optimum. For the highest value of 𝜌, the
optimum obtained with constraint aggregation is graphically indistinguishible,
and the objective function value approaches the true optimal value of -22.1358.
5.8 Summary
With constraints, we can no longer use ∇ 𝑓 as a first-order convergence

criterium, but rather define a new function called the Lagrangian
and use ∇ℒ as our first-order convergence criteria. The Lagrangian
is a function not just of the design variables, but also the Lagrange
multipliers and slack variables. The enumeration of that first-order
convergence criteria, ∇ℒ (along with a constraint on the sign of the
8 8 8
6 6 6
𝑥2 𝑥2 𝑥2
4 𝑥∗ 4 𝑥∗ 4 𝑥∗
∗
∗
∗
𝑥KS 𝑥KS
2 𝑥 KS 2 2
ℓ rope2
rope1 ℓrope2
rope1 ℓrope2
rope1
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
𝑥1 𝑥1 𝑥1
(a) 𝜌KS = 2, 𝑓KS

∗
= −19.448 (b) 𝜌KS = 10, 𝑓KS
∗
= −21.653 (c) 𝜌KS = 100, 𝑓KS
∗
= −22.090
Lagrange multipliers associated with the inequality constraints), is Figure 5.32: Spring system con-
called the KKT conditions. strained by two cables.
Penalty methods are useful as a beginning conceptual model, and

for use in gradient-free methods, but are no longer used for constrained
gradient-based optimization. Instead, sequential quadratic program-
ming and interior point methods are the state of the art. These methods
are applications of Newton’s method to the KKT conditions. One
primary difference is in the treatment of inequality constraints. Se-
quential quadratic programming methods try to distinguish between
active and inactive constraints, using the (potentially) active constraints
like equality constraints and ignoring the (potentially) inactive ones.
Interior point methods add slack variables to force all constraints to
behave like equality constraints.
Problems
5.1 Answer true or false and correct the false statements.
a) Penalty methods are among the most effective methods for

constrained optimization.
b) For an equality constraint in 𝑛-dimensional space, all feasible
directions about a point are perpendicular to the constraint
gradient at that point and define a plane with 𝑛 − 1 degrees
of freedom.
c) The feasible directions about a point on an inequality con-
straint define a half space whose dividing hyperplane is
perpendicular by the gradient of the constraint at that point.
d) A point is optimal if there is only one feasible direction that
is also a descent direction.
e) For an inequality-constrained problem, if we replace the
inequalities that are active at the optimum with equality
constraints and ignore the inactive constraints, we get the

same optimum.
f) For a point to be optimal, the Lagrangian multipliers for both
equality and active inequality constraints must be positive.
g) The complementarity conditions are easy to solve for because
either the Lagrange multiplier is zero of the slack variable is
zero.
h) At the optimum of a constrained problem, the Hessian of
the objective function must be positive semidefinite.
i) The Lagrange multipliers represent the change in the objec-
tive function we would get for a perturbation in the constraint
value.
j) Sequential quadratic programming seeks to find the solution
of the KKT system.
k) Interior point methods must start with a point in the interior
of the feasible region.
l) Constraint aggregation combines multiple constraints into a
single constraint that is equivalent.
5.2 Consider Ex. 5.4 and modify it so that the equality constraint is
the negative of the original one, that is,
1
ℎ(𝑥 1 , 𝑥2 ) = − 𝑥12 − 𝑥 22 + 1 = 0.
4
Classify the critical points and compare them with the original
solution. What does that tell you about the significance of the
Lagrange multiplier sign?
5.3 Similarly to the previous exercise, consider Ex. 5.5 and modify
it so that the inequality constraint is the negative of the original
one, that is,
1
ℎ(𝑥1 , 𝑥2 ) = − 𝑥12 − 𝑥 22 + 1 ≤ 0.
4
Classify the critical points and compare them with the original
solution.
5.4 Consider the following optimization problem,
minimize 𝑥12 + 3𝑥22 + 4
(5.95)
subject to 𝑥2 ≥ 1
𝑥12 + 4𝑥22 ≤ 4
Find the optimum analytically.
5.5 Find the rectangle of maximum area that can be inscribed in an

ellipse. Give you answer in terms of the ratio of the two areas.
Check that your answer intuitively correct for the special case of
a rectangle inscribed in a circle.
5.6 In Section 2.1, we mentioned that Euclid showed that among

rectangles of a given perimeter, the square has the largest area.
Formulate the problem and confirm the result analytically. What
are the units and the physical interpretation of the Lagrange
multiplier in this problem? Exploration: Show that if you minimize
the perimeter with an area constrained to the optimum value you 𝐹
found above, you get the same solution.
5.7 Column in Compression. Consider a thin-walled tubular column

subjected to a compression force. We want to minimize the mass
of the column while ensuring that the structure does not yield or
buckle under a compression force of magnitude 𝐹. Our design
variables are the radius of the tube (𝑅) and the wall thickness (𝑡). 𝑡 ℓ
This design optimization problem can be stated as,
minimize 2𝜌ℓ 𝜋𝑅𝑡 mass

by varying 𝑅, 𝑡 radius, wall thickness
𝐹
subject to 𝜎yield − ≤0 yield stress 𝑅
2𝜋𝑅𝑡
𝜋3 𝐸𝑅3 𝑡 Figure 5.33: Slender tubular col-
𝐹− ≤0 buckling load umn in compression
4ℓ 2
𝑏 = 125mm
In the formula for the mass in the objective above, 𝜌 is the material
density, and we assume that 𝑡 𝑅. The first constraint is the
compressive stress, which is simply the force divided by the cross-
sectional area. The second constraint uses Euler’s critical buckling 𝑡𝑏
load formula, were 𝐸 is the material Young’s modulus, and the

second moment of area is replaced with the one corresponding 𝑡𝑤 ℎ = 250mm
to a circular cross section (𝐼 = 𝜋𝑅 3 𝑡).

𝑡𝑏
Find the optimum 𝑅 and 𝑡 as a function of the other parameters.
Pick some reasonable values for the parameters and verify your
solution graphically. Plot the gradients of the objective and
constraints at the optimum and verify the Lagrange multipliers 𝑃 = 100kN
graphically.
5.8 Beam with H section. Consider a cantilevered beam with an H-

ℓ = 1m
shape cross-section composed of a web and flanges subject to a
transverse load as shown in Fig. 5.34. The objective is to minimize Figure 5.34: Cantilever beam with
the structural weight by varying the web thickness 𝑡𝑤 and the H section.
flange thickness 𝑡𝑏 , subject to stress constraints. The other cross-

sectional parameters are fixed; the web height ℎ is 250 mm and
the flange width 𝑏 is 125 mm. The axial stress in the flange and
the shear stress in the web should not exceed the corresponding
yield values (𝜎yield = 200 MPa and 𝜏yield = 116 MPa, respectively).
The optimization problem an be stated as,
minimize 2𝑏𝑡𝑏 + ℎ𝑡𝑤 mass

by varying 𝑡𝑏 , 𝑡𝑤 flange and web thicknesses
𝑃ℓ ℎ
subject to − 𝜎yield ≤ 0 axial stress
2𝐼
1.5𝑃
− 𝜏yield ≤ 0 shear stress
ℎ𝑡𝑤
The second moment of area is
ℎ3 𝑏 ℎ2𝑏
𝐼= 𝑡𝑤 + 𝑡𝑏3 + 𝑡𝑏 .
12 6 2
Find the optimal values of 𝑡𝑏 and 𝑡𝑤 by solving the KKT conditions

analytically. Plot the objective contours and constraints to verify
your result graphically.
5.9 Penalty method implementation. Program one or more penalty

methods from Section 5.3.
a) Solve the constrained problem from Ex. 5.8 as a first test of

your implementation. Use an existing software package for
the optimization subproblem or the unconstrained optimizer
you implemented in Prob. 4.9. How far can you push the
penalty parameter until the optimizer fails? How close can
you get to the exact optimum? Try different starting points
and verify that the algorithms always converges to the same
optimum.
b) Solve the problem from Prob. 5.3.
c) Solve the problem detailed in Prob. 5.11.
d) Exploration: Solve any other problem from this section or a
problem of your choosing.
5.10 Constrained optimizer implementation. Program a SQP or interior

point algorithm. You may repurpose the BFGS algorithm that
you implemented in Prob. 4.9. For SQP, start by implementing
only equality constraints and reformulate test problems with
inequality constraints as problems with only equality constraints.
a) Reproduce the results from Ex. 5.17 (SQP) or Ex. 5.19 (interior
point).
b) Solve the problem from Prob. 5.3.
c) Solve the problem detailed in Prob. 5.11.
d) Compare the computational cost, precision, and robustness
of your optimizer with an existing software package.
𝑑
5.11 Aircraft Fuel Tank. A jet aircraft needs to carry a streamlined ℓ
external fuel tank with a required volume. The tank shape is

approximated as an ellipsoid. We want to minimize the drag of
the fuel tank by varying its length and diameter, that is, Figure 5.35: Ellipsoid fuel tank
minimize 𝐷(ℓ , 𝑑)
by varying ℓ , 𝑑
subject to 𝑉req − 𝑉(ℓ , 𝑑) ≤ 0.
The drag is given by,
1 2
𝐷= 𝜌𝑣 𝐶 𝐷 𝑆,
2
where the air density is 𝜌 = 0.55 kg/m3 , the aircraft speed is
𝑣 = 300 m/s. The drag coefficient of an ellipsoid can be estimated
as‖ " 1/2 2#
Hoerner76 provides this approximation
‖
in page 6-18.
ℓ 𝑑 𝑑
𝐶𝐷 = 𝐶 𝑓 3 + 4.5 + 21 . 76. Hoerner, Fluid-Dynamic Drag. 1965
𝑑 ℓ ℓ
We assume a friction coefficient of 𝐶 𝑓 = 0.0035. The drag is

proportional to the surface area of the tank, which for an ellipsoid
is
𝜋 2 ℓ
𝑆 = 𝑑 1+ arcsin 𝑒
2 𝑑𝑒
p
Where 𝑒 = 1 + 𝑑2 /ℓ 2 . The volume of the fuel tank is
𝜋 2
𝑉= 𝑑 ℓ,
6
and the required volume is 𝑉req = 2.5 m3 .
Find the optimum tank length and diameter numerically using
your own optimizer or a software package. Verify your solution
graphically by plotting the objective function contours and the
constraint.
5.12 Solve a variation of Ex. 5.24 where we replace the system of cables
with a cable and a rod that resists both tension and compression.
The cable is positioned above the spring as shown in Fig. 5.36,
where 𝑥 𝑐 = 2 m and 𝑦 𝑐 = 3 m, with a maximum length of
ℓ 𝑐 = 7.0 m. The rod is positioned at 𝑥 𝑟 = 2 m and 𝑦𝑟 = 4 m, with
a length of ℓ 𝑟 = 4.5 m. How does this change the formulation of
𝑥𝑐
𝑦𝑐 ℓ𝑐
𝑘1 , ℓ1 𝑘2 , ℓ2
ℓ𝑟 𝑦𝑟
Figure 5.36: Spring system con-
strained by two cables.
𝑥𝑟
the problem? Does the optimum change?

ℓ = 0.5m ℓ
5.13 Three-Bar Truss. Consider the truss shown in Fig. 5.37.∗∗ The
truss is subjected to a load 𝑃 and we want to minimize the mass
of the structure subject to stress and buckling constraints. The
axial stresses in each bar are

1 2 3 ℓ
1 𝑃 cos 𝜃 𝑃 sin 𝜃
𝜎1 = √ + √
2 𝐴 𝑜 𝐴 𝑜 + 2𝐴𝑚 𝜃 = 55◦
√
2𝑃 sin 𝜃 𝑃 = 500kN
𝜎2 = √
𝐴 𝑜 + 2𝐴𝑚
Figure 5.37: Three-bar truss ele-
1 𝑃 sin 𝜃 𝑃 cos 𝜃 ments
𝜎3 = √ √ − , ∗∗ This is a well-known optimization prob-
2 𝐴 𝑜 + 2𝐴𝑚 𝐴𝑜
lem formulated by Schmit27 when he first
proposed integrating numerical optimiza-
where 𝐴 𝑜 is the cross-sectional area of the outer bars 1 and 3, tion with finite-element structural analy-
and 𝐴𝑚 is the cross-sectional area of the middle bar 2. The full sis
27. Schmit, Structural Design by System-
atic Synthesis. 1960
optimization problem for the three-bar truss is

h √ i
minimize 𝜌 ℓ (2 2𝐴1 + 𝐴2 )
by varying 𝐴 𝑜 , 𝐴𝑚 cross-sectional areas

subject to 𝐴min − 𝐴 𝑜 ≤ 0 minimum area
𝐴min − 𝐴𝑚 ≤ 0 minimum area
𝜎yield − 𝜎1 ≤ 0 stress constraint for bar 1
𝜋2 𝐸𝛽𝐴 𝑜
− 𝜎1 − ≤0 buckling for bar 1
2ℓ 2
𝜋2 𝐸𝛽𝐴𝑚
2ℓ 2
𝜋2 𝐸𝛽𝐴 𝑜
2ℓ 2
In the buckling constraints, 𝛽 relates the second moment of area to
the area (𝐼 = 𝛽𝐴2 ) and is dependent on the cross-sectional shape
of the bars. Assuming a square cross-section, 𝛽 = 1/12. The bars
are made out of an aluminum alloy with the following properties:
𝜌 = 2710 kg/m3 , 𝐸 = 69 GPa, 𝜎yield = 110 MPa.
Find the optimal bar cross-sectional areas using your own opti-
mizer or a software package. Which constraints are active? Verify
your result graphically. Exploration: Try different combinations
of unit magnitudes (e.g., Pa versus MPa for the stresses) for the
functions of interest and the design variables to observe the effect
of scaling.
5.14 Solve the same three-bar truss optimization problem of Prob. 5.13
by aggregating all the constraints into a single constraint. Try
different aggregation parameters and see how close you can get
to the solution you obtained for Prob. 5.13.
5.15 Ten-bar Truss. Consider the ten-bar truss structure described in

Appendix C.2.2. The full design optimization problem is
Õ
10
minimize 𝜌 𝐴𝑖 ℓ 𝑖
𝑖=1
by varying 𝐴𝑖 , 𝑖 = 1, . . . , 10 cross-sectional areas
subject to 𝐴 𝑖 ≥ 𝐴min minimum area
|𝜎𝑖 | ≤ 𝜎 𝑦 𝑖 for 𝑖 = 1, . . . , 10 yield stress constraints
Find the optimal mass and corresponding cross-sectional areas

using your own optimizer or a software package.. Show a conver-
gence plot. Report the number of function evaluations and the
number of major iterations. Exploration: Restart from different
starting points. Do you get more than one local minimum? What
can you conclude about the multimodality of the design space?
5.16 Solve the same three-bar truss optimization problem of Prob. 5.15
by aggregating all the constraints into a single constraint. Try
different aggregation parameters and see how close you can get
to the solution you obtained for Prob. 5.15.
Computing Derivatives
6
Derivatives play a central role in many numerical algorithms. In the
context of optimization, we are interested in computing derivatives for
the gradient-based optimization methods introduced in the previous
chapter. The accuracy and computational cost of the derivatives is
critical for the success of these optimization methods. In this chapter, we
introduce the various methods for computing derivatives and discuss
the relative advantages of each method.
1. List the various methods used to compute derivatives.
2. Describe the pros and cons of these methods.
3. Use the methods in computational analyses.
6.1 Derivatives, Gradients, and Jacobians
The derivatives we focus on are first-order derivatives of one or more

functions of interest ( 𝑓 ) with respect to a vector of variables (𝑥). In the
engineering optimization literature, the term “sensitivity analysis” is
often used to refer to the computation of derivatives, and derivatives
are sometimes referred to as “sensitivity derivatives” or “design sensi-
tivities.” While these terms are not incorrect, we prefer to use the more
specific and concise term, derivative.
For the gradient-based, unconstrained methods introduced in the
previous chapter, we need the gradient of the objective (a scalar) with
respect to the vector of variables, which is a column vector with 𝑛 𝑥
components:
𝑇
𝜕𝑓 𝜕𝑓 𝜕𝑓
∇ 𝑓 (𝑥) = , ,..., . (6.1)
𝜕𝑥1 𝜕𝑥2 𝜕𝑥 𝑛 𝑥
In general, however, optimization problems in engineering design are
constrained. We will see in Chapter 5 that for constrained, gradient-
173
based optimization, we also need the gradients of all the constraints

with respect to all the design variables.
For the sake of generality, we do not distinguish between the
objective and constraints in this chapter. Instead, we refer to the
functions being differentiated as the functions of interest, and represent
them as a vector-valued function, 𝑓 = [ 𝑓1 , 𝑓2 , . . . , 𝑓𝑛 𝑓 ]𝑇 . The derivatives
of all the functions of interest with respect to all the variables form the
Jacobian matrix,
 𝜕 𝑓1 𝜕 𝑓1 
 
 ∇ 𝑓1 𝑇   𝜕𝑥1 ···
𝜕𝑥 𝑛 𝑥 
  
 .   . .. 
𝐽(𝑥) =  ..  =  .. ..
. , (6.2)
   
.
∇ 𝑓 𝑛 𝑇   𝜕 𝑓 𝑛 𝑓 𝜕 𝑓𝑛 𝑓 
 𝑓   
 𝜕𝑥 ···
𝜕𝑥 𝑛 𝑥 
 1
which is an (𝑛 𝑓 × 𝑛 𝑥 ) rectangular matrix where each row corresponds to
the gradient of each function with respect to all the variables. Sometimes
it is useful to write the Jacobian in index notation,
𝜕 𝑓𝑖
𝐽𝑖𝑗 = . (6.3)
𝜕𝑥 𝑗
Row 𝑖 of the Jacobian is the gradient of function 𝑓𝑖 . Each column in the

Jacobian is called the tangent with respect to a given variable 𝑥 𝑗 .
Example 6.1: Jacobian of a vector-valued function.
Consider the following function with two inputs and two outputs:

𝑓1 (𝑥1 , 𝑥2 ) 𝑥1 𝑥2 + sin 𝑥1
𝑓 (𝑥) = = . (6.4)
𝑓2 (𝑥1 , 𝑥2 ) 𝑥1 𝑥2 + 𝑥22
We can differentiate this symbolically to obtain exact reference values.

d𝑓 𝑥 + cos 𝑥1 𝑥1
= 2 (6.5)
d𝑥 𝑥2 𝑥 1 + 2𝑥2 .
We evaluate this at 𝑥 = (𝜋/4, 2), which yields:

d𝑓 2.707 0.785
= . (6.6)
d𝑥 2.000 4.785
6.2 Overview of Methods for Computing Derivatives
The methods for computing derivatives can be classified according

to the representation used for the numerical model. There are three
possible representations, as shown in Fig. 6.1. In one extreme (left),
we do not know anything about the model and consider it a black box
where we only have control over the inputs and observe the outputs.
When this is the case, we can only compute derivatives using finite
differences (Section 6.4). In the other extreme (right), we have access
to the all the source code used to compute the functions of interest
and perform the differentiation line by line. This is the essence of
the algorithmic differentiation approach (Section 6.6), as well as the
complex-step method (Section 6.5). In the intermediate case we look
at the model residuals and states (middle), which are the basis for
the implicit analytic methods (Section 6.7). Finally, when the model
can be represented with multiple components, we can use a coupled
derivative approach where any of the above methods can be used for
each component (Section 13.4).
𝑥 𝑥 𝑥
𝑣1 = 𝑥
𝑅(𝑥, 𝑢) = 0 𝑅(𝑥, 𝑢) = 0 𝑢 1) 0
𝑉2 (𝑣=
𝑣2 = 𝑢)
𝑅(𝑥,
𝑣3 = 𝑉3 (𝑣1 , 𝑣2 )
..
.
𝑓 (𝑥, 𝑢) 𝑓 𝑓 (𝑥, 𝑢) 𝑓 𝑓 = 𝑉(𝑣1 , . . .) 𝑓 (𝑥, 𝑢) 𝑓
Figure 6.1: Derivative computation

methods can consider three differ-
Tip 6.2: Identify and fix the sources of numerical noise. ent levels of information: function
values, model states, and lines of
As mentioned in Tip 3.8, it is important to find out the level of numerical
code.
noise in your model. This is especially important when computing derivatives
of the model because taking the derivative can amplify the noise. There are
several common sources of model numerical noise, some of which can be
mitigated. First, you should establish the level of numerical noise in your model
(see Tip 3.8). Then you can do the same for the derivative and see if you get
enough precision.
Iterative solvers can introduce numerical noise when the convergence
tolerance is to high or when they have an inherit limit in their precision (see
Section 3.5.3). When you do not have enough precision, you can try to reduce
the convergence tolerance or increase the iteration limit.
Another possible source of error is file input and output. Many legacy
codes are driven by reading and writing input and output files. However,
the numbers in the files usually have far fewer digits than a double precision
floating point number. The ideal solution is to modify the original code so that
it can be called directly so that the data is passed in memory. Another solution
is to change the output precision in the files.
Tip 6.3: Smooth model discontinuities.
Many models are defined in a piecewise manner, resulting in a discon-

tinuous function value, discontinuous derivative, or both. This can happen
even if the underlying physical behavior is continuous (for example, by using a
non-smooth interpolation of experimental data). The solution is to modify the
implementation so that it is continuous and still consistent with the physics.
If the physics is truly discontinuous, it might still be advisable to artificially
smooth the function as long as there is no serious degradation in the modeling
error. Even if the smoothed version is highly nonlinear, having a continuous
first derivative will help in the derivative computation and gradient-based
optimization.
6.3 Symbolic Differentiation
Symbolic differentiation is well known and widely used in calculus,

but it is of limited use in numerical optimization because it is only
applicable for explicit functions. Except for the simplest cases (such as
Ex. 6.1), many computational models are implicitly defined and involve
iterative solvers (see Chapter 3). While the mathematical expression
within these iterative procedures are explicit, it is challenging, or
even impossible, to use symbolic differentiation to obtain closed-form
mathematical expressions for the derivative of the procedure. Even
when it is possible, these expressions are almost always computationally
inefficient.
Example 6.4: Difficulties associated with symbolic differentiation.
Kepler’s equation describes the orbit of a body under gravity as briefly

discussed in Chapter 2. Following Prob. 6.6 we seek to compute the quantity
𝑓 , which is the difference between the eccentric and mean anomalies. For
simplicity, we set the eccentricity to 1, and call the mean anomaly 𝑥. We are
then left with the equation:
𝑓 = sin(𝑥 + 𝑓 ) (6.7)
We can see that 𝑓 is an implicit function of 𝑥. As a simple numerical procedure,

to determine the value of 𝑓 for a given input 𝑥, we use fixed point iteration.
That means that we start with a guess for 𝑓 , input the value of 𝑓 on only the
right hand side of that expression to estimate a new value for 𝑓 , and repeat.
In this case, convergence typically happens in about 10 iterations. Arbitrarily

we choose 𝑥 as the initial guess for 𝑓 and the computational procedure is as
follows:
Input: 𝑥
𝑓 =𝑥
for 𝑖 = 1 to 10 do
𝑓 = sin(𝑥 + 𝑓 )
end for
return 𝑓
Now that we have a computational procedure, we would like to compute
the derivative d 𝑓 /d𝑥. A symbolic math toolbox can be used to find the closed-
form expression for this derivative as shown in Fig. 6.2, but the expression is
extremely long and is full of redundant calculations. This problem becomes
exponentially worse as you increase the number of iterations in the loop.
Therefore, this approach is intractable for computational models of even
moderate complexity—this is known as expression swell.
dfdx =
cos(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x +
sin (2*x))))))))))*( cos(x + sin(x + sin(x + sin(x + sin(x + sin(x +
sin(x + sin(x + sin (2*x)))))))))*( cos(x + sin(x + sin(x + sin(x +
sin(x + sin(x + sin(x + sin (2*x))))))))*( cos(x + sin(x + sin(x + sin
(x + sin(x + sin(x + sin (2*x)))))))*( cos(x + sin(x + sin(x + sin(x +
sin(x + sin (2*x))))))*( cos(x + sin(x + sin(x + sin(x + sin (2*x)))))
*( cos(x + sin(x + sin(x + sin (2*x))))*( cos(x + sin(x + sin (2*x)))*(
cos(x + sin (2*x))*(2* cos (2*x) + 1) + 1) + 1) + 1) + 1) + 1) + 1) +
1) + 1)
Figure 6.2: Symbolic derivative of the simple function shown above.
Nevertheless, symbolic differentiation is still useful to derive deriva-

tives of simple explicit components within a larger model. Furthermore,
algorithm differentiation (discussed in a later section) relies on symbolic
differentiation to differentiate each line of code in the model.
6.4 Finite Differences
Finite-difference methods are widely used to compute derivatives due

to their simplicity. They are versatile because they require nothing more
than function values. Finite-differences are the only viable option when
dealing with “black-box” functions because they do not require any
knowledge about how the function is evaluated. Most gradient-based
optimization algorithms perform finite-differences by default when
the user does not provide the required gradients. However, finite
differences are neither accurate nor efficient.
6.4.1 Finite-Difference Formulas

Finite-difference approximations are derived by combining Taylor series
expansions. Using the right combinations of these expansions, it is
possible to obtain finite-difference formulas that estimate an arbitrary
order derivative with any order of truncation error. The simplest
finite-difference formula can be derived directly from a Taylor series
expansion in the 𝑗 th direction,
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) + ℎ + + +..., (6.8)
𝜕𝑥 𝑗 2! 𝜕𝑥 𝑗 2 3! 𝜕𝑥 𝑗 3
where 𝑒ˆ 𝑗 is the unit vector in the 𝑗 th direction. Solving the above for the
first derivative we obtain the finite-difference formula,
𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥)
= + 𝒪(ℎ), (6.9)
𝜕𝑥 𝑗 ℎ
where ℎ is the finite-difference step size. This approximation is called the

forward difference and is directly related to the definition of a derivative
since,
𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥) 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥)
= lim ≈ . (6.10)
𝜕𝑥 𝑗 ℎ→0 ℎ ℎ
The truncation error is 𝒪(ℎ), and therefore this is a first-order approx-

imation. The difference between this approximation and the exact
derivative is illustrated in Fig. 6.3.
The backward difference approximation can be obtained by replac-
ing ℎ with −ℎ to yield, 𝑓 (𝑥)
FD estimate
𝜕𝑓 𝑓 (𝑥) − 𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 )
= + 𝒪(ℎ), (6.11) 𝑓 (𝑥 + ℎ)
𝜕𝑥 𝑗 ℎ True derivative
which is also a first order approximation. 𝑥 𝑥+ℎ
Assuming each function evaluation yields the full vector 𝑓 , the above Figure 6.3: Exact derivative com-
formulas compute the 𝑗 th column of the Jacobian (6.2). To compute pared to a forward-difference finite-
the full Jacobian, we need to loop through each direction 𝑒ˆ 𝑗 , add a difference approximation.
step, recompute 𝑓 , and compute a finite difference. Hence, the cost
of computing the complete Jacobian is proportional to the number of
input variables of interest, 𝑛 𝑥 .
For a second-order estimate of the first derivative, we can use the
expansion of 𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 ) to obtain,
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) − ℎ + − +.... (6.12)
𝜕𝑥 𝑗 2! 𝜕𝑥 𝑗 2 3! 𝜕𝑥 𝑗 3
Then, if we subtract this from the expansion (6.8) and solve the resulting
equation for the derivative of 𝑓 , we get the central-difference formula,
𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 )
= + 𝒪(ℎ 2 ). (6.13)
𝜕𝑥 𝑗 2ℎ
Even more accurate estimates can also be derived by combining different

Taylor series expansions to obtain higher order truncation error terms.
However, finite-precision arithmetic eventually limits the achievable
accuracy (as will be discussed in the next section). With double precision
arithmetic there are not enough significant digits to realize a significant
advantage beyond central-difference.
We can also estimate second derivatives (or higher) by nesting finite-
difference formulas. We can use, for example, the central difference
formula (6.13) to estimate the second derivative instead of the first,
𝜕𝑓 𝜕𝑓
−
𝜕2 𝑓 𝜕𝑥 𝑗 𝑥+ℎ 𝑒ˆ 𝑗 𝜕𝑥 𝑗 𝑥−ℎ 𝑒ˆ 𝑗
= + 𝒪(ℎ 2 ). (6.14)
𝜕𝑥 𝑗 2 2ℎ
Then we can use central difference again to estimate both 𝑓 0(𝑥 + ℎ) and
𝑓 0(𝑥 − ℎ) in the above equation to obtain,
𝜕2 𝑓 𝑓 (𝑥 + 2ℎ 𝑒ˆ 𝑗 ) − 2 𝑓 (𝑥) + 𝑓 (𝑥 − 2ℎ 𝑒ˆ 𝑗 )
= + 𝒪(ℎ). (6.15)
𝜕𝑥 𝑗 2 4ℎ 2
The finite-difference method can also be used to compute directional

derivatives, which represent the projection of the gradient into a given
direction. To do this, instead of stepping in orthogonal directions to
get the gradient, we need to step in the direction of interest, 𝑝. Using
the forward difference, for example,
𝑓 (𝑥 + ℎ𝑝) − 𝑓 (𝑥)
𝐷𝑝 𝑓 = + 𝒪(ℎ). (6.16)
ℎ
One application of directional derivatives is to compute the slope in
line searches (Section 4.3).
6.4.2 The Step-Size Dilemma

When estimating derivatives using finite-difference formulas we are
faced with the step-size dilemma. Because each estimate has a truncation
error of 𝒪(ℎ) (or 𝒪(ℎ 2 ) when second order) we would like to choose as
small of a step size as possible to reduce this error. However, as the step
size reduces, subtractive cancellation (a finite-precision arithmetic error)
becomes dominant. In Table 6.1, for example, we show a case where

we have 16-digit precision, and the step size was made small enough
that the reference value and the perturbed value differ only in the last
two digits. This means that when we do the subtraction required by
the finite-difference formula, the final number only has two digits of
precision. If ℎ is made small enough, this difference vanishes to zero.
If there were an infinite number of digits this wouldn’t be a problem,
but with finite precision arithmetic we are limited to larger step sizes.
𝑓 (𝑥 + ℎ) +1.234567890123431
𝑓 (𝑥) +1.234567890123456
Δ𝑓 −0.000000000000025
Table 6.1: Subtractive cancellation leads to a loss of precision and ultimately inaccu-
rate finite difference estimates.
Theoretically, the optimal step size for a first-order finite difference

√
is approximately 𝜖 where 𝜖 is the precision of 𝑓 . If we assume 𝑓
is accurate to 10−12 (meaning 12 digits of precision out of maximum
of 15 for double precision), then the optimal step size is around 10−6 .
√
The error bound is also about 𝜖, meaning that in this case, the finite
difference would have about 6 digits of accuracy. Similarly, for central
difference, the optimal step size scales approximately with 𝜖 1/3 , with an
error bound of 𝜖2/3 , meaning that for the previous example, the optimal
step size would be around 10−4 and it would achieve about 8 digits of
accuracy. Even though 12 digits of accuracy would be expected based
on the truncation error of 𝒪(ℎ 2 ), only a couple more digits of accuracy
are realized because of finite-precision arithmetic. This step and error
bound estimate above are just approximate and assume well-scaled
problems.
Finite-difference approximations are sometimes used with larger
steps than would be desirable from an accuracy standpoint in order to
help smooth out numerical noise or discontinuities in the model. This
type of approach can sometimes work, but it is better to address these
problems within the model whenever possible.
Tip 6.5: When using finite-differencing, always perform a step-size

study.
In practice, most gradient-based optimizers use finite-differences by default
to compute the gradients. Given the potential for inaccuracies, finite differences
are often the culprit in cases where gradient-based optimizers fail to converge.
Although some of these optimizers try to estimate a good step size, there is
no substitute for a step-size study by the user. The step-size study must be
performed for all variables and does not necessarily apply to the whole design
space. Therefore, repeating this study for other values of 𝑥 might be required.
6.4.3 Practical Implementation

A procedure for computing a Jacobian using forward finite-differences
is detailed in Alg. 6.6. It is generally helpful to scale the step size based
on the value of 𝑥 𝑖 (e.g., ℎ 𝑖 = 10−6 𝑥 𝑖 ), unless 10−6 𝑥 𝑖 is already smaller
than the default step size. Although the absolute step size usually
differs for each ℎ 𝑖 , the relative step size ℎ is often the same, though that
isn’t necessary.
Algorithm 6.6: Forward finite-difference computation of the gradients of a

vector-valued function 𝑓 (𝑥)
Inputs:
𝑥: Point about which to compute the gradient
𝑓 : Function of interest
Outputs:
𝐽: Jacobian of 𝑓 about point 𝑥
𝑓0 = 𝑓 (𝑥) Evaluate reference values

𝐽 = matrix(𝑛 𝑓 , 𝑛 𝑥 ) Initialize Jacobian as 𝑛 𝑓 × 𝑛 𝑥 matrix
ℎ = 10−6 Default relative step size. Optional input parameter(s)
for 𝑗 = 1 to 𝑛 𝑥 do
Δ𝑥 = max(ℎ, ℎ𝑥 𝑗 ) Scaled step size, but not smaller than ℎ
𝑥 𝑗 += Δ𝑥 Modification in place for efficiency although copying vector is an option
𝑓+ = 𝑓 (𝑥) Evaluate function at perturbed point
𝐽∗,𝑗 = ( 𝑓+ − 𝑓0 ) / Δ𝑥 Fill one column of Jacobian at a time
𝑥 𝑗 −= Δ𝑥 Do not forget to reset!
end for
We can also use a directional derivative in arbitrary directions to

verify the gradient computation. The directional derivative is the
projection of the gradient in the chosen direction, i.e., ∇ 𝑓 𝑇 𝑝. This
can be used to verify the gradient computed by some other method
and is especially useful when the evaluation of 𝑓 is expensive and
computing the complete gradient is time consuming. We can verify a
gradient by projecting it into some direction (say 𝑝 = [1, . . . , 1])), and
then comparing it to the directional derivative in that direction. If the
result matches the reference, then all the gradient components are most
likely correct (you might try a couple more directions just to be sure).
However, if the result does not match, this directional derivative does
not tell you which components of the gradient are incorrect.
6.5 Complex Step
The complex-step derivative approximation, strangely enough, com-

putes derivatives of real functions using complex variables. Unlike finite
differences, the complex-step method requires access to the source code,
and therefore cannot be applied to black box components. The complex-
step method is accurate, but no more efficient than finite-differences, as
the computational cost still scales linearly with the number of variables.
This method originated with the work of Lyness77 and Lyness et al.78 , 77. Lyness, Numerical Algorithms Based on
the Theory of Complex Variable. 1967
who developed formulas that use complex arithmetic for computing
78. Lyness et al., Numerical Differentiation
derivatives of real functions of arbitrary order with arbitrary order of Analytic Functions. 1967
truncation error, much like the Taylor series combination approach in
finite differences.
The simplest of the formulas approximate the first derivative using
only one complex function evaluation and can be derived using a Taylor
series expansion. Rather than using a real step ℎ, as we did for the
derivation of the finite-difference formulas, we use a pure imaginary
step, 𝑖 ℎ. If 𝑓 is a real function in real variables, and is also analytic, we
can expand it in a Taylor series about a real point 𝑥 as follows,
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) + 𝑖 ℎ − − 𝑖 +... (6.17)
𝜕𝑥 𝑗 2 𝜕𝑥 𝑗 2 6 𝜕𝑥 𝑗 3
Taking the imaginary parts of both sides of this equation,
𝜕𝑓 ℎ 3 𝜕3 𝑓
Im[ 𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 )] = ℎ − +... (6.18)
𝜕𝑥 𝑗 6 𝜕𝑥 𝑗 3
Dividing this by ℎ and solving for 𝜕 𝑓 /𝜕𝑥 𝑗 yields the complex-step

derivative approximation,

𝜕𝑓 Im 𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 )
= + 𝒪(ℎ 2 ), (6.19)
𝜕𝑥 𝑗 ℎ
which is a second order approximation. To use this approximation, you

must provide a complex number with a perturbation in the imaginary
part, compute the original function using complex arithmetic, and take
the imaginary part of the output to obtain the derivative.
In practical terms, this means that we must convert the function
evaluation so that it can take complex numbers as inputs and compute
the corresponding complex outputs. Because we have assumed that
𝑓 (𝑥) is a real function of a real variable, this most easily works with
models that do not already involve complex numbers, but the procedure
can be extended to work with functions that are already complex
(multicomplex step),79 or to provide exact second derivatives.80 In 79. Lantoine et al., Using Multicomplex
Variables for Automatic Computation of
Tip 6.10 we explain how to convert programs to handle the required High-Order Derivatives. 2012
complex arithmetic for the complex-step method to work in general. 80. Fike et al., The Development of Hyper-
Unlike finite-differences, this formula has no subtraction operation Dual Numbers for Exact Second-Derivative
Calculations. 2011
and thus no subtractive cancellation error. The only source of numerical
error is the truncation error. However, if ℎ is decreased to a small
enough value (say, 10−40 ), the truncation error can be eliminated. Then,
the precision of the complex-step derivative approximation (6.19) will
match the precision of 𝑓 . This is a tremendous advantage over the
finite-difference approximations (6.9) and (6.13).
Like the finite-difference approach, each evaluation yields a column
of the Jacobian (𝜕 𝑓 /𝜕𝑥 𝑗 ), and the cost of computing all the derivatives
is proportional to the number of design variables. The cost of the
complex-step method is more comparable to that of a central difference
as opposed to a forward difference because we must now compute a
real and an imaginary part for every number in our program.
If we take the real part of the Taylor series expansion (6.17), we
obtain the value of the function on the real axis,

𝑓 (𝑥) = Re 𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) + 𝒪(ℎ 2 ). (6.20)
Similarly to the derivative approximation, we can make the truncation

error disappear by using a small enough ℎ. This means that no separate
evaluation of 𝑓 (𝑥) is required to get the original real value of the
function, we can simply take the real part of the complex evaluation.
What is a “small enough ℎ?” For the real function value (6.20) the
truncation error is eliminated if ℎ is such that
1 𝜕2 𝑓
ℎ2 < 𝜖 𝑓 (𝑥) , (6.21)
2 𝜕𝑥 𝑗 2
where 𝜖 is the relative working precision of the function evaluation.

Similarly, for the derivative approximation we require
1 𝜕3 𝑓
ℎ2 < 𝜖 𝑓 0(𝑥) . (6.22)
6 𝜕𝑥 𝑗 3
Although the step, ℎ, can be set to an extremely small value, it is not

always possible to satisfy this condition especially when the 𝑓 or 𝜕 𝑓 /𝜕𝑥 𝑗
tend to zero for the above two conditions, respectively. In practice, a
step size in the 10−20 –10−40 range typically suffices.
The complex-step method can be used even when the evaluation

of 𝑓 involves the solution of numerical models through computer
programs. The outer loop for computing the derivatives of multiple
functions with respect to all variables (Alg. 6.7) is similar to the one for
finite-differences. A reference function evaluation is not required, but
now the function must be able to handle complex numbers correctly.
Algorithm 6.7: Computing the gradients of a vector-valued function 𝑓 (𝑥) us-

ing the complex-step method
Inputs:
𝑥: Point about which to compute the gradient
𝑓 : Function of interest
Outputs:
𝐽: Jacobian of 𝑓 about point 𝑥
𝐽 = matrix(𝑛 𝑓 , 𝑛 𝑥 ) Initialize Jacobian as 𝑛 𝑓 × 𝑛 𝑥 matrix

ℎ = 10−40 Typical “small enough” step size
for 𝑗 = 1 to 𝑛 𝑥 do
𝑥 𝑗 += complex(0, ℎ) Modify in place; could copy vector instead
𝑓+ = 𝑓 (𝑥) Evaluate function perturbed with complex step
𝐽∗,𝑗 = imag( 𝑓+ ) / h Extract derivatives from imaginary part
𝑥 𝑗 −= complex(0, ℎ) Do not forget to reset!
end for
Tip 6.8: Check the convergence of the imaginary part.
When the solver that computes 𝑓 is iterative, it is important to change the

convergence criterion so that it checks for the convergence of the imaginary
part in addition to the existing check on the real part. The imaginary part,
which contains the derivative information, usually lags relative to the real part
in terms of convergence. Therefore, if the solver only checks for the real part, it
might yield a derivative with a precision lower than the function value.
Example 6.9: Complex step accuracy compared to finite differences.
To show the how the complex-step method works, consider the following
analytic function:
𝑒𝑥
𝑓 (𝑥) = √ . (6.23)
sin3 𝑥 + cos3 𝑥
The exact derivative at 𝑥 = 1.5 is computed to 16 digits based on symbolic
differentiation as a reference value. The errors relative to this reference for
the complex-step derivative approximation and the forward and central finite-
difference approximations are computed as
d𝑓 d𝑓
−
d𝑥 approx d𝑥 exact
𝜖= (6.24)
d𝑓
d𝑥 exact
As the step decreases, the forward-difference estimate initially converges

at a linear rate, since its truncation error is 𝒪(ℎ), while the central-difference
converges quadratically, as shown in Fig. 6.4. However, as the step is reduced
below a value of about 10−8 for the forward difference and 10−5 for the central
difference, subtractive cancellation errors become increasingly significant.
When ℎ is so small that no difference exists in the output (for steps smaller than
10−16 ) the finite-difference estimates eventually yields zero (and 𝜖 = 1), which
means 100% error.
101
Forward difference Central difference

10−2
Figure 6.4: Relative error of the

10−5
Relative error, 𝜖
derivative approximations as
the step size decreases. Finite-
10−8 difference approximations initially
converge as the truncation error
10−11
decrease, but when the step is too
small, the subtractive cancellation
errors become overwhelming. The
10−14 complex-step approximation does
Complex step
not suffer from this issue. Note that
10−17 the x-axis is oriented so that smaller
10−1 10−5 10−9 10−13 10−17 10−21 10−25 step sizes are to the right.
Step size, ℎ
The complex-step estimate converges quadratically with decreasing step

size, as predicted by the truncation error term. The relative error reduces
to machine zero around ℎ = 10−8 , and stays there until ℎ is so small that
underflow occurs (around ℎ = 10−308 in this case.)
Comparing the best accuracy of each of these approaches, we can see that
by using finite-differences we only achieve a fraction of the accuracy that is
obtained by using the complex-step approximation.
Tip 6.10: Code refactoring to handle complex step.
As previously mentioned the complex-step requires access to the source

code of these programs. The reason is that changes are required to make sure
that the program can handle complex numbers and the associated arithmetic,
that it handles logical operators consistently, and that certain functions yield
the correct derivatives.
First, the program may need to be modified to use complex numbers. In
programming languages like Fortran or C, this involves changing real valued
type declarations (e.g., double) to complex type declarations (e.g., double
complex). In some languages, such as Matlab, this is not necessary because
functions are overloaded to accept either type automatically.
Second, some changes may be required to preserve the correct logical flow
through the program. Relational logic operators such as “greater than” and
“less than” or “if” and “else” are usually not defined for complex numbers. These
operators are often used in programs, together with conditional statements, to
redirect the execution thread. The original algorithm and its “complexified”
version should follow the same execution thread, and therefore, defining
these operators to compare only the real parts of the arguments is the correct
approach. Functions that choose one argument, such as the maximum or the
minimum values are based on relational operators. Following the previous
argument, we should determine the maximum and minimum values based on
the real parts alone.
Third, some functions need to be redefined for complex arguments. The
most common function that needs to be redefined is the absolute value function,
which for a complex number, 𝑧 = 𝑥 + 𝑖𝑦, is defined as
q
|𝑧| = 𝑥2 + 𝑦2 . (6.25)
This definition is not complex analytic, which is required in the derivation

of the complex-step derivative approximation. The following definition of
absolute value, (
−𝑥 − 𝑖𝑦, if 𝑥 < 0
abs(𝑥 + 𝑖𝑦) = (6.26)
+𝑥 + 𝑖𝑦, if 𝑥 ≥ 0,
yields the correct result since when 𝑦 = ℎ, the imaginary part divided by ℎ
corresponds to the slope of the absolute value function. There is an exception
at 𝑥 = 0, where the function is not analytic, but a derivative does not exist in
any case. We use the “greater or equal” in the logic above so the approximation
at least yields the correct right-sided derivative at that point.
Depending on the programming language, some trigonometric functions
may also need to be redefined. This is because some default implementations,
while correct, do not maintain accurate derivatives for small complex-step sizes
using finite precision arithmetic. These must be replaced with mathematically
equivalent implementations that avoid numerical issues.
Fortunately, most of these changes can be automated by using scripts
to process the codes, and functions can be easily redefined using operator
overloading in most programming languages. Martins et al.45 provide more 45. Martins et al., The Complex-Step Deriva-
tive Approximation. 2003
details on the problematic functions and how to implement the complex-step
method in various programming languages.∗ ∗ http://bit.ly/complexstep
6.6 Algorithmic Differentiation
Algorithmic differentiation (AD)—also known as computational dif-

ferentiation or automatic differentiation—is a well known approach
based on the systematic application of the differentiation chain rule
to computer programs 81,82 . The derivatives computed with AD can 81. Griewank, Evaluating Derivatives.
2000
match the precision of the function evaluation. We will see that the 82. Naumann, The Art of Differentiating
cost of computing derivatives with AD can be either proportional to Computer Programs—An Introduction to
Algorithmic Differentiation. 2011
the number of variables or to the number of functions, depending on
the type of AD, which makes it flexible. Another attractive feature of
AD is the fact that its implementation is largely automatic. To explain
AD, we start by outlining the basic theory. We then show an example,
and finally explain how the method is implemented in practice.
6.6.1 Variables and Functions as Lines of Code

The basic concept of AD is straightforward. Even long, complicated
codes consist of a sequence of basic operations (e.g., addition, multipli-
cation, cosine, exponentiation). These operations are assembled in lines
of code that compute variable values using explicit expressions, which
can be differentiated symbolically with respect to all the variables in
the expression. AD performs this symbolic differentiation and adds the
code that computes the derivatives for each variable in the code. The
derivatives of each variable are then computed using the chain rule.
The fundamental building blocks of a code are unary and binary
operations. These operations can be combined to obtain more elaborate
explicit functions, which are typically expressed in one line of computer
code. We represent the variables in the computer code as the sequence
𝑣 = 𝑣 1 , . . . , 𝑣 𝑖 , . . . , 𝑣 𝑁 , where 𝑁 is the total number of variables assigned
in the code. One or more of these variables at the start of this sequence
are given and correspond to 𝑥, while one or more of the variables
toward the end of the sequence are the outputs of interest, 𝑓 . In general,
a variable assignment corresponding to a line of code can depend on
any other variable, including itself, through a function 𝑉𝑖 for line (or
operation) 𝑖:
𝑣 𝑖 = 𝑉𝑖 (𝑣1 , 𝑣2 , . . . , 𝑣 𝑖 , . . . , 𝑣 𝑛 ), (6.27)
where we adopt the convention that the lower case represents the value
of the variable, and the upper case represents the function that computes
that value. Except for the simplest codes, many of the variables in the
code are overwritten due to iterative loops.
To understand AD, it is useful to imagine a version of the code
where all the loops are unrolled. That is, instead of overwriting variables,
we just create new versions of those variables. Then, we can represent

the computations in the code in a sequence with no loops such that
each variable in this larger set only depends on previous variables, and
then
𝑣 𝑖 = 𝑉𝑖 (𝑣1 , 𝑣2 , . . . , 𝑣 𝑖−1 ). (6.28)
Given such a sequence of operations, and the derivatives for each
operation, we can apply the chain rule to obtain the derivatives for the
entire sequence. The chain rule can be applied in two ways. In the
forward mode, we fix one input variable and work forward through the
outputs until we get the desired total derivative. In the reverse mode, we
fix one output variable and work backwards through the inputs until
we get the desired total derivative.
6.6.2 Forward Mode AD

The chain rule for the forward mode can be written as:
d𝑣 𝑖 Õ 𝜕𝑉𝑖 d𝑣 𝑘
𝑖−1
= , (6.29)
d𝑣 𝑗 𝜕𝑣 𝑘 d𝑣 𝑗
𝑘=𝑗
where 𝑉𝑖 represents an explicit function. Using the forward mode, we

evaluate a sequence of these expressions by fixing 𝑗 (effectively choosing
one input 𝑣 𝑗 ) and incrementing 𝑖 to get the derivative of each variable 𝑣 𝑖 .
We only need to sum up to 𝑖 − 1 because of the form (6.28), where each
𝑣 𝑖 only depends on variables that precede it. For a more convenient
notation, we define a new variable that represents the derivative of
variable 𝑖 with respect to a fixed input 𝑗 as:
d𝑣 𝑖
𝑣¤ 𝑖 , . (6.30)
d𝑣 𝑗
Once we are done applying the chain rule (6.29) for the chosen input
variable, we end up with the corresponding full column of the Jacobian,
i.e., the tangent vector.
Suppose we have four variables: 𝑣1 , 𝑣2 , 𝑣3 , 𝑣4 and 𝑥 = 𝑣1 , 𝑓 = 𝑣 4 ,
and we want d 𝑓 /d𝑥. Using the above formula we set 𝑗 = 1 (as we want
the derivative with respect to 𝑣1 = 𝑥) and increment in 𝑖 to get the
sequence of derivatives
𝑣¤ 1 = 1
𝜕𝑉2
𝑣¤ 2 = 𝑣¤ 1
𝜕𝑣1
𝜕𝑉3 𝜕𝑉3 (6.31)
𝑣¤ 3 = 𝑣¤ 1 + 𝑣¤ 2
𝜕𝑣1 𝜕𝑣2
𝜕𝑉4 𝜕𝑉4 𝜕𝑉4 d𝑓
𝑣¤ 4 = 𝑣¤ 1 + 𝑣¤ 2 + 𝑣¤ 3 ,
𝜕𝑣1 𝜕𝑣2 𝜕𝑣3 d𝑥
The colored derivatives show how the values are reused. In each step
we just need to compute the partial derivatives of the current operation
𝑉𝑖 and then multiply using total derivatives that have already been
computed. We move forward evaluating partial derivatives of 𝑉 in the
same sequence that we evaluate the original function. This is convenient
because all of the unknowns are partial derivatives, meaning that we
only need to compute derivatives based on the operation at hand (or
line of code).
Using forward mode AD, obtaining derivatives with respect to
additional outputs is either free (e.g., d𝑣3 /d𝑣1 , 𝑣¤ 3 in Eq. 6.31) or
requires only one more line of computation (e.g., if we had an additional
output 𝑣5 ), and thus has a negligible additional cost for a large code.
However, if we want the derivatives with respect to additional inputs
(e.g., d𝑣4 /d𝑣2 ) we would need to evaluate an entire set of similar
calculations. Thus, the cost of the forward mode scales linearly with
the number of inputs and is practically independent of the number of
outputs.
Example 6.11: Forward mode AD for explicit functions.
Consider the function with two inputs and two outputs from Ex. 6.1. The
explicit expressions in this function could be evaluated using only two lines of
code. However, to make the AD process more apparent, we write the code such
that each line has a single unary or binary operation, which is how a computer
ends up evaluating the expression.
𝑣1 = 𝑉1 (𝑣1 ) = 𝑥1
𝑣2 = 𝑉2 (𝑣2 ) = 𝑥2
𝑣3 = 𝑉3 (𝑣1 , 𝑣2 ) = 𝑣1 𝑣 2
𝑣4 = 𝑉4 (𝑣1 ) = sin 𝑣1 (6.32)
𝑣5 = 𝑉5 (𝑣3 , 𝑣4 ) = 𝑣3 + 𝑣4 = 𝑓1
𝑣6 = 𝑉6 (𝑣2 ) = 𝑣 22
𝑣7 = 𝑉7 (𝑣3 , 𝑣6 ) = 𝑣3 + 𝑣6 = 𝑓2
𝑣1 𝑣4 𝑣5
𝑥1 sin + 𝑓1
𝑣1 𝑣3
×
Figure 6.5: Dependency graph
𝑣2 𝑣3 for the numerical example evalua-
𝑥2 [·]2 + 𝑓2 tions (6.32).
𝑣2 𝑣6 𝑣7
The operations above result in the dependency graph shown in Fig. 6.5.
Say we want to compute d 𝑓2 /d𝑥1 , which in our example is d𝑣7 /d𝑣1 . The
evaluation point is the same as in Ex. 6.1: 𝑥 = (𝜋/4, 2). Using the forward
mode, set the seed for the corresponding input, 𝑣¤ 1 to one, and the seed for the
other input to zero. Then we get the sequence,
𝑣¤ 1 = 1
𝑣¤ 2 = 0
𝜕𝑉3 𝜕𝑉3
𝑣¤ 3 = 𝑣¤ 1 + 𝑣¤ 2 = 𝑣2 𝑣¤ 1 =2
𝜕𝑣 1 𝜕𝑣2
𝜕𝑉4
𝑣¤ 4 = 𝑣¤ 1 = cos 𝑣1 𝑣¤ 1 = 0.707 . . .
𝜕𝑣 1 (6.33)
𝜕𝑉5 𝜕𝑉5 𝜕 𝑓1
𝑣¤ 5 = 𝑣¤ 3 + 𝑣¤ 4 = 𝑣¤ 3 + 𝑣¤ 4 = 2.707 . . . =
𝜕𝑣 3 𝜕𝑣4 𝜕𝑥1
𝜕𝑉6
𝑣¤ 6 = 𝑣¤ 2 = 2𝑣 2 𝑣¤ 2 =0
𝜕𝑣 2
𝜕𝑉7 𝜕𝑉7 𝜕 𝑓2
𝑣¤ 7 = 𝑣¤ 3 + 𝑣¤ 6 = 𝑣¤ 3 + 𝑣¤ 6 =2= ,
𝜕𝑣 3 𝜕𝑣6 𝜕𝑥1
where we have evaluated the partial derivatives numerically. We now have a
procedure (not a symbolic expression) for computing d 𝑓2 /d𝑥1 for any (𝑥 1 , 𝑥2 ).
While we set out to compute d 𝑓2 /d𝑥1 , we also obtained d 𝑓1 /d𝑥 1 as a
byproduct. For a given input, we can obtain the derivatives for all outputs for
essentially the same cost as one output. In contrast, if we want the derivative
with respect to the other input, d 𝑓1 /d𝑥 2 , a new sequence of calculations is
necessary. Because this example contains so few operations the difference is
small, but for a long program with many inputs, the difference will be large.
So far, we have assumed that we are computing Cartesian derivatives

(i.e., derivatives with respect to each orthogonal component of 𝑥).
However, just like for finite differences, it is possible to compute
directional derivatives using forward mode AD. This is done by setting
the appropriate seed in the 𝑣¤ ’s that correspond to the inputs in a
vectorized manner. Suppose we have 𝑥 , [𝑣1 , . . . , 𝑣 𝑛 𝑥 ]𝑇 . To compute
the Cartesian derivative with respect to 𝑥1 , for example, we would
set the seed as a vector 𝑣¤ = [1, 0, . . . , 0]𝑇 , and similarly for the other
components. To compute a directional derivative in direction 𝑝, we

would set the seed to the vector 𝑣¤ = 𝑝/||𝑝||.
6.6.3 Reverse Mode AD

The reverse mode is also based on the chain rule, but using the alternative
form:
d𝑣 𝑖 Õ 𝑖
𝜕𝑉𝑘 d𝑣 𝑖
= . (6.34)
d𝑣 𝑗 𝜕𝑣 𝑗 d𝑣 𝑘
𝑘=𝑗+1
where the summation happens in reverse (starts at 𝑖 and decrements

to 𝑗 + 1). This is less intuitive than the forward chain rule, but it is
mathematically valid. Here, we fix the index 𝑖 corresponding to the
output of interest and decrement 𝑗 until we get the desired derivative.
Similarly to the forward mode total derivative notation (6.30), we
define a more convenient notation for the variables that carry the total
derivatives with a fixed 𝑖,
d𝑣 𝑖
𝑣¯ 𝑗 , , (6.35)
d𝑣 𝑗
which are sometimes called adjoint variables. These reverse mode
variables represent the derivatives of one output, 𝑖, with respect to all
the variables, instead of the derivatives of all the variables with respect
to one input, 𝑗, in the forward mode. Once we are done applying the
reverse chain rule (6.34) for the chosen output variable, we end up with
the corresponding full row of the Jacobian, i.e., the gradient vector.
Applying the reverse mode to the same four variable example as
before, we get the following sequence of derivative computations (we
set 𝑖 = 4 and decrement 𝑗):
𝑣¯ 4 = 1
𝜕𝑉4
𝑣¯ 3 = 𝑣¯ 4
𝜕𝑣3
𝜕𝑉3 𝜕𝑉4 (6.36)
𝑣¯ 2 = 𝑣¯ 3 + 𝑣¯ 4
𝜕𝑣2 𝜕𝑣2
𝜕𝑉2 𝜕𝑉3 𝜕𝑉4 d𝑓
𝑣¯ 1 = 𝑣¯ 2 + 𝑣¯ 3 + 𝑣¯ 4 ,
𝜕𝑣1 𝜕𝑣1 𝜕𝑣1 d𝑥
The partial derivatives for 𝑉 must be computed for 𝑉4 first, then 𝑉3 , and
so on. Therefore, we have to traverse the code in reverse. In practice,
not every variable depends on every other variable, so a dependency
graph is created during code evaluation. Then, when computing the
adjoint derivatives we traverse the dependency graph in reverse. As
before, the derivatives we need to compute in each line are only partial
derivatives.
Using the reverse mode of AD, obtaining derivatives with respect to

additional inputs is either free (e.g., d𝑣4 /d𝑣2 , 𝑣¯ 2 ) or requires one more
line of code to be evaluated. However, if we want the derivatives with
respect to additional outputs (e.g., d𝑣 3 /𝑑𝑣 1 ) we would need to evaluate
a different sequence of derivatives. Thus, the cost of the reverse
mode scales linearly with the number of outputs and is practically
independent of the number of inputs.
One complication with the reverse mode is that the resulting se-
quence of derivatives requires the values of the variables, starting with
the last ones and progressing in reverse. Therefore, the code needs to
run in a forward pass first and all the variables must be stored for use
in the reverse pass, which has an impact on memory usage.
Example 6.12: Reverse mode AD.
To compute the same derivative using the reverse mode, we need to choose
the output to 𝑓2 by setting the seed for the corresponding output, 𝑣¯ 7 to one.
Then we get,
𝑣¯ 7 = 1
𝜕𝑉7
𝑣¯ 6 = 𝑣¯ 7 = 𝑣¯ 7 =1
𝜕𝑣6
𝑣¯ 5 = = = 0 (nothing depends on 𝑣5 )
𝜕𝑉5
𝑣¯ 4 = 𝑣¯ 5 = 𝑣¯ 5 =0
𝜕𝑣4
𝜕𝑉7 𝜕𝑉5
𝑣¯ 3 = 𝑣¯ 7 + 𝑣¯ 5 = 𝑣¯ 7 + 𝑣¯ 5 =1
𝜕𝑣3 𝜕𝑣3
𝜕𝑉6 𝜕𝑉3 𝜕 𝑓2
𝑣¯ 2 = 𝑣¯ 6 + 𝑣¯ 3 = 2𝑣2 𝑣¯ 6 + 𝑣1 𝑣¯ 3 = 4.785 =
𝜕𝑣2 𝜕𝑣2 𝜕𝑥2
𝜕𝑉4 𝜕𝑉3 𝜕 𝑓2
𝑣¯ 1 = 𝑣¯ 4 + 𝑣¯ 3 = (cos 𝑣1 )𝑣¯ 4 + 𝑣2 𝑣¯ 3 =2= ,
𝜕𝑣1 𝜕𝑣1 𝜕𝑥1
(6.37)
While we set out to evaluate d 𝑓2 /d𝑥1 , we also computed d 𝑓2 /d𝑥2 as a
byproduct. For each output, the derivatives of the all inputs come at the cost of
evaluating only one more line of code. Conversely, if we want the derivatives
of 𝑓1 a whole new set of computations is needed.
In the reverse computation sequence above, we needed the values of some
of the variables in the original code to be precomputed and stored. In addition,
the reverse computation requires the dependency graph information (which,
for example, is how we would know that nothing depended on 𝑣5 ). In forward
mode, the computation of a given derivative, 𝑣¤ 𝑖 , requires the partial derivatives
of the line of code that computes 𝑣 𝑖 with respect to its inputs. In the reverse case,
however, to compute a given derivative, 𝑣¯ 𝑗 , we require the partial derivatives of
the functions that the current variable 𝑣 𝑗 affects with respect to 𝑣 𝑗 . Knowledge
of the function a variable affects is not encoded in that variable computation,

and that is why the dependency graph if required.
Unlike forward mode AD and finite differences, it is not possible

to compute a directional derivative by setting the appropriate seeds.
While the seeds in the forward mode are associated with the inputs, the
seeds for the reverse mode are associated with the outputs. Suppose
we have multiple functions of interest, 𝑓 , [𝑣 𝑛−𝑛 𝑓 , . . . , 𝑣 𝑛 ]𝑇 . To find
the derivatives of 𝑓1 in a vectorized operation, we would set 𝑣¯ =
[1, 0, . . . , 0]𝑇 . A seed with multiple nonzero components computes
the derivatives of a weighted function with respect to all the variables,
where the weight for each function is determined by the corresponding
𝑣¯ value.
In summary, if we want to compute a Jacobian matrix of partial
derivatives of the outputs of interest with respect to all the inputs, the
forward mode computes a column of derivatives in the Jacobian at each
pass, while the reverse mode computes a row of the same Jacobian
at each pass. Therefore, if we have more outputs (e.g., objective and
constraints) than inputs (design variables) then the forward mode is
more efficient. On the other hand, if we have many more inputs than
outputs, then the reverse mode is more efficient. If the number of inputs
is similar to the number of outputs, neither is particularly efficient and
profiling the resulting code is necessary to determine which approach is
faster. In practice, because of the need for creating a dependency graph
the reverse mode is usually only favorable if the number of inputs is
significantly larger than the number of outputs.
Example 6.13: Comparison of source code transformations and operator

overloading.
There are two main ways to implement AD: by source code transformation or
by using derived datatypes and operator overloading.
To implement AD by source transformation, the whole source code must
be processed with a parser and all the derivative computations are introduced
as additional lines of code. This approach is demonstrated below, using a
forward mode, for the algorithm we used earlier to illustrate the difficulties of
symbolic differentiation (Ex. 6.4). Differentiating this function with AD is much
simpler. We set the seed, 𝑥,¤ to one, and for each function assignment we add
the corresponding derivative line. As the loops are repeated, 𝑓¤ accumulates
the derivative as 𝑓 is successively updated.
Input: 𝑥, 𝑥¤ Set seed 𝑥¤ = 1 to get d 𝑓 /d𝑥
𝑓 =𝑥
𝑓¤ = 𝑥¤
𝑓¤ = ( 𝑥¤ + 𝑓¤) · cos(𝑥 + 𝑓 )
end for
return 𝑓 , 𝑓¤ d 𝑓 /d𝑥 is given by 𝑓¤
The reverse mode AD version of the same function is shown below. We set
𝑓¯ = 1 to get the derivative of 𝑓 . Now we need two distinct loops: a forward one
that computes the original function, and a reverse one that accumulates the
derivatives in reverse starting from the last derivative in the chain. Because the
derivatives that are accumulated in the reverse loop depend on the intermediate
values of the variables, we need to store all the variables in the forward loop.
Here we do it via a stack, which is a data structure that stores a one-dimensional
array. Using the stack concept, we can only add an element to the top of the
stack (push) and take the element from the top of the stack (pop).
Input: 𝑥, 𝑓¯ Set 𝑓¯ = 1 to get d 𝑓 /d𝑥
𝑓 =𝑥
push( 𝑓 ) Put current value of 𝑓 on top of stack
end for
for 𝑖 = 20 to 1 do Do the reverse loop
𝑓 = pop() Get value of 𝑓 from top stack
𝑓¯ = cos(𝑥 + 𝑓 ) · 𝑓¯
end for
𝑥¯ = 𝑥¯ + 𝑓¯
return 𝑓 , 𝑥¯ d 𝑓 /d𝑥 is given by 𝑥¯
Tip 6.14: Operator overloading versus source code transformation. 83. Hascoët et al., TAPENADE 2.1 User’s
Guide. 2004
To use derived types and operator overloading approach to AD, we need a
84. Griewank et al., Algorithm 755: ADOL-
language that supports these features, such as C++, Fortran 90, Python, Julia, C: A Package for the Automatic Differen-
Matlab, etc. The derived types feature is used to replace all the real numbers tiation of Algorithms Written in C/C++.
1996
in the code, 𝑣, with a dual number type that includes both the original real
85. Wiltschko et al., Tangent: automatic dif-
number and the corresponding derivative as well, i.e., 𝑢 = (𝑣, 𝑣¤ ). Then, all ferentiation using source code transformation
operations are redefined (overloaded) such that, in addition to the result of in Python. 2017
the original operations, they yield the derivative of that operation. All these 86. Revels et al., Forward-Mode Automatic
Differentiation in Julia. 2016
additional operations are performed “behind the scenes” without adding much
87. Neidinger, Introduction to Automatic
code. Except for the variable declarations and setting the seed, the code remains Differentiation and MATLAB Object-
exactly the same as the original. Oriented Programming. 2010
There are AD tools available for most programming languages, including † Although some AD tools can be applied
Fortran,83 C/C++,84 Python,85 Julia,86 and Matlab.87 They have been exten- recursively to yield higher order deriva-
tives, this approach is not typically effi-
sively developed and provide the user with great functionality, including the cient and is sometimes unstable.88
calculation of higher-order derivatives, multivariable derivatives, and reverse 88. Betancourt, A geometric theory of
mode options.† higher-order automatic differentiation. 2018
The operator overloading approach is much more elegant, since the original
code stays practically the same and can be maintained directly. The source
code transformation approach, on the other hand, enlarges the original code
and results in code that is less readable, making it hard to debug the new
extended code. Instead of maintaining source code transformed by AD, it is
advisable to work with the original source, and devise a workflow where the
parser is rerun before compiling a new version. The advantage of the source
code transformation is that it tends to yield faster code, and it is easier to see
what operations actually take place when debugging.
6.7 Implicit Analytic Methods—Direct and Adjoint
Direct and adjoint methods—which we refer to jointly as implicit

analytic methods—linearize the model governing equations to obtain
a system of linear equations where the derivatives are the unknown
variables that can be solved for. Like the complex-step method and AD,
implicit analytic methods can compute derivatives with a precision
matching that of the function evaluation. Like AD, there are two main
methods—direct and adjoint—that are analogous to the AD forward
and reverse modes.
Analytic methods can be thought of as being in between the finite-
difference method and AD in terms of the variables that they deal with.
With finite differences, we only need to be aware of inputs and outputs,
while AD involves dealing with every single variable assignment in
the code. Analytic methods work at the model level, where we require
knowledge of the governing equations and the corresponding state
variables.
There are two main approaches to deriving implicit analytic meth-
ods: continuous and discrete. The continuous approach linearizes
the original governing equations in PDE form, and then discretizes
this continuous linearization. The discrete approach linearizes the
discrete form of the governing equations instead. Each approach has its
advantages and disadvantages. The discrete approach is preferred and
is easier to generalize, so we explain the discrete approach exclusively.
6.7.1 Residuals and Functions

As mentioned in Chapter 3, a numerical model can be written as a set
of residuals that need to be driven to zero,
𝑟 = 𝑅(𝑥, 𝑢(𝑥)) = 0, (6.38)

where these equations are solved for the 𝑛𝑢 state variables 𝑢 for a given
fixed set of design variables 𝑥. This means that 𝑢 is an implicit function
of 𝑥.
The functions of interest, 𝑓 , can in general be written as explicit
functions of the state variables and the design variables, i.e.,
𝑓 = 𝐹(𝑥, 𝑢(𝑥)). (6.39)
Therefore, 𝑓 depends not only explicitly on the design variables, but also
implicitly through the governing equations (6.38). This dependency is
illustrated in Fig. 6.6.
Figure 6.6: Relationship between

functions and design variables for
𝑅(𝑥, 𝑢(𝑥)) = 0 𝐹(𝑥, 𝑢(𝑥)) a general set of implicit equations.
• •
The implicit equations 𝑟 = 0 define
•
𝑢
• the states 𝑢 for a given 𝑥, so the
• •
𝑥 • •
𝑓 𝑛 𝑓 functions of interest 𝑓 end up
• •
depending both explicitly and im-
• •
• • plicitly on the 𝑛 𝑥 design variables
• •
𝑥.
6.7.2 Direct and Adjoint Derivative Equations

The derivatives we ultimately want to compute are the ones in the
Jacobian d 𝑓 /d𝑥. Given the explicit and implicit dependence of 𝑓 on 𝑥,
we can use the chain rule to write the total derivative Jacobian of 𝑓 as
d𝑓 𝜕𝐹 𝜕𝐹 d𝑢
= + (6.40)
d𝑥 𝜕𝑥 𝜕𝑢 d𝑥
where the result is an (𝑛 𝑓 × 𝑛 𝑥 ) matrix.
In this context, the total derivatives, d 𝑓 /d𝑥, take into account the
change in 𝑢 that is required to keep the residuals of the governing
equations, Eq. 6.38, equal to zero. The partial derivatives represent just
the variation of 𝑓 = 𝐹(𝑥, 𝑢) with respect to changes in 𝑥 or 𝑢 without
any regard to satisfying the governing equations. To better understand
this, imagine computing these derivatives using finite differences. For
the total derivatives, we would perturb 𝑥, solve the governing equations
to obtain 𝑢, and then compute 𝑓 , which would account for both
dependency paths in Fig. 6.6. To compute the partial derivatives 𝜕𝐹/𝜕𝑥
and 𝜕𝐹/𝜕𝑢, however, we would perturb 𝑥 or 𝑢 and just recompute 𝑓
without solving the governing equations. By definition, d𝑢/d𝑥 is a total
derivative because 𝑢 depends on 𝑥 through the governing equations, a
total derivative that would be costly to compute using finite differences.
In general, partial derivative terms are easy to compute, while total

derivatives are not.
The total derivative of the governing equations residuals (6.38) with
respect to the design variables can also be derived using the chain rule.
Furthermore, the total derivatives of the residuals must be zero if the
governing equations are to remain satisfied, and we can write:
d𝑟 𝜕𝑅 𝜕𝑅 d𝑢
= + = 0, (6.41)
d𝑥 𝜕𝑥 𝜕𝑢 d𝑥
where d𝑟/d𝑥 and d𝑢/d𝑥 are both (𝑛𝑢 × 𝑛 𝑥 ) matrices, and the Jacobian,
𝜕𝑅/𝜕𝑢, is a square matrix of size (𝑛𝑢 × 𝑛𝑢 ).
We can visualize the requirement for the total derivative (6.41) to
be zero in Fig. 6.7, which is a simplified representation of the set of
points that satisfy the governing equations. In this case, it is just a line
that maps a scalar 𝑥 to a scalar 𝑢. In the general case, the governing
equations map 𝑥 ∈ R𝑛 𝑥 to 𝑢 ∈ R𝑛𝑢 and the set of points that satisfy the
governing equations is a manifold in 𝑥-𝑢 space. As a result, any change,
d𝑥, must be accompanied by the appropriate change, d𝑢, so that the
governing equations are still satisfied. If we look at small perturbations
about a feasible point and want to remain feasible, the variations of 𝑥
and 𝑢 are no longer independent, because the total derivative of the
governing equation residuals (6.41) with respect to 𝑥 must be zero.
As in the total derivative of the function of interest (6.40), the partial
∇𝑅
derivatives here do not take into account the solution of the governing
equations and are therefore much more easily computed than the total d𝑢
derivatives. However, if we provide the two partial derivative terms in 𝑢 d𝑥
the equation above, we can compute the total derivatives by solving

𝑅(𝑥, 𝑢) = 0
the linear system, i.e.,
−1 𝑥
d𝑢 𝜕𝑅 𝜕𝑅
=− , (6.42)
d𝑥 𝜕𝑢 𝜕𝑥 Figure 6.7: The residuals of the
governing equations determine the
where the inverse of the 𝜕𝑅/𝜕𝑢 matrix does not necessarily mean that values of 𝑢 for a given 𝑥. Given a
we actually find the inverse; rather, it represents the solution of the point that satisfies the equations, a
perturbation in 𝑥 about that point
linear system using any suitable method (e.g., LU factorization). Since must be accompanied by the appro-
d𝑢/d𝑥 is a matrix with 𝑛 𝑥 columns, this linear system needs to be priate change in 𝑢 for the equations
solved for each 𝑥 with the corresponding column of the right-hand side to remain satisfied.
matrix 𝜕𝑅/𝜕𝑥. Substituting this result into the total derivative (6.40),
we obtain: −1
d𝑓 𝜕𝐹 𝜕𝐹 𝜕𝑅 𝜕𝑅
= − , (6.43)
d𝑥 𝜕𝑥 𝜕𝑢 𝜕𝑢 𝜕𝑥
where all the derivative terms on the right-hand side are partial deriva-
tives. The partial derivatives in this equation can be computed using any
of the methods that we have described earlier: symbolic differentiation,

finite difference, complex step, or AD.
The total derivative equation (6.43) shows that there are actually
two ways to compute the total derivatives: the direct method and the
adjoint method.
The direct method (which is already outlined above) consists in
solving the linear system (6.42) and substituting d𝑢/d𝑥 into Eq. 6.40.
We can express this by replacing the last two Jacobians with
−1
𝜕𝑅 𝜕𝑅
𝜙, . (6.44)
𝜕𝑢 𝜕𝑥
Multiplying this by 𝜕𝑅/𝜕𝑢, we get the linear system,

𝜕𝑅 𝜕𝑅
𝜙= , (6.45)
𝜕𝑢 𝜕𝑥
which we can solve to compute 𝜙. Then, we can replace 𝜙 in the total
derivative equation,
d𝑓 𝜕𝐹 𝜕𝐹
= − 𝜙. (6.46)
d𝑥 𝜕𝑥 𝜕𝑢
This is also called the forward mode as it is analogous for forward
mode AD. As previously mentioned, we need to solve the linear
system (6.45) for one component of 𝑥 at a time, and this equation has
no dependence on any of the outputs 𝐹. Solving the linear system is
the most computationally expensive operation in this procedure, and
so the cost of this approach scales with the number of inputs 𝑛 𝑥 , but is
essentially independent of the number of outputs 𝑛 𝑓 .
The adjoint method changes the linear system that needs to be solved
to compute the total derivatives (6.43). Instead of solving the linear
system with 𝜕𝑅/𝜕𝑥 as the right-hand side, we solve it with 𝜕𝐹/𝜕𝑥 in
the right-hand side. This corresponds to replacing the two Jacobians in
the middle by a new matrix of unknowns,
−1
𝜕𝐹 𝜕𝑅
𝜓 , 𝑇
, (6.47)
𝜕𝑢 𝜕𝑢
where the columns of 𝜓, are called the adjoint vectors. Multiplying by
𝜕𝑅/𝜕𝑢 on the right and taking the transpose of the whole equation, we
get the adjoint equations,
𝑇 𝑇
𝜕𝑅 𝜕𝐹
𝜓= . (6.48)
𝜕𝑢 𝜕𝑢
This equation has no dependence on 𝑥. Again, this is the most expensive
operation and so the cost of the adjoint method scales with the number
of outputs 𝑛 𝑓 and is essentially independent of the number of inputs

𝑛 𝑥 . Each adjoint vector is associated with a function of interest and
is found by solving the linear system above with the corresponding
row of 𝜕𝐹/𝜕𝑢. The solution can then be used in the total derivative
equations
d𝑓 𝜕𝐹 𝜕𝑅
= − 𝜓𝑇 . (6.49)
d𝑥 𝜕𝑥 𝜕𝑥
This is also called the reverse mode, as it is analogous to reverse mode
AD.
The two approaches differ in cost, depending on the total derivatives
that are required. The direct and adjoint methods require exactly the
same partial derivative matrices, and the linear systems require the
factorization of the same Jacobian matrix, 𝜕ℛ/𝜕𝑢. The difference
in cost between the two approaches boils down to how many times
the corresponding linear system needs to be solved, as summarized
in Table 6.2. The direct method requires a solution of the linear
system (6.45) for each design variable in 𝑥, while the adjoint method
requires a solution of the linear system (6.48) for each function of
interest in 𝑓 .
Table 6.2: Cost comparison of computing sensitivities for direct and adjoint methods.
Step Direct Adjoint
Partial derivative computation Same Same
Linear solution 𝑛 𝑥 times 𝑛 𝑓 times
Matrix multiplications Same Same
Example 6.15: Differentiating an implicit function.
Consider the following simplified equation for the natural frequency of a

beam:
𝑓 = 𝜆𝑚 2 (6.50)
where 𝜆 is a function of 𝑚 through the following relationship
𝜆
+ cos 𝜆 = 0 (6.51)
𝑚
Our goal is to compute the derivative d 𝑓 /d𝑚, but 𝜆 is an implicit function of 𝑚.
In other words, we cannot find an explicit expression for 𝜆 as a function of 𝑚,
substitute that expression into Eq. 6.50, and then differentiate normally.
Fortunately, the direct and adjoint methods will allow us to compute this
derivative.
Referring back to our nomenclature,
𝐹(𝑥, 𝑢(𝑥)) , 𝐹(𝑚, 𝜆(𝑚)) = 𝜆𝑚 2 ,

𝜆 (6.52)
𝑅(𝑥, 𝑢(𝑥)) , 𝑅(𝑚, 𝜆(𝑚)) = + cos 𝜆 = 0
𝑚
where 𝑚 is the design variable and 𝜆 is the state variable. The partial derivatives
that we need to compute the total derivative (6.43) are:
𝜕𝐹 𝜕𝑓 𝜕𝐹 𝜕𝑓
= = 2𝜆𝑚, = = 𝑚2
𝜕𝑥 𝜕𝑚 𝜕𝑢 𝜕𝜆
(6.53)
𝜕𝑅 𝜕𝑅 𝜆 𝜕𝑅 𝜕𝑅 1
= =− 2, = = − sin 𝜆
𝜕𝑥 𝜕𝑚 𝑚 𝜕𝑢 𝜕𝜆 𝑚
Because this is a problem of only one function of interest and one design
variable there is no distinction between the direct and adjoint methods (forward
and reverse), and the matrix inverse is simply a division. Substituting these
partial derivatives into the total derivative equation (6.43) yields:
d𝑓 𝜆
= 2𝜆𝑚 + 1 . (6.54)
d𝑚 − sin 𝜆
𝑚
Thus, we are able to obtain the desired derivative in spite of the implicitly
defined function. Here, it was possible to get an explicit expression for the total
derivative, but in general, it is only possible to get a numeric value.
Example 6.16: Implicit analytic differentiation for two equations
Suppose we want to find the rectangle of a given area that is inscribed in

an ellipse with given semi-major axes 𝑥1 , 𝑥2 , as shown in Fig. 6.8 The equation
for the ellipse can be written as the residual, 𝑥2
(𝑢1 , 𝑢2 )
𝑢12 𝑢22
𝑟1 (𝑢1 , 𝑢2 ) = + − 1 = 0. (6.55) 𝐴 = 4𝑢1 𝑢2
𝑥1
𝑥12 𝑥22
Of all the possible rectangles that can be inscribed in the ellipse, we want the
rectangle with maximum area (spoiler alert: this is the solution for Prob. 5.5).
Then, the area of the rectangle is constrained as,
Figure 6.8: Rectangle inscribed in
𝑟2 (𝑢1 , 𝑢2 ) = 4𝑢1 𝑢2 − 2𝑥1 𝑥 2 = 0. (6.56) ellipse.
Suppose that our functions of interest are the rectangle perimeter and the
rectangle aspect ratio,
𝑢
𝑓1 = 4(𝑢1 + 𝑢2 ), 𝑓2 = 1 . (6.57)
𝑢2
We want to find the derivatives of these functions of interest with respect to the
ellipse semi-major axes.
Example 6.17: Direct/adjoint methods applied to finite element analysis.
Now we consider a more complex example that applies to finite element

structural analysis in general. The variables involved in finite element analysis
map to the nomenclature as follows:
𝑥 𝑖 : design variables (element dimension, shape parameters)

𝑢 𝑗 : state variables (structural displacements)
ℛ 𝑘 : residuals or governing equations
𝑓𝑛 : functions of interest (mass, stress)
We use index notation for clarity because the equations involve derivatives of
matrices with respect to vectors.
First, we need to derive expressions to compute all of the partial derivatives.
The residual equations can be written as:
ℛ 𝑘 (𝑥, 𝑢) = 𝑁 𝑘 (𝑥, 𝑢) − 𝐹 𝑘 = 0 (6.58)
where 𝑁 is some nonlinear function and 𝐹 is a vector of forces. Often, small

displacements and elastic deformations are assumed, which leads to the linear
form of the governing equations:
ℛ 𝑘 (𝑥, 𝑢) = 𝐾 𝑘 𝑗 (𝑥)𝑢 𝑗 − 𝐹 𝑘 = 0 (6.59)
where 𝐾 is the stiffness matrix and we assumed that the external forces are not
functions of the mesh node locations. The partial derivatives of these equations
are:
𝜕ℛ 𝑘 𝜕𝐾 𝑘 𝑗
= 𝑢𝑗
𝜕𝑥 𝑖 𝜕𝑥 𝑖
(6.60)
𝜕ℛ 𝑘
= 𝐾𝑘 𝑗 .
𝜕𝑢 𝑗
The second equation is convenient because we already have the stiffness matrix.
The first equation involves a new term, which are the derivatives of the stiffness
matrix with respect to each of the design variables.
One of the functions of interest is the stress, which under the elastic
assumption, is a linear function of the deflections that can be expressed as:
𝑓𝑛 (𝑢) = 𝜎𝑛 (𝑢) = 𝑆𝑛 𝑗 𝑢 𝑗 (6.61)
Taking the partial derivatives,

𝜕 𝑓𝑛
= 0,
𝜕𝑥 𝑖
(6.62)
𝜕 𝑓𝑛
= 𝑆𝑛 𝑗 .
𝜕𝑢 𝑗
All of the partial derivatives above, except for one, require data that
we already know. Putting everything together using the total derivative
equation (6.43) yields,

d𝜎𝑛 𝜕𝐾 𝑘 𝑗
= −𝑆𝑛 𝑗 𝐾 −1 𝑢𝑗 (6.63)
d𝑥 𝑖 𝑘 𝑗 𝜕𝑥 𝑖
We can solve this either using the direct or adjoint method. The direct method
would solve the linear system for 𝜙, for one input 𝑖 at a time, and then multiply
through in the second equation:

𝜕𝐾 𝑘 𝑗
𝐾𝑘 𝑗 𝜙𝑗 = − 𝑢𝑗
𝜕𝑥 𝑖
(6.64)
d𝜎𝑛
= 𝑆𝑛 𝑗 𝜙 𝑗
d𝑥 𝑖
This approach is preferable if we have few design variables 𝑥 𝑖 and many stresses
𝜎𝑛 . Or, we can solve this with the adjoint approach. The adjoint method solves
the linear system for 𝜓, one output, 𝑛, at a time, and then multiplies through in
the second equation:
𝐾 𝑘 𝑗 𝜓 𝑘 = −𝑆𝑛 𝑗
(6.65)
d𝜎𝑛 𝜕𝐾 𝑘 𝑗
= 𝜓𝑘 𝑢𝑗 .
d𝑥 𝑖 𝜕𝑥 𝑖
This approach is preferable when there are many design variables and few
stresses.
6.8 Sparse Jacobians and Graph Coloring
In this chapter we have discussed various ways to compute a Jacobian.

If the Jacobian is large and has many entries which are zero it is said
to be sparse. In many cases we can take advantage of that sparsity to
greatly accelerate the computational time required to construct the
Jacobian.
When applying a forward mode, whether forward mode AD, finite
differencing, or complex step, the cost of computing the Jacobian scales
with 𝑛 𝑥 . Each forward pass completes one column of the Jacobian.
For example, if using forward mode finite differencing, 𝑛 𝑥 evaluations
would be required where, at iteration 𝑗, the input vector would be:
[𝑥1 , 𝑥 2 , . . . , 𝑥 𝑗 + ℎ, . . . , 𝑥 𝑛 𝑥 ]𝑇 (6.66)
For many sparsity patterns, we can reduce computational cost

greatly. As a motivating example, consider a square diagonal Jacobian:
𝐽11 0 0 0 0
 
0 0 0 0 
 𝐽22
0 0 
 0 𝐽33 0 (6.67)
0 0 
 0 0 𝐽44
0 𝐽55 
 0 0 0
For this scenario, the Jacobian can be evaluated with one evaluation
rather than 𝑛 𝑥 evaluations. This is because a given output 𝑓𝑖 depends
on only one input 𝑥 𝑖 . We could think of the outputs as 𝑛 𝑥 independent
functions. Thus, for finite differencing rather than requiring 𝑛 𝑥 input
vectors with 𝑛 𝑥 function evaluations we can use one input vector:
[𝑥1 + ℎ, 𝑥 2 + ℎ, . . . , 𝑥 𝑗 + ℎ, . . . , 𝑥 𝑛 𝑥 + ℎ]𝑇 (6.68)
and compute all the nonzero entires in one shot.‡ ‡ Curtis

et al.89 were the first to show that
While the diagonal case is perhaps easier to understand, it is fairly the number of function evaluations could
be reduced in evaluating the Jacobian for
specialized. To continue the discussion more generally we will use the sparse matrices.
following 5 x 6 matrix as an example: 89. Curtis et al., On the Estimation of
Sparse Jacobian Matrices. 1974
𝐽11 0 0 0 𝐽16 
 𝐽14

0 0 0 0 
 𝐽23 𝐽24
𝐽31 0 
 𝐽32 0 0 0 (6.69)
0 0 
 0 0 0 𝐽45
0 𝐽56 
 0 𝐽53 0 𝐽55
A subset of columns that do not have more than one nonzero in the same
row are said to have structurally orthogonal columns. For this example
the following columns are structurally orthogonal: (1, 3), (1, 5), (2, 3), (2,
4, 5), (2, 6), and (4, 5). Structurally orthogonal columns can be combined,
forming a smaller Jacobian that reduces the number of forward passes
required. This reduced Jacobian is referred to as compressed. There is
more than one way to compress the example Jacobian, but for this case
the minimum number of compressed columns (referred to as colors) is
three. A compressed Jacobian is shown below where columns 1 and 3
have been combined, and 2, 4, and 5 have been combined:
𝐽11 0 0 0 𝐽16  𝐽11 𝐽16 
 𝐽14
  𝐽14

0 0 0 0  𝐽23 0 
 𝐽23 𝐽24  𝐽24
𝐽31 0  ⇒ 𝐽31
 0 
 𝐽32 0 0 0 𝐽32 (6.70)
0 0  0 0 
 0 0 0 𝐽45  𝐽45
0  𝐽 𝐽56 
 0 𝐽53 0 𝐽55 𝐽56   53 𝐽55
For finite differencing or complex step, only compression amongst

columns is possible. But for AD the reverse mode provides the oppor-
tunity to take advantage of compression along rows. The concept is the
same, but instead we look for structurally orthogonal rows. One such
compression is shown below.
𝐽11 0 0 0 𝐽16 
 𝐽14

0 0 0 0  𝐽11 0 0 𝐽16 
 𝐽23 𝐽24  𝐽14 𝐽45
𝐽31 0  ⇒  0 0  (6.71)
 𝐽32 0 0 0 0 𝐽23 𝐽24 0
0 0  𝐽31 𝐽56 
 0 0 0 𝐽45  𝐽32 𝐽53 0 𝐽55
0 
 0 𝐽53 0 𝐽55 𝐽56 
AD can also be used even more flexibly where both modes are used:
forward passes to evaluate groups of structurally orthogonal columns,
and reverse passes to evaluate groups of structurally orthogonal rows.
Rather than taking incremental steps in each direction as is done in finite
differencing, in AD we set the seed vector with ones in the directions we
wish to evaluate, similar to how the seed is set for directional derivatives
as discussed in Section 6.6.
For these small Jacobians it is fairly straightforward to determine
how best to compress the matrix. For a large matrix this is not so easy.
The approach that is used is called graph coloring. In one approach, a
graph is created with row and column indices as vertices and edges
denoting nonzero entries in the Jacobian. Graph coloring algorithms
use heuristics to estimate the fewest number of “colors” or orthogonal
columns. Graph coloring is a large field of its own with derivative
computation as just one application.§ § Gebremedhin et al.90 provide a review
paper of graph coloring in the context of

computing derivatives. Gray et al.91 show
Example 6.18: Speed up from sparse derivatives. how to use graph coloring to compute to-
tal coupled derivatives.
In static aerodynamic analyses the forces and moments produced at two 90. Gebremedhin et al., What Color Is Your
different wind speeds are independent from each other, and if many different Jacobian? Graph Coloring for Computing
Derivatives. 2005
wind speeds are of interest, the resulting Jacobian will display a high degree
91. Gray et al., OpenMDAO: An open-
of sparsity. Some examples include evaluating the power produced by a source framework for multidisciplinary
wind turbine across different wind speeds, or assessing an aircraft’s thrust design, analysis, and optimization. 2019
throughout a flight envelope. Many other engineering analyses have similar

structure. In this example we consider a typical blade optimization. The
Jacobian is fully dense with respect to geometry changes, but as discussed is
diagonal with respect to the various inflow conditions (left side ofFig. 6.9). We
can compress the Jacobian as shown on the right side of Fig. 6.9.
Geometry Inflow Geometry Inflow
Outputs Outputs
The illustrate the potential benefits of using a sparse representation, the Figure 6.9: Representation of the Ja-
Jacobian was constructed for various sizes of inflow conditions using both cobian for this example. The blocks
indicate areas where a derivatie ex-
forward AD, and forward AD with graph coloring (Fig. 6.10). After about ists, and the blank spots where the
100 inflow conditions, the difference in time required exceeds an order of derivative is always zero. The left is
magnitude (note the log-log scale). As Jacobians are needed at every iteration the original Jacobian and the right
is the compressed representation.
in the optimization, this is a tremendous speedup, enabled just by leveraging

the existing sparsity pattern.¶ ¶ The full details of this example are avail-
able in a preprint.92
92. Ning, Using Blade Element Momen-
tum Methods with Gradient-Based Design
Optimization, 2020
6.9 Unification of the Methods for Computing Derivatives 101
100
Jacobian time (s)

Now that we have introduced all the methods for computing derivatives, AD
10−1
we will see how they are connected. For example, we have mentioned
that the direct and adjoint methods are analogous to the forward and 10−2
AD with coloring
reverse mode of AD, respectively, but we have not formalized this 10−3
relationship. 10−4
100 101 102
To get a broader view of these methods, we go back to the notion of
Inflow conditions
the list of variables (6.27) considered when introducing AD:
Figure 6.10: The compressed Jaco-
𝑣 𝑖 = 𝑉𝑖 (𝑣 1 , 𝑣2 , . . . , 𝑣 𝑖 , . . . , 𝑣 𝑛 ). (6.72) bian.
However, instead of defining these as every variable in a program, we

are going to use a more flexible interpretation. At the very minimum,
these variables include the inputs of interest, 𝑥, and the outputs of
interest, 𝑓 . The variables might also include intermediate variables,
which we do not define for now. We are also going to use the concept
of residuals that we already introduced, where,
𝑟 = 𝑅(𝑣) = 0. (6.73)
We use a more flexible interpretation of what these residuals are and

they must be consistent with the definition of the variables: The number
of residuals must be the same as the number of variables, and driving
the residuals to zero must result in the correct variable values.
The unified derivatives equation (UDE) is based on these variables and
residuals and be written as 93 : 93. Martins et al., Review and Unification
of Methods for Computing Derivatives of
Multidisciplinary Computational Models.
𝜕𝑅 d𝑣 𝜕𝑅 𝑇 d𝑣 𝑇 2013
=𝐼= , (6.74)
𝜕𝑣 d𝑟 𝜕𝑣 d𝑟
where 𝐼 is the identity matrix and all the matrices are square and have
the same size. The derivatives that we ultimately want to compute,
d 𝑓 /d𝑥 are a subset of d𝑣/d𝑟. The matrix of partial derivatives needs
to be constructed, and then the linear system is solved for the total
derivatives. With the appropriate definition of the variables and the
corresponding residuals (shown in Table 6.3), we can recover all the
derivative computation methods using the UDE (6.74). The left-hand
side represents forward (or direct) derivative computation, while the
right-hand side represents reverse (or adjoint) derivative computation.
For the inputs and outputs, the residuals assume that the associated
variables (𝑥 and 𝑓 ) are free, but they are constructed such that the
variables assume the correct values when the residual equations are
satisfied.
Table 6.3: Variable and residual definition needed to recover the various derivative
computation methods with the UDE (6.74). The residuals of the governing equations
are represented by 𝑅 𝑔 to distinguish them from the UDE residuals.
Method Level Variables, 𝑣 Residuals, 𝑅

Monolithic Inputs and 𝑥 𝑥 − 𝑥0
outputs 𝑓 𝑓 − 𝐹(𝑥)
𝑥   𝑥 − 𝑥0 
Analytic Governing    
𝑢  𝑟 − 𝑅 𝑔 (𝑥, 𝑢)
equations and    
𝑓  𝑓 − 𝐹(𝑥, 𝑢) 
state variables    
AD Lines of code 𝑣 𝑣 − 𝑉(𝑥, 𝑣)
Using the variable and residual definitions from Table 6.3 for the
monolithic method in the left hand side of the UDE (6.74), we get
" #
𝐼 0  𝐼 0
d 𝑓  = 𝐼,
𝐼  𝐼 
𝜕𝐹 (6.75)
−
𝜕𝑥  d𝑥 
which yields the obvious result d 𝑓 /d𝑥 = 𝜕𝐹/𝜕𝑥. This is not a particu-
larly useful result, but it shows that the UDE can recover the monolithic
case.
For the analytic derivatives, the left-hand side of the UDE becomes,
 𝐼 0  𝐼 0
 0 0
 𝜕𝑅 𝑔   d𝑢 d𝑢 
− 𝜕𝑅 𝑔
0  0 = 𝐼.
 𝜕𝑥 − (6.76)
 𝜕𝑢  d d𝑥 d𝑟 
 𝜕𝐹  𝑓 d𝑓 
− 𝐼  𝐼
𝜕𝐹
−
 𝜕𝑥 𝜕𝑢   d𝑥 d𝑟 
Since we are only interested in the d 𝑓 /d𝑥 block in the second matrix,
we can ignore the second and third block columns of that matrix.
Multiplying the remaining blocks out and using the definition 𝜙 ,
− d𝑢/d𝑥, we get the direct linear system (6.45) and the total derivative
equation (6.46).
The right-hand side of the UDE yields the transposed system,
 𝜕𝐹𝑇   d𝑓 
 𝐼 − 𝜕𝑅 𝑔
𝑇
𝐼 d𝑢 
   d𝑥 
−
 𝜕𝑥   d𝑥
 𝜕𝐹 
𝜕𝑥
0 − 𝑔
𝑇 𝑇 0 d𝑢 d 𝑓  = 𝐼. (6.77)
 
𝜕𝑅
 −   d𝑟 
 𝜕𝑢  0
d𝑟
𝐼 
𝜕𝑢
0 0 𝐼   0

Similarly to the forward case, we ignore the block columns of the

matrix of unknowns to focus on the block column containing d 𝑓 /d𝑥.
Multiplying this out, and defining 𝜓 , − d 𝑓 /d𝑟, we get the adjoint linear
system (6.48) and the corresponding total derivative equation (6.49).
The definition of 𝜓 here is significant, since the adjoint vector can
indeed be interpreted as the derivatives of the objective function with
respect to the residuals of the governing equations.
Finally, we can recover AD from the UDE as well by defining the
vector of variables as all the variables assigned in a code (with unrolled
loops), and constructing the corresponding residuals. The forward
mode yields
 1 0  1 0
 0 ...  0 ...
 𝜕𝑉2 ..   d𝑣2 .. 
− ..
.   ..
. 
 𝜕𝑣 1 .  d𝑣 1 .
   1 
 .   .  = 𝐼,
1
(6.78)
 .. .. ..
0  .. .. ..
0
 . .  . .
 𝜕𝑉𝑛   d𝑣 𝑛 d𝑣 𝑛 
− 𝜕𝑉𝑛
1  1
 𝜕𝑣 ... −  d𝑣 ...
 1
𝜕𝑣 𝑛−1   1 d𝑣 𝑛−1 
where the Jacobian d 𝑓 /d𝑥 is composed of a subset of derivatives in the
corners near the d𝑣 𝑛 /d𝑣1 term. To compute these derivatives, we need
to perform forward substitution and compute one column of the total
derivative matrix at the time, where each column is associated with the
inputs of interest.
The reverse mode yields
   d𝑣2 d𝑣 𝑛 
1 − 2  1 
𝜕𝑉 𝜕𝑉𝑛
 ... −
  d𝑣1
...
d𝑣1 
 𝜕𝑣1 
𝜕𝑣1  
 
..  .. 
0  0 
.. ..
 1 .

.  1 . .  = 𝐼, (6.79)
 .. 𝜕𝑉𝑛   .. d𝑣 𝑛 
. . 
.. .. ..
 1
𝜕𝑣 𝑛−1   d𝑣 𝑛−1 
. . − .
 
0 1  0 1 
 ... 0  ... 0
where the derivatives of interest are now near the top right corner of
the total derivative matrix. To compute these derivatives, we need to
perform back substitutions, which computes one column of the matrix
at the time. Since the total derivative matrix is transposed here, the
reverse mode actually computes a row of the total derivative Jacobian
at the time, where each row is associated with an output of interest.
This is consistent with what we concluded before: The cost of the
forward mode is proportional to the number of the inputs of interest,
while the cost of the reverse move is proportional to the number of
outputs of interest.
In this unification, we have found nothing new except for a new

perspective on how all methods relate. This was achieved by gener-
alizing the concept of “variable” and “residual.” As we will see later,
these concepts and the UDE (6.74) have been used to develop a general
framework for solving models and computing their derivatives 39,91 . 39. Hwang et al., A computational architec-
ture for coupling heterogeneous numerical
models and computing coupled derivatives.
2018
6.10 Summary 91. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
We discussed the methods available for computing derivatives. Each of design, analysis, and optimization. 2019
these is summarized in the following list.

Symbolic differentiation is accurate, but only scalable for simple,
explicit functions of low dimensionality. Although it cannot be used to
derive a closed-form expression for models that are solved iteratively,
symbolic differentiation is used by AD at each line of code, and can
also be used in direct and adjoint methods to derive expressions for
computing the required partial derivatives.
Finite difference formulas are popular because of their ease of use
and because it is a black box method able to work with most any
algorithm. The downsides are that they are not accurate and the
cost scales linearly with the number of variables. Many of the issues
optimization practitioners experience with gradient-based optimization
can be traced to errors in the gradients when optimization algorithms
automatically compute these gradients using finite differences.
The complex-step method is accurate and relatively easy to imple-
ment. It usually requires some changes to the analysis source code,
but this process can be scripted. Its main advantage is that it produces
analytically accurate derivatives. However, like the finite difference
method, the cost scales linearly with the number of inputs, and each
individual simulation requires almost twice the cost because of the use
of complex arithmetic.
Algorithmic differentiation produces analytically accurate deriva-
tives and can be scalable. Many implementations can be fully automated.
The implementation requires access to the source code, but is still rela-
tively straightforward to apply. Both forward and reverse modes are
available, the former scales with the number of inputs and the latter
scales with the number of outputs. The scaling factor for forward mode
is generally much lower than finite differences, and in reverse mode is
independent of the number of design variables.
Direct and adjoint methods are accurate, scalable, but require
significant changes to source code. These methods are exact (depending
on how the partial derivatives are obtained), and like algorithmic
differentiation provides both forward and reverse modes with the same
scaling advantages. The disadvantage is that the method is strongly

intrusive and often considerable development effort is required.
Problems
a) A first-order derivative is only one of many types of sensitiv-

ity analysis.
b) Each column of the Jacobian matrix represents the gradient
of one of the functions of interest with respect to all the
variables.
c) You can only compute derivatives of models for which you
have the source code, or at the very least understand how
the model computes the functions of interest.
d) Symbolic differentiation is intractable for all but the simplest
models because of expression swell.
e) Finite-difference approximations can compute first deriva-
tives with a precision matching that of the function being
differentiated.
f) The complex-step method can only be used to compute
derivatives of complex functions.
g) Algorithmic differentiation uses a code parser to differentiate
each line of code symbolically.
h) The forward mode of algorithmic differentiation computes
the derivatives of all outputs with respect to one input, while
the reverse mode computes the derivative of one output
with respect to all inputs.
i) The adjoint method requires the same partial derivatives as
the direct method.
j) Of the two implicit analytic methods, the direct method is
more widely used than the adjoint method because more
problems have more design variables than functions of
interest.
k) Graph coloring makes Jacobians sparse by selectively replac-
ing small-valued entries with zeros to trade accuracy for
speed.
l) The unified derivatives equation can represent implicit an-
alytic and algorithmic differentiation approaches, but not
monolithic differentiation.
6.2 Reproduce the comparison between the complex-step and finite-

difference methods from Ex. 6.9. Reversing the 𝑥-axis as we did in
Fig. 6.4 is not necessary. Do you get any complex-step derivatives
with zero error compared to the analytic reference? What does
that mean, and how should you show those points on the plot?
Estimate the value of ℎ required to eliminate truncation error in
derivative using Eq. 6.22. Is this estimate consistent with your
plot?
6.3 Compute the derivative using symbolic differentiation and using

algorithmic differentiation (either forward or reverse mode) for the
iterative code in Ex. 6.4. Use a package to facilitate the AD portion.
Most scientific computing languages have AD packages.
6.4 Implement a forward-mode-AD tool using operator overloading

to differentiate the function of Ex. 6.9. You need to define your
own type and provide it with overloaded functions for exp , sin,
cos, sqrt, addition, division, and exponentiation.
6.5 Suppose you have two airplanes that are flying in a horizontal
plane defined by 𝑥 and 𝑦 coordinates. Both airplanes start at
𝑦 = 0, but airplane 1 starts at 𝑥 = 0 while airplane 2 has a head
start of 𝑥 = Δ𝑥. The airplanes fly at a constant velocity. Airplane 1
has a velocity 𝑣1 in the direction of the positive 𝑥-axis and airplane
two has a velocity 𝑣 2 at an angle 𝛾 with the 𝑥-axis. The functions
of interest are the distance (𝑑) and the angle (𝜃) between the two
airplanes as a function of time. The independent variables are Δ𝑥,
𝛾, 𝑣 1 , 𝑣2 , 𝑡. Write the code that computes the functions of interest
(outputs) for a given set of independent variables (inputs). Use
AD to differentiate the code. Choose a set of inputs, compute the
derivatives of all the outputs with respect to the inputs and verify
them against the complex-step method.
6.6 Kepler’s equation, which we mentioned in Section 2.2, defines the

relationship between a planet’s polar coordinates and the time
elapsed from a given initial point through the implicit equation,
𝐸 − 𝑒 sin(𝐸) = 𝑀,
where 𝑀 is the mean anomaly (a parameterization of time), 𝐸

is the eccentric anomaly (a parameterization of the polar angle),
and 𝑒 is the eccentricity of the elliptical orbit. Suppose that the
function of interest is the difference between the eccentric and
mean anomalies,
𝑓 (𝐸, 𝑀) = 𝐸 − 𝑀.
Derive an analytic expression for d 𝑓 /d𝑒 and d 𝑓 /d𝑀. Verify your

result against the complex-step method or AD (you will need a
solver for Kepler’s equation, which was the subject of Prob. 3.6).
6.7 Compute the derivatives for the ten-bar truss problem described in
Appendix C.2.2 using the direct and adjoint implicit differentiation
methods. We want to compute the derivatives of the objective
(mass) with respect to the design variables (ten cross-sectional
areas), and the derivatives of the constraints (stresses in all ten
bars) with respect to the design variables (a 10 × 10 Jacobian
matrix). Compute the derivatives using:
a) A finite-difference formula of your choice.

b) The complex-step derivative method.
c) Algorithmic differentiation.
d) The implicit analytic method (direct and adjoint).
Report the errors relative to the implicit analytic methods. Discuss

your findings and the relative merits of each approach.
6.8 We can now solve the ten-bar truss problem (previously solve in
Prob. 5.15) using the derivatives computed in Prob. 6.7. Solve this
optimization problem using both finite-difference derivatives and
an implicit analytic method. Report the following:
a) Convergence plot with two curves for the different derivative

computation approaches on the same plot.
b) Number of function calls required to converge for each
method. This metric is more meaningful if you use more
than one starting point and average the results.
Discuss your findings.
6.9 Aggregate the constraints for the ten-bar truss problem and extend
the code from Prob. 6.7 to compute the required constraint deriva-
tives using the implicit analytic method that is most advantageous
in this case. Verify your derivatives against the complex-step
method. Solve the optimization problem and compare your re-
sults to the ones you obtained in Prob. 6.8. How close can you get
to the reference solution?
Gradient-Free Optimization
7
Gradient-free algorithms fill an important role in optimization. The
gradient-based algorithms introduced in Chapter 4 are efficient in
finding local minima for high-dimensional nonlinear problems defined
by continuous smooth functions. However, the assumptions made
for these algorithms are not always valid, which can render these
algorithms ineffective. Also, gradients might not be available, as in the
case of functions given as a black box.
In this chapter, we introduce only a few popular representative
gradient-free algorithms. Most are designed to handle unconstrained
functions only, but they can be adapted to solve constrained problems
by using the penalty or filtering methods introduced in Chapter 5. We
start by discussing the problem characteristics that are relevant to the
choice between gradient-free and gradient-based algorithms and then
give an overview of the types of gradient-free algorithms.
1. Identify problems that are well-suited for gradient-free

optimization.
2. Describe the characteristics and approach of more than

one gradient-free optimization method.
3. Use gradient-free optimization algorithms to solve real

engineering problems.
7.1 Relevant Problem Characteristics
Gradient-free can be useful when gradients are not available, such as

when dealing with black-box functions. Although gradients can always
be approximated with finite differences, these approximations suffer
from potentially large inaccuracies (see Section 6.4.2). Gradient-based
algorithms require a more experienced user because they take more
212
effort to setup and run. Overall, gradient-free algorithms are easier

to get up and running but are much less efficient, particularly as the
dimension of the problem increases.
One major advantage of gradient-free algorithms is that they do not
assume function continuity. For gradient-based algorithms, function
smoothness is essential when deriving the optimality conditions, both
for unconstrained functions and constrained functions. More specifi-
cally, the KKT conditions (5.13) require that the function be continuous
in value (𝐶 0 ), gradient (𝐶 1 ), and Hessian (𝐶 2 ) in at least a small neigh-
borhood of the optimum. If, for example, the gradient is discontinuous
at the optimum, it is undefined and the KKT conditions are not valid.
Away from optimum points, this requirement is not as stringent. While
gradient-based algorithms work on the same continuity assumptions,
they can usually tolerate the occasional discontinuity as long as it is
away from an optimum point. However, for functions with excessive
numerical noise and discontinuities, gradient-free algorithms might be
the only option.
Many considerations are involved when choosing between a gradient-
based and a gradient-free algorithm. Some of these considerations
are common sources of misconception. One problem characteristic
that is often cited as a reason for choosing gradient-free methods is
multimodality. Design space multimodality can be due to an objective
function with multiple local minima, or in the case of a constrained
problem, the multimodality can arise from the constraints that define
disconnected or nonconvex feasible regions.
As we will see shortly, some gradient-free methods feature a global
search that increases the likelihood of finding the global minimum.
This feature is a reason why gradient-free methods are often used for
multimodal problems. However, not all gradient-free methods are
global search methods, some perform only a local search. Additionally,
even though gradient-based methods are by themselves local search
methods, they are often combined with global search strategies as
discussed in Tip 4.24. It is not necessarily true that a global search,
gradient-free method is more likely to find a global optimum than a
multistart gradient-based method. As always, problem-specific testing
is needed.
Furthermore, it is assumed far too often that any complex problem is
multimodal, but that is often not the case. While it might be impossible
to prove that a function is unimodal, it is easy to prove that a function
is multimodal by just finding another local minimum. Therefore, one
should assume that a function is unimodal until proven otherwise.
Additionally, one must be careful that artificial local optima, created
by numerical noise, are not the reason why one believes the physical
design space is multimodal.
Tip 7.1: Choose your bounds carefully for gradient-free methods.
Although many gradient-free algorithms are not designed for nonlinear

constraints, many do use bound constraints. Unlike gradient-based methods
where generous boundaries are often used, for global search methods one must
be careful in choosing bounds. Because the optimizer will want to explore
throughout the design space, if bounds are unnecessarily wide, the effectiveness
of the algorithm will be diminished considerably.
Another reason that is often cited for using a gradient-free method

is multiple objectives. Some gradient-free algorithms, like the genetic
algorithm discussed in this chapter, can be naturally applied to multiple
objectives. However, it is a misconception that gradient-free methods
are always preferable just because there are multiple objectives. This
topic is discussed in more detail in Chapter 9.
Another common reason for using gradient-free methods is when
there are discrete design variables. Since the notion of a derivative
with respect to a discrete variable is invalid, gradient-based algorithms
cannot be used directly (although there are ways around this limitation
as discussed in Chapter 8). Some of the gradient-free algorithms (but
not all) can handle discrete variables directly.
The preceding discussion highlights that although multimodality,
multiple objectives, or discrete variables, are commonly mentioned as
reasons for choosing a gradient-free algorithm, they are not necessarily
valid. One of the most relevant factors when choosing between a
gradient-free and a gradient-based approach is the dimension of the
problem. Figure 7.1 shows how many function evaluations are required
to minimize the 𝑛-dimensional Rosenbrock function for varying num-
bers of design variables. Three classes of algorithms are shown in the
plot: gradient-free, gradient-based with finite differenced gradients,
and gradient-based with numerically exact gradients. While the exact
numbers are problem dependent, similar scaling has been observed
on large-scale computational fluid dynamics based optimization 94 . 94. Yu et al., On the Influence of Optimiza-
The general takeaway is that for problems of small size (usually ≤ 30 tion Algorithm and Starting Design on Wing
Aerodynamic Shape Optimization. 2018
variables 95 ) gradient-free methods can be useful in finding a solution.
95. Rios et al., Derivative-free optimization:
Furthermore, because gradient-free methods usually take much less a review of algorithms and comparison of
developer time to use, then for these smaller problems a gradient-free software implementations. 2013
solution may even be preferable. However, if the problem is large in

dimension then a gradient-based method may be the only feasible path
forward despite the need for more developer time.
107 Gradient free
Number of function evaluations

106 Finite
difference
105
Figure 7.1: Cost of optimization for
1.49 increasing the number of design
2.52 variables of the 𝑛-dimensional
104 Rosenbrock function. A gradient-
free algorithm compared with
103 and a gradient-based algorithm
Analytic with gradients computed with
0.37
finite-differences and analytically.
102
A gradient-based optimizer with
analytic gradients enables much
101 102 103 better scalability.
Number of design variables
7.2 Classification of Gradient-Free Algorithms
There is a much wider variety of gradient-free algorithms compared to

their gradient-based counterparts. While gradient-based algorithms
tend to perform local searches, have a mathematical rationale, and be
deterministic, gradient-free algorithms exhibit different combinations
of these characteristics. We list the most widely known gradient-free
algorithms in Table 7.1 and classify them according to the characteristics
introduced in Fig. 1.19.∗ ∗ Rios et al.95 review and benchmark a
large selection of gradient-free optimiza-
tion algorithms.
Table 7.1: Classification of gradient-free optimization methods, using the characteris- 95. Rios et al., Derivative-free optimization:
a review of algorithms and comparison of
tics of Fig. 1.19.
software implementations. 2013
Optimal Iteration Function Stochas
Search criteria proc. eval. -ticity
Mathematical
Mathematical
Deterministic
Stochastic
Surrogate
Heuristic
Heuristic
Global
Direct
Local
Nelder–Mead X X X X X
GPS X X X X X
MADS X X X X X
Trust region X X X X X
Implicit filtering X X X X X
DIRECT X X X X X
MCS X X X X X
EGO X X X X X
SMFs X X X X X
Branch and fit X X X X X
Hit and run X X X X X
Evolutionary X X X X X
Local-search, gradient-free algorithms that use direct function eval-

uations include the Nelder–Mead algorithm, generalized pattern search
(GPS), and mesh-adaptive direct search (MADS). The Nelder–Mead
algorithm (which we detail in Section 7.3) is heuristic, while the other
two are not.
GPS and MADS are examples of derivative-free optimization (DFO)
algorithms, which, in spite of the name, do not include all gradient-free
algorithms. DFO algorithms are understood to be largely heuristic-free
and focus on local search.† GPS is actually a family of methods that † The textbooks by Conn et al.96 and Audet
et al.97 provide a more extensive treatment
iteratively seek an improvement using a set of points around the current of gradient-free optimization algorithms
point. In its simplest versions, GPS uses a pattern of points based on the that are based on mathematical criteria.
coordinate directions, but there are more sophisticated versions that 96. Conn et al., Introduction to Derivative-
Free Optimization. 2009
use a more general set of vectors. MADS is an improvement on GPS
97. Audet et al., Derivative-Free and Black-
algorithms by allowing an infinite set of such vectors and improving box Optimization. 2017
convergence. ‡ ‡ The NOMAD software is an open-source
Model-based, local-search algorithms include trust-region algo- implementation of MADS. 98
rithms and implicit filtering. The model is an analytic approximate 98. Le Digabel, Algorithm 909: NOMAD:
Nonlinear Optimization with the MADS
of the original function (also called a surrogate model) and it should algorithm. 2011
be smooth, easy to evaluate, and accurate in the neighborhood of
the current point. The trust-region approach detailed in Section 4.5
can be considered gradient-free if the surrogate model is constructed
using just evaluations of the original function without evaluating its
gradients. This does not prevent the trust-region algorithm from using
gradients of the surrogate model, which can be computed analytically.
Implicit filtering methods extend the trust region method by adding
a surrogate model of the function gradient and use that to guide the
search. This effectively becomes a gradient-based method applied to
the surrogate model instead of evaluating the function directly as done
for the methods in Chapter 4.
Global-search algorithms can be broadly classified as deterministic
or stochastic, depending on whether they include random parameter
generation within the optimization algorithm.
Deterministic, global-search algorithms can be either direct or
model-based. Direct algorithms include Lipschitzian-based parti-
tioning techniques—such as the “divide a hyperrectangle” (DIRECT)
algorithm detailed in Section 7.4 and branch and bound search (dis-
cussed in Chapter 8)—and multilevel coordinate search (MCS). The
DIRECT algorithm selectively divides the space of the design variables
§ DIRECT
into smaller and smaller 𝑛-dimensional boxes (hyperrectangles) and is one of the few gradient-free
methods that has a built-in way to handle
uses mathematical arguments to decide on which boxes should be constraints that is not a penalty or filtering
subdivided.§ Branch-and-bound search also partitions the design space, method 99 .
99. Jones, Direct Global Optimization
but estimates lower and upper bounds for the optimum by using the Algorithm. 2009
function variation between partitions. MCS is another algorithm that

partitions the design space into boxes, where a limit is imposed on how
small the boxes can get based on its “level”—the number of times it has
been divided.
Model-based global-search algorithms—sometimes called response
surface methods (RSMs)—are similar to their local-search algorithm
counterparts, but instead of using convex surrogate models, they use
surrogate models that can reproduce the features of a multimodal
function. One of the most widely used of these algorithms is efficient
global optimization (EGO), which uses kriging surrogate models and
uses the idea of expected improvement to maximize the likelihood of
finding the optimum more efficiently (introduced in Chapter 10). Other
algorithms use radial basis functions (RBFs) as the surrogate model
and also maximize the probability of improvement at new iterates.
Stochastic algorithms rely on one or more non-deterministic pro-
cedures; they include hit and run algorithms, and the broad class of
evolutionary algorithms. When performing benchmarks of a stochastic
algorithm, you should run a large enough number of optimizations to
obtain statistically significant results.
Hit-and-run algorithms generate random steps about the current
iterate in search of better points. A new point is accepted when it is
¶ Simon100 provides a more comprehen-
better than the current one and this process is repeated until the point
sive review of evolutionary algorithms.
cannot be improved.
100. Simon, Evolutionary Optimization
What constitutes an evolutionary algorithm is not well defined.¶ Algorithms. 2013
Evolutionary algorithms are inspired by processes that occur in nature ‖
These algorithms include ant colony op-
or society. There is a plethora of evolutionary algorithms in the litera- timization, artificial bee colony algorithm,
design by shopping paradigm, dolphin
ture, thanks to the fertile imagination of the research community and a echolocation algorithm, bacterial foraging
never-ending supply of phenomena for inspiration.‖ These algorithms optimization, bat algorithm, big bang-big
crunch algorithm, biogeography-based
are more of an analogy of the phenomenon rather than an actual model optimization, bird mating optimizer, cat
because they are at best oversimplified models and at worst completely swarm optimization, cuckoo search, hy-
brid glowworm swarm optimization algo-
wrong . Nature-inspired algorithms tend to invent their own termi- rithm, mine bomb algorithm, quantum-
nology for the mathematical terms in the optimization problem. For behaved particle swarm optimization, ar-
tificial fish swarm, firefly algorithm, inva-
example, a design point might be called a “member of the population,” sive weed optimization, moth-flame op-
or the objective function might be referred to as the “fitness.” timization algorithm, galactic swarm op-
timization, hummingbirds optimization
The vast majority of evolutionary algorithms are population-based, algorithm, flower pollination algorithm,
which means they involve a set of points at each iteration instead artificial flora optimization algorithm,
whale optimization algorithm, cockroach
of a single one. Because the population is spread out in the design swarm optimization, grey wolf opti-
space, evolutionary algorithms perform a global search. The stochastic mizer, fruit fly optimization algorithm,
imperialist competitive algorithm, har-
elements in these algorithms contribute to global exploration and mony search algorithm, penguins search
reduce the susceptibility to getting stuck in local minima. These optimization algorithm, intelligent wa-
ter drops, grenade explosion method,
features increase the likelihood of getting close to the global minimum, salp swarm algorithm, teaching–learning-
but by no means guarantee it. The reason it may only get close is that based optimization, and water cycle algo-
rithm.
heuristic algorithms have a poor convergence rate, especially in higher

dimensions, and because they lack a first-order optimality criterion.
In this chapter, we cover four gradient-free algorithms: the Nelder–
Mead algorithm, genetic algorithms, particle swarm optimization, and
the DIRECT method. Simulated annealing is covered in Chapter 8
because it was originally developed to solve discrete problems.
7.3 Nelder–Mead Algorithm
The simplex method of Nelder et al.23 is a deterministic, direct-search 23. Nelder et al., A Simplex Method for
Function Minimization. 1965
method that is among the most cited of the gradient-free methods. It
is also known as the nonlinear simplex—not to be confused with the
simplex algorithm used for linear programming, with which it has
nothing in common.
The Nelder–Mead algorithm is based on a simplex, which is a
geometric figure defined
(0) (1)by a set (𝑛)of 𝑛 + 1 points in the design space of
𝑛 variables, 𝑋 = 𝑥 , 𝑥 , . . . , 𝑥 . In two dimensions, the simplex
is a triangle, and in three dimensions it becomes a tetrahedron. Each
optimization iteration is represented by a different simplex. The
algorithm consists in modifying the simplex at each iteration using
five simple operations. The sequence of operations to be performed is
chosen based on the relative values of the objective function at each of
the points.
The first step of the simplex algorithm is to generate 𝑛 + 1 points
based on an initial guess for the design variables. This could done by
simply adding steps to each component of the initial point to generate
𝑛 new points. However, this will generate a simplex with different
edge lengths. Equal length edges are preferable. Suppose we want the
length of all sides to be 𝑙 and that the first guess is 𝑥 (0) . The remaining
points of the simplex, 𝑥 (1) , . . . , 𝑥 (𝑛) , can be computed by
𝑥 (𝑖) = 𝑥 (0) + 𝑠 (𝑖) , (7.1)
where 𝑠 (𝑖) is a vector whose components 𝑗 are defined by

√



 𝑙
𝑛+1−1 + √𝑙 , if 𝑗 = 𝑖
√
√
(𝑖) 𝑛 2 2
𝑠𝑗 = (7.2)

𝑥 (1)
 𝑛 √𝑙 2 𝑛+1−1 , if 𝑗 ≠ 𝑖

For example, Fig. 7.2 shows a starting simplex for a two-dimensional
𝑥 (0)
problem.
𝑥 (2)
At any given iteration, the objective 𝑓 is evaluated for every point,
and the points are ordered based on the respective values of 𝑓 , from Figure 7.2: Starting simplex for
the lowest to the highest. Thus, in the ordered list of simplex points 𝑛 = 2.

𝑋 = 𝑥 (0) , 𝑥 (1) , . . . , 𝑥 (𝑛−1) , 𝑥 (𝑛) , the best point is 𝑥 (0) , and the worst one
is 𝑥 (𝑛) .
The Nelder–Mead algorithm performs four main operations on the
simplex to create a new one: reflection, expansion, outside contraction,
inside contraction and shrinking. The operations are show in Fig. 7.3.
Each of these operations, except for the shrinking, generates a new
point given by:
𝑥 = 𝑥 𝑐 + 𝛼 𝑥 𝑐 − 𝑥 (𝑛) (7.3)
where 𝛼 is a scalar, and 𝑥 𝑐 is the centroid of all the points except for the
worst one, i.e.,
1 Õ (𝑖)
𝑛−1
𝑥𝑐 = 𝑥 . (7.4)
𝑛
𝑖=0
This generates a new point along the line that connects the worst point,
𝑥 (𝑛) , and the centroid of the remaining points, 𝑥 𝑐 . This direction can be
seen as a possible descent direction.
𝑥𝑒
𝑥𝑟
𝑥 (1) 𝑥 (1) 𝑥 (1)
𝑥𝑐
𝑥 (0) 𝑥 (0) 𝑥 (0)

𝑥 (2) 𝑥 (2) 𝑥 (2)
(a) Initial simplex (b) Reflection (𝛼 = 1) (c) Expansion (𝛼 = 2)
𝑥 (1) 𝑥 𝑜𝑐 𝑥 (1) 𝑥 (1)
𝑥 𝑖𝑐
𝑥 (0) 𝑥 (0) 𝑥 (0)
𝑥 (2) 𝑥 (2) 𝑥 (2)
Figure 7.3: Nelder–Mead algorithm
(d) Outside contraction (e) Inside contraction (f) Shrink operations for 𝑛 = 2.
(𝛼 = 0.5) (𝛼 = −0.5)
The objective of each iteration is to replace the worst point with a

better one to form a new simplex. Each iteration always starts with
reflection, which generates a new point using Eq. 7.3 with 𝛼 = 1 as
shown in Fig. 7.3. If the reflected point is better than the best, then the
“search direction” was a good one and we go further by performing an
expansion using Eq. 7.3 with 𝛼 = 2. If the reflected point is between
the second worst and the worst, then the direction wasn’t great but it
was at least somewhat of an improvement so we perform an outside

contraction (𝛼 = 1/2). If the reflected point is worse than our worst
point we try an inside contraction instead (𝛼 = −1/2). Shrinking is a
last resort operation that is performed when nothing along the line
connecting 𝑥 (𝑛) and 𝑥 𝑐 fails to produce a better point. This operation
consists in reducing the size of the simplex by moving all the points
closer to the best point,

𝑥 (𝑖) = 𝑥 (0) + 𝛾 𝑥 (𝑖) − 𝑥 (0) for 𝑖 = 1, . . . , 𝑛, (7.5)
where 𝛾 = 0.5.
Alg. 7.2 details how a new simplex is obtained for each iteration. In
each iteration, the focus is on replacing the worst point with a better
one, as opposed to improving the best. The corresponding flowchart is
shown in Fig. 7.4.
Algorithm 7.2: Nelder–Mead algorithm
Inputs:
𝜏𝑥 : Simplex size tolerances
𝜏 𝑓 : Function value standard deviation tolerances
Outputs:
for 𝑗 = 1 to 𝑛 do Create a simplex with edge length 𝑙

𝑥 (𝑗) = 𝑥 (0) + 𝑠 (𝑗) 𝑠 (𝑗) given by (7.2)
end for
while Δ𝑥 > 𝜏𝑥 or Δ 𝑓 > 𝜏 𝑓 do

n o
Simplex size (7.6) and standard deviation (7.7)
Sort 𝑥 (0) , . . . , 𝑥 (𝑛−1) , 𝑥 (𝑛) Order from the lowest (best) to the highest 𝑓 (𝑥 (𝑗) )
1 Í𝑛−1 (𝑖)

𝑥𝑐 = 𝑛 𝑖=0 𝑥 The centroid excluding the worst point 𝑥 (𝑛) (7.4)
𝑥𝑟 = 𝑥𝑐 + 𝑥𝑐 − 𝑥 (𝑛) Reflection, (7.3) with 𝛼 = 1
if 𝑓 (𝑥 𝑟 ) < 𝑓 (𝑥 (0) ) then Is reflected point is better than the best?
𝑥𝑒 = 𝑥𝑐 + 2 𝑥𝑐 − 𝑥 (𝑛) Expansion, (7.3) with 𝛼 = 2
if 𝑓 (𝑥 𝑒 ) < 𝑓 (𝑥 (0) )
then Is expanded point better than the best?
𝑥 (𝑛) = 𝑥 𝑒 Accept expansion and replace worst point
else
𝑥 (𝑛) = 𝑥 𝑟 Accept reflection
end if
else if 𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (𝑛−1) ) then Is reflected better than second worst?
𝑥 (𝑛) = 𝑥 𝑟 Accept reflected point
else
if 𝑓 (𝑥 𝑟 ) > 𝑓 (𝑥 (𝑛) ) then Is reflected point worse than the worst?
𝑥 𝑖𝑐 = 𝑥 𝑐 − 0.5 𝑥 𝑐 − 𝑥 (𝑛) Inside contraction, (7.3) with 𝛼 = −0.5
if 𝑓 (𝑥 𝑖𝑐 ) < 𝑓 (𝑥 (𝑛) )
then Inside contraction better than worst?
𝑥 (𝑛) = 𝑥 𝑖𝑐 Accept inside contraction
else
for 𝑗 = 1 to 𝑛 do
𝑥 (𝑗) = 𝑥 (0) + 0.5 𝑥 (𝑗) − 𝑥 (0) Shrink, (7.5) with 𝛾 = 0.5
end for
end if
else
𝑥 𝑜𝑐 = 𝑥 𝑐 + 0.5 𝑥 𝑐 − 𝑥 (𝑛) Outside contraction, (7.3) with 𝛼 = 0.5
if 𝑓 (𝑥 𝑜𝑐 ) < 𝑓 (𝑥 𝑟 ) then Is contraction better than reflection?
𝑥 (𝑛) = 𝑥 𝑜𝑐 Accept outside contraction
else
for 𝑗 = 1 to 𝑛 do
𝑥 (𝑗) = 𝑥 (0) + 0.5 𝑥 (𝑗) − 𝑥 (0) Shrink, (7.5) with 𝛾 = 0.5
end for
end if
end if
end if
end while
The cost for each iteration is one function evaluation if the reflection
is accepted, two function evaluations if an expansion or contraction is
performed, and 𝑛 + 2 evaluations if the iteration results in shrinking.
Although we could parallelize the 𝑛 evaluations when shrinking, it
would not be worthwhile because the other operations are sequential.
There are a number of ways to quantify the convergence of the
simplex method. One straightforward way is to use the size of simplex,
i.e.,
Õ
𝑛−1
Δ𝑥 = ||𝑥 (𝑖) − 𝑥 (𝑛) ||, (7.6)
𝑖=0
and specify that it must be less than a certain tolerance. Another

measure of convergence we can use is the standard deviation the
function value, v u
u
tÍ𝑛 2
𝑓 (𝑖) − 𝑓¯
𝑖=0
Δ𝑓 = , (7.7)
𝑛+1
where 𝑓¯ is the mean of the 𝑛 + 1 function values. Another possible
convergence criterion is the difference between the best and worst value
in the simplex.
𝑘 = 𝑘+1
𝑥𝑒
𝑥 (1)
𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (0) ) 𝑥 (1) 𝑓 (𝑥 𝑒 ) ≤ 𝑓 (𝑥 (0) )
𝑥 (𝑛) = 𝑥 𝑒
𝑥 (0) 𝑥 (0)
𝑥 (2) 𝑥 (2)
else
𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (𝑛−1) ) 𝑥 (𝑛) = 𝑥 𝑟
𝑥𝑟 𝑥 (1)
𝑥 (1)
𝑓 (𝑥 𝑟 ) ≥ 𝑓 (𝑥 (𝑛) ) 𝑓 (𝑥 𝑖𝑐 ) ≤ 𝑓 (𝑥 (𝑛) )
𝑥𝑐 𝑥 (𝑛) = 𝑥 𝑖𝑐
𝑥 𝑖𝑐
𝑥 (0)
𝑥 (0) 𝑥 (1)
𝑥 (2)
𝑥 (2) else
else
𝑥 (0)
𝑥 (1) 𝑥 𝑜𝑐 𝑥 (2)
else 𝑓 (𝑥 𝑜𝑐 ) ≤ 𝑓 (𝑥 𝑟 )
𝑥 (𝑛) = 𝑥 𝑜𝑐
𝑥 (0)
𝑥 (2)
Note that the methodology, like most direct-search methods, cannot Figure 7.4: Flowchart of Nelder–
directly handle constraints. One approach to handle constraints would Mead (Alg. 7.2).
be to use a penalty method (discussed in Section 5.3) to form an

unconstrained problem. In this case, the penalty does not need not be
differentiable, so a linear penalty method would suffice.
Example 7.3: Nelder–Mead algorithm applied to the bean function.
Figure 7.5 shows the sequence of simplices that results when minimizing
the bean function using a Nelder–Mead simplex. The initial simplex on the
upper left is equilateral. The first iteration is a reflection, followed by an
inside contraction, another reflection, and an inside contraction before the
shrinking. The simplices then shrink dramatically in size, slowly converging to
the minimum.
Using a convergence tolerance of 10−6 in the difference between 𝑓best and
𝑓worst the problem took 68 function evaluations.
∗∗ This method was developed by Jones

et al.47 , who was motivated to develop a
7.4 DIRECT Algorithm global search that did not rely on any tun-
able parameters (such as population size
The divided rectangles (DIRECT) algorithm is different from the other in genetic algorithms).
gradient-free optimization algorithms in this chapter in that it is based 47. Jones et al., Lipschitzian optimization
without the Lipschitz constant. 1993
𝑥2 1
𝑥∗
Figure 7.5: Sequence of simplices

−1
−2 −1 0 1 2 3 that minimize the bean function
𝑥1
on rigorous mathematics.∗∗ This is a deterministic method that is

guaranteed to converge to the global optimum under conditions that
are not too restrictive (although it might require a prohibitive number
of function evaluations).
One way to guarantee finding the global optimum within a finite
design space is by dividing this space into a regular rectangular grid
and evaluating every point in this grid. This is called an exhaustive search,
and the precision of the minimum depends on how fine the grid is. The
cost of this brute-force strategy is high and goes up exponentially with
the number of design variables.
The DIRECT method also relies on a grid, but it uses an adaptive
meshing scheme that greatly reduces the cost. It starts with a single 𝑛-
dimensional hypercube that spans the whole design space—like genetic
algorithms, DIRECT requires upper and lower bounds on all the design
variables. Each iteration divides this hypercube into smaller ones and
evaluates the objective function at the center of each of these. At each
iteration, the algorithm only divides rectangles that are determined to
be potentially optimal. The key strategy in the DIRECT method is how it
determines this subset of potentially optimal rectangles, which is based
on the mathematical concept of Lipschitz continuity.
We start by explaining the concept of Lipschitz continuity and
then explain an algorithm for finding the global minimum of a one-
dimensional function using this concept—Shubert’s algorithm. While
Shubert’s algorithm is not practical in general, it will help us under-
stand the mathematical rationale for the DIRECT algorithm. Then we
explain the DIRECT algorithm for one-dimensional functions before
generalizing it for 𝑛 dimensions.
The Lipschitz Constant

Consider the single variable function 𝑓 (𝑥) shown in Fig. 7.6. For a trial
point 𝑥 ∗ , we can draw a cone with slope 𝑘 by drawing the lines,
𝑓+ (𝑥) = 𝑓 (𝑥 ∗ ) + 𝑘(𝑥 − 𝑥 ∗ ), (7.8)

𝑓− (𝑥) = 𝑓 (𝑥 ∗ ) − 𝑘(𝑥 − 𝑥 ∗ ), (7.9)
to the left and right, respectively. We show this cone in Fig. 7.6 (left), as
well as cones corresponding to other values of 𝑘.
𝑓 (𝑥 ∗ ) Figure 7.6: From a given trial point

𝑥 ∗ , we can draw a cone with slope 𝑘
𝑓 𝑓
𝑘 𝑘 (left). For a function to be Lipschitz
𝑓− 𝑓+ continuous, we need all cones with
1 1
slope 𝑘 to lie under the function for
all points in the domain (right).
𝑥 𝑥
𝑥∗ 𝑥∗
A function 𝑓 is said to be Lipschitz continuous if there is a cone slope

𝑘 such that the cones for all possible trial points in the domain remain
under the function. This means that there is a positive constant 𝑘 such
that
| 𝑓 (𝑥) − 𝑓 (𝑥 ∗ )| ≤ 𝑘|𝑥 − 𝑥 ∗ |, for all 𝑥, 𝑥 ∗ ∈ 𝐷, (7.10)
where 𝐷 is the function domain. Graphically, this condition means
that we can draw a cone with slope 𝑘 from any trial point evaluation
𝑓 (𝑥 ∗ ) such that the function is always bounded by the cone, as shown
in Fig. 7.6 (right). Any 𝑘 that satisfies condition (7.10) is a Lipschitz
constant for the corresponding function.
Shubert’s Algorithm
If a Lipschitz constant for a single variable function is known, Shubert’s
algorithm can find the global minimum of that function. Because the
Lipschitz constant is not available in the general case, the DIRECT
algorithm is designed so that it does not require this constant. However,
we explain Shubert’s algorithm first because it provides some of the
basic concepts used in the DIRECT algorithm.
Shubert’s algorithm starts with a domain within which we want to
find the global minimum—[𝑎, 𝑏] in Fig. 7.7. Using the property of the
Lipschitz constant 𝑘 defined in Eq. 7.10, we know that the function is
always above a cone of slope 𝑘 evaluated at any point in the domain.
We start by establishing a first lower bound on the global minimum
by finding the intersection of the cones—𝑥1 in Fig. 7.7 (left)—for the
3rd
2nd
𝑓 𝑓
1st increase
in lower bound
𝑎 𝑥1 𝑏 𝑎 𝑥2 𝑥1 𝑥5 𝑥3 𝑥4 𝑏
𝑥 𝑥
extremes of the domain. We evaluate the function at 𝑥1 and can now Figure 7.7: Shubert’s algorithm
draw a cone about this point to find two more intersections (𝑥2 and requires an initial domain and a
valid Lipschitz constant (left) and
𝑥3 ). Because these two points always intersect at the same objective then increases the lower bound
lower bound value, they both need to be evaluated to see which one of the global minimum with each
successive iteration (right).
has the highest lower bound increase (the 𝑥3 side in this case). Each
subsequent iteration of Shubert’s algorithm adds two new points to
either side of the current point. These two points are evaluated to find
out which side has the lowest actual function value and that side gets
selected to be divided.
The lowest bound on the function increases at each iteration and
ultimately converges to the global minimum. At the same time, the
segments in 𝑥 decrease in size. The lower bound can switch from
distinct regions, as the lower bound in one region increases beyond
the lower bound in another region. Using the minimum Lipschitz
constant in this algorithm would be the most efficient because it would
correspond to the largest possible increments in the lower bound at
each iteration.
The two major shortcomings of Shubert’s algorithm are that: (1) A
Lipschitz constant is usually not available for a general function and (2)
it is not easily extended to 𝑛 dimension. These two shortcomings are
addressed by the DIRECT algorithm.
One-dimensional DIRECT
Before explaining the 𝑛-dimensional DIRECT algorithm, we introduce
the one-dimensional version, which is based on principles similar to
those of the Shubert algorithm. The main difference is that instead
of evaluating at the cone intersection points, we divide the segments
evenly and evaluate the center of the segments.
Consider the closed domain [𝑎, 𝑏] shown in Fig. 7.8 (left). For each
segment, we evaluate the objective function at the midpoint of the
segment. In the first segment, which spans the whole domain, this is
𝑐 0 = (𝑎 + 𝑏)/2. Assuming some value of 𝑘, which is not known and
which we will not need, the lower bound on the minimum would be
𝑓 (𝑐) − 𝑘(𝑏 − 𝑎)/2.
+𝑘 −𝑘
𝑓 𝑓
𝑓 (𝑐) − 12 𝑘(𝑏 − 𝑎)
𝑎 𝑐 = 12 (𝑎 + 𝑏) 𝑏 𝑎 𝑐 = 12 (𝑎 + 𝑏) 𝑏
𝑥
𝑥
𝑑 = 12 (𝑏 − 𝑎)
We want to increase this lower bound on the function minimum Figure 7.8: The DIRECT algorithm
evaluates the middle point (left)
by dividing this segment further. To do this in a regular way that
and each successive iteration tri-
reuses previously evaluated points and can be repeated indefinitely, sects the segments that have the
we divide it into three segments, as shown in Fig. 7.8 (right). Now we greatest potential (right).
have increased the lower bound on the minimum. Unlike the Shubert
algorithm, the lower bound is a discontinuous function across the
segments, as shown in Fig. 7.8 (right). We now have a regular division
of segments, which is more amenable for extending the method to 𝑛
dimensions.
Instead of continuing to divide every segment into three other
segments, we only divide segments selected according to a potentially
optimal criterion. To better understand this criterion, consider a set of
segments [𝑎 𝑖 , 𝑏 𝑖 ] at a given DIRECT iteration, where segment 𝑖 has a
half length 𝑑 𝑖 = (𝑏 𝑖 − 𝑎 𝑖 )/2 and a function value 𝑓 (𝑐 𝑖 ) evaluated at the
segment center 𝑐 𝑖 = (𝑎 𝑖 + 𝑏 𝑖 )/2. If we plot 𝑓 (𝑐 𝑖 ) versus 𝑑 𝑖 for a set of
segments, we get the pattern shown in Fig. 7.9.
𝑓 (𝑐)
𝑘
𝑓 (𝑐 𝑗 )
𝑓min Figure 7.9: Potentially optimal

𝑓𝑚𝑖𝑛 − 𝜖| 𝑓𝑚𝑖𝑛 | segments in the DIRECT algorithm
are identified by the lower convex
hull of this plot.
0 𝑑 𝑑𝑗
The overall rationale for the potentially optimal criterion is that there
are two metrics that quantify this potential: the size of the segment and
the function value at the center of the segment. The greater the size of
the segment, the greater the potential for containing a minimum. The
lower the function value, the greater that potential is as well. For a set
of segments of the same size, we know that the one with the lowest
function value has the best potential and should be selected. If two
segments had the same function value and different sizes, the one with
the largest size would should be selected. For a general set of segments
with various sizes and value combinations, there might be multiple
than can be considered potentially optimal.
We identify potentially optimal segments as follows. If we draw a
line with a slope corresponding to a Lipschitz constant 𝑘 from any point
in Fig. 7.9, the intersection of this line with the vertical axis is a bound
on the objective function for the corresponding segment. Therefore,
the lowest bound for a given 𝑘 can be found by drawing a line through
the point that achieves the lowest intersection.
However, we do not know 𝑘 and we do not want to assume a value
because we do not want to bias the search. If 𝑘 were high, it would favor
dividing the larger segments. Low values of 𝑘 would result in dividing
the smaller segments. The DIRECT method hinges on considering all
possible values of 𝑘, effectively eliminating the need for this constant.
To eliminate the dependence on 𝑘, we select all the points for which
there is a line with slope 𝑘 that does not go above any other point. This
corresponds to selecting the points that form a lower convex hull, as
shown by the piecewise linear function in Fig. 7.9. This establishes a
lower bound on the function for each segment size.
Mathematically, a segment 𝑗 in the set of current segments 𝑆 is said
to be potentially optimal if there is a 𝑘 ≥ 0 such that
𝑓 (𝑐 𝑗 ) − 𝑘𝑑 𝑗 ≤ 𝑓 (𝑐 𝑖 ) − 𝑘𝑑 𝑖 ∀𝑖 ∈ 𝑆 (7.11)
𝑓 (𝑐 𝑗 ) − 𝑘𝑑 𝑗 ≤ 𝑓min − 𝜀 𝑓min (7.12)
where 𝑓min is the best current objective function value, and 𝜀 is a small
positive parameter. The first condition corresponds to finding the
points in the lower convex hull mentioned previously.
The second condition in Eq. 7.12 ensures that the potential minimum
is better than the lowest function value so far by at least a small amount.
This prevents the algorithm from becoming too local, wasting function
evaluations in search of smaller function improvements. The parameter
𝜀 balances the search between local and global search. A typical value
is 𝜀 = 10−4 , and its the range is usually such that 10−2 ≤ 𝜀 ≤ 10−7 .
There are efficient algorithms for finding the convex hull of an
arbitrary set of points in two dimensions, such as the Jarvis march.
These algorithms are more than we need here, since we only require the
lower part of the convex hull, so they can be simplified for this purpose.
As in the Shubert algorithm, the division might switch from one part
of the domain to another, depending on the new function values. When
compared to the Shubert algorithm, the DIRECT algorithm produces
a discontinuous lower bound on the function values, as shown in
Fig. 7.10.
Figure 7.10: The lower bound on

function values for the DIRECT
method are discontinuous at the
𝑎 𝑐 𝑏 segment boundaries.
𝑥
DIRECT in 𝑛 Dimensions
The 𝑛-dimensional DIRECT algorithm is similar to the one-dimensional
version but becomes more complex.†† The main differences is that we †† In this chapter, we present an improved
version of DIRECT 99 .
deal with hyperrectangles instead of segments. A hyperrectagle can
99. Jones, Direct Global Optimization
be defined by its centerpoint position 𝑐 in 𝑛-dimensional space and a Algorithm. 2009
half length in each direction 𝑖, 𝛿𝑒 𝑖 , as shown in Fig. 7.11. The DIRECT
algorithm assumes that the initial dimensions are normalized so that
we start with a hypercube.
δe2
Figure 7.11: Hyperrectangle in
c δe1 three dimensions, where 𝑑 is the
δe3
maximum distance between the
d center and the vertices and 𝛿𝑒 𝑖 is
the half-length in each direction 𝑖.
To identify the potentially optimal rectangles at a given iteration, we

use exactly the same conditions in Eqs. (7.11) and (7.12), but 𝑐 𝑖 is now
the center of the hyperrectangle, and 𝑑 𝑖 is the maximum distance from
the center to a vertex. The explanation illustrated in Fig. 7.9 still applies
in the 𝑛-dimensional case and still just involves finding the lower convex
hull of a set of points with different combinations of 𝑓 and 𝑑.
The main complication introduced in the 𝑛-dimensional case is the
division of a selected hyperrectangle. The question is which directions

should be divided first. The logic to handle this in the DIRECT algorithm
is to prioritize the reduction of the dimensions with the maximum
length, which ensures that hyperrectangles do not deviate too much
from the proportions of a hypercube. First, we select the set of longest
dimensions for intersection (there are often multiple dimensions with
the same length). Among this set of longest dimension, we select the
direction that has been split the least over the whole history of the
search. If there are still multiple dimensions in the selection, we simply
select the one with the lowest index. Alg. 7.4 provides the details of
this selection and its place in the overall algorithm.
Algorithm 7.4: DIRECT in 𝑛-dimensions
Inputs:
𝑥: Variable upper bounds
x: Variable lower bounds
Outputs:
Normalize the design space to be the unit hypercube.

Compute center of the hypercube, 𝑐0
𝑓min = 𝑓 (𝑐0 )
Find the set 𝑆 of potentially optimal hyperrectangles
for each hyperrectangle 𝑟 ∈ 𝑆
Find the set 𝐼 of dimensions that have maximum side length, 𝑙max .
if There is only one maximum side length then select it
else Select the dimension with the lowest number of divisions over the
whole history
if There are more than one selected dimension then select the one
with the lowest dimension index
end if
end if
Divide the rectangle into thirds along the selected dimension
𝑘 = 𝑘+1
end while
Fig. 7.12 shows the first three iterations for a two-dimensional

example and the corresponding visualization of conditions expressed
in Eqs. (7.11) and (7.12). We start with a square that contains the whole
domain and evaluate the center point. The value of this point is plotted
on the 𝑓 -𝑑 plot on the far right. The first iteration trisects the starting
square along the first dimension and evaluates the two new points.
The values for these three points are plotted in the 2nd column from
the right in the 𝑓 -𝑑 plot, where the center point is reused, as indicated
by the arrow and the matching color. At this iteration, we have two
points that define the convex hull. In the second iteration, we have
three rectangles of the same size, so we divide the one with the lowest
value and evaluate the centers of the two new rectangles (which are
squares in this case). We now have another column of points in the 𝑓 -𝑑
plot corresponding to a smaller 𝑑 and an additional point that defines
the lower convex hull. Because the convex hull now has two points, we
trisect two different rectangles in the third iteration.
Iteration Select rectangles Trisect and sample
1 𝑓
2 𝑓
Figure 7.12: DIRECT iterations

3 𝑓 for two-dimensional case (left)
and corresponding identification
of potentially optimal rectangles
(right).
𝑑
Example 7.5: Minimization of multimodal function with DIRECT.
Consider the Jones function, which is a two-dimensional analytic function

defined as
𝑓 (𝑥1 , 𝑥2 ) = 𝑥14 + 𝑥 24 − 4𝑥13 − 3𝑥23 + 2𝑥12 + 2𝑥1 𝑥2 ., (7.13)
which has multiple local minima. Applying the DIRECT method to this
function, we get the sequence of rectangles shown below.
𝑥2
−1
−2
−2 −1 0 1 2 3 4 Figure 7.13: CAPTION
𝑥1
𝑓
40
30
20
7.5 Genetic Algorithms 10
0
Genetic algorithms (GAs) are the most well-know and widely used
type of evolutionary algorithm. They were also among the earliest to −10
have been developed.‡‡ GAs, like many evolutionary algorithms, are −20
10−2 10−1 100
population based: The optimization starts with a set of design points (the 𝑑
population) rather than a single starting point, and each optimization
Figure 7.14: CAPTION
iteration updates this set in some way. Each iteration in the GA is called
‡‡ The first GA software was written in
a generation, and each generation has a population with 𝑁𝑝 points. A
1954, followed by other seminal work.101
chromosome is used to represent each point and contains the values for Initially, these GAs were not written to per-
all the design variables, as shown in Fig. 7.15. Each design variable is form optimization, but rather, to model
the evolutionary process. GAs were even-
represented by a gene. As we will see later, there are different ways for tually applied to optimization.102
genes to represent the design variables. 101. Barricelli, Esempi numerici di processi
di evoluzione. 1954
GAs evolve the population using an algorithm inspired by biological
102. Jong, An analysis of the behavior of a
reproduction and evolution using three main steps: 1) selection, 2) class of genetic adaptive systems. 1975
Population
Gene Chromosome
𝑥 (0) 𝑥1 𝑥2 ... 𝑥𝑛
𝑥 (1) Figure 7.15: Each GA iteration in-

volves a population of design points
.. where each design is represented
.
by a chromosome and each design
𝑥 (𝑁𝑝 ) variable is represented by a gene.
crossover , and 3) mutation. Selection is based natural selection, where

members of the population that acquire favorable adaptations survive
longer and contribute more to the population gene pool. Crossover is
inspired by chromosomal crossover, which is the exchange of genetic
material between chromosomes during sexual reproduction. In this Population 𝑃𝑘
step, two parents produce two offspring. Mutation mimics genetic
mutation, which is a permanent change in the gene sequence that occurs
naturally.
Alg. 7.6 and Fig. 7.16 show how these three steps are used to perform
optimization. Although most GAs follow this general procedure, there
Selection
is a great degree of flexibility in how the steps are performed, leading
to many variations in GAs. For example, there is no one single method Parents
specified for the generation of the initial population, and the size of
that population varies. Similarly, there are many possible methods for
selecting the parents, for generating the offspring, and for selecting the
survivors. Here, the new population (𝑃𝑘+1 ) is formed exclusively by
the offspring generated from crossover. However, some GAs add an Crossover
extra selection process that selects a surviving population of size 𝑁𝑝 Offspring
among the population of parents and offspring.
Algorithm 7.6: Genetic algorithm
Inputs:
x: Variable lower bounds Population 𝑃𝑘+1
Outputs:
𝑘 = 0n o
𝑃 𝑘 = 𝑥 (1) , 𝑥 (2) , . . . , 𝑥 (𝑁𝑃 ) Generate initial population
while 𝑘 < 𝑘max do Figure 7.16: At each GA iteration,
Compute 𝑓 (𝑥) ∀ 𝑥 ∈ 𝑃 𝑘 Evaluate fitness pairs of parents are selected from
Select 𝑁𝑝 /2 parent pairs from 𝑃 𝑘 for crossover Selection the population to generate the
offspring through crossover, which
Generate a new population of 𝑁𝑝 offspring (𝑃 𝑘+1 ) Crossover
become the new population.
Randomly mutate some points in the population Mutation

𝑘 = 𝑘+1
end while
In addition to the flexibility in the various operations, there are also

different methods for representing the design variables in a genetic
algorithm. The design variable representation can be used to classify
genetic algorithms into two broad categories: binary-encoded and real-
encoded genetic algorithms. Binary-encoded algorithms use bits to
represent the design variables, while the real-encoded algorithms keep
the same real value representation used in most other algorithms. The
details of the operations in Alg. 7.6 depend on whether we are using
one or the other of these representations, but the principles remain the
same. In the rest of this section, we describe in more detail a particular
way of performing these operations for each of the possible design
variable representations.
7.5.1 Binary-encoded Genetic Algorithms

The original genetic algorithms were based on binary encoding because
they more naturally mimic chromosome encoding. Binary-coded
GAs are widely used and are applicable to discrete or mixed-integer
problems.§§ When using binary encoding, we represent each variable §§ Onepopular binary-encoded genetic al-
gorithm implementation is NSGA-II 103 .
as a binary number with 𝑚 bits. Each bit in the binary representation
103. Deb et al., A fast and elitist multiobjec-
has a location, 𝑖, and a value, 𝑏 𝑖 (which is either 0 or 1). If we want tive genetic algorithm: NSGA-II. 2002
to represent a real-valued variable, we first need to consider a finite
interval 𝑥 ∈ [x, 𝑥],
¯ which we can then divide into 2𝑚 − 1 intervals. The
size of the interval is given by
𝑥¯ − x
Δ𝑥 = . (7.14)
2𝑚 − 1
To have a more precise representation, we must use more bits.
When using binary-encoded GAs, we do not need to encode the de-
sign variables (since they are generated and manipulated directly in the
binary representation), but we do need to decode them before providing
them to the evaluation function. To decode a binary representation, we
use
Õ
𝑚−1
𝑥 =x+ 𝑏 𝑖 2𝑖 Δ𝑥. (7.15)
𝑖=0
Example 7.7: Binary representation of a real number.

Suppose we have a continuous design variable 𝑥 that we want to represent

in the interval [−20, 80] using 12 bits. Then, we have 212−1 = 4, 095 intervals,
and using Eq. 7.14, we get Δ𝑥 ≈ 0.0244. This interval is the error in this finite
precision representation. For the sample binary representation shown below,
we can use Eq. 7.15 to compute the equivalent real number, which turns out to
be 𝑥 ≈ 32.55.
𝑖 1 2 3 4 5 6 7 8 9 10 11 12
𝑏𝑖 0 0 0 1 0 1 1 0 0 0 0 1
Initial Population
The first step in a genetic algorithm is to generate an initial set (pop-
ulation) of points. As a rule of thumb, the population size should be
approximately one order of magnitude larger than the number of design
variables, but in general you will need to experiment with different
population sizes.
One popular way to choose the initial population is to do it at random.
Using binary encoding, we can assign each bit in the representation of
the design variables a 50% chance of being either 1 or 0. This can be
done by generating a random number 0 ≤ 𝑟 ≤ 1 and setting the bit to 0
if 𝑟 ≤ 0.5 and 1 if 𝑟 > 0.5. For a population of size 𝑁𝑃 , with 𝑛 𝑥 design
variables, and each variable is encoded using 𝑚 bits, the total number
of bits that needs to be generated is 𝑁𝑃 × 𝑛 𝑥 × 𝑚.
To achieve better spread in a larger dimension space, methods like
Latin hypercube sampling are generally more effective than random
populations (discussed in Section 10.2).
Evaluate Fitness
The objective function for all the points in the population must be
evaluated and then converted to a fitness value. These evaluations
could be done in parallel. The numerical optimization convention
is usually to minimize the objective, while the GA convention is to
maximize the fitness. Therefore, we can convert the objective to fitness
simply by setting 𝐹 = − 𝑓 .
For some types of selection (like the tournament selection detailed
in the next step) all the fitness values need to be positive. To achieve
that, we can perform the following conversion:
− 𝑓𝑖 + Δ𝐹
𝐹= , (7.16)
max(1, Δ𝐹 − 𝑓low )
where Δ𝐹 = 1.1 𝑓high −0.1 𝑓low is based on the highest and lowest function
values in the population, and the denominator is introduced to scale
the fitness.
Selection
In this step we choose points from the population for reproduction
in a subsequent step. On average, it is desirable to choose a mating
pool that improves in fitness (thus mimicking the concept of natural
selection), but it is also important to maintain diversity. In total, we
need to generate 𝑁𝑃 /2 pairs.
The simplest selection method is to randomly select two points from
the population until the requisite number of pairs is complete. This
approach is not particularly effective because there is no mechanism to
move the population toward points with better objective functions.
Tournament selection is a better method that randomly pairs up 𝑁𝑃
points, and selects the best point from each pair to join the mating pool.
The same pairing and selection process is repeated to create 𝑁𝑃 /2 more
points to complete a mating pool of 𝑁𝑃 points.
Example 7.8: Tournament selection process.
Figure 7.17 illustrates the process with a very small population. Each
member of the population ends up in the mating pool zero, one, or two times
with better points more likely to appear in the pool. The best point in the
population will always end up in the pool twice, while the worst point in the
population will always be eliminated.
12 2
10 2
10 15
7 6
7 6
15 7
2 10
2 10 Figure 7.17: Tournament selection
6 12 example.
Another common method is roulette wheel selection. This concept

is patterned after a roulette wheel used in a casino. It assigns better
points a larger sector on the roulette wheel so that they have a higher
probability of being selected.
To find the sizes of the sectors in the roulette wheel selection, we
use the fitness value defined by Eq. 7.16. We then take the normalized
cumulative sum of the scaled fitness values to compute an interval for
each members in the population 𝑗 as
Í𝑗
𝐹𝑖
𝑖=1
𝑆𝑗 = (7.17)
Í𝑃
𝑁
𝐹𝑖
𝑖=1
We can now create a mating pool of 𝑁𝑃 points by turning the roulette

wheel 𝑁𝑃 times. We do this by generating a random number 0 ≤ 𝑟 ≤ 1
at each turn. The 𝑗 th member is copied to the mating pool if
𝑆 𝑗−1 < 𝑟 ≤ 𝑆 𝑗 (7.18)
This ensures that the probability of a member being selected for repro-
duction is proportional to its scaled fitness value.
Example 7.9: Roulette wheel selection process.
Assume that 𝐹 = [20, 5, 45, 10]. Then 𝑆 = [0.25, 0.3125, 0.875, 1], which
divides the “wheel” into four segments shown graphically as show in Fig. 7.18.
0
0.875
𝑥 (4)
𝑥 (1)
Crossover
0.25
𝑥 (2)
In the reproduction operation, two points (offspring) are generated
0.3125
from a pair of points (parents). Various strategies are possible in genetic 𝑥 (3)
algorithms. Single-point crossover usually involves generating a random
integer 1 ≤ 𝑘 ≤ 𝑚 − 1 that defines the crossover point. This is illustrated
in Table 7.2. For one of the offspring, the first 𝑘 bits are taken from, say, Figure 7.18: Roulette wheel selec-
tion example.
parent 1 and the remaining bits from parent 2. For the second offspring,
the first 𝑘 bits are taken from parent 2 and the remaining ones from
parent 1. Various extensions exist like two-point crossover or 𝑛-point
crossover.
Mutation
Mutation is a random operation performed to change the genetic infor-
mation and is needed because even though selection and reproduction
effectively recombine existing information, occasionally some useful
Table 7.2: Single-point crossover operation example.

Before crossover After crossover
11 111 11 000
00 000 00 111
genetic information might be lost. The mutation operation protects

against such irrecoverable loss and introduces additional diversity into
the population.
When using bit representation, every bit is assigned a small permu-
tation probability, say 𝑝 = 0.005 ∼ 0.1. This is done by generating a
random number 0 ≤ 𝑟 ≤ 1 for each bit, which is changed if 𝑟 < 𝑝. An
example is illustrated in Table 7.3.
Table 7.3: Mutation example where only one bit changed.

Before mutation After mutation
11111 11011
7.5.2 Real-encoded Genetic Algorithms

As the name implies, real-encoded GAs represent the design variables
in their original representation as real numbers. This has several
advantages over the binary-encoded approach. First, real-encoding
represents numbers up to machine precision rather than being limited by
the initial choice of string length required in binary-encoding. Second,
it avoids the “Hamming cliff” issue of binary-encoding, which is caused
by the fact that a large number of bits must change to move between
adjacent real numbers (e.g., 0111 to 1000). Third, some real-encoded
GAs are able to generate points outside the design variable bounds
used to create the initial population; in many problems, the design
variables are not bounded. Finally, it avoids the burden of binary
coding and decoding. The main disadvantage is that integer or discrete
variables cannot be handled in a straightforward way. For problems
that are continuous, a real-encoded GA is generally more efficient than
a binary-encoded GA 100 . We now describe the required changes to the 100. Simon, Evolutionary Optimization
Algorithms. 2013
GA operations in the real-encoded approach.
Initial Population
The most common approach is to pick the 𝑁𝑃 points using random
sampling within the provided design bounds. Each member is often
chosen at random within some initial bounds. For each design variable
𝑥 𝑖 , with bounds such that x𝑖 ≤ 𝑥 𝑖 ≤ 𝑥¯ 𝑖 , we could use,
𝑥 𝑖 = x𝑖 + 𝑟( 𝑥¯ 𝑖 − x𝑖 ) (7.19)
where 𝑟 is a random number such that 0 ≤ 𝑟 ≤ 1.

Again, for higher dimensional spaces Latin hypercube sampling can
provide better coverage.
Selection
The selection operation does not depend on the design variable en-
coding, and therefore, we can just use any of the selection approaches
already described in the binary-encoded GA.
Crossover
When using real-encoding, the term “crossover” does not accurately
describe the process of creating the two offspring from a pair of points.
Instead, the approaches are more accurately described as a blending,
although the name crossover is still often used.
There are various options for the reproduction of two points encoded
using real numbers. A common method is linear crossover, which
generates two or more points in the line defined by the two parent
points. One option for linear crossover is to generate the following two
points:
𝑥 𝑐1 = 0.5𝑥 𝑝1 + 0.5𝑥 𝑝2 ,
(7.20)
𝑥 𝑐2 = 2𝑥 𝑝2 − 𝑥 𝑝1 ,
where parent two is more fit than parent 1 ( 𝑓 (𝑥 𝑝2 ) < 𝑓 (𝑥 𝑝1 )). An
example of this linear crossover approach is shown in Fig. 7.19, where
we can see that child 1 is the average of the two parent points, while
child 2 is obtained by extrapolating in the direction of the “fitter” parent.
Another option is a simple crossover like the binary case where a
𝑥 𝑐2
random integer is generated to split the vectors. For example with a
split after the first index: 𝑥 𝑝2
𝑥 𝑝1 = [𝑥 1 , 𝑥2 , 𝑥3 , 𝑥4 ] 𝑥 𝑐1
𝑥 𝑝2 = [𝑥 5 , 𝑥6 , 𝑥7 , 𝑥8 ] 𝑥 𝑝1
⇓ (7.21)
Figure 7.19: Linear crossover pro-
𝑥 𝑐1 = [𝑥 1 , 𝑥6 , 𝑥7 , 𝑥8 ] duces two new points along the line
𝑥 𝑐2 = [𝑥 5 , 𝑥2 , 𝑥3 , 𝑥4 ] defined by the two parent points.
This simple crossover does not generate as much diversity as the binary
case does and relies more heavily on effective mutation. Many other
strategies have been devised for real-encoded GAs 104 . 104. Deb, Multi-Objective Optimization
Using Evolutionary Algorithms. 2001
Mutation
Like a binary-encoded GA, mutation should only occur with a small
probability (e.g., 𝑝 = 0.005 ∼ 0.1). However, rather than changing
each bit with probability 𝑝, we now change each design variable with
probability 𝑝.
Many mutation methods rely on random variations around an
existing member such as a uniform random operator:
𝑥new 𝑖 = 𝑥 𝑖 + (𝑟 𝑖 − 0.5)Δ𝑖 , for 𝑖 = 1 . . . 𝑛 𝑥 (7.22)
where 𝑟 𝑖 is a random number between 0 and 1, and Δ𝑖 is a pre-selected

maximum perturbation in the 𝑖 𝑡 ℎ direction. Many non-uniform methods
exist as well. For example, we can use a Gaussian distribution
𝑥new 𝑖 = 𝑥 𝑖 + 𝒩(0, 𝜎𝑖 ), for 𝑖 = 1 . . . 𝑛 𝑥 (7.23)
where 𝜎𝑖 is a pre-selected standard deviation and random samples are

drawn from the normal distribution. During the mutation operations,
bound checking is necessary to ensure the mutations stay within the
upper and lower limits.
Example 7.10: Genetic algorithm applied to the bean function.
Figure 7.20 shows the evolution of the population when minimizing the
bean function using a genetic algorithm. The initial population size was 40,
and the simulation was run for 14 generations, requiring 2000 total function
evaluations. Convergence was assumed if the best member in the population
improved by less than 10−4 for 3 consecutive generations.
7.5.3 Constraint Handling

Various approaches exist for handling constraints. Like the Nelder-
Mead method, we can use a penalty method (e.g., Augmented La-
grangian, linear penalty, etc.). However, there are additional options
for GAs. In the tournament selection, we can use other selection criteria
that do not depend on penalty parameters. One such approach for
choosing the best selection amongst two competitors is:
1. prefer a feasible solution

3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
(a) 𝑘 = 0 (b) 𝑘 = 1 (c) 𝑘 = 2
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
(d) 𝑘 = 4 (e) 𝑘 = 6 (f) 𝑘 = 10
2. among two feasible solutions, choose the one with a better objec- Figure 7.20: Population evolution
tive at iterations 𝑘 using a genetic algo-
rithm to minimize the bean func-
tion
3. among two infeasible solutions, choose the one with a smaller
constraint violation
This concept is a lot like the filter methods discussed in Section 5.6.
7.5.4 Convergence
Rigorous mathematical convergence criteria, like those used in gradient-
based optimization, do not apply to genetic algorithms. The most
common way to terminate a genetic algorithm is to simply specify a
maximum number of iterations, which corresponds to a computational
budget. Another similar approach is to run indefinitely until the user
manually terminates the algorithm, usually by monitoring the trends
in population fitness.
A more automated approach is to track a running average of the
population fitness, although it can be difficult to decide what tolerance
to apply to this criterium as we generally aren’t interested in the average
performance anyway. Perhaps a more direct metric of interest is to
track the fitness of the best member in the population. However, this
can be a problematic criterium to use because the best member can

disappear due to crossover or mutation. To avoid this, and to improve
convergence, many genetic algorithms employ elitism. This means that
the fittest member in the population is retained so that the population
is guaranteed to never regress. Even without this behavior, the best
member often changes slowly, so one should not terminate unless the
best member has not improved for several generations.
7.6 Particle Swarm Optimization
Like a GA, particle swarm optimization (PSO) is a stochastic population-

based optimization algorithm based on the concept of “swarm intelli-
gence”. Swarm intelligence is the property of a system whereby the
collective behaviors of unsophisticated agents interacting locally with
their environment cause coherent global patterns to emerge. In other
words: dumb agents, properly connected into a swarm, can yield smart
results.¶¶
¶¶ PSO was first proposed by Eberhart and
Kennedy 105 . Eberhart was an electri-

The “swarm” in PSO is a set of design points (“agents” or “parti- cal engineer and Kennedy was a social-
cles”) that move in 𝑛-dimensional space looking for the best solution. psychologist.
Although these are just design points, the history for each point is 105. Eberhart et al., New Optimizer Using
Particle Swarm Theory. 1995
relevant to the PSO algorithm, so we use adopt the term “particle”.
Each particle moves according to a velocity, and this velocity changes
according to the past objective function values of that particle and
the current objective values of the rest of the particles. Each particle
remembers the location where it found its best result so far and it
exchanges information with the swarm about the location where the
swarm has found the best result so far.
The position of particle 𝑖 for iteration 𝑘 + 1 is updated according to
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝑣 𝑘+1 Δ𝑡, (7.24)
where Δ𝑡 is a constant artificial time step. The velocity for each particle
is updated as follows:
(𝑖) (𝑖) (𝑖)
(𝑖) (𝑖) 𝑥best − 𝑥 𝑘 𝑥best − 𝑥 𝑘
𝑣 𝑘+1 = ¯ 𝑘
𝑤𝑣 + 𝑐 1 𝑟1 + 𝑐 2 𝑟2 . (7.25)
Δ𝑡 Δ𝑡
The first component in this update is the “inertia”, which, through the
parameter 𝑤, ¯ dictates how much the new velocity should tend to be
the same as the one in the previous iteration.
The second term represents “memory” and is a vector pointing
toward the best position particle 𝑖 has seen in all its iterations so far,
(𝑖)
𝑥best . The weight in this term consists of a constant 𝑐 1 , and a random
parameter 𝑟1 in the interval [0, 1] that introduces a stochastic component

to the algorithm. Thus, 𝑐 1 controls how much of an influence the best
point found by the particle so far has on the next direction.
The third term represents “social” influence. It behaves similarly
to the memory component, except that 𝑥best is the best point the entire
swarm has found so far, and 𝑐 2 controls how much of an influence this
best point has in the next direction. The relative values of 𝑐 1 and 𝑐 2 thus
control the tendency toward local versus global search, respectively.
Since the time step is artificial, we can eliminate it by multiplying
Eq. 7.25 by Δ𝑡 to yield a step

(𝑖) (𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝑤Δ𝑥 𝑘 + 𝑐 1 𝑟1 𝑥best − 𝑥 𝑘 + 𝑐 2 𝑟2 𝑥 best − 𝑥 𝑘 . (7.26)
We then use this step to update the particle position for the next iteration,
i.e.,
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + Δ𝑥 𝑘+1 . (7.27)
The three components of the update (7.26) are shown in Fig. 7.21 for a
two-dimensional case.
𝑥best
(𝑖)
𝑥 best (𝑖)
𝑥 𝑘+1

(𝑖) (𝑖)
𝑐 1 𝑟1 𝑥best − 𝑥 𝑘
(𝑖)
Δ𝑥 𝑘+1

(𝑖)
𝑥 𝑘−1
(𝑖)
𝑐 2 𝑟2 𝑥 best − 𝑥 𝑘
Figure 7.21: Components of the
(𝑖)
𝑥𝑘 (𝑖)
PSO update.
𝑤Δ𝑥 𝑘
Typical values for the inertia parameter 𝑤 are in the interval [0.8, 1.2].
A lower value of 𝑤 reduces the particle’s inertia and tends toward faster
convergence to an minimum, while a higher value of 𝑤 increases the
particle’s inertia and tends toward increased exploration to potentially
help discover multiple minima. Thus, there is a tradeoff in this value.
Both 𝑐 1 and 𝑐2 values are in the interval [0, 2], and typically closer to 2.
The first step in the PSO algorithm is to initialize the set of particles
(Alg. 7.12). Like a GA, the initial set of points can be determined at
random or can use a more sophisticated design of experiments strategy
(like Latin hypercube sampling). The main loop in the algorithm
computes the steps to be added to each particle and updates their
positions. A number of convergence criteria are possible, some of
which are similar to the simplex method and GA: the distance (sum
or norm) between each particle and the best particle falls below some
tolerance, the best particle’s fitness changes by less than some tolerance
across multiple generations, the difference between the best and worst
member falls below some tolerance. In the case of PSO, another
alternative is to check whether the velocities for all particles (norm,
mean, etc.) falls below some tolerance. Some of these criteria that
assume all the particles will congregate (distance, velocities) don’t work
well for multimodal problems. In those cases tracking just the best
particle’s fitness may be more desirable.
Example 7.11: PSO algorithm applied to the bean function.
Figure 7.22 shows the sequence of simplices that results when minimizing
the bean function using a particle swarm method. The initial population size
was 40 and the optimization required 600 function evaluations. Convergence
was assumed if the best value found by the population did not improve by
more than 10−4 for 3 consecutive iterations.
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
(a) 𝑘 = 0 (b) 𝑘 = 1 (c) 𝑘 = 3
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
(d) 𝑘 = 5 (e) 𝑘 = 12 (f) 𝑘 = 17
Figure 7.22: Sequence of particles at

iterations 𝑘 that minimize the bean
function
Algorithm 7.12: Particle swarm optimization algorithm

Inputs:
𝑤: “Inertia” parameter
𝑐 1 : Self influence parameter
𝑐 2 : Social influence parameter
Outputs:
𝑘=0
for all i do Loop to initialize all particles
(𝑖)
Generate position 𝑥0 within specified bounds.
(𝑖) (𝑖)
𝑥best = 𝑥0

First position is the best so far.
(𝑖)
Evaluate 𝑓 𝑥0
if 𝑖 = 0 then
(𝑖)
𝑥best = 𝑥0
else
(𝑖)
if 𝑓 𝑥0 < 𝑓 (𝑥best ) then
(𝑖)
𝑥best = 𝑥0
end if
end if
(𝑖)
Initialize “velocity” Δ𝑥 𝑘
end for
while not converged do Main iteration loop
(𝑖) (𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝑤Δ𝑥 𝑘 + 𝑐 1 𝑟1 𝑥best − 𝑥 𝑘 + 𝑐2 𝑟2 𝑥 best − 𝑥 𝑘
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + Δ𝑥 𝑘+1 Update the particle position while enforcing bounds.
(𝑖) (𝑖)
if 𝑥 𝑘+1 < 𝑥 lower or 𝑥 𝑘+1 > 𝑥upper then

(𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝑐1 𝑟1 𝑥best − 𝑥 𝑘 + 𝑐 2 𝑟2 𝑥best − 𝑥 𝑘
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + Δ𝑥 𝑘+1
end if
(𝑖)
for all 𝑥 𝑘+1 do

(𝑖)
Evaluate 𝑓 𝑥 𝑘+1

(𝑖) (𝑖)
if 𝑓 𝑥 𝑘+1 < 𝑓 𝑥best then
(𝑖) (𝑖)
𝑥best = 𝑥 𝑘+1
end if
end for
𝑘 = 𝑘+1
end while
Example 7.13: Comparison of algorithms for multimodal function

We now return to the Jones function (Eq. 7.13 used in Ex. 7.5 to demonstrate
the DIRECT method), but make it discontinuous by adding the following
function:
Δ 𝑓 = 4dsin(𝜋𝑥1 ) sin(𝜋𝑥2 )e. (7.28)
By taking the ceiling of the product of the two sine waves, this function creates
a checkerboard pattern with zeros and fours. Adding this function to the Jones
function produces the discontinuous function shown in Fig. 7.23, where we
can clearly see the discontinuities. The global optimium remains the same as
the original function. The resulting optimization paths demonstrate that the
gradient-free algorithms are effective. Both the GA and PSO find the global
minimum, but they require a large number of evaluations for the same accuracy.
Nelder–Mead converges quickly, but not to the global minimum.
3 3
2420 evaluations 760 evaluations
2 2
1 1
𝑥2 𝑥2
0 0
𝑥∗
−1 −1
𝑥∗
−1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1
(a) Genetic algorithm (b) Particle swarm optimization
3 3 3
55 evaluations 99 evaluations 384 evaluations
2 𝑥∗ 2 2
1 1 1
𝑥2 𝑥2 𝑥2
0 0 0
𝑥∗
−1 −1 −1
𝑥∗
−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
(c) Nelder-Mead algorithm (d) DIRECT algorithm (e) Quasi-Newton method
Figure 7.23: Convergence path for

gradient-free algorithms compared
to gradient-based with multistart.
Tip 7.14: Compare optimization algorithms fairly.
It is difficult to make a fair comparison between different algorithms,

especially when they use different convergence criteria. You can either compare
the computational cost of achieving an objective with a specified accuracy, or
compare the objective achieved for a specified computational cost. To compare
algorithms that use different convergence criteria, you can run them for as long
as you can afford with the lowest convergest tolerance possible and tabulate the
number of function evaluations and the respective objective function values.
To compare the computational cost for a specified tolerance, you can determine
the number of function evaluations that each algorithm requires to achieve
a given number of digits agreement in the objective function. Alternatively,
you can compare the objective achieved for the different algorithms for a
given number of function evaluations. Comparison becomes more challenging
for constrained problems because a better objective that is less feasible is not
necessarily better. In that case, you need to make sure that all results are feasible
to the same tolerance. When comparing algotrithms that include stochastic
procedures (e.g., GA, PSO), you should run each optimizatin multiple times
to get statistically significant data and compare the mean and variance of the
performance metrics.
7.7 Summary
Gradient-free optimization algorithms are needed when the objective

and constraint functions are not smooth enough or when it is not
possible to compute derivatives with enough precision. One major
advantage of gradient-free methods is that they tend to be robust to
numerical noise and discontinuities, which makes them easier to use
than gradient-based methods.
However, the overall cost of gradient-free optimization is sensitive
to the cost of the function evaluations because the number of iterations
required for convergence scales poorly with the number of design
variables.
There is a wide variety of gradient-free methods. They can perform
local or global search, use mathematical or heuristic criteria, and
be deterministic of stochastic. A global search does not guarantee
convergence to the global optimum, but increases the likelihood of
such convergence. We should be wary when heuristics are used to
establish convergence because the result might not correspond to the
true mathematical optimum. Heuristics in the optimization algorithm
also limit the rate of convergence compared to algorithms based on
mathematical principles.
Evolutionary algorithms are global search methods based on the
evolution of a population of designs. They are based on heuristics
inspired by natural or societal phenomena and have some stochastic
element in their algorithm The genetic algorithm (GA) and particle
swarm optimization (PSO) covered in this chapter are only two of the
many evolutionary algorithms that have been invented.
The methods presented in this chapter do not directly address

the solution of constrained problems. The assumption is that we can
use penalty or filtering methods to enforce constraints. The DIRECT
method is one of the few methods that handles constraints without
resorting to penalties or filtering.
Problems
a) Gradient-free optimization algorithms are not as efficient as

gradient-based algorithms but they converge to the global
optimum.
b) None of the gradient-free algorithms check the KKT condi-
tions for optimality.
c) The Nelder–Meade algorithm is a deterministic local search
algorithm using heuristic criteria and direct function evalua-
tions.
d) The simplex is a geometric figure defined by a set of 𝑛 points,
where 𝑛 is the dimensionality of the design variable space.
e) The DIRECT algorithm is a deterministic global search al-
gorithm using mathematical criteria and direct function
evaluations.
f) The DIRECT method favors small rectangles with better
function values over large rectangles with worse function
values.
g) Evolutionary algorithms are stochastic global search algo-
rithms based on heuristics and direct function evaluations.
h) Genetic algorithms start with a population of designs that
gradually decreases to a single individual design at the
optimum.
i) Each design in the initial population of a genetic algorithm
should be carefully selected to ensure a successful optimiza-
tion.
j) Stochastic procedures in the genetic algorithms are necessary
to maintain population diversity and therefore avoid getting
stuck in local minima.
k) Particle swarm optimization follows a model developed by
biologists in the research of how bee swarms search for
pollen and nectar.
l) All evolutionary algorithms are based on either evolutionary

genetics or animal behavior.
7.2 Program the Nelder–Mead algorithm and perform the following

studies:
a) Reproduce the bean function results shown in Ex. 7.3.

b) Add random noise to the function with a magnitude of
10−4 using a Gaussian distribution and see if that makes a
difference in the convergence of the Nelder–Mead algorithm.
Compare the results to those of a gradient-based algorithm.
c) Consider the function,
𝑓 (𝑥1 , 𝑥2 , 𝑥3 ) = |𝑥1 | + 2|𝑥 2 | + 𝑥32 . (7.29)
Minimize this function with the Nelder–Mead algorithm

and a gradient-based algorithm. Discuss your results.
d) Exploration: Study the logic of the Nelder–Mead algorithm
and devise possible improvements. For example, is it a good
idea to be greedier and do multiple expansions?
7.3 Program the DIRECT algorithm and perform the following stud-
ies:
a) Reproduce the Jones function results shown in Ex. 7.5.

b) Use a gradient-based algorithm with a multistart strategy to
minimize the same function. On average, how many different
starting points do you need to find the global minimum?
c) Minimize the Hartmann function (defined in Appendix C.1.5)
using both methods. Compare and discuss your results.
d) Exploration: Develop a hybrid approach that starts with
DIRECT and then switches to the gradient-based algorithm.
Are you able to reduce the computational cost of DIRECT
significantly while converging to the global minimum?
7.4 Program a GA algorithm and perform the following studies:

b) Use your GA to minimize the Harmann function. Estimate
the rate of convergence and compare the performance of the
GA with a gradient-based algorithm.
c) Study the effect of adding checkerboard steps (Eq. 7.28) with

a suitable magnitude to this function. How does this affect
the performance of the GA and the gradient-based algorithm
compared to the smooth case? Study the effect of reducing
the magnitude of the steps.
d) Exploration: Experiment with different population sizes,
types of crossover, and mutation probability. Can you
improve on your original algorithm? Is that improvement
still observed for other problems?
7.5 Program the PSO algorithm and perform the following studies:

b) Use your PSO to minimize the 𝑛-dimensional Rosenbrock
function (defined in Appendix C.1.2) with 𝑛 = 4. Estimate
the convergence rate and discuss the performance of PSO
compared to a gradient-based algorithm.
c) Study the effect of adding noise to the objective function for
both algorithms (see Prob. 7.2). Experiment with different
levels of noise.
d) Exploration: Experiment with different population sizes, and
the values of the coefficients in Eq. 7.26. Are you able
to improve the performance of your implementation for
multiple problems?
7.6 Study the effect of increased problem dimensionality using the

𝑛-dimensional Rosenbrock function defined in Appendix C.1.2.
Solve the problem using three approaches:
a) Gradient-free algorithm
b) Gradient-based algorithm with gradients computed using
finite differences
c) Gradient-based algorithm with exact gradients
You can either use an off-the-shelf optimizer or your own im-

plementation. In each case, repeat the minimization for 𝑛 =
2, 4, 8, 16, . . . up to at least 128 and see how far you can get with
each approach. Plot the number of function calls required as a
function of the problem dimension (𝑛) for all three methods on
one figure. Discuss any differences in optimal solutions found by
the various algorithms and dimensions. Compare and discuss
your results.
Discrete Optimization
8
Most algorithms in this book assume that the design variables are
continuous. However, sometimes design variables must be discrete.
Common examples of discrete optimization include scheduling, net-
work problems, and resource allocation. This chapter introduces some
techniques for dealing with discrete optimization problems.
1. Identify situations where discrete variables can be avoided.
2. Convert problems with integer variables to ones with

binary variables.
3. Understand the basics of various discrete optimization

algorithms (branch and bound, greedy, dynamic program-
ming, simulated annealing, binary genetic algorithms).
4. Identify which algorithms are likely to be most suitable

for a given problem.
8.1 Binary, Integer, and Discrete Variables
Discrete optimization can be classified with three different labels: binary

(sometimes called zero-one), integer, and discrete. A light switch, for
example, can only be on or off and would be represented with a binary
decision variable that is either 0 or 1. The number of wheels on a vehicle
is an integer design variable, as it is not useful to build a vehicle with half
a wheel. The material in a structure that is restricted to one of titanium,
steel, or aluminum is an example of a discrete variable. All of these
cases can be represented as integers (the discrete categories are simply
mapped to integers), and an optimization problem with integer design
variables is referred to as integer programming, discrete optimization, or ∗ These phrases are often used inter-
combinatorial optimization.∗ Problems that have both continuous and changeably, but differences in meaning
are intended based on the way the prob-
lem is posed.
250
discrete variables are referred to as mixed integer programming or mixed

integer optimization.
Unfortunately, discrete optimization is NP-complete, which means
that we can easily verify a solution, but there is no known approach to
efficiently find a solution. Furthermore, the time required to solve the
problem becomes much worse as the problem size grows.
Example 8.1: The drawback of an exhaustive search.
The scaling difficulty is illustrated by a well-known discrete optimization

problem: the traveling salesman problem. Consider a set of cities represented
graphically on the left of Fig. 8.1. The problem is to find the shortest possible
route that visits each city exactly once and returns to the starting city. The right
figure of Fig. 8.1 envisions one such solution (not necessarily the optimum).
If there were only a handful of cities you could imagine doing an exhaustive
search: enumerate all possible paths, evaluate them, and return the one with
the shortest distance. Unfortunately, this is not a scalable algorithm. The
number of possible paths is (𝑛 − 1)! where 𝑛 is the number of cities. If, for
example, we used all fifty U.S. state capitols as the set of cities, then there would
be 49! = 6.08 × 1062 possible paths! That is an amazingly large number that
could not be evaluated using an exhaustive search.
Figure 8.1: An example instance of

the traveling salesman problem.
The apparent advantage of a discrete optimization problem is that

we can construct algorithms that will find the global optimum, such
as an exhaustive search. Exhaustive search ideas can also be used for
continuous problems (see Section 7.4 for example, but the cost is much
higher). The downside is that while an algorithm may eventually arrive
at the right answer, as Ex. 8.1 highlights, in practice executing that
algorithm to completion is often not practical. The goal of discrete
optimization algorithms is to allow us to search the large combinatorial
space more efficiently, often by using heuristics and approximate
solutions.
8.2 Techniques to Avoid Discrete Variables
Even though a discrete optimization problem limits the options and thus
conceptually sounds easier to solve, in practice discrete optimization
problems are usually much more difficult and inefficient compared
to continuous problems. Thus, if it is reasonable to do so, it is often
desirable to find ways to avoid using discrete design variables. There
are a couple ways this can be accomplished.
The first approach is an exhaustive search. We just discussed how
exhaustive search scales poorly, but sometimes we have many contin-
uous variables but only a few discrete variables with few options. In
this case enumerating all options is possible. For each combination of
discrete variables, the optimization is repeated using all continuous
variables. We then choose the best feasible solution amongst all the
optimization. Assuming, the continuous part of the problem can be
solved, this approach will lead to the true optimum.
Example 8.2: Exhaustively evaluating discrete variables when the number

of combinations is small.
Consider optimizing a propeller. While most of the design variables will be
continuous, the number of blades on a propeller is not. Fortunately, the number
of blades falls within a reasonably small set (2 to 6). Assuming there are no other
discrete variables, we could just perform five optimizations corresponding to
each option and choose the best solution.
A second technique is rounding. For some problems, we can opti-

mize with a continuous representation, then round to integer values
afterward. This is usually justifiable if the magnitude of the design
variables is large or if there are many continuous variables and few
discrete variables. After rounding, it is usually best to repeat the opti-
mization once more, allowing only the continuous design variables to
vary. This process may not lead to the exact optimum, and sometimes
may not even lead to a feasible solution, but for many problems this is
an effective approach.
One variation of this method is called dynamic rounding. The idea
is that rather than round all continuous variables at once, perform an
iterative process where you round only one, or a subset, of variables, fix
them and then reoptimize using a continuous formulation. The process
is repeated until all discrete variables are fixed, followed by one last
optimization with the continuous variables.
Sometimes, exhaustive search is not feasible, or rounding is unac-
ceptable as is typically the case for binary variables, or an intermediate
continuous representation is not possible. For these cases, we can

utilize discrete optimization methods.
8.3 Branch and Bound
A popular method to solve integer optimization problems is the branch

and bound method. It is popular not because it is the most efficient
method (much better methods exist that leverage specific problem
structure, some of which are discussed this chapter), but rather because
it is robust and so can generally apply to a wide variety of discrete
problems. This approach is particularly effective with convex integer
programing problems as the method is guaranteed to find the global
optimum. The most common type of convex integer problem is linear
integer problems (all the objectives and constraints are linear in the
design variables). The methodology can be extended to nonconvex
integer optimization problems, but is generally far less effective and
has no such guarantees. In this section we will assume linear mixed
integer problems, with a short discussion on nonconvex problems at
the end. Mathematically a linear mixed integer optimization problem
can be expressed as:
minimize 𝑐𝑇 𝑥
subject to ˆ ≤ 𝑏ˆ
𝐴𝑥
(8.1)
𝐴𝑥 + 𝑏 = 0
𝑥 𝑖 ∈ Z+ for some or all i
† † Zahlen, Z, is a standard symbol for the

set of all integers, whereas Z+ represents
the set of all positive integers (including
zero).
8.3.1 Binary Variables
Before exploring the integer case, we first explore the binary case where
the discrete entries in 𝑥 𝑖 must be 0 or 1. Most integer problems can
be converted to binary problems by adding additional variables and
constraints. Even though the new problem is larger it is usually far
easier to solve.
Example 8.3: Converting an integer problem to a binary one.
Consider a problem where an engineering device may use one of 𝑛 different

materials: 𝑦 ∈ (1 . . . 𝑛). Rather than have one design variable 𝑦, we could
convert the problem to have 𝑛 binary variables 𝑥 𝑖 where each 𝑥 𝑖 is 0 if material
𝑖 is not selected and 1 if material 𝑖 is selected. We would also need to add an
additional linear constraint to make sure that one (and only one) material is
selected:
Õ
𝑛
𝑥𝑖 = 1 (8.2)
𝑖=1
The key to a successful branch and bound problem is a good relax-

ation. Relaxation means approximating an optimization problem, often
by removing constraints. For a given problem many types of relaxation
are possible, but for linear mixed integer programming problems, the
most natural relaxation is to remove the integer constraints. In other
words, we solve the corresponding continuous linear programming
problem, also known as an LP (discussed in Section 11.2). If the solution
to the original LP happened to return all binary values then we would
have the solution and would terminate the search. If the LP returned
fractional values then we need to branch.
Branching is done by adding additional constraints and solving
additional optimization problems. For example, we could branch by
adding constraints on 𝑥 1 , creating two new optimization problems: the
LP from above but with 𝑥1 = 0 and the LP from above but with 𝑥1 = 1.
This procedure is then repeated with additional branching as needed.
𝑥1 = 0 1
𝑥2 = 0 1 0 1
Figure 8.2: Enumerating the op-

tions for a binary problem with
𝑥3 = 0 1 0 1 0 1 0 1 branching.
Figure 8.2 illustrates the branching concept for binary variables. If

we explored all of those branches then we would be conducting an
exhaustive search. The main benefit of branch and bound algorithm
is that we can find ways to eliminate branches (referred to as pruning)
to narrow down the search scope. There are two ways to prune. If
any of the relaxed problems is infeasible then we know that everything
from that node downward (i.e., that branch) is also infeasible. Adding
more constraints cannot make an infeasible problem suddenly feasible
again. Thus, that branch is pruned and we back up the tree. The
other way we can eliminate branches is by determining that a better
solution cannot exist on that branch. The algorithm keeps track of
the best solution to the problem found so far. If one of the relaxed
problems returns an objective that is worse then the best we have found
then we can prune that branch. We know this because adding more
constraints will always lead to a solution that is either the same or
worse, never better (assuming you always find the global optimum,
which we can guarantee for LP problems). The solution from a relaxed
problem provides a lower bound—the best that could be achieved if
continuing on that branch. The logic for these various possibilities
is summarized in Alg. 8.4. The initial starting point for 𝑓best can be
𝑓best = ∞ if nothing is known, but if a known feasible solution exists, or
can be found quickly by some heuristic, providing any finite best point
can often greatly speed up the optimization.
Algorithm 8.4: Branch and bound algorithm.
Inputs:
𝑓best : Best known solution, if any; otherwise 𝑓best = ∞
Outputs:
Let 𝒮 be the set of indices for binary constrained design variables

while branches remain do
Solve relaxed problem for 𝑥, ˆ 𝑓ˆ
if relaxed problem is infeasible then
prune this branch, back up tree
else
if 𝑥ˆ𝑖 ∈ {0, 1}∀𝑖 ∈ 𝒮 then A solution is found
𝑓best = min( 𝑓best , 𝑓ˆ), back up tree
else
if 𝑓ˆ > 𝑓best then
prune this branch, back up tree
else A better solution might exist.
branch further
end if
end if
end if
end while
Many variations exist for these algorithms. One design variation

is the choice of which variables to branch on at a given node. One
common strategy is to branch on the variable with the largest fractional
component. For example, if 𝑥ˆ = [1.0, 0.4, 0.9, 0.0] we could branch on
𝑥2 or 𝑥3 since both are fractional. We hypothesize that we are more
likely to force the algorithm to make faster progress by branching on
variables that are closer to midway between integers. In this case that
value would be 𝑥2 = 0.4. Mathematically, we would choose to branch

on the value closest to 0.5:
min |𝑥 𝑖 − 0.5| (8.3)

𝑖
Another design variation is on how to search the tree. Two com-

mon strategies are depth-first or breadth-first. A depth-first strategy
continues as far down as possible (for example, by always branching
left) until it cannot go further and then right branches are followed. A
breadth-first strategy would explore all nodes on a given level before
increasing depth. Various other strategies exist, and in general, we do
not know beforehand what is best. Depth-first is a common strategy as,
in the absence of other information, is likely the fastest way to find a
solution (reaching the bottom of the tree generally forces a solution).
Finding a solution quickly is desirable because its solution can then be
used as a bound on other branches. Additionally, a depth-first strategy
requires less memory storage because breadth-first must maintain a
longer history as the number of levels increases, whereas depth-first
only requires node storage equal to the number of levels.
Example 8.5: A binary branch and bound optimization.
Consider the following problem:
minimize − 2.5𝑥 1 − 1.1𝑥 2 − 0.9𝑥3 − 1.5𝑥4

subject to 4.3𝑥1 + 3.8𝑥2 + 1.6𝑥3 + 2.1𝑥4 ≤ 9.2
(8.4)
4𝑥1 + 2𝑥2 + 1.9𝑥3 + 3𝑥4 ≤ 9
𝑥 𝑖 ∈ {0, 1} for all 𝑖
We begin at the first node by solving the linear relaxation. The binary
constraint is removed, and instead replaced with continuous bounds: 0 ≤ 𝑥 𝑖 ≤ 1.
The solution to this LP is:
𝑥 ∗ = [1, 0.5274, 0.4975, 1]𝑇

(8.5)
𝑓 ∗ = −5.0279
There are non binary values in the solution so we need to branch. As

discussed, a typical choice is to branch on the variable with the most fractional
component. In this case, that is 𝑥3 so we now create two additional problems
which add the constraints 𝑥3 = 0 and 𝑥3 = 1 respectively (Fig. 8.3).
𝑥 ∗ = [1, 0.53, 0.50, 1]𝑇

𝑓 ∗ = −5.03
𝑥3 = 0 1 Figure 8.3: Initial binary branch.
While, depth-first was recommended above, for this example we will use
breadth-first only because it is shorter in this case giving a more concise example.
The depth-first tree is also shown at the end of the example. We solve both of the
problems at this next level as shown in Fig. 8.4. Neither of these optimizations
yields all binary values so we have to branch both of them. In this case the left
node branches on 𝑥 2 (the only fractional component) and the right node also
branches on 𝑥2 (the most fractional component).
𝑥 ∗ = [1, 0.53, 0.50, 1]𝑇

𝑓 ∗ = −5.03
𝑥3 = 0 1 Figure 8.4: Solutions along these
𝑥 ∗ = [1, 0.74, 0, 1]𝑇 𝑥 ∗ = [1, 0.47, 1, 0.72]𝑇 two branches.
𝑓 ∗ = −4.81 𝑓 ∗ = −5.00
The first branch (see Fig. 8.5) yields a feasible binary solution! The corre-
sponding function value 𝑓 = −4 is saved as the best value we have seen so far.
There is no need to continue on this branch as the solution cannot be improved
on this particular branch.
𝑥 ∗ = [1, 0.53, 0.50, 1]𝑇

𝑓 ∗ = −5.03
𝑥3 = 0 1
𝑥 ∗ = [1, 0.74, 0, 1]𝑇 𝑥 ∗ = [1, 0.47, 1, 0.72]𝑇
𝑓 ∗ = −4.81 𝑓 ∗ = −5.00
𝑥2 = 0 1 0 1 Figure 8.5: The first feasible solu-
𝑥 ∗ = [1, 0, 0, 1]𝑇 tion.
𝑓 ∗ = −4
We continue solving along the rest of this row (Fig. 8.6). The third node
on this row yields another binary solution. In this case the function value is
𝑓 = −4.9, which is better, and so we save this as the best value we have seen so
far. The second and fourth nodes do not yield a solution. Normally we’d have
to branch these further but both of them have a lower bound which is worse
than the best solution we have found so far. Thus, we can prune both of these
branches.
𝑥 ∗ = [1, 0.53, 0.50, 1]𝑇

𝑓 ∗ = −5.03
𝑥3 = 0 1
𝑥 ∗ = [1, 0.74, 0, 1]𝑇 𝑥 ∗ = [1, 0.47, 1, 0.72]𝑇
𝑓 ∗ = −4.81 𝑓 ∗ = −5.00
𝑥2 = 0 1 0 1
𝑥 ∗ = [1, 0, 0, 1]𝑇 𝑥 ∗ = [0.40, 1, 1, 1]𝑇
𝑓 ∗ = −4 𝑓 ∗ = −4.49 Figure 8.6: The rest of the solutions
𝑥 ∗ = [0.77, 1, 0, 1]𝑇 𝑥 ∗ = [1, 0, 1, 1]𝑇 on this row.
𝑓 ∗ = −4.52 𝑓 ∗ = −4.9
All branches have been pruned and so we have solved the original problem:
𝑥 ∗ = [1, 0, 1, 1]𝑇
(8.6)
𝑓 ∗ = −4.9
Alternatively, we could have used a depth-first strategy. In this case, it is
less efficient, but in general that is not known beforehand. The depth-first tree
for this same example is depicted in Fig. 8.7. Feasible solutions to the problem
are shown with 𝑓 ∗ .
𝑥3 = 0 1
𝑥2 = 0 1 0 1
𝑓 ∗ = −4 𝑓 ∗ = −4.9 bounded
𝑥1 = 0 1
𝑓 ∗ = −2.6
𝑥4 = 0 1
Figure 8.7: The search path with a
depth-first strategy instead.
𝑓∗ = −3.6 infeasible
8.3.2 Integer Variables

If the problem cannot be put in binary form, we can use essentially the
same procedure with integer variables. Instead of branching with two
constraints: 𝑥 𝑖 = 0 or 𝑥 𝑖 = 1 we branch with two inequality constraints
that will encourage solutions to find an integer solution. For example,
if the variable we branched on was 𝑥 𝑖 = 3.4 we would branch with two
new problem with the constraints: 𝑥 𝑖 ≤ 3 or 𝑥 𝑖 ≥ 4. An example of this
is shown below.
Example 8.6: Branch and bound with integer variables.
Consider the following problem:
minimize − 𝑥1 − 2𝑥2 − 3𝑥3 − 1.5𝑥4

subject to 𝑥 1 + 𝑥2 + 2𝑥3 + 2𝑥 4 ≤ 10
7𝑥1 + 8𝑥2 + 5𝑥 3 + 𝑥4 = 31.5 (8.7)
+
𝑥 𝑖 ∈ Z for 𝑖 = 1, 2, 3
𝑥4 ≥ 0
We begin by solving the LP relaxation (i.e., we remove the integer con-

straints), but with a lower bound of 0. The solution to that problem is:
𝑥 ∗ = [0, 1.1818, 4.4091, 0], 𝑓 ∗ = −15.59 (8.8)
We begin by branching on the most fractional value, which is 𝑥3 . We create

two new branches:
• The original LP but with the added constraint that 𝑥3 ≤ 4
• The original LP but with the added constraint that 𝑥3 ≥ 5
Even though depth-first is usually more efficient, we will use breadth first as it
is easier to display on a figure. The solution to that first problem is:
𝑥 ∗ = [0, 1.4, 4, 0.3], 𝑓 ∗ = −15.25 (8.9)
The second problem is infeasible so we can prune that branch.

Recall that the last variable is allowed to be continuous, so we now branch
on 𝑥2 by creating two new problems with additional constraints: 𝑥2 ≤ 1, and
𝑥2 ≥ 2.
The problem continues with the same procedure, as shown in the breadth-
first tree in Fig. 8.8. The figure gives some indication why solving integer
problems is more time consuming than solving binary ones. Unlike the binary
case, the same value is revisited with tighter constraints. For example, early on
the constraint 𝑥3 ≤ 4 is enforced. Later, two additional problems are created
with tighter bounds on the same variable: 𝑥3 ≤ 2 or 𝑥3 ≥ 4. In general, the same
variable could be revisited many times as the constraints are slowly tightened,
whereas in the binary case each variable is only visited once since the values
can only be 0 or 1.
𝑥3 ≤ 4 𝑥3 ≥ 5
infeasible
𝑥2 ≤ 1 𝑥2 ≥ 2
𝑥1 ≤ 0 𝑥1 ≥ 1 𝑥3 ≤ 2 𝑥3 ≥ 3
infeasible bounded 𝑓 ∗ = −13.75

𝑥2 ≤ 0 𝑥2 ≥ 1
bounded
𝑥1 ≤ 1 𝑥1 ≥ 2 Figure 8.8: A breadth-search of
the mixed integer programming
example.
infeasible bounded
Once all the branches are pruned we see that the solution is:
𝑥 ∗ = [0, 2, 3, 0.5]𝑇
(8.10)
𝑓 ∗ = −13.75.
Nonconvex mixed integer problems can also be used with the

branch and bound technique, and generally will use this latter strategy
of forming two branches of continuous constraints. In this case the
relaxed problem is not a convex problem and so we cannot provide
any guarantees that we have found a lower bound for that branch.
Furthermore, the cost of each suboptimization problem is increased.
Thus, for nonconvex discrete problems the methodology is usually only
practical for a relatively small number of discrete design variables.
8.4 Greedy Algorithms
Greedy algorithms are perhaps the simplest approach for discrete opti-
mization problems. This approach is more of a concept than a specific
algorithm. The implementation varies with the application. The idea is
to reduce the problem to a subset of smaller problems (often down to a
single choice), and then make a locally optimal decision. That decision
is locked in, and then the next small decision is made in the same
manner. A greedy algorithm does not revisit past decisions, and so
ignores much of the coupling that may occur between design variables.
Example 8.7: A weighted directed graph.
As an example consider the weighted directed graph shown in Fig. 8.9. The
objective is to traverse from node 1 to node 12 with the smallest possible
cost (cost denoted by the numbers above path segments). Note that a series
of discrete choices must be made at each step, and those decisions limit the
available options in the next step. This graph might represent a transportation
problem for shipping goods, information flow through a social network, or a
supply chain problem.
A greedy algorithm simply makes the best choice assuming each decision
is the only decision that will be made. Starting at node 1, we first choose to
move to node 3 because that is the smallest cost between the three options
(node 2 cost 2, node 3 cost 1, node 4 cost 5). We then choose to move to node 6
because that is the smallest cost between the next two available options (node
6 cost 4, node 7 cost 6) and so on. The path selected by the greedy algorithm
is highlighted in the figure and results in a total cost of 15. The algorithm is
easy to apply and scalable, but will not generally find the global optimum. The
global optimum in this case, also highlighted in the figure, results in a total
5
5 3
2 2 9
5
2 4 3
Global 4 6
4
1 6
1 3 10 12
6 7
5 2
Greedy 7 5 Figure 8.9: The greedy algorithm
3 1
in this weighted directed graph
4 4 11
results in a cost of 15, compared to
5 2 the global optimum with a cost of
10.
8
cost of 10. To find that global optimum we have to consider the impact of our
choices on future decisions. A method to do this will be discussed in the next
section.
Even for a fixed problem there are, in general, many ways to

construct a greedy algorithm. The advantage of this approach is
that these algorithms are relatively easy to construct, and they allow
us to bound the computational expense of the problem. The main
disadvantages are that we usually will not find a optimal solution (in
fact sometimes it can produce the worst possible solution 106 ), and that 106. Gutin et al., Traveling salesman should
not be greedy: domination analysis of greedy-
it may not even produce a feasible solution. Despite the disadvantages type heuristics for the TSP. 2002
there are times when the solutions, although suboptimal, are reasonably
close to an optimal solution, and can be found quickly.
Example 8.8: Some greedy algorithms.
Here are some examples of greedy algorithms for different problems.
• Traveling salesman (Ex. 8.1): Always select the nearest city as the next
step.
• Propeller problem (Ex. 8.2 but with more discrete variables): optimize
the number of blades with all remaining discrete variables fixed, then
optimize the material selection with all remaining discrete variables
fixed, . . ..
• Grocery shopping (Ex. 11.1)‡ : There are many possibilities for formulat- ‡ This is a form of the knapsack problem,
ing a greedy solution. For example: always pick the cheapest food item which is a classic problem in discrete opti-
mization
next, or always pick the most nutritious food item next, or always pick
the food item with the most nutrition per unit cost.
8.5 Dynamic Programming
Dynamic programming is a useful technique for discrete optimization

problems with a special structure. This structure also allows for usage
with continuous problems and for algorithms beyond optimization.
The required structure is that the problem can be posed as a Markov
chain (for continuous problems this is called a Markov process). A
Markov chain or process satisfies the Markov property, which means
that a future state can be predicted from the current state without
needing to know a full history. The concept can be generalized to a
finite number of states (i.e., more than one but not the full history) and
is called a variable-order Markov chain. If this property holds then we
can break up the problem into a recursive one where a small problem
is solved, and larger problems are solved by using the solutions of
the smaller problems. For example, we could solve the grocery store
problem for the case where the store only carried one item, then the
case where the store carried two items reusing the previous solution,
and so on. On the surface this may sound like a greedy optimization,
but it is not. We are not using a heuristic, but fully solving the smaller
problems and because of the problem structure we can reuse those
solutions (we will see this in some examples shortly). This approach has
become particularly useful in optimal control as well as some areas of
economics and computational biology. More general design problems,
like the propeller example (Ex. 8.2), do not fit this type of structure
(i.e., choosing the number of blades cannot be broken up into a smaller
problem separate from choosing the material).
A classic example of a Markov chain, though not an optimization
problem, is the Fibonacci sequence. Its definition is:
𝑓0 = 0
𝑓1 = 1 (8.11)
𝑓𝑛 = 𝑓𝑛−1 + 𝑓𝑛−2
Notice, that we do not need a full history, but can compute the next
number in the sequence just by knowing the last two states.§ We could § We can also convert this to a standard first
order Markov chain by defining 𝑔𝑛 = 𝑓𝑛−1

implement this with recursion as shown algorithmically in Alg. 8.9, and considering our state to be ( 𝑓𝑛 , 𝑔𝑛 ).
and graphically in Fig. 8.10 for 𝑓5 . Then, each state only depends on the pre-
vious state.
Algorithm 8.9: Fibonacci with recursion.
procedure fib(𝑛)
if 𝑛 ≤ 1 then
return 𝑛
else
return fib(𝑛 − 1) + fib(𝑛 − 2)
end if
end procedure
fib(5)
fib(4) fib(3)
fib(3) fib(2) fib(2) fib(1)
Figure 8.10: Computing Fibonacci

fib(2) fib(1) fib(1) fib(0) fib(1) fib(0) sequence using recursion. The
function fib(2) is highlighted as an
example to show the repetition that
occurs in this recursive procedure.
fib(1) fib(0)
While this recursive procedure works and is simple, it is inefficient.

For example, the calculation for fib(2) is highlighted showing that
the same calculation is repeated multiple times. There are two main
approaches to avoid this inefficiency. The first is a top-down procedure
called memoization. This just means that we store previously computed
values to avoid having to compute them again. For example, the first
time we need fib(2) we call the fib function and store the result (the
value 1). As we progress down the tree, if we need fib(2) again we do
not call the function but rather just retrieve the stored value.
More commonly we use a bottom-up approach called tabulation.
This procedure is how one would typically show what the Fibonacci
sequence looks like. We start from the bottom ( 𝑓0 ) and work our
way forward computing each new value using the previous states.
Rather than use recursion this involves a simple loop as shown in
Alg. 8.10. Whereas memoization fills entries on demand, tabulation
systematically works its way up filling in entries. In either case, we
reduce the computational complexity of this algorithm from exponential
complexity (approximately 𝒪(2𝑛 )) to linear complexity (𝒪(𝑛)).
Algorithm 8.10: Fibonacci with tabulation.
procedure fib2(𝑛)
𝑓0 = 0
𝑓1 = 1
for 𝑖 = 2 to 𝑛 do
𝑓𝑖 = 𝑓𝑖−1 + 𝑓𝑖−2
end for
return 𝑓𝑛
end procedure
The ideas are essentially the same in an optimization context, but

before jumping into those examples we will formalize the mathematics
of the approach. One main difference in optimization is that we do not
have a set formula like a Fibonacci sequence. Rather at each state we
will need to make a design decision, which will then change the next
state. For example, with the simple graph problem shown in Fig. 8.9
we will make decisions on which path to take.
Mathematically, we express a given state as 𝑠 𝑖 , make a design 𝑠𝑖
𝑥𝑖
𝑠 𝑖+1
𝑥 𝑖+1
𝑠 𝑖+2
decision 𝑥 𝑖 , that will transition us to the next state 𝑠 𝑖+1 (Fig. 8.11).
Figure 8.11: Diagram of state transi-
𝑠 𝑖+1 = 𝑡 𝑖 (𝑠 𝑖 , 𝑥 𝑖 ) (8.12) tions in a Markov chain.
where 𝑡 is a transition function. For some variants this transition

function is stochastic. At each transition we compute the cost function
𝑐. A common variation uses a discount factor on future costs. For
generality we simply specify a cost function that may change at each
iteration 𝑖:
𝑐 𝑖 (𝑠 𝑖 , 𝑥 𝑖 ) (8.13)
We want to make a set of decisions that will minimize the cost not just
for the current cost but for the sum of all future costs as well. This is
called the value function 𝑣.
𝑣(𝑠 𝑖 ) = min [𝑐 𝑖 (𝑠 𝑖 , 𝑥 𝑖 ) + 𝑐 𝑖+1 (𝑠 𝑖+1 , 𝑥 𝑖+1 ) + . . . + 𝑐 𝑛 (𝑠 𝑛 , 𝑥 𝑛 )] (8.14)

𝑥 𝑖 ,...,𝑥 𝑛
where 𝑛 defines the time horizon we seek to optimize across. For

continuous problems the time horizon may be infinite. To repeat,
the value function is the minimum cost, not simply the cost for some
arbitrary set of decisions. Note that 𝑣 and 𝑐 are scalar functions, but
rather than use greek symbols we use 𝑣 and 𝑐 here as the connection to
“value” and “cost” is clearer and more common.
Bellman’s principle of optimality notes that because of the structure of
the problem (where the next state only depends on the current state), we
can determine the best solution at this iteration 𝑥 ∗𝑖 if we already know all
the future optimal decisions 𝑥 ∗𝑖+1 . . . 𝑥 𝑛∗ . Thus, we can recursively solve
this problem from the back (bottom) determining 𝑥 𝑛∗ , then 𝑥 𝑛−1 ∗
and so
on back to 𝑥 ∗𝑖 . Mathematically, this recursive procedure is captured by

Bellman’s equation:
𝑣(𝑠 𝑖 ) = min {𝑐(𝑠 𝑖 , 𝑥 𝑖 ) + 𝑣(𝑠 𝑖+1 )} . (8.15)

𝑥𝑖
We can also express this in terms of our transition function to show the
dependence on the current decision:
𝑣(𝑠 𝑖 ) = min {𝑐(𝑠 𝑖 , 𝑥 𝑖 ) + 𝑣(𝑡 𝑖 (𝑠 𝑖 , 𝑥 𝑖 ))} . (8.16)

𝑥𝑖
Example 8.11: Dynamic programming applied to graph problem.
Let us solve the graph problem posed in Ex. 8.7 using dynamic programming.
For convenience, we will repeat a smaller version of the figure in Fig. 8.12. We
will use the tabulation (bottom-up) approach. To do this we construct a table
where we keep track of the cost to move from this node to the end (node 12),
and which node we should move to next.
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost
Next
We start from the end. The last node is simple. There is no cost to move 5 5 3
from node 12 to the end (we are already there) and there is no next node. 2
2 5 9
2 4 3
4 6
Node 1 2 3 4 5 6 7 8 9 10 11 12 4
1 6
1 3 6 10 12
Cost 0 5
7
2
Next - 3 7 5 1
4
4 5 2 11
We now move back one level to consider nodes 9, 10, and 11. These nodes
8
all lead to node 12 and so are straightforward. we will be a little more careful
with the formulas as we get to the more complicated cases next. Figure 8.12: Small version of
Fig. 8.9 for convenience.
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 3 6 2 0
Next 12 12 12 -
We now move back one level to nodes 5, 6, 7, 8. For node 5 the cost is the
following (i.e., Bellman’s equation):
cost(5) = min(3 + cost(9), 2 + cost(10), 1 + cost(11)) (8.17)
Note that we have already computed the minimum value for cost(9), cost(10),
and cost(11) and so just look up these values in the table. In this case, the
minimum total value is 3 and is associated with moving to node 11. Similarly,
the cost for node 6 is:
cost(6) = min(5 + cost(9), 4 + cost(10)) (8.18)
The result is 8, and is realized by moving to node 9.
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 3 8 3 6 2 0
Next 11 9 12 12 12 -
We repeat this process, moving back and reusing optimal solutions to find
the global optimum. The completed table looks like the following:
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 10 8 12 9 3 8 7 4 3 6 2 0
Next 2 5 6 8 11 9 11 11 12 12 12 -
From the table we see that the minimum cost is 10, and is achieved by
moving to node 2, under node 2 we see that we next go to node 5, then 11, and
finally 12. Thus, the tabulation gives us the global minimum for cost and the
design decisions to achieve that.
To illustrate the concepts more generally, let us consider the knapsack

problem, another classic problem in discrete optimization. In this
problem we have a fixed set of items we can select from. Each item has a
weight 𝑤 𝑖 and a cost 𝑐 𝑖 (since cost usually implies something to minimize
we would typically use the word value here, but will stick with cost
as it is consistent with our earlier notation). Our knapsack has a fixed
capacity 𝐾 (a scalar) then we cannot exceed. The objective is to choose
the items that will yield the highest total cost subject to the capacity
of our knapsack. The design variables 𝑥 𝑖 are either 1 or 0 indicating
whether we take or do not take item 𝑖 respectively. This problem has
many variations with practical applications such as shipping, data
transfer, and investment portfolio selection. Mathematically we pose
the problem as:
Õ
𝑛
maximize 𝑐𝑖 𝑥𝑖
𝑖=1
Õ𝑛
(8.19)
subject to 𝑤𝑖 𝑥𝑖 ≤ 𝐾
𝑖=1
𝑥 𝑖 ∈ {0, 1}
In its present form, the problem has a linear objective and linear
constraints, so branch and bound is a good fit. However, it can also
be formulated as a Markov chain, so we can therefore use dynamic
programming. The dynamic programming version allows us to ac-
commodate variations such as stochasticity and other constraints more
easily. To see that this can be posed as a Markov chain, we define the
state as the remaining capacity of the knapsack 𝑘 and the number of
items we have already considered. In other words, we are interested in
𝑣(𝑘, 𝑖) where 𝑣 is the value function (optimal value given the inputs),
𝑘 is the remaining capacity in the knapsack and 𝑖 indicates that we
have already considered items 1 through 𝑖 (this doesn’t mean we have
added them all to our knapsack, but that we have considered them).
We iterate through a series of decisions 𝑥 𝑖 deciding whether to take
item 𝑖 or not, which transitions us to a new state where 𝑖 increases and
𝑘 may decrease depending on whether or not we took the item.
The real problem we are interested in is 𝑣(𝐾, 𝑛), which we will solve
using tabulation. Starting at the bottom, we know that 𝑣(𝑘, 0) = 0 for
any 𝑘. In words, this just means that no matter what the capacity is,
if we haven’t considered any items yet then the value is 0. To work
forward, let’s consider a general case considering item 𝑖, with the
assumption that we have already solved up to item 𝑖 − 1 for any capacity.
If item 𝑖 cannot fit in our knapsack (𝑤 𝑖 > 𝑘) then we cannot take the
item. Alternatively, if the weight is less than the capacity we need to
make a choice: select item 𝑖 or do not. If we do not, then the value is
unchanged: 𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1). If we do select item 𝑖 then our value is
𝑐 𝑖 plus the best we could do with the previous items but with a capacity
that was smaller by 𝑤 𝑖 : 𝑣(𝑘, 𝑖) = 𝑣 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1). Whichever of
these decisions yields a better value is what we should choose. This
process is summarized in Alg. 8.12.
Algorithm 8.12: Knapsack with tabulation.
Inputs:
𝑐 𝑖 : Cost of item 𝑖
𝑤 𝑖 : Weight of item 𝑖
𝐾: Total available capacity
Outputs:
𝑣(0 : 𝐾, 0 : 𝑛): 𝑣(𝑘, 𝑖) is the optimal cost for capacity 𝑘 considering items 1 through 𝑖 , note that
indexing starts at 0
for 𝑘 = 0 to 𝐾 do
𝑣(𝑘, 0) = 0 No items considered, value is zero for any capacity
end for
for 𝑖 = 1 to 𝑛 do Iterate forward solving for one additional item at a time

for 𝑘 = 0 to 𝐾 do
if 𝑤 𝑖 > 𝑘 then
𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1) Weight exceeds capacity, value unchanged
else
𝑣(𝑘, 𝑖) = max(𝑣(𝑘, 𝑖 − 1), 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1)) Choose to either reject or
take item using previous solutions
end if
end for
end for
Note that we will end up filling all entries in the matrix 𝑣[𝑘, 𝑖], in
order to extract the last value 𝑣[𝐾, 𝑛]. For small numbers, filling this
matrix (or table) is often illustrated manually, hence the name tabulation.
Like the Fibonnaci example, using dynamic programming instead of a
fully recursive solution reduces the complexity from 𝒪(2𝑛 ) to 𝒪(𝐾𝑛),
which means it is psuedolinear. It is only psuedolinear because there is
a dependence on the knapsack size. For small capacities the problem
scales well even with many items, but as the capacity grows the problem
becomes scales much less efficiently. Note that the knapsack problem
requires integer weights. Real numbers can be scaled up to integers
(e.g., 1.2, 2.4 becomes 12, 24). Arbitrary precision floats are not feasible
given the number of combinations to search across.
Example 8.13: Knapsack problem with dynamic programming.
Let’s consider five items with the following weights and costs:
𝑤 𝑖 = [4, 5, 2, 6, 1]
(8.20)
𝑐 𝑖 = [4, 3, 3, 7, 2]
The capacity of our knapsack is 𝐾 = 10. Using Alg. 8.12 we find that the optimal
cost is 12. The value matrix looks as follows:
0 0 0 0 0 0
 
0 2 
 0 0 0 0
0 3 
 0 0 3 3
0 5 
 0 0 3 3
0 5 
 4 4 4 4
 
0 4 4 4 4 6 (8.21)
 
0 4 4 7 7 7
 
0 4 4 7 7 9 

0 10
 4 4 7 10
0 12
 4 7 7 10
0 12
 4 7 7 11
To determine which items produce this cost we need to add a bit more logic.
To focus on the main principles this was left out of the previous algorithm, but
for completeness is discussed in this example. To keep track of the selected
items we need to define a selection matrix 𝑆 of the same size as 𝑣 (note that this
matrix is indexed starting at zero in both dimensions). Every time we accept an
item 𝑖 in Alg. 8.12 we note that in the matrix as 𝑆 𝑘,𝑖 = 1. We would replace this
line:
𝑣(𝑘, 𝑖) = max(𝑣(𝑘, 𝑖 − 1), 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1))
with
if 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1) > 𝑣(𝑘, 𝑖 − 1) then
𝑣(𝑘, 𝑖) = 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1)
𝑆(𝑘, 𝑖) = 1
else
𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1)
end if
Then at the end of the algorithm we can determine, which entries were
saved by this logic:
Input: 𝑆 Selection matrix (𝐾 + 1 × 𝑛 + 1) matrix from (8.12)
𝑘=𝐾
𝑋 ∗ = {} Initialize solution 𝑋 ∗ as an empty set
for 𝑖 = 𝑛 to 1 by −1 do
if 𝑆 𝑘,𝑖 = 1 then
add 𝑖 to 𝑋 ∗ Item 𝑖 was selected
𝑘 −= 𝑤 𝑖
end if
end for
return 𝑋 ∗
For this example the selection matrix 𝑆 looks as follows:
0 0 0 0 0 0
 
0 1
 0 0 0 0
0 0
 0 0 1 0
0 1
 0 0 1 0
0 1
 1 0 0 0
 
𝑆 = 0 1 0 0 0 1 (8.22)
 
0 1 0 1 0 0
 
0 1 0 1 0 1

0 0
 1 0 1 1
0 1
 1 1 0 1
0 1
 1 1 0 1
Following the above algorithm we find that we selected items 3, 4, 5 for a total
cost of 12, as expected, and a total weight of 9.
Like greedy algorithms, dynamic program is more of a technique

than a specific algorithm. The implementation will in general vary with

the particular application.
8.6 Simulated Annealing
Simulated annealing¶ is a methodology designed for discrete opti- ¶ First

developed by Kirkpatrick et al.107
and Černý108 .
mization problems, although it is can also be effective for continuous
multimodal problems as we will discuss. Inspiration for the algorithm 107. Kirkpatrick et al., Optimization by
Simulated Annealing. 1983
comes from the annealing process of metals. The atoms in a metal form
108. Černý, Thermodynamical approach to
a crystal lattice structure. If the metal is heated the atoms move around the traveling salesman problem: An efficient
freely. As the metal is cooled down, the atoms slow down and if the simulation algorithm. 1985
cooling is slow enough they reconfigure into a minimum energy state.

Alternatively, if the metal is quenched, or cooled rapidly, the metal
recrystallizes with a different higher energy state (called an amorphous
metal).
From statistical mechanics, the Boltzman distribution (also called
Gibbs distribution) describes the probability of a system occupying a
given energy state:
−𝑒
prob(𝑒) ∝ exp (8.23)
𝑘𝐵𝑇
where 𝑒 is the energy level, 𝑇 is the temperature, and 𝑘 𝐵 is Boltzmann’s
constant. This equation shows that as the temperature is decreased,
the probability of occupying a higher energy state is decreased, but it is
not zero. Therefore, unlike classical mechanics, an atom could jump to
a higher energy state, with some small probability. It is this property,
applied to an optimization algorithm, that gives the methodology an
exploratory nature avoiding premature convergence to a local minimum.
The temperature level provides some control on the level of expected
exploration.
In the Metropolis algorithm the probability of transitioning from
energy state 𝑒1 to energy state 𝑒2 is taken to be:

−(𝑒2 − 𝑒1 )
prob = exp (8.24)
𝑘𝐵𝑇
If 𝑒2 < 𝑒1 then the predicted probability would be greater than 1, and

so is capped at 1.
In the optimization analogy, the energy level is our objective function.
Temperature is a parameter controlled by the optimizer, which begins
high and is slowly “cooled” to drive towards convergence. At a given
iteration the design variables are given by 𝑥, and the objective (or
energy) is given by 𝑓 (𝑥 (𝑖) ). A new state 𝑥new is selected at random in
the neighborhood of 𝑥. If the energy level decreases then the new state
is accepted. If the energy level increases, the new state might still be
accepted with probability,

− 𝑓 (𝑥 new ) − 𝑓 (𝑥)
exp , (8.25)
𝑇
where Boltzmann’s constant is removed because it is just an arbitrary
scale factor in the optimization context. Otherwise, the state remains
unchanged. Constraints can be naturally handled in this algorithm
without resorting to penalities by rejecting any infeasible step.
We must supply the optimizer with a function that provides a
random neighboring design from the set of possible design configura-
tions. A neighboring design is usually related to the current design, as
opposed to picking a pure random design from the entire set. In defin-
ing the neighborhood structure, one might wish to define transition
probabilities so that all neighbors are not equally likely. This type of
structure is common in Markov chain problems.
Finally, we need to determine the annealing schedule (or cooling
schedule), a process for decreasing the temperature throughout the
optimization. A common approach is exponential decrease:
𝑇 = 𝑇0 𝛼 𝑘 (8.26)
where 𝑇0 is the initial temperature, 𝛼 the cooling rate, and 𝑘 is the

iteration number. The cooling rate 𝛼 is a number close to one, like 0.8 −
−0.99. Another simple approach to iterate towards zero temperature is:
𝛽
𝑘
𝑇 = 𝑇0 1 − (8.27)
𝑘 max
where the exponent 𝛽 is usually in the range of 1–4, with higher expo-
nents spending more time at low temperatures. In many approaches the
temperature is kept constant for a fixed number of iterations (or a fixed
number of successful moves) before moving to the next decrease. Many
methods are simple schedules with a predetermined rate, although
more complex adaptive methods also exist.‖ The annealing schedule
can have a strong impact on the algorithm’s performance and some ‖
See for example Andresen et al.109
experimentation is required to select an appropriate schedule for the 109. Andresen et al., Constant thermo-
dynamic speed for minimizing entropy
problem at hand. The important principles are that the temperature production in thermodynamic processes and
should start high enough to allow for exploration, significantly higher simulated annealing. 1994
than the maximum expected energy change (change in objective) but

not so high that computational time is wasted with lots of random
searching, and that cooling should occur slowly to improve the ability
to recover from local optimum, imitating the annealing process as
opposed to the quenching process.
The basic algorithm is summarized in Alg. 8.14 where for simplicity

in the description the annealing schedule uses an exponential decrease
at every iteration.
Algorithm 8.14: Simulated Annealing
Inputs:
𝑇 (0) : Initial temperature
Outputs:
for 𝑘 = 0 to 𝑘max do Simple iteration; convergence metrics can be used instead.

𝑥new = neighbor(𝑥 (𝑘) ) Randomly generate from neighbors
if 𝑓 (𝑥new ) ≤ 𝑓 (𝑥 (𝑘) ) then Energy decreased, jump to new state
𝑥 (𝑘+1) = 𝑥new
else
𝑟 ∈ U[0, 1] ! Randomly draw from uniform distribution
− 𝑓 (𝑥 new )− 𝑓 (𝑥 (𝑘) )
𝑝 = exp 𝑇
if 𝑝 ≥ 𝑟 then Probabilty high enough to jump

𝑥 (𝑘+1) = 𝑥new
else
𝑥 (𝑘+1) = 𝑥 (𝑘) Otherwise remain at current state
end if
end if
𝑇 = 𝛼𝑇 Reduce temperature
end for
Example 8.15: Traveling Salesman with Simulated Annealing
This example sets up the traveling salesman problem with 50 points

randomly distributed (from uniform sampling) on a square grid with sides
of length 1 (top of Fig. 8.13). The objective is the total Euclidian distance of
a path that traverses all points and returns to the starting point. The design
variables are simply a sequence of integers that correspond to the order in
which to traverse the points. We generate new neighboring designs using
the technique from Lin110 , where one of two options is randomly chosen at 110. Lin, Computer Solutions of the Travel-
ing Salesman Problem. 1965
each iteration: 1) randomly choose two points and flip the direction of the
path segments between those two points, or 2) randomly choose two points
and move the path segments to follow another randomly chosen point. The
distance traveled by the randomly generated initial set of points is 26.2. We
specify an iteration budget of 25,000 iterations, set the initial temperature to be
10, and every 100 iterations we decrease the temperature by a multiplicative
factor of 0.95. The final design is shown in the bottom of Fig. 8.13 with a path
length of 5.61. The final path might not be the global optimum (remember these
finite time methods are only approximations of the full combinatorial search),
but the methodology is effective and fast for this problem in finding at least a
near-optimal solution. The iteration history is shown in Fig. 8.14.
Figure 8.13: Initial and final path

for traveling salesman problem.
30
20
Distance
10
0 Figure 8.14: Convergence history of

0 0.5 1 1.5 2 2.5 the simulated annealing algorithm.
Iteration ·104
The simulated annealing algorithm can be applied to continuous

multimodal problems as well. The motivation is similar in that the
initial high temperature would permit the optimizer to escape local
minima whereas a pure descent-based approach would not. By slowly
cooling, the initial exploration gives way to exploitation. The only
real change in the procedure is in the neighbor function. A typical
approach is to generate a random direction, and choose a step size
proportional to the temperature. Thus, smaller, more conservative
steps are taken as the algorithm progresses. If bound constraints
are used, they would be enforced at this step. Purely random step
directions are not particularly efficient for many continuous problems,
particularly when most directions are bad (e.g., a narrow valley, or
near convergence). One variation adopts concepts from the Nelder
Mead simplex (Section 7.3) to improve efficiency.111 Overall, simulated 111. Press et al., Numerical Recipes in C:
The Art of Scientific Computing. 1992
annealing has made more impact on discrete problems as compared to
continuous ones.
8.7 Quantum Annealing
Quantum annealing is similar to simulated annealing but borrows ideas

from quantum mechanics instead of from statistical mechanics. While
simulated annealing allows for a design to probabilistically jump over
an energy barrier based on the concept of thermal energy, quantum
annealing allows for a design to probabilistically tunnel through an
energy barrier based on the Heisenberg uncertainty principle. In brief,
one major difference is that in simulated annealing the probability
of accepting a worse step is related to the change in function value,
whereas in quantum annealing the probability of accepting a worse
step is related to the change in function value and the change in design
variables (Fig. 8.15). Thus, the intuition is that a function space with
local minimum that are deep (high thermal barrier), but close to other
minima (small tunneling distance), may be more suitable for quantum
annealing as opposed to simulated annealing.
While the algorithm can be represented classically, it is more ef-
ficiently implemented on a quantum computer. A set of entangled
quantum bits (qubits) already possess the desired properties for the
algorithm, allowing these problems to be solved very quickly.
SA
8.8 Binary Genetic Algorithms QA
The binary form of a genetic algorithm (GA) can be directly used

with discrete variables. Since the binary form requires a discrete
representation for the members of the population anyway, using discrete
design variables is a natural fit. The details of this method were
discussed in Section 7.5.1.
Figure 8.15: Illustration of main dif-

8.9 Summary ference between simulated anneal-
ing (SA) and quantum annealing
This chapter discussed various strategies to approach discrete opti- (QA). The black dot is the starting
mization problems. Some problems can be well approximated using point, and the light dot a candi-
rounding, or only have a few discrete combinations allowing for explicit date point. The probability that SA
takes the worse step is related to the
enumeration. For problems that can be posed as linear, branch and function increase, whereas the prob-
bound is very effective. If the problem can be posed as a Markov chain ability that QA accepts the worse
then dynamic programming is a useful technique. If these catego- step is related to both the function
increase and the magnitude of the
rizations are not applicable then a stochastic method like simulated
design variable changes.
annealing or genetic algorithms may work well. These stochastic meth-
ods typically struggle as the dimensionality of the problem increases
(although some simulated annealing problems can scale better if there

are clever ways to quickly evaluate designs in the neighborhood, as is
done with the traveling salesman problem). An alternative to these
various algorithms is to use a greedy strategy, which can scale well, but
is a heuristic usually requiring some loss in solution quality.
Problems
a) All discrete variables can be represented by integers.

b) Discrete optimization algorithms sometimes use heuristics
and find only approximate solutions.
c) The rounding technique solves a discrete optimization prob-
lem with continuous variables and then rounds each result-
ing design variable, objective, and constraint to the nearest
integer.
d) Exhaustive search is the only way to be sure we have found
the global minimum for a problem that involves discrete
variables.
e) The branch and bound method is guaranteed to find the
global optimum for convex problems.
f) When using the branch and bound method for binary vari-
ables, the same variable might have to be revisited.
g) When using the branch and bound method, the breath-first
strategy requires less memory storage than the depth-first
strategy.
h) Greedy algorithms never reconsider a decision once it has
been made.
i) The Markov property applies when a future state can be
predicted from the current state without needing to know
any previous state.
j) Both memoization and tabulation reduce the computational
complexity of dynamic programming such that it no longer
scales exponentially.
k) Simulated annealing can be used to minimize smooth uni-
modal functions of continuous design variables.
l) Simulated annealing, genetic algorithms, and dynamic pro-
gramming include stochastic procedures.
8.2 Converting to binary variables. You have one integer design variable
𝑥 ∈ [1, 𝑛]. Let’s say, for example, that this variable represents one
of 𝑛 materials that we would like to select from. Convert this
to an equivalent binary problem so that it is more efficient for
branch and bound. To accomplish this you will need to create
additional design variables and additional constraints.
8.3 Branch and bound. Solve the following problem using a manual
branch and bound approach (i.e., show each LP subproblem) as
is done in Ex. 8.5.
maximize 0.5𝑥1 + 2𝑥2 + 3.5𝑥3 + 4.5𝑥 4

subject to 5.5𝑥1 + 0.5𝑥 2 + 3.5𝑥3 + 2.3𝑥4 ≤ 9.2
2𝑥 1 + 4𝑥2 + 2𝑥4 ≤ 8 (8.28)
1𝑥 1 + 3𝑥2 + 3𝑥3 + 4𝑥4 ≤ 4
𝑥 𝑖 ∈ {0, 1} for all 𝑖
8.4 Solve an integer linear programming problem. A chemical company

produces four types of products: A, B, C, and D. Each requires
labor to produce and uses some combination of chlorine, sulfuric
acid, and sodium hydroxide in the process. The production
process can also produce these chemicals as a byproduct, rather
than just consuming them. The chemical mixture and labor
required for the production of the three products are listed in
the table below, along with the availability per day. The market
value for one barrel of A, B, C, and D are $50, $30, $80, and $30
respectively. Determine the number of barrels of each to produce
in order to maximize profit using three different approaches:
a) As a continuous linear programming problem with round-

ing.
b) As an integer linear programming problem.
c) Exhaustive search.
A B C D Limit
Chlorine 0.74 -0.05 1.0 -0.15 97
Sodium hydroxide 0.39 0.4 0.91 0.44 99
Sulfuric acid 0.86 0.89 0.09 0.83 52
Labor (person-hours) 5 7 7 6 1000
8.5 Solve a dynamic programming problem. Solve the knapsack problem

with the following weights and costs:
𝑤 𝑖 = [2, 5, 3, 4, 6, 1]
(8.29)
𝑐 𝑖 = [5, 3, 1, 5, 7, 2]
and a capacity of 𝐾 = 12. Maximize the cost subject to capacity

constraint. Use the following two approaches:
a) a greedy algorithm where you take the item with the best
cost/weight ratio (that fits within the remaining capacity) at
each iteration.
b) dynamic programming
8.6 Simulated annealing. Construct a traveling salesman problem with

50 randomly generated points. Implement a simulated annealing
algorithm to solve.
8.7 Binary genetic algorithm. Solve the same problem as above (travel-
ing salesman) with a binary genetic algorithm.
8.8 Binary genetic algorithm II. Something more suited to a GA.

Multiobjective Optimization
9
Up to this point in the book, all of our optimization problem for-
mulations have had a single objective function. In this chapter, we
consider multiobjective optimization problems, that is, problems whose
formulations have more than one objective function. A classic exam-
ple of multiobjective optimization is risk versus reward in financial
investments.
1. Determine the scenarios where multiobjective optimiza-

tion is useful.
2. Understand the concept of dominance and identify a Pareto

set.
3. Use and identify various methods for performing multi-

objective optimization and understand the pros and cons
of the methods.
9.1 Multiple Objectives
Before discussing how to solve multiobjective problems we must first

explore what it means to have more than one objective. In some sense,
there is no such thing as a multiobjective optimization problem. While
many metrics may be important to the design engineer, in practice
only one thing can be made best at a time. A common technique
when presented with multiple objectives, as we will discuss in more
detail, is to assign weights to the various objectives and combine them.
But in doing so we have defined a single objective. More generally,
multiobjective optimization is useful in exploring tradeoffs between
different metrics, but if we intend to select a design (or even multiple
designs) from the presented option, we have indirectly formulated an
objective. This new objective may be difficult to formalize, and we
278
should be careful that the imprecision in our selection is warranted.
Tip 9.1: Are you sure you have multiple objectives?
A common pitfall for beginner optimization practitioners is to categorize

a problem as multiobjective without critical evaluation. When considering
whether you should use more than one objective, you should ask whether
or not there is a more fundamental underlying objective, or if some of the
“objectives” are actually constraints. Solving a multiobjective problem is much
more costly than solving a single objective one, so you should make sure you
absolutely need multiple objectives.
Example 9.2: Selecting an objective.
Determining the appropriate objective is often a real challenge. For example,

in designing an aircraft, one may decide that minimizing drag and minimizing
weight are both important. However, these metrics are competing and cannot
be minimized simultaneously. Instead, we may conclude that maximizing
range (the distance the aircraft can fly) is the underlying metric that matters
most for our application and appropriately balances the tradeoffs between
weight and drag. Or, perhaps maximizing range isn’t the right metric. Range
may be important, but only insofar as we reach some threshold. Increasing the
range does not increase the value because range is a constraint. The underlying
objective in this scenario may be some other metric like operating costs.
Despite these considerations, there are still good reasons to pursue

a multiobjective problem. A few of the most common reasons include:
1. Multiobjective optimization allows us to quantify tradeoff sensi-

tivities between different objectives and constraints. The benefits
of this approach will become apparent when we discuss Pareto
surfaces and can lead to important design insights.
2. Multiobjective optimization provides a “family” of designs rather

than a single design. A family of options is desirable when
decision making needs to be deferred to a later stage as more
information is gathered. For example, an executive team or
higher-fidelity numerical simulations may be used to make later
design decisions.
3. For some problems, the underlying objective is either unknown

or too difficult to compute. For example, cost and environmental
impact may be two important metrics for a new design. While
the latter could arguably be turned into a cost, doing so may

be too difficult to quantify and add an unacceptable amount of
uncertainty.
Mathematically, the only change to our optimization problem for-

mulation is that the objective statement,
minimize 𝑓 (𝑥) (9.1)
becomes
 𝑓1 (𝑥) 
 
 𝑓2 (𝑥) 
 
minimize 𝑓 (𝑥) =  .  , where 𝑛 𝑓 ≥ 2 (9.2)
 .. 
 
 𝑓𝑛 (𝑥)
 𝑓 
The constraints are unchanged, unless some of them have been refor-
mulated as objectives. This multiobjective formulation might require
tradeoffs when trying to minimize all functions simultaneously because
at some point, further reduction in one objective can only be achieved
by increasing one of more of the other objectives.
One exception occurs if the objectives are independent because they
depend on different sets of design variables. Then, the objectives are
said to be separable and they can be minimized independently. If there
are constraints, these need to be separable as well. However, separable
objectives and constraints are rare because in real engineering systems
all functions tend to be linked in some way.
Given that multiobjective optimization requires tradeoffs, we need
a new definition of optimality. In the next section, we explain how
there are an infinite number of points that are optimal, forming a
surface in the space of objective functions. After defining optimality
for multiple objectives, we present several possible methods for solving
multiobjective optimization problems.
9.2 Pareto Optimality
With multiple objectives, we have to reconsider what it means for a

point to be optimal. In multiobjective optimization we use the concept
of Pareto optimality.
Figure 9.1 shows three designs measured against two objectives that
we want to minimize: 𝑓1 and 𝑓2 . Let us first compare design A with
design B. From the figure, we see that design A is better than design B in
both objectives. In the language of multiobjective optimization we say
that design A dominates design B. One design is said to dominate another
design if it is superior in all of the objectives (design A dominates any

design in the shaded rectangle). Comparing design A with design C,
we note that design A is better in one objective ( 𝑓1 ), but worse in the
other objective ( 𝑓2 ). Neither design dominates the other.
A point is said to be nondominated if none of the other evaluated
points dominate it. If a point is nondominated by any point in the entire 𝐵
domain, then that point is called Pareto optimal. This does not imply that
𝑓2 𝐴
this point dominates all other points; it simply means no other point
dominates it. The set of all Pareto optimal points is called the Pareto set, 𝐶
and is visualized in Fig. 9.2. The Pareto set refers to the vector of points
𝑥 ∗ , whereas the Pareto front refers to the vector of functions 𝑓 (𝑥 ∗ ). 𝑓1
Figure 9.1: The three designs: 𝐴,

Example 9.3: A Pareto front in wind farm optimization.
𝐵, and 𝐶, are plotted against two
The Pareto front is a useful tool to produce design insights. Figure 9.3 objectives: 𝑓1 and 𝑓2 . The region
shows a notional Pareto front for a wind farm optimization. The two objectives in the shaded rectangle highlights
are maximizing power production (shown with a negative sign so that it is points that are dominated by design
𝐴.
minimized), and minimizing noise. The Pareto front is helpful to understand
tradeoff sensitivities. For example, the left end point shows the maximum
power solution, and the right end point shows the minimum noise solution.
The nature of the curve on the left side tells us how much power we have to
sacrifice for a given reduction in noise. If the slope is steep, as is the case in the 𝑓2
figure, we can see that a small sacrifice in maximum power production can be
exchanged for greatly reduced noise. However, if even larger noise reductions
are sought then large power reductions will be required. Conversely, if the
left side of the figure had a flatter slope we would know that small reductions 𝑓1
in noise would require significant decreases in power. Understanding the

magnitude of these tradeoff sensitivities is helpful in making high-level design Figure 9.2: A plot of all the eval-
uated points in the design space
decisions.
plotted against two objectives 𝑓1
and 𝑓2 . The set of red points are not
dominated by any other and thus
form the Pareto set.
9.3 Solution Methods
Various solution methods exist to solving multiobjective problems. This

chapter does not cover all methods, but highlights some of the more
Noise
commonly used methods. These included the weighted-sum method,

the 𝜖-constraint method, the normal boundary interface method, and
evolutionary algorithms.
−Power
9.3.1 Weighted Sum Figure 9.3: A notional Pareto front

representing power and noise trade-
The weighted-sum method is easy to use, but it is not particularly
offs for a wind farm optimization
efficient. Other methods exist that are just as simple but have better problem.
performance. It is only introduced because it is well known and is
frequently used. The idea is to combine all of the objectives into one
objective using a weighted sum, which can be written as:
Õ
𝑁
𝑓¯(𝑥) = 𝑤 𝑖 𝑓𝑖 (𝑥), (9.3)
𝑖
where 𝑁 is the number of objectives and the weights are usually

normalized such that
Õ
𝑁
𝑤𝑖 = 1 (9.4)
𝑖
If we have two objectives, the objective reduces to
𝑓¯(𝑥) = 𝑤 𝑓1 (𝑥) + (1 − 𝑤) 𝑓2 (𝑥), (9.5)
where 𝑤 is a weight in [0, 1].

Consider a two-objective case. Points on the Pareto set are deter-
mined by choosing a weight 𝑤, completing the optimization for the
composite objective, and then repeating the process for a new value of
𝑤. It is straightforward to see that at the extremes 𝑤 = 0 and 𝑤 = 1,
the optimization returns the designs that optimize one objective while
ignoring the other. The weighted-sum objective forms an equation
for a line with the objectives as the ordinates. Conceptually we can
think of this method as choosing a slope for the line (by selecting 𝑤), 𝑓2
then pushing that line down and to the left as far as possible until it is
just tangent to the Pareto front (Fig. 9.4). With the above form of the −𝑤
1−𝑤
objective the slope of the line would be:
𝑓1
d 𝑓2 −𝑤
= (9.6)
d 𝑓1 1−𝑤 Figure 9.4: The weighted-sum
method defines a line for each value
This procedure identifies one point in the Pareto set and the procedure of 𝑤 and find the point tangent to
must then be repeated with a new slope. the Pareto front.
The main benefit of this method is that it is easy to use. However,

the drawbacks are that 1) uniform spacing in 𝑤 leads to nonuniform 𝑤=1
spacing along the Pareto set, 2) it is not obvious which values of 𝑤

should be used to sweep out the Pareto set effectively, and 3) this 𝑓2
method can only return points on the convex portion of the Pareto front
(see Fig. 9.5). 𝑤=0
Using the Pareto front shown in Fig. 9.4, Fig. 9.5 highlights the
𝑓1
convex portion of the Pareto front. Those are the only portions of the
Pareto front that can be found using a weighted sum method. Figure 9.5: The convex portion of
this Pareto front are the portions
highlighted.
9.3.2 Epsilon-constraint method

The 𝜖-constraint method works by minimizing one objective, while
setting all other objectives as additional constraints112
minimize 𝑓𝑖
by varying 𝑥
subject to 𝑓𝑗 ≤ 𝜖 𝑗 for all 𝑗 ≠ 𝑖, (9.7)
𝑔(𝑥) ≤ 0
ℎ(𝑥) = 0
Then, we must repeat this procedure for different values of 𝜖 𝑗 .

This method is visualized in Fig. 9.6. In this example, we constrain 𝑓1
to be less than a certain value and minimize 𝑓2 to find the corresponding
point on the Pareto front. We then repeat this procedure for different
values of 𝜖.
One advantage of this method is that determining appropriate
values for 𝜖 is more intuitive than selecting the weights in the previous
method, although one must be careful to choose values that result in a 𝑓2
feasible problem. Another advantage is that this method reveals the
non-convex portions of the Pareto front. Its main limitation is that like
the weighted-sum method, a uniform spacing in 𝜖 does not in general
yield uniform spacing of the Pareto front and therefore it might may 𝑓1 𝜖
still be inefficient, particularly with more than two objectives.
Figure 9.6: The vertical line repre-
sents an upper bound constraint
9.3.3 Normal Boundary Intersection on 𝑓1 . The other objective 𝑓2 is min-
imized to reveal one point in the
The normal boundary intersection method is designed to address the Pareto set. This procedure is then
issue of nonuniform spacing along the Pareto front.113 The basic idea is repeated for different constraints on
to first find the extremes of the Pareto set; in other words, we minimize 𝑓1 to sweep out the Pareto set.
the objectives one at a time. These extreme points are referred to as 113. Das et al., Normal-Boundary Inter-
section: A New Method for Generating the
anchor points. Next, construct a plane that passes through the anchor Pareto Surface in Nonlinear Multicriteria
points. We space points along this plane (usually evenly) and starting Optimization Problems. 1998
from those points, solve optimization problems that search along lines
normal to this plane.
This procedure is shown in Fig. 9.7 for a two objective case. In this
case, the plane that passes through the anchor points is a line. We now
space points along this plane by choosing a vector of weights that we
will call 𝑏, and which are illustrated on the left hand figure. The weights
Í
are constrained such that 𝑏 𝑖 ∈ [0, 1], and 𝑖 𝑏 𝑖 = 1. If we make 𝑏 𝑖 = 1
and all other entires zero, then this equation returns one of the anchor
points, 𝑓 (𝑥 ∗𝑖 ). For two objectives, we would define 𝑏 as 𝑏 = [𝑤, 1 − 𝑤]𝑇
and vary 𝑤 in equal steps between 0 and 1.
𝑓 (𝑥1∗ ) 𝑓 (𝑥1∗ )
𝑏 = [0.8, 0.2]
𝛼 𝑛ˆ Figure 9.7: A notional example of
𝑓2 𝑓2 the normal boundary intersection
𝑃𝑏 + 𝑓 ∗ 𝑃𝑏 + 𝑓 ∗ method. A plane is created passing
through the single objective optima,
and solutions are sought normal
𝑓∗ 𝑓∗
𝑓 (𝑥2∗ ) 𝑓 (𝑥2∗ ) to that plane to allow for a more
evenly spaced Pareto front.
𝑓1 𝑓1
Starting with a specific value of 𝑏, we search along a line perpendicu-

lar to this point, represented by the line with the arrow seen on the right
hand figure. We seek to find the point along this line that is furthest
away from the plane (a maximization problem), with the constraint
that the point is consistent with the objective functions. The resulting
optimal point found along this line is shown as a point along the Pareto
front. We then repeat for another set of weighting parameters in 𝑏.
We can see how this method is similar to the constraint epsilon
method, but instead of searching along lines parallel to one of the axes,
we search along lines parallel to this plane. The idea is that even spacing
along this plane is more likely to lead to even spacing along the Pareto
front.
Mathematically, we start by determining the anchor points, which
are just single objective optimization problems. From the anchor points
we define what is called the utopia point. The utopia point is an
ideal point that cannot be obtained, where every objective reaches its
minimum simultaneously (shown in the lower left corner of Fig. 9.7):
 𝑓1 (𝑥 ∗ ) 
 1 
 𝑓2 (𝑥 ∗ ) 
 2 
𝑓∗ =  .  , (9.8)
 .. 
 
 𝑓𝑛 (𝑥 ∗ )
 𝑛 
where 𝑥 ∗𝑖 denotes the design variables that minimize objective 𝑓𝑖 . The

utopia point allows us to define the equation of a plane that passes
through all anchor points,
𝑃𝑏 + 𝑓 ∗ , (9.9)
where the 𝑖 th column of 𝑃 is 𝑓 (𝑥 ∗𝑖 ) − 𝑓 ∗ . A single vector 𝑏, whose length
is given by the number of objectives, defines a point on the plane.
We now define a vector (𝑛) ˆ that is normal to this plane, in the
direction toward the origin. We search along this vector using a step
length 𝛼, yielding
𝑃𝑏 + 𝑓 ∗ + 𝛼 𝑛. ˆ (9.10)
Computing the exact normal (𝑛) ˆ is involved and it is not actually

necessary that the vector be exactly normal. As long as the vector points
toward the Pareto front then it will still yield well-spaced points. In
practice, a quasi-normal vector is often used, such as,
𝑛˜ = −𝑃1 (9.11)
where 1 is a vector of ones.

We now solve the following optimization problem, for a given vector
𝑏, to yield a point on the Pareto front:
maximize 𝛼
by varying 𝑥, 𝛼
subject to 𝑃𝑏 + 𝑓 ∗ + 𝛼 𝑛ˆ = 𝑓 (𝑥) (9.12)
𝑔(𝑥) ≤ 0
ℎ(𝑥) = 0
This means that we are finding the point furthest away from the
anchor point plane, starting from a given value for 𝑏, while satisfying
the original problem constraints. The process is then repeated for
additional values of 𝑏 to sweep out the Pareto front.
In contrast to the previously mentioned methods, this method yields
a more uniformly spaced Pareto front, which is desirable for computa-
tional efficiency, albeit at the cost of a more complex methodology.
Example 9.4: A two-dimensional normal boundary interface problem.
3 (2, 3)
𝑓2 2
𝑛ˆ
1 (5, 1)
𝑓∗
Figure 9.8: Search directions are
0 normal to the line connecting an-
1 2 3 4 5 6 chor points.
𝑓1
First, we optimize the objectives one at a time, which in our example results
in the two anchor points shown in Fig. 9.8: 𝑓 (𝑥1∗ ) = (2, 3) and 𝑓 (𝑥2∗ ) = (5, 1).
The utopia point is then:

2
𝑓∗ = (9.13)
1
For the matrix 𝑃 recall that the 𝑖 th column of 𝑃 is 𝑓 (𝑥 ∗𝑖 ) − 𝑓 ∗ :

0 3
𝑃= (9.14)
2 0
Our quasi-normal vector is given by −𝑃1 (note that the true normal is
[−2, −3]):
−3
˜𝑛 = (9.15)
−2
We now have all the parameters we need to solve Eq. 9.12.
114. Ismail-Yahaya et al., Effective genera-

For most multiobjective design problems additional complexity tion of the Pareto frontier using the Normal
beyond the normal boundary intersection (NBI) method is unnecessary, Constraint method. 2002
however, even this method can still have deficiencies for problems with 115. Messac et al., Normal Constraint
Method with Guarantee of Even Representa-
unusual Pareto fronts and new methods continue to be developed. For tion of Complete Pareto Frontier. 2004
example, the normal constraint method uses a very similar approach114 , 116. Hancock et al., The smart normal
constraint method for directly generating a
but with inequality constraints to address a deficiency in the NBI method smart Pareto set. 2013
that occurs when the normal line does not cross the Pareto front. This ∗ The first application of an evolution-
methodology has undergone various improvements including better ary algorithm for solving a multiobjective
problem was by Schaffer117 .
scaling through normalization.115 A more recent improvement allows
117. Schaffer, Some Experiments in Ma-
for even more efficient generation of the Pareto frontier by avoiding chine Learning Using Vector Evaluated
regions of the Pareto front where minimal tradeoffs occur.116 Genetic Algorithms. 1984
9.3.4 Evolutionary Algorithms

Gradient-free methods can, and occasionally do, use all of the above 𝑓2
methods. However, evolutionary algorithms also enable a funda-
mentally different approach. Genetic algorithms, a specific type of
evolutionary algorithms, were introduced in Section 7.5.∗
A genetic algorithm (GA) is amenable to an extension that can 𝑓1
handle multiple objectives because it keeps track of a large population

Figure 9.9: Population for a multi-
of designs at each iteration. If we plot two objective functions for a given objective genetic algorithm iteration
population of a genetic algorithm iteration, we would get something like plotted against two objectives. The
that shown in Fig. 9.9. The points represent the current population, and nondominated set is highlighted
at the bottom left and eventually
the highlighted points in the lower left are the current nondominated converges toward the Pareto front.
set. As the optimization progresses, the nondominated set moves † The NSGA-II algorithm was developed
further down and to the left and eventually converges toward the true by Deb et al.103 Some key developments
Pareto set. include using the concept of domination
in the selection process, preserving diver-
In the multiobjective version of the genetic algorithm, the repro- sity amongst the non-dominated set, and
duction and mutation phases are unchanged from the single objective using elitism.118
version. The primary difference is in determining the fitness and the 103. Deb et al., A fast and elitist multiobjec-
tive genetic algorithm: NSGA-II. 2002
selection procedure. Here, we provide an overview of one popular
118. Deb, Introduction to Evolutionary
approach, the NSGA-II algorithm.† Multiobjective Optimization. 2008
A step in the algorithm is to find a nondominated set (i.e., the Pareto

set), and several algorithms exist to accomplish this. In the following
we use the algorithm by Kung 119 , which has been shown to be one of 119. Kung et al., On Finding the Maxima of
a Set of Vectors. 1975
the fastest. This procedure recursively divides the population in half,
finding the front for each half separately. Before calling the algorithm
the population should be sorted by the first objective. First, we split the
population into two halves, where the top half is superior to the bottom
half in the first objective. Both populations are recursively fed back
through the algorithm to find their fronts. We then initialize a merged
population with the members of the top half. All members in the bottom
half are checked, and any that are nondominated by any member of
the top half are added to the merged population. Finally, we return
the merged population as the nondominated set. The methodology is
summarized in Alg. 9.5.
Algorithm 9.5: Find nondominated set using Kung’s algorithm
Inputs:
𝑝: a population sorted by the first objective
Outputs:
𝑓 : the pareto set for the population
procedure front(𝑝)
if length(𝑝) = 1 then if there is only one point, it is the front
return f
end if
split population into two halves 𝑝 𝑡 and 𝑝 𝐵
⊲ because input was sorted, 𝑝 𝑡 will be superior to 𝑝 𝐵 in the first objective.
𝑡 = front(𝑝 𝑡 ) recursive call to find front for top half
𝑏 = front(𝑝 𝐵 ) recursive call to find front for bottom half
initialize 𝑓 with the members from 𝑡 merged population
for 𝑖 = 1 to length(𝑏) do
dominated = false track whether anything in 𝑡 dominates 𝑏 𝑖
for 𝑗 = 1 to length(𝑡) do
if 𝑡 𝑗 dominates 𝑏 𝑖 then
dominated = true
break no need to continue search through 𝑡
end if
end for
if not dominated then 𝑏 𝑖 was not dominated by anything in T
add 𝑏 𝑖 to 𝑓
end if
end for
return 𝑓
end procedure
In NSGA-II, we are interested in not just the Pareto set, but rather
in ranking all members by their dominance depth, which is also called
nondominated sorting. In this approach, all points in the population
that are nondominated (i.e., the Pareto set) are given a rank of 1. Those
points are then removed from the set and the next set of nondominated
points is given a rank of 2, and so on (see Fig. 9.10). Note that there
are alternative procedures that can perform a nondominated sorting
directly, which can sometimes be more efficient, though we don’t
highlight them here. This algorithm is summarized in Alg. 9.6.
rank = 1
rank = 2
Algorithm 9.6: Perform nondominated sorting rank = 3
rank ≥ 4
Inputs: 𝑓2
𝑝: a population
Outputs:
rank: the rank for each member in the population
𝑟=1 initialize current rank

𝑓1
𝑠=𝑝 set sub population as entire population

Figure 9.10: Points in the popula-
while length(𝑠) > 0 do tion highlighted by rank.
f = front(sort(𝑠)) identify the current front
set rank for every member of 𝑓 to 𝑟
r += 1 increment rank
remove all members of 𝑓 from 𝑠
end while
The new population is filled by placing all rank 1 points in the new
population, then all rank 2 points, and so on. At some point, an entire
group of constant rank will not fit within the new population. Points
with the same rank are all equivalent as far as Pareto optimality is
concerned, so an additional sorting mechanism is needed to determine
which members of this group to include.
The way that we perform selection within a group that can only 𝑓2
partially fit is to try to preserve diversity as much as possible. Points

in this last group are ordered by their crowding distance, which is a
measure of how spread apart the points are. The algorithm seeks to
preserve points that are well spread. For each point, a hypercube in 𝑓1
objective space is formed around it, which in NSGA-II is referred to as

Figure 9.11: A cuboid around one
a cuboid. Figure 9.11 shows an example cuboid considering the rank point demonstrating the definition
3 points from Fig. 9.10. The hypercube extends to the function values of crowding distance (except that
of its nearest neighbors in function space. That does not mean that it the distances are normalized).
necessarily touches its neighbors as the two closest neighbors can differ
for each objective. The sum of the dimensions of this hypercube is the
crowding distance. When summing the dimensions, each dimension is
normalized by the maximum range of that objective value. For example,
considering only 𝑓1 for the moment, if the objectives were in ascending
order, then the contribution of point 𝑖 to the crowding distance would
be:
𝑓1 − 𝑓1 𝑖−1
𝑑1,𝑖 = 𝑖+1 (9.16)
𝑓1 𝑁 − 𝑓1 1
Sometimes, instead of using the first and last points in the current
objective set, user-supplied values are used for the min and max values
of 𝑓 that appear in that denominator. The anchor points (the single
objective optima) are assigned a crowding distance of infinity as we
want to preference their inclusion. The algorithm for crowding distance
is shown in Alg. 9.7.
Algorithm 9.7: Crowding distance
Inputs:
𝑝: a population
Outputs:
𝑑: crowding distances
initialize 𝑑 with zeros

for 𝑖 = 1 to number of objectives do
set 𝑓 as a vector containing the 𝑖 th objective for each member in 𝑝
𝑠 = 𝑠𝑜𝑟𝑡( 𝑓 ) and let 𝐼 contain the corresponding indices (𝑠 = 𝑓𝐼 )
𝑑 𝐼1 = ∞ anchor points receive an infinite crowding distance
𝑑𝐼𝑛 = ∞
for 𝑗 = 2 to length(𝑝) - 1 do add distance for interior points
𝑑𝐼 𝑗 + = (𝑠 𝑗+1 − 𝑠 𝑗−1 )/(𝑠 𝑁 − 𝑠 1 )
end for
end for
return d
We can now put together the pieces in the overall algorithm. The
crossover and mutation operations remain the same. Tournament
selection (Fig. 7.17) is modified slightly to use the ranking and crowding
metrics of this algorithm. In the tournament, a member with a lower
rank is superior. If two members have the same rank, then the one
with the larger crowding distance is selected. This procedure is called
crowded tournament selection. After reproduction/mutation, instead of
replacing the parent generation with the offspring generation, both the
parent generation and offspring generation are saved as candidates for

the next generation. This preserves elitism, which means that the best
member in the population is guaranteed to survive. The population size
is now twice its original size (2𝑁) and the selection process must reduce
the population back down to size 𝑁. This is done using the procedure
explained previously. The new population is filled by including all rank
1 members, rank 2 members, etc., until an entire rank can no longer
fit. Inclusion for members of that last rank are done in order of largest
crowding distance until the population is filled. The general algorithm
is summarized in Alg. 9.8. Note that many variations are possible, so
while the algorithm is based on the concepts of NSGA-II, the details
may differ somewhat.
Algorithm 9.8: Elitist non-dominated sorting genetic algorithm
Inputs:
𝑓 (𝑥): function
Outputs:
Generate initial population

while Stopping criterion is not satisfied do
Using a parent population 𝑃, evaluate fitness, perform selection, crossover,
mutation as in a standard GA to produce an offspring population 𝑂, except
modify selection to use a crowded tournament selection
𝐶 =𝑃∪𝑂 combine populations
Compute 𝑟 𝑎𝑛 𝑘 𝑖 for 𝑖 = 1, 2, . . . of 𝐶 using Alg. 9.6
⊲ Fill new parent population with as many whole ranks as possible
𝑃=∅
𝑟=1
while true do
set 𝐹 as all 𝐶 𝑖 with 𝑟𝑎𝑛𝑘 𝑖 = 𝑟
if length(𝑃) + length(𝐹) > 𝑁𝑝 then
break
end if
add 𝐹 to 𝑃
𝑟 += 1
end while
⊲ For last rank that doesn’t fit, add by crowding distance
if length(𝑃) < 𝑁𝑝 then population isn’t full
d = crowding(𝐹) Alg. 9.7, using last 𝐹 from terminated loop above
𝑚 = 𝑁𝑝 − length(𝑃) determine how many members to add
sort 𝐹 by the crowding distance 𝑑 in descending order
add the first 𝑚 entries from 𝐹 to 𝑃

end if
end while
Example 9.9: Filling a new population in NSGA-II
After reproduction and mutation, we are left with a combined population of

parents and offspring. In this small example the combined population is of size
12 and so we must must reduce it back to 6. This example has two objectives,
and the values for each member in the population is shown below. To refer to
these members, just for this purpose of this example, we have assigned each
member with a lettered label. The population is plotted in Fig. 9.12. 𝑓2
L G
10
B
A B C D E F G H I J K L A
8
𝑓1 5 7 10 1 3 10 5 6 9 6 9 4 E
F
𝑓2 8 9 4 4 7 6 10 3 5 1 2 10 6
I
D C
First, we compute the ranks using Alg. 9.6, resulting in the following output: 4
H
K
2 J
A B C D E F G H I J K L
3 4 3 1 2 4 4 2 3 1 2 3 0
0 2 4 6 8 10
We see that current nondominated set consists of points D and J and that there 𝑓1
are four different ranks.

Figure 9.12: Popluation for Ex. 9.9
Next, we start filling the new population in order of rank. Our max capacity
is 6, so all of rank 1 (D, J) and rank 2 (E, H, K) fit. We cannot add rank 3 (A,
C, I, L) because the population size would be 9. So far our new population
consists of [D, J, E, H, K]. To choose which items from rank 3 continue forward,
we compute the crowding distance for the members of rank 3:
A C I L
1.67 ∞ 1.5 ∞
We would then add these in order: C, L, A, I, but only have room for one, so we
add 𝐶 and complete this iteration with a new population of [D, J, E, H, K, C].
The main advantage of this multiobjective approach is that if an

evolutionary algorithm is appropriate for solving a given single objective
problem, then the extra information needed for a multiobjective problem
is already there and therefore solving the multiobjective problem does
not incur an additional computational cost. The pros and cons of this
approach as compared to the previous approaches are basically the
pros and cons of gradient-based versus gradient-free methods with the
exception that the multiobjective gradient-based approaches require
solving multiple problems to generate the Pareto front. Still, solving
multiple gradient-based problems is often more efficient than solving

one gradient-free problem, especially for problems with a large number
of design variables.
9.4 Summary
Multiobjective optimization is particularly useful in quantifying tradeoff

sensitivities between critical metrics. It is also useful when a family
of potential solutions is sought, rather than a single solution. Some
scenarios where a family of solutions might be preferred is when the
models used in optimization are low fidelity and higher fidelity design
tools will be applied, or when more investigation is needed and only
candidate solutions are desired at this stage.
The presence of multiple objectives changes what it means for a
design to be optimal. The concept of Pareto optimality was introduced
where a design is nondominated by any other design. The weighted
sum method is perhaps the most well-known approach, but it not
recommended as other methods are just as easy and much more
efficient. The constraint epsilon method is still simple, but almost
always preferable to the weighted sum method. If willing to use a more
complex approach, the normal boundary intersection method is even
more efficient at capturing a Pareto front.
Some gradient-free methods are also effective at generating Pareto
fronts, particularly a multiobjective genetic algorithm. While gradient-
free methods are sometimes associated with multiobjective problems,
gradient-based algorithms may be the more effective approach for many
multiobjective problems.
Problems
a) The solution of multiobjective optimization problems is

usually an infinite number of points.
b) It is advisable to include as many objectives as you can in
your problem formulation to make sure you get the best
possible design.
c) Multiobjective optimization allows us to quantify tradeoffs
between objectives and constraints.
d) If the objectives are separable, that means that they can be
minimized independently and that there is no Pareto front.
e) A point 𝐴 dominates point 𝐵 if it is better than 𝐵 in at least

one objective.
f) The Pareto front is the set of all the points that dominate all
other points in the objective space.
g) When a point is Pareto optimal, you cannot make either of
the objectives better.
h) The weighted-sum method obtains the Pareto front by solv-
ing optimization problems with different objective functions.
i) The 𝜖-constraints method obtains the Pareto front by mini-
mizing each objective in turn while constraining the others.
j) The utopia point is the point where every objective has a
minimum value.
k) It is not possible to compute a Pareto front with a single-
objective optimizer.
l) Because GAs optimize by evolving a population of diverse
designs, they can be used for multiobjective optimization
without modification.
9.2 Which of the following function value pairs would be Pareto

optimal in a multiobjective minimization problem (may be more
than one)?
• (20, 4)
• (18, 5)
• (34, 2)
• (19, 6)
9.3 You seek to minimize the following two objectives:
𝑓1 (𝑥) = 𝑥12 + 𝑥22

𝑓2 (𝑥) = (𝑥1 − 1)2 + 20(𝑥2 − 2)2
Identify the Pareto front using the weighted sum method with
11 evenly spaced weights: 0, 0.1, 0.2, . . . , 1. If some parts of the
front are underresolved, discuss how you might select weights
for additional points.
9.4 Repeat Prob. 9.3 with the epsilon-constraint method. Constrain 𝑓1

with 11 evenly spaced points between the anchor points. Contrast
the Pareto front with that of the previous problem, and discuss
whether improving the front with additional points will be easier
with the previous method, or with this method.
9.5 Repeat Prob. 9.3 with the normal boundary intersection method
using the following 11 evenly spaced points: 𝑏 = [0, 1], [0.1, 0.9], [0.2, 0.8], . . . , [1, 0]
9.6 Consider a two-objective population with the following com-

bined parent/offspring population (objective values shown for
all sixteen members).
[6.0, 8.0]
[6.0, 4.0]
[5.0, 6.0]
[2.0, 8.0]
[10.0, 5.0]
[6.0, 0.5]
[8.0, 3.0]
[4.0, 9.0]
[9.0, 7.0]
[8.0, 6.0]
[3.0, 1.0]
[7.0, 9.0]
[1.0, 2.0]
[3.0, 7.0]
[1.5, 1.5]
[4.0, 6.5]
Develop code based on the NSGA-II procedure and determine
the new population at the end of this iteration. Detail the results
of each step during the process.
Surrogate-Based Optimization
10
A surrogate model, also known as a response surface model or meta-
model, is an approximate model of a functional output that represents
a “curve fit” to some underlying data. The goal is to create a surrogate
that is much faster to compute than the original function, but that still
retains sufficient accuracy away from known data points. The surrogate
model is also usually smoother than the original function.
1. Identify and describe the steps in surrogate-based opti-

mization.
2. Understand and use Latin hypercube sampling.
3. Select optimized parameters for a given surrogate model.
4. Perform cross validation as part of model selection.
5. Describe two approaches to infill.
10.1 When to Use a Surrogate
There are various scenarios for which a surrogate model might be

useful:
• When the simulation is expensive to evaluate and you expect to

evaluate it many times.
• When the simulation is noisy.
• When the model is derived from experimental data (which may

be both expensive and noisy).
• When you want to better understand functional relationships

between the inputs and outputs.
295
• When multiple model fidelities are involved and a surrogate may

be able to provide a correction between fidelities.
Our interest is not just in building surrogate models, but rather in
performing optimization using a surrogate model. This topic is called
surrogate-based optimization (SBO). SBO is a rich subfield and this chapter
presents only a brief introduction.
A broad overview of the steps in a SBO is shown in Fig. 10.1. First,
sampling methods are used to choose the initial points to evaluate
the function, or conduct experiments at. These points are sometimes
referred to as training data. Next, a surrogate model is created from the
sampled points. We then perform an optimization using the surrogate
model. Based on the results of the optimization we can include
additional points in the sample and reconstruct the surrogate (infill).
This process is repeated until some convergence criteria or maximum
number of iterations is reached. The optimization step reuses the
techniques we have already discussed. The new steps we discuss in
this chapter are sampling, constructing a surrogate, and infill.∗ In some ∗ Forrester et al.120 provides a nice intro-
procedures infill is omitted, the surrogate is fully constructed upfront duction to this topic with much more
depth than can be provided in this chap-
and not subsequently updated. Many of the concepts discussed in this ter.
chapter are of broader usefulness in optimization beyond just SBO. 120. Forrester et al., Engineering Design
via Surrogate Modelling: A Practical Guide.
2008
10.2 Sampling
Sample
Sampling methods select the evaluation points for constructing the
initial surrogate. These evaluation points must be chosen carefully.
Construct
surrogate
Example 10.1: Full grid sampling is not scalable.
Imagine a simulation model computing the endurance of an aircraft. As-

sume that one simulation takes a few hours on a supercomputer and so Perform
optimization
evaluating many points is prohibitive. If we only wanted to understand how
endurance varied with one variable, like wing span, we could run the simula-
tion 10 times and likely create a fairly useful curve that could predict endurance
at wing spans we didn’t directly evaluate. Now imagine that we have nine
No
additional input variables that we want to use: wing area, taper ratio, wing root Converged? Infill
twist, wing tip twist, wing dihedral, propeller spanwise position, battery size,
tail area, and tail longitudinal position. If we discretized all ten variables with
ten intervals each, we would need to run 1010 simulations in order to assess all Yes
combinations. With 3 hours per simulation, that would take almost 5 million
years! Done
Figure 10.1: Overview of surrogate-

Ex. 10.1 highlights one of the major challenges of sampling methods:
based optimization procedure.
the curse of dimensionality. For SBO, using a large number of variables
is costly, and so we need to identify the most important, or most

influential variables. There are many methods that can help us do that,
but they are beyond the scope of this introductory text. Instead, we
assume that the most influential variables have already been determined
so that the dimensionality is reasonable. However, even with a modest
number of variables, a full grid search is highly inefficient. We are
interested in sampling methods that can efficiently characterize our
design space of interest. In this introduction we focus on a popular
sampling method called Latin hypercube sampling (LHS).
LHS is a a random process, but it is much more effective and
efficient than pure random sampling. In random sampling each sample
is independent of past samples, but in LHS we choose all samples
before hand in order to represent the variability effectively. Consider
two random variables with some bounds, whose design space we could
represent as a square. Say we wanted only eight samples, we could
divide the design space into eight intervals in each dimension as shown
in Fig. 10.2.
A full grid search would identify a point in every little box, but this
does not scale well. To be as efficient as possible, and still cover the
variation, we would want each row and each column to have one sample
in it. In other words, the projection of points onto each dimension
should be uniform. This is called a Latin square and the generalization
to higher dimensions is a Latin hypercube. There are a large number of
possible ways we could achieve this, and some choices will be better
than others. Consider the sampling plan shown in Fig. 10.3a. This Figure 10.2: A two-dimensional de-
plan achieves our criteria but clearly does not fill the space and likely sign space divided into 8 intervals
will not capture the relationships between design parameters well. in each dimension.
Alternatively Fig. 10.3b has a sample in each row and column while
also spanning the space much more effectively.
As it turns out, LHS is itself an optimization problem. It seeks to
maximize the distance between the samples with the constraint the
projection on each axis follow a chosen probability distribution (usually
uniform as in the above examples, but it could also be something else like
Gaussian). There are many possible solutions, and so some randomness
is involved. Rather than relying on the law of large numbers to fill out
our chosen probability distributions, we enforce it as a constraint. This
method still generally requires a large number of samples to accurately
characterize the design space, but usually far less than pure random
sampling.
Most scientific packages include convenience methods for random
sampling from typical distributions (e.g., uniform or normal), but you
can also easily sample from an arbitrary distribution using a technique
(a) A sampling strategy whose (b) A sampling strategy whose Figure 10.3: Contrasting sampling
projection uniformly spans each projection uniformly spans each strategies that both fulfill the uni-
dimension but does not fill the space dimension and fills the space more form projection requirement.
well. effectively.
called inversion sampling. Assume that we want to generate samples 𝑥

from an arbitrary PDF 𝑝(𝑥) or equivalently from the corresponding CDF
𝑃(𝑥). The probability integral transform states that for any continuous
CDF: 𝑦 = 𝑃(𝑥) the variable 𝑦 is uniformly distributed (a simple proof,
but not shown here to avoid introducing additional notation). The
procedure is to randomly sample from a uniform distribution (e.g.,
generate 𝑦), then compute the corresponding 𝑥 such that 𝑃(𝑥) = 𝑦. This
latter step is known as an inverse CDF, a percent-point function, or a
quantile function and this process is depicted pictorially in Fig. 10.4
for a normal distribution. Note that this same procedure allows you to
use LHS with any distribution, simply by generating the samples on a
uniform distribution.
0.8
Figure 10.4: An example of inver-

0.6 sion sampling with a normal dis-
tribution. A few uniform samples
𝑦
are shown on the y-axis. The points
0.4 are evaluated by the inverse CDF,
depicted by the arrows passing
through the CDF for a normal dis-
0.2 tribution (blue curve). If enough
samples are drawn, the resulting
distribution will be the PDF of a
0
−4 −3 −2 −1 0 1 2 3 4 normal distribution (red curve).
𝑥
In addition to use in SBO, LHS is also very useful in other applica-

tions discussed in this book including: initializing a Genetic Algorithm
(Section 7.5), initializing a Particle Swarm Method (Section 7.6), per-

forming restarts in a gradient-based method for exploration of multiple
minima (Tip 4.24), or choosing the points to run in a Monte Carlo
simulation (Section 12.5.2).
10.3 Constructing a Surrogate
Many types of surrogate models are possible, some physics-based,

others mathematically-based, and some that are a hybrid (particularly
with multi-fidelity models). We will discuss the fundamentals of
a simple and frequently used mathematically-based model: linear
regression. A linear regression model does not mean that the surrogate
is linear, but that the model is linear in its coefficients (i.e., linear in the
parameters we are estimating).
In this section, we discuss two linear regression models: polynomial
models and Kriging. Many others exist, but these are chosen because of
their popularity and because there will illustrate many of the important
concepts in surrogate models.
A general linear regression model looks like:
𝑓ˆ = 𝑤 𝑇 𝜓 (10.1)
where 𝑤 is a vector of weights and 𝜓 is a vector of basis functions.

Another popular regression model is Kriging, where the basis functions
are:
© Õ ª
𝜓 (𝑖) = exp − 𝜃 𝑗 |𝑥 𝑗 − 𝑥 𝑗 | 𝑝 𝑗 ®
(𝑖)
(10.2)
« 𝑗 ¬
In general, the basis functions can be any set of functions that we choose,
though often we would like these functions to be orthogonal.
Example 10.2: Data fitting can be posed as a linear regression model.
Consider a simple quadratic fit: 𝑓ˆ = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐. This is a linear regression

model because it is linear in the coefficients we wish to estimate: 𝑤 = [𝑎, 𝑏, 𝑐].
The basis functions would be 𝜓 = [𝑥 2 , 𝑥, 1]. For a more general n-dimensional
polynomial model, the basis function would be polynomials like:
𝜓 ∈ {1, 𝑥1 , 𝑥2 , 𝑥3 , 𝑥1 𝑥2 , 𝑥1 𝑥3 , 𝑥22 , 𝑥12 𝑥3 . . .} (10.3)
To construct a linear regression model there are two tasks: 1)

determine what terms to include in 𝜓, and 2) estimate 𝑤 to minimize
some error. For a given set of 𝜓 terms, the latter is straightforward.
From sampling, we already selected a bunch of evaluation points and

computed their corresponding function values. We call these values
the training data: (𝑥 (𝑖) , 𝑓 (𝑖) ). We want to choose the weights 𝑤 to
minimize the error between our predicted function values 𝑓ˆ and the
actual function values 𝑓 (𝑖) . Errors can be positive or negative but of
course any error is bad, so what we want to minimize is not the sum of
the errors, but rather the sum of the square of the errors† : † Itis not arbitrary that we minimize the
Õ 2 sum of the squares rather than the sum
of the absolute values or some other met-
minimize 𝑓ˆ(𝑥 (𝑖) ) − 𝑓 (𝑖) (10.4) ric. If we assume that the error in our lin-
𝑖 ear regression model is independently dis-
tributed according to a Gaussian distribu-
The solution to this optimization problem is called a least squares tion (which is a typical assumption in the
solution as discussed in Section 11.2. Let us rewrite 𝑓ˆ in matrix form so absence of other information), and wish to
find the 𝑤 that maximizes the probability
that it is compact for all 𝑥 (𝑖) : that we would observe the data 𝑓ˆ, then the
resulting optimization problem reduces to
𝑓ˆ = Ψ𝑤 (10.5) a sum of the square of the errors.
where Ψ is matrix:
— 𝜓(𝑥 (1) )𝑇 —

— 𝜓(𝑥 (2) )𝑇 —

Ψ=   (10.6)
 ..

 . 
— 𝜓(𝑥 (𝑛) )𝑇 —

Thus, the same minimization problem can be expressed as:
minimize ||Ψ𝑤 − 𝑓 || 2 (10.7)
where
 𝑓 (1) 
 (2) 
𝑓 
 
𝑓 = .  (10.8)
 .. 
 
 𝑓 (𝑛) 
 
The matrix Ψ is of size (𝑚 × 𝑛) where 𝑚 > 𝑛. This means that there
should be more equations than unknowns, or that we have sampled
more points than the number of polynomial coefficients we need to
estimate. This should make sense because our polynomial function
is only an assumed form, and generally not an exact fit to the actual
underlying function. Thus, we need more data to create a reasonable
fit.
This is exactly the same problem as 𝑦 = 𝐴𝑥 where 𝐴 ∈ ℛ 𝑚×𝑛 . There
are more equations than unknowns so generally there is not a solution
(the problem is called overdetermined). Instead, we seek the solution
that minimizes the error ||𝐴𝑥 − 𝑦|| 2 .
Tip 10.3: Least squares is not the same as a linear system solution.
In Julia or Matlab you can solve this with x = A\b, but keep in mind that
for 𝐴 ∈ ℛ 𝑚×𝑛 this syntax performs a least squares solution, not a linear system
solution because it would for a full rank 𝑛 × 𝑛 system. This overloading is
generally not used in other languages, for example in Python rather than using
numpy.linalg.solve you would use numpy.linalg.lstsq.
The other important consideration for developing our surrogate

model is choosing the basis functions in 𝜓. Sometimes you may know
something about the model behavior and thus what type of basis
functions should be used, but generally the best way to determine the
basis functions is through cross validation. Cross validation is also
useful in characterizing error, even if we already have a chosen set of
basis functions.
Example 10.4: The dangers of overfitting.
Consider a simple underlying function with Gaussian noise added to

simulate experimental or noisy computational data.
𝑦 = 0.2𝑥 4 + 0.2𝑥 3 − 0.9𝑥 2 + 𝒩(0, 𝜎), where 𝜎 = 0.75 (10.9)
We will create the data at 20 points in the interval [−3, 2] (see Fig. 10.5).
𝑓
0
−2
Figure 10.5: Data from a numerical
−3 −2 −1 0 1 2 model or experiments.
𝑥
For a real problem we do not know the underlying function and the
dimensionality is often too high to visualize. Determining the right basis to
functions to use can be difficult. If we are using a polynomial basis we might
try to determine the order by trying each case (e.g., quadratic, cubic, quartic,
etc.) and measuring the error in our fit (Fig. 10.6).
It seems as if the higher the order of the polynomial, then the lower the
error. For example, a 20th order polynomial reduces the error to almost zero.
The problem is that while the error may be low on this set of data, we expect
the predictive capability of such a model for future data points to be poor. For
example, Fig. 10.7 shows a 19th order fit to the data. The model passes right
through the points, but its predictive capability is poor.
Error 2
0 Figure 10.6: Error in fitting data

0 5 10 15 20 with different order polynomials.
Order of Polynomial
4
𝑓
0
−2 Figure 10.7: A 19th order polyno-

mial fit to the data. Low error, but
−3 −2 −1 0 1 2 poor predictive ability.
𝑥
This is called overfitting. Of course, we also want to avoid the opposite

problem of underfitting where we do not have enough degrees of freedom to
create a useful model (e.g., think of a linear fit for the above example).
The solution to the overfitting problem highlighted in Ex. 10.4 is

cross validation. Cross validation means that we use one set of data for
training (creating the model), and a different set of data for assessing
its predictive error. There are many different ways to perform cross-
validation, we will describe two. Simple cross validation consists of the
following steps:
1. Randomly split your data into a training set and a validation set
(e.g., a 70/30 split).
2. Train each candidate model (the different options for 𝜓) using

only the training set, but evaluate the error with the validation
set.
3. Choose the model with the lowest error on the validation set, and
optionally retrain that model using all of the data.
An alternative option may be useful if you have few data points and
so can’t afford to leave much out for validation. This method is called
k-fold cross validation and while it is more computationally intensive, it
makes better use of all of your data:
1. Randomly split your data into 𝑛 sets (e.g., 𝑛 = 10).
2. Train each candidate model using the data from all sets except
one (e.g., 9 of the 10 sets), and use the remaining set for valida-
tion. Repeat for all 𝑛 possible validation sets and average your
performance.
3. Choose the model with the lowest average error on all the 𝑛
validation sets.
Example 10.5: Cross validation is used to avoid overfitting.
This example continues from Ex. 10.4. First, we perform k-fold cross-
validation using ten divisions. The average error across the divisions using the
training data is shown in Fig. 10.8.
·104
3 15
2 10
Error
Error
1 5
0 0 Figure 10.8: Error from k-fold cross

5 10 15 20 2 4 6 8 10 12 validation.
Order of Polynomial Order of Polynomial
The error becomes extremely large as the polynomial order becomes large.
Zooming in on the flat region we see a range of options with similar errors.
Amongst similar solutions, one generally prefers the simplest model. In this
case, a fourth-order polynomial seems reasonable. A fourth-order polynomial
is compared against the data in Fig. 10.9. This model has much better predictive
capability.
𝑓
0
−2
Figure 10.9: A 4th order polynomial
−3 −2 −1 0 1 2 fit to the data.
𝑥
An important class of linear regression models are those that use

radial basis functions. A radial basis function is a basis function that
depends on distance from some center point:

𝜓 (𝑖) = 𝜓 ||𝑥 − 𝑐 (𝑖) || , (10.10)
of which one of the more popular models is Kriging:
© Õ ª
𝜓 (𝑖) = exp − 𝜃 𝑗 |𝑥 𝑗 − 𝑥 𝑗 | 𝑝 𝑗 ®
(𝑖)
(10.11)
« 𝑗 ¬
A Kriging basis is a generalization of a Gaussian basis (which would
have 𝜃 = 1/𝜎2 and 𝑝 = 2). These types of models are useful because in
addition to creating a model they also predict the uncertainty in the
model through the surrogate. An example of this is shown in Fig. 10.10.
Notice how the uncertainty goes to zero at the known data points, and
becomes largest when far from known data points.
𝑓 0
Data
−1 Actual
Fit Figure 10.10: A Kriging fit to the
Uncertainty input data (circles) and a shaded
−2
0 2 4 6 confidence interval.
𝑥
The key advantage of a Gaussian process model is this ability to

predict not just a surrogate function but also its uncertainty. Another
advantage is its flexibility. Unlike the polynomial models, we make
fewer assumptions upfront. This flexibility is also a disadvantage in that
we generally need many more function calls to produce a reasonable
model. The other main disadvantage of Gaussian process models is
that they tend to introduce lots of local minima.
Tip 10.6: Surrogate modeling toolbox.
The surrogate modeling toolbox‡ is a useful package for surrogate modeling ‡ https://smt.readthedocs.io/
with a particular focus on providing derivatives for use in gradient-based

optimization.
10.4 Infill
There are two main approaches to infill: prediction-based exploitation

and error-based exploration. Typically, only one infill point is chosen at
a time with the assumption that evaluating the model is computationally
expensive, but recreating and evaluating the surrogate is cheap.
For the polynomial models discussed in the previous section the only
real option is exploitation. A prediction-based exploitation infill strategy
adds an infill point wherever the surrogate predicts the optimum. The
reasoning behind this method, is that in SBO we do not necessarily care
about having a globally accurate surrogate, but rather only care about
having an accurate surrogate near the optimum. The most reasonable
point to sample at is the optimum predicted by the surrogate. Likely,
the location predicted by the surrogate will not be at the true optimum,
but it will add valuable information as we recreate the surrogate and
reoptimize, repeating the process until convergence. This approach
generally allows for the quickest convergence to an optimum. Its
downside is that for problems with multiple local optima we are likely
to converge prematurely to an inferior local optimum.
An alternative approach is called error-based exploration. This
approach requires the use of a Gaussian process model (mentioned in
the previous section) that not only predicts function values, but also
uncertainties. In exploration we may not want to just sample where the
mean is low (this is the same as exploitation), but we also do not want
to just sample where the error is high (this is essentially just a larger
sampling plan). Instead, we want to sample where the probability of
finding improvement is highest. One metric with this intent is called
expected improvement.
Let the best solution we have found so far be called 𝑥 ∗ , and 𝑓 (𝑥) is
our objective function. The improvement for any new test point 𝑥 is
then given by:

𝐼(𝑥) = max 𝑓 (𝑥 ∗ ) − 𝑓 (𝑥), 0 (10.12)
In other words, if 𝑓 (𝑥) ≥ 𝑓 (𝑥 ∗ ) there is no improvement but if 𝑓 (𝑥) <
𝑓 (𝑥 ∗ ) then the improvement is just the magnitude of the objective
decrease. However, we need to keep in mind that 𝑓 (𝑥) is not a de-
terministic value in this model, but rather a probability distribution.
Thus, the expected improvement is the expected value (or mean) of the
improvement
𝐸𝐼(𝑥) = E max( 𝑓 (𝑥 ∗ ) − 𝑓 (𝑥), 0) . (10.13)
For a Gaussian process model, the expected value can be determined
analytically. The selected infill point is the point with the highest
expected improvement. After sampling, we recreate the surrogate and

repeat.
Example 10.7: Expected improvement.
Consider the one-dimensional function with data points and fit shown
in Fig. 10.11. § The best point we have found so far is denoted in the figure § Thisdata is based on an example from
as 𝑥 ∗ , 𝑓 ∗ . For a Gaussian process model, the fit also provides an uncertainty Rajnarayan et al.121 .
distribution as shown in the shaded region. 121. Rajnarayan et al., A Multifidelity

Gradient-Free Optimization Method and
Application to Aerodynamic Design. 2008
𝑓
5
(𝑥 ∗ , 𝑓 ∗ )
Figure 10.11: A one-dimensional
function with a Gaussian process
4
0 0.5 1 1.5 2 2.5 model surrogate fit and uncertainty.
𝑥
Now imagine we want to evaluate this function at some new test point
𝑥test = 0.5. In Fig. 10.12 the probability distribution for the objective at 𝑥test is
shown in red (imagine that the probability distribution was coming out of the
page). The shaded blue region is the probability of improvement over the best
point. Expected value is similar to the probability of improvement but rather
than return a probability it returns the magnitude of improvement expected.
That magnitude may be more helpful in defining a stopping criteria as opposed
to a probability.
𝑓
5
Figure 10.12: At a given test point
(𝑥test = 0.5) we highlight the proba-
(𝑥 ∗ , 𝑓 ∗)
bility distribution and the expected
improvement in the shaded blue
4
0 0.5 1 1.5 2 2.5 region.
𝑥
Now, let us evaluate the expected improvement not just at 𝑥 test = 0.5 but
across the domain. The result is shown by the red function in Fig. 10.13. The
spike on the right tells us that we expect improvement by sampling close to our
best known point, but the expected improvement is rather small. The spike on
the left tells us that there is a promising region where the surrogate suggests a
relatively high potential improvement. Notice that the metric does not simply
capture regions with high uncertainty, but rather regions with high uncertainty
in areas that are likely to lead to improvement. For our next sample, we would
choose the location with the highest expected improvement, recreate the fit
and repeat.
𝐸𝐼(𝑥)
𝑓
5
Figure 10.13: The process is re-

peated by evaluating expected
4
0 0.5 1 1.5 2 2.5 improvement across the domain.
𝑥
10.5 Deep Neural Networks
Neural networks loosely mimic the brain, which consists of a vast

network of neurons. In neural networks, each neuron is a node that
represents a simple function. A network defines chains of these simple
functions to obtain composite functions that are much more complex.
For example, three simple functions: 𝑓 (1) , 𝑓 (2) , 𝑓 (3) may be chained into
the composite function (or network):
𝑓 (𝑥) = 𝑓 (3) ( 𝑓 (2) ( 𝑓 (1) (𝑥))) (10.14)
Even though each individual function may be simple, the composite

function can exhibit complex behavior. Most deep neural networks
are feedforward networks, meaning information flows from inputs 𝑥 to
outputs 𝑓 . Recurrent neural networks include feedback connections.
Figure 10.14 shows a diagram of a neural network. Each node repre-
sents a neuron, and these neurons are connected between consecutive
layers forming a dense network. The first layer is called the input layer,
the last is called the output layer, and the middle layer(s) are called
hidden layers. The total number of layers is called the network’s depth.
The usage of the phrase deep neural networks instead of just neural
networks or artificial neural networks reflects recent advances that have
allowed researchers to train networks that are much deeper than was
possible before. The phrase neural is used because these models are
inspired by neurons in a brain.
Figure 10.14: A representation of a

small neural net.
Input layer Hidden layers Output layer
The first and last layers are the inputs and outputs of our surrogate
model. Each neuron in the hidden layer represents a function. This
means that the output from a neuron is a number, and thus the output
from a whole layer can be represented as a vector 𝑥. We call 𝑥 (𝑘) the
(𝑘)
vector of values for layer 𝑘, and 𝑥 𝑖 is the value for the 𝑖th neuron
in layer 𝑘. Let us consider just one neuron in layer 𝑘. This neuron is
connected to many neurons from the previous layer 𝑘 −1 (see first part of
Fig. 10.15). We need to choose a functional form for this neuron taking
in the values from the previous layer as inputs. A linear function is too
simple. Chaining together linear functions will only result in a linear
composite function, so the function for this neuron must be nonlinear.
The most common choice for hidden layers is a linear function passed
through a second activation function that creates the nonlinearity. Let
us first focus on the linear portion, which produces an intermediate
variable we call 𝑧 (figure):
Õ
𝑛
(𝑘−1)
𝑧= 𝑤𝑗 𝑥𝑗 +𝑏 (10.15)
𝑗=1
or in vector form:
𝑧 = 𝑤 𝑇 𝑥 (𝑘−1) + 𝑏 (10.16)
Notice that the first term is just a weighted sum of the values from the
neurons in the previous layer. The 𝑤 vector contains the weights. The
𝑏 term is called the bias, which provides an offset allowing us to scale
the significance of the overall output. This process is summarized in

the second column of Fig. 10.15.
𝑧1
(𝑘−1) 𝑧2
𝑥1
𝑤1
(𝑘−1) 𝑧3
𝑥2
𝑤2
(𝑘−1)
𝑤3 Í (𝑘−1)

(𝑘)
𝑥3 𝑧4 = 𝑗 𝑤𝑗 𝑥𝑗 + 𝑏4 𝑥 (𝑘) = 𝑎(𝑧) 𝑥4
.. 𝑤𝑛 𝑧5
.
(𝑘−1)
𝑥𝑛 ..
.
𝑧𝑚
Figure 10.15: Typical functional

Summation form for a neuron in the neural net.
Inputs Weights Activation Output
and bias
Next, we pass 𝑧 through an activation function, which we will call 1
𝑎(𝑧). Historically, a sigmoid function (top of Fig. 10.16) was almost 0.8
always used as the activation function: 0.6

Sigmoid
𝑎(𝑧)
1 0.4
𝑎(𝑧) = (10.17)
1 + 𝑒 −𝑧
0.2
Notice that this function produces values between zero and one, so
large negative values would become insignificant (close to zero) and −5 𝑧 5
large positive values would produce results close to one. Most modern
neural nets now use a rectified linear unit (ReLU) as the activation 4
function (bottom of Fig. 10.16):
𝑎(𝑧)
𝑎(𝑧) = max(0, 𝑧) (10.18) 2
ReLU
The ReLU has been found to be far more effective in producing accurate
neural nets. Notice that this activation function completely eliminates
5
negative inputs. Thus, we see that the bias term can be thought of as a −5 𝑧
threshold establishing what constitutes a significant value. This last Figure 10.16: Activation functions.
step is summarized in the final two columns of Fig. 10.15.
Combining the linear function with the activation function produces

the output for this 𝑖 th neuron:
(𝑘)
𝑥𝑖 = 𝑎(𝑤 𝑇 𝑥 (𝑘−1) + 𝑏 𝑖 ) (10.19)
To compute across all the neurons in this layer, the weights 𝑤 for this
one neuron would form one row in a matrix of weights 𝑊.
𝑥 (𝑘)    𝑥 (𝑘−1)   𝑏 
 1  © ...   1   1 ª
 .     .   . ®
 ..  
..
  ..   ..  ®
    
  (𝑘−1)    ®®
.
 (𝑘) 
𝑥 𝑖  = 𝑎 𝑊𝑖1 . . . 𝑊𝑖𝑗 . . . 𝑊𝑖,𝑛 𝑘−1  𝑥 𝑗  +  𝑏 𝑖  ® (10.20)
 .    
 .   .. 
  ..   ..  ®®
 .   .   .   . ®
 (𝑘)     (𝑘−1)   
𝑥 𝑛  𝑥 
 𝑘 « ...   𝑛 𝑘−1   𝑛 𝑘  ¬
𝑥
or
𝑥 (𝑘) = 𝑎(𝑊 𝑥 (𝑘−1) + 𝑏) (10.21)
The activation function is applied separately for each row. The below
equation is more explicit (where 𝑤 𝑖 is the 𝑖 th row of 𝑊), though we
generally use the above equation as shorthand.
(𝑘) (𝑘−1)
𝑥𝑖 = 𝑎(𝑤 𝑇𝑖 𝑥 𝑖 + 𝑏𝑖 ) (10.22)
Our neural net is now parameterized by a bunch of weights and

biases, and we need to determine the optimal value for these parameters
(e.g., train the network) by supplying a large set of training data. In the
example of Fig. 10.14 there is a layer of 5 neurons, 7 neurons, 7 neurons,
then 4 neurons and so there would be 5 x 7 + 7 x 7 + 7 x 4 weights and 7
+ 7 + 4 bias terms giving a total of 130 design variables. This represents
a very small neural net as there are few inputs and few outputs. Large
neural nets can have millions of variables. We need to optimize those
design variables to minimize a cost function.
The cost function combines the results from the output layer into
a single objective. We will call the inputs to the neural net 𝑥, the
outputs from the neural net 𝑓 (𝑥), and use 𝑦 to represent the training
data. Ideally, we want to adjust the parameters such that 𝑓 (𝑥) closely
matches the data 𝑦. For example, 𝑥 could be parameters that set the
shape of a structure, and the outputs 𝑦 the stresses at various locations
in the structure. We would like our surrogate 𝑓 (𝑥) to accurately
predict these stresses. An objective used in many machine learning
problems is maximum likelihood estimation. In other words, we choose
the parameters 𝜃 (weights and biases in this case) to maximize the
probability of observing the output data conditioned on our inputs 𝑥.
max 𝑝(𝑦|𝑥; 𝜃) (10.23)

𝜃
This is not conditioned on the parameters 𝜃 because we do not assume

that those are also random variables. If we assume that the inputs
are independently drawn, as is typical, then the probability can be
expressed as a product across all the samples in the training data:
Ö
𝑛
max 𝑝(𝑦 (𝑖) |𝑥 (𝑖) ; 𝜃) (10.24)
𝜃
𝑖=1
We now take the log of the objective, which does not change the solution,
but changes the products to a better numerically behaved summation.
We also add a negative sign up front so that the problem is one of
minimization:
Õ
𝑛
min − log(𝑝(𝑦 (𝑖) |𝑥 (𝑖) ; 𝜃)) (10.25)
𝜃
𝑖=1
This is a typical cost function, but if we further assume that the

probability distribution is normally distributed (Gaussian), then this
cost function simply reduces to a familiar sum of square errors:
Õ
𝑛
min ( 𝑓 (𝑥 (𝑖) ) − 𝑦 (𝑖) )2 (10.26)
𝜃
𝑖=1
We now have the objective and design variables in place to train

the neural net. Like the other models discussed in this chapter it is
critical to hold out some of the training data for cross validation. Two
unique considerations for neural nets involve derivative computation,
and the optimization algorithm used for training. We consider these
topics next.
Because neural networks have many inputs but generally only
have one output function, they are well suited to using reverse mode
algorithmic differentiation (Section 6.6) to compute gradients. The
reverse mode AD is so common that in the machine learning community
it is simply called back propagation. However, for machine learning
practitioners, back propagation is not usually performed at the code
level as it is in general reverse mode AD, but rather is defined for
larger sets of operations often requiring the user to use specialized
libraries or allowing them define their own adjoints. While less general,
this approach can enable increased efficiency and stability. The ReLU
activation function (bottom of Fig. 10.16) is not differentiable right at
𝑧 = 0, but in practice this is generally not problematic. Especially,
because these methods typically rely on inexact gradients anyway as
discussed next.
The form of the objective discussed in this section, and in many

other machine learning problems is of the form:
Õ
𝑛
𝑓 (𝑥) = 𝑓ˆ(𝑥 (𝑖) ) (10.27)
𝑖=1
where 𝑥 (𝑖) is the 𝑖 th sample from the training set and 𝑓ˆ is any function
that operates on one training sample. As seen in this section, the
objectives commonly used for many machine learning applications
fit this form (e.g., negative log likelihood, or a squared error). The
difficulty with these problems is that we often have large training sets,
sometimes with 𝑛 in the billions. That means that computing the
objective can be time consuming, but computing the gradient is even
more time consuming.
If we divide the objective by 𝑛 (which does not change the solution),
we can see that objective function is an expectation:
1 Õ ˆ (𝑖)
𝑛
𝑓 (𝑥) = 𝑓 (𝑥 ) (10.28)
𝑛
𝑖=1
We know that we can reasonably estimate an expected value from

a smaller set of random samples. We call this subset of samples a
minibatch 𝒮 = {𝑥 (1) . . . 𝑥 (𝑚) } where 𝑚 is usually between one to a few
hundred. The entries 1 . . . 𝑚 do not correspond to the first 𝑛 entries but
are drawn randomly from a uniform probability distribution (Fig. 10.17).
Using the minibatch we can estimate the gradient as the sum of the
gradients for each training point:
1 Õ
∇𝑥 𝑓 (𝑥) ≈ ∇𝑥 (𝑖) 𝑓ˆ(𝑥 (𝑖) ) (10.29)
𝑚
𝑖∈𝒮
Thus, we divide our training data into these minibatches and use a new
minibatch to estimate the gradients at each iteration in the optimization.
Training data Testing data
Minibatch 1 Minibatch 2 Minibatch 3
This approach works well for these specific problems because of Figure 10.17: A simplified example
of how training data is randomly
the unique form for the objective. As an example, if there were one assigned into minibatches.
million training samples then a single gradient evaluation would require

evaluating all one million training samples. Alternatively, for a similar
cost, a minibatch approach could update the design variables a million
times using the gradient estimated from one training sample at a time.
This latter process often converges much faster, especially because in
these problems we generally do not care if we arrive at the absolute
minimum.
Typically, this gradient is used with steepest descent (Section 4.4.1),
also known as gradient descent. As discussed in Chapter 4, steepest
descent is usually not very effective, however, for machine learning
applications stochastic gradient descent has been found to work very
well. This suitability is primarily because many machine learning
optimizations are performed repeatedly, the true objective is often
difficult to formalize, and finding the absolute minimum is not as
important as finding a good enough solution quickly. One key difference
in stochastic gradient descent is that we do not perform a line search.
Rather the step size, called the learning rate in these applications, is
a preselected size (often chosen somewhat arbitrarily) that is usually
decreased between major iterations in the optimization.
Stochastic minibatching is easily applied to first-order methods
and has thus driven innovation in alternative first-order methods
that improve on stochastic gradient descent like Momentum, Adam,
AMSGrad, etc.122 While some of these methods may seem rather ad 122. Ruder, An overview of gradient descent
optimization algorithms. 2016
hoc there is mathematical rigor to many of them.123 Batching makes the
123. Goh, Why Momentum Really Works.
gradients noisy and so second-order methods are generally not pursued. 2017
However, ongoing research is exploring stochastic batch approaches
that might effectively leverage the benefits of second-order methods.
10.6 Summary
Surrogate-based optimization can be a particularly effective way to

incorporate simulations that are expensive or noisy. The first step
in surrogate construction is sampling, in which we select evaluation
points for constructing an initial surrogate. Full grid searches are
too expensive for even a modest number of variables and so we need
techniques that provide good coverage with a small number of samples.
A popular technique for this application is LHS (which itself requires
solving an optimization problem).
The next step is surrogate selection and construction. Linear re-
gression models are popular approaches that cover a wide variety of
models. Despite the name linear, these surrogates are generally highly
nonlinear functions, but are linear in the coefficients (which is what
we are selecting in the model construction process). Some common

examples of linear regression models include polynomial models and
Kriging (which is a subset of Gaussian radial basis functions). Data is
used to train the model and select appropriate coefficients (an optimiza-
tion problem). Cross validation is a critical component of this process.
What we really want is good predictive capability, which means that
the models work well on data that the model has not been trained
against. Model selection often involves tradeoffs of more rigid models
that do not need as much training data, versus more flexible models
that require more training data.
The last step of the process is infill where points sampled during
the process of optimization are used to update the surrogate. Some ap-
proaches are exploitation based, where we perform optimization using
the surrogate, and use the optimal solution to update the model. For
other models where uncertainty estimates are provided, exploration-
based approaches can be used where we sample not just at the deter-
ministic optimum, but at points where the expected improvement is
high.
Finally, we discussed deep neural nets, which is another common
surrogate model. While the general process is similar, there are some
unique considerations. Neural nets are extremely flexible, but the
downside of such flexibility is that large amounts of training data are
needed to produce useful models. Approaches like backpropagation
and stochastic gradients are needed to efficiently manage the large
amount of training data.
Problems
a) You should use surrogate-based optimization when a prob-

lem has an expensive simulation and many design variables
because it is immune to the “curse of dimensionality”.
b) LHS is a random process that is more efficient than pure
random sampling.
c) LHS seeks to minimize the distance between the samples
with the constraint that the projection on each axis follow a
chosen probability distribution.
d) Polynomial regressions are not considered to be surrogate
models because they are too simple and do not consider any
of the model physics.
e) There can be some overlap between the training points and

cross-validation points, as long as that overlap is small.
f) Cross-validation is a required step in surrogate-based opti-
mization.
g) The more points you use to train a surrogate model, the
more accurate it gets.
h) In addition to modeling the function values, Kriging surro-
gate models also provide an estimate of the uncertainty in
the values.
i) A prediction-based exploitation infill strategy adds an infill
point wherever the surrogate predicts the largest error.
j) Maximizing the expected improvement maximizes the prob-
ability of finding a better function value.
k) Neural networks require many nodes with a variety of
sophisticated activation functions to represent challenging
nonlinear models.
l) Back propagation is the computation of the derivatives of
the neural net output with respect to the activation function
weights using reverse mode AD.
10.2 Latin hypercube sampling. Use a LHS package to create and plot
20 points across two dimensions with uniform projection in both
dimensions.
10.3 Inversion sampling. Use inversion sampling with Latin hypercube
sampling to create and plot 100 points across two dimensions.
Each dimension should follow a normal distribution with zero
mean and a standard deviation of 1 (cross-terms in covariance
matrix are 0).
10.4 Linear regression. Use the following training data sampled at 𝑥
with resulting function value 𝑓 :
𝑥 = [ − 2.0000, −1.7895, −1.5789, −1.3684, −1.1579,
− 0.9474, −0.7368, −0.5263, −0.3158, −0.1053,
0.1053, 0.3158, 0.5263, 0.7368, 0.9474,
1.1579, 1.3684, 1.5789, 1.7895, 2.0000]
𝑓 = [7.7859, 5.9142, 5.3145, 5.4135, 1.9367,

2.1692, 0.9295, 1.8957, −0.4215, 0.8553,
1.7963, 3.0314, 4.4279, 4.1884, 4.0957,
6.5956, 8.2930, 13.9876, 13.5700, 17.7481]
Use linear regression to determine the coefficients for a polynomial

basis of [𝑥 2 , 𝑥, 1] to predict 𝑓 (𝑥). Plot your fit against the training
data and report the coefficients for the polynomial bases.
10.5 Cross validation. Use the following training data sampled at 𝑥

with resulting function value 𝑓 :
𝑥 = [ − 3.0, −2.6053, −2.2105, −1.8158, −1.4211,

− 1.0263, −0.6316, −0.2368, 0.1579, 0.5526,
0.9474, 1.3421, 1.7368, 2.1316, 2.5263,
2.9211, 3.3158, 3.7105, 4.1053, 4.5]
𝑓 = [43.1611, 28.1231, 12.9397, 3.7628, −2.5457,

− 4.267, 2.8101, −0.6364, 1.1996, −0.9666,
− 2.7332, −6.7556, −9.4515, −7.0741, −7.6989,
− 8.4743, −7.9017, −2.0284, 11.9544, 33.7997]
a) Create a polynomial surrogate model using the set of poly-

nomial basis functions 𝑥 𝑖 for 𝑖 = 0 : 𝑛. Plot the error in the
surrogate model while increasing 𝑛 (the maximum order of
the polynomial model) from 1 to 20.
b) Plot the polynomial fit for 𝑛 = 16 against the data and
comment on its suitability.
c) Recreate the error plot versus polynomial order using k-fold
cross validation with ten divisions. Be sure to limit the
y-axes to the area of interest.
d) Plot the polynomial fit against the data for a polynomial order
that produces low error under cross validation, and report
the coefficients for the polynomial. Justify your selection.
10.6 Wave drag minimization using a surrogate. Minimize the drag

of a supersonic body of revolution using a global polynomial
surrogate model. The provided analysis code is somewhat noisy,
similar to what might exist with experimental data or with some
grid-based simulations, hence the use of a surrogate. The details
of this problem are here.
Present your methodology and discuss your results and lessons
learned.
Convex Optimization
11
General nonlinear optimization problems are difficult to solve. De-
pending on the particular optimization algorithm, they require may
the selection of tuning parameters, derivatives, appropriate scaling,
and trying different starting points. Convex optimization problems
do not have any of those issues and are thus relatively easy to solve.
The difficulty is that some strict requirements must be met. Even for
candidate problems that have the potential to be convex, significant
experience is often needed to recognize and utilize techniques that
reformulate the problems into an appropriate form.
1. Understand the benefits and limitations of convex opti-

mization.
2. Identify and solve linear and quadratic optimization prob-

lems.
3. Formulate and solve convex optimization problems.
4. Identify and solve geometric programming problems.
11.1 Introduction
Convex optimization problems have desirable characteristics that make

them more predictable and easier to solve. Since a convex problem
has provably only one optimum, convex optimization methods always
converge to the global minimum. Solving convex problems is straight-
forward and does not require a starting point, parameter tuning, or
derivatives, and they can scale efficiently even for problems with mil-
lions of design variables.124 All we need to solve a convex problem 124. Diamond et al., Convex Optimization
with Abstract Linear Operators. 2015
is set it up properly; there is no need to worry about convergence,
local optimum, or noisy functions. Some of the convex problems are
so straightforward to solve that they are often not recognized as an
317
optimization problem and are just thought of as a function or operation.

A familiar example of convex optimization is the linear-least-squares
problem (described in a subsequent section).
While these are very desirable properties, the catch is that for
an optimization problem to be convex, it must satisfy some strict
requirements. Namely, the objective and all inequality constraints must
be convex functions, and the equality constraints must be affine.∗ A ∗ An affine function consists of a linear
transformation and a translation. Infor-
function 𝑓 is convex if: mally, this type of function is often re-
ferred to as linear (including in this book),
𝑓 (1 − 𝜂)𝑥1 + 𝜂𝑥2 ≤ (1 − 𝜂) 𝑓 (𝑥 1 ) + 𝜂 𝑓 (𝑥2 ) (11.1) but strictly speaking these are distinct con-
cepts. For example: 𝐴𝑥 is a linear function
in 𝑥 , whereas 𝐴𝑥 + 𝑏 is an affine function
for all 𝑥1 and 𝑥2 in the domain, where 0 ≤ 𝜂 ≤ 1. This requirement is in 𝑥 .
illustrated in Fig. 11.1 for the one-dimensional case. The right-hand
side of the inequality is just the equation of a line from 𝑓 (𝑥1 ) to 𝑓 (𝑥 2 )
(the blue line), whereas the left-hand side is the function 𝑓 (𝑥) evaluated
at all points between 𝑥 1 to 𝑥 2 (the black curve). The inequality says that
the function must always be below a line joining any two points in the
domain. Stated informally, a convex function looks something like a
bowl everywhere.
Unfortunately, even these strict requirements are not enough. In
general, we cannot identify a given problem as convex or take advantage
of its structure to solve it efficiently, and thus must treat it as a general
nonlinear problem. There are two approaches to take advantage of 𝑓 (𝑥2 )
convexity. The first one is to directly formulate the problem in a
known convex form, such as a linear program or a quadratic program 𝑓 (𝑥1 )
(discussed later in this chapter). The second option is to use disciplined 𝑥1 𝑥2

convex programming, which is a specific set of rules and mathematical
functions that one can use to build up a convex problem. By following Figure 11.1: Illustration of what it
means for a function to be convex
these rules, one can always translate the problem into an efficiently
in the one-dimensional case. The
solvable form automatically. function (black) must be below a
While both of these approaches are straightforward to apply, they line that connects any two points
also expose the main weakness of these methods: we need to be able (blue) in the domain.
to express the objective and inequality constraints using only these
elementary functions and operations. In most cases, this requirement
means that the model must be simplified. Often, a problem is not
directly expressed in a convex form and a combination of experience
and creativity is needed to reformulate the problem in an equivalent
manner that is convex.
Simplifying models usually results in a reduction in fidelity. This
is less problematic for optimization problems that are intended to be
solved repeatedly, such as in optimal control and machine learning,
domains in which convex optimization is heavily used. In these cases,
simplification by local linearization, for example, is less problematic
because the linearization can be updated in the next time step. However,
this reduction in fidelity is problematic for design applications. In
design scenarios, the optimization is performed once, and the design
cannot continue to be updated after it is created. For this reason, convex
optimization less frequently used for design applications, with the
exception of some limited uses of geometric programming, a topic
discussed in more detail in Section 11.6.
This chapter is introductory in nature, focusing only on understand-
ing what convex optimization is useful for and describing some of the
most widely used forms.† The known categories of convex optimiza- † Boyd et al.125 provides a good starting
point for those seeking a more complete

tion problems include: linear programming, quadratic programming, introduction into the field of convex opti-
second-order cone programming, semidefinite programming, cone mization.
programming, and graph form programming. Each of these categories 125. Boyd et al., Convex Optimization.
2004
is a subset of the next (Fig. 11.2). We will focus on the first three because
they are the mostly widely used, including in other chapters in this
book.‡ The latter three forms are less frequently formulated directly. ‡ Many good references exist with exam-
ples for those categories we do not discuss

Instead, the user applies elementary functions and operations, rules in detail.126–127
specified by disciplined convex programming, and a software tool 126. Lobo et al., Applications of second-
transforms the problem into a suitable conic form that can be solved. order cone programming. 1998
This procedure is described in Section 11.5. 127. Vandenberghe et al., Applications of

semidefinite programming. 1999
After covering those main categories of convex optimization we
128. Vandenberghe et al., Semidefinite
discuss geometric programming. Geometric programming problems Programming. 1996
are actually not convex, but with a change of variables, they can be 129. Parikh et al., Block splitting for dis-
tributed optimization. 2013
transformed into an equivalent convex form extending the types of
problems that can be solved with convex optimization. Graph Form Programming
(GFP)
Cone Programming
11.2 Linear Programming (CP)
Semidefinite Programming
A linear program (LP) is an optimization problem with linear objective (SDP)
and linear constraints and can be written as Second-Order Cone
Programming (SOCP)
minimize 𝑓 𝑇𝑥 Quadratic
Programming (QP)
subject to 𝐴𝑥 + 𝑏 = 0 (11.2)
𝐶𝑥 + 𝑑 ≤ 0, Linear Programming
(LP)
where, 𝑓 , 𝑏, and 𝑑 are vectors and 𝐴 and 𝐶 are matrices. All LPs are
convex.
Example 11.1: Formulating a linear programming problem. Figure 11.2: Relationship between
various convex optimization prob-
Suppose we are going shopping and want to figure out how to best meet
lems.
our nutritional needs for the least amount of cost. We enumerate all the
food options, and use the variable 𝑥 𝑗 to represent how much of food 𝑗 we
will purchase. The parameter 𝑐 𝑗 is the cost of a unit amount of food 𝑗. The
parameter 𝑁𝑖𝑗 is the amount of nutrient 𝑖 contained in a unit amount of food

𝑗. We need to make sure we have at least 𝑟 𝑖 of nutrient 𝑖 to meet our dietary
requirements. We can now formulate this as an optimization problem. We
wish to minimize the cost of our food:
Õ
minimize 𝑐 𝑗 𝑥 𝑗 = 𝑐𝑇 𝑥 (11.3)
𝑗
To meet the nutritional requirement of nutrient 𝑗 we need to satisfy:

Õ
𝑁𝑖𝑗 𝑥 𝑗 ≥ 𝑟 𝑖 ⇒ 𝑁 𝑥 ≥ 𝑟. (11.4)
𝑗
Finally, we cannot purchase a negative amount of food so we require 𝑥 ≥ 0.

The objective and all of the constraints are linear in 𝑥, so this is an LP ( 𝑓 =
𝑐, 𝐶 = −𝑁 , 𝑑 = 𝑟). We do not need to artificially restrict what foods we include
in our initial list of possibilities. The formulation allows the optimizer to select
a given food item 𝑥 𝑖 to be zero (that is, do not purchase any of that food item),
according to what is optimal.
As a concrete example, let us consider a simplified version (and a reduc-
tionist view of nutrition) with 10 food options and three nutrients with the
amounts shown below.
Food Cost Nutrient 1 Nutrient 2 Nutrient 3

A 0.46 0.56 0.29 0.48
B 0.54 0.84 0.98 0.55
C 0.40 0.23 0.36 0.78
D 0.39 0.48 0.14 0.59
E 0.49 0.05 0.26 0.79
F 0.03 0.69 0.41 0.84
G 0.66 0.87 0.87 0.01
H 0.26 0.85 0.97 0.77
I 0.05 0.88 0.13 0.13
J 0.60 0.62 0.69 0.10
If we call the amount of each food 𝑥, the cost column 𝑐, and the nutrient
columns 𝑛1 , 𝑛2 , 𝑛3 then we can setup the following linear problem:
minimize 𝑐𝑇 𝑥
subject to 5 ≤ 𝑛1𝑇 𝑥 ≤ 8
7 ≤ 𝑛2𝑇 𝑥 (11.5)
1≤ 𝑛3𝑇 𝑥 ≤ 10
𝑥≤4
The last constraint was added to ensure we do not eat too much of any one
item and get tired of it. LP solvers are widely available. In fact, some solvers
operate independent of a programming language as the input is just a table of

numbers. The solution in this case is:
𝑥 = [0, 1.43, 0, 0, 0, 4.00, 0, 4.00, 0.73, 0]𝑇 (11.6)
suggesting that our optimal diet consists of items B, F, H, and I in the proportions
shown above. The solution hit the upper limit on nutrient 1 and the lower limit
on nutrient 2.
LPs frequently occur with allocation or assignment problems, such

as choosing an optimal portfolio of stocks, deciding what mix of
products to build, deciding what tasks should be assigned to each
worker, determining which goods to ship to which locations. These
types of problems occur frequently in domains like operations research,
finance, supply chain management, and transportation.
A common consideration with LPs is whether or not the variables
should be discrete. In Ex. 11.1, 𝑥 𝑖 is a continuous variable and purchas-
ing fractional amounts of food may or may not make sense, depending
on the type of food. If we were performing an optimal stock allocation
then we can purchase fractional amounts of stock, but if we were opti-
mizing how much of each product to manufacture, it might not make
sense to build 32.4 products. In these cases, we may want to restrict
the variables to be integers, which are called integer constraints. These
types of problems require discrete optimization algorithms, which are
covered in Chapter 8.
11.3 Quadratic Programming:
A quadratic program (QP) has a quadratic objective and linear constraints.

Quadratic programming was mentioned in Section 5.4 when discussing
sequential quadratic programming. A general QP can be expressed as:
1 𝑇
2
𝐶𝑥 + 𝑑 ≤ 0
A QP is only convex if the matrix 𝑄 is positive semidefinite. A QP is

reduced to an LP if 𝑄 = 0.
One of the most common QP examples (really a subset of QP) is
least-squares regression, which is used in many applications, such as
data fitting. As the name suggests, least squares seeks to minimize the
sum of squared residuals:

Õ
minimize (𝑏ˆ 𝑖 − 𝑏 𝑖 )2 , (11.8)
𝑖
where the vector 𝑏ˆ contains the estimated values (from a model, for
example), and 𝑏 contains the data points that we are trying to fit. If we
assume a linear model, then 𝑏ˆ = 𝐴𝑥, where 𝑥 are the model parameters
we want to optimize to fit the data. Here, “linear” means linear in the
coefficients, not in the data fit. For example, we could estimate the
coefficients 𝑐 𝑖 of a quadratic function that best fits some data:
𝑓 (𝜁) = 𝑐 1 𝜁 2 + 𝑐 2 𝜁 + 𝑐3 (11.9)
This equation is linear in the coefficients (which corresponds to 𝑥 =

[𝑐 1 , 𝑐2 , 𝑐3 ]). For this to be a least-squares problem, 𝐴 must have more
rows than columns, that is, more equations than unknowns, and is also
known as overdetermined.
We can rewrite the problem statement as:
Õ
minimize (𝑎 𝑇𝑖 𝑥 − 𝑏 𝑖 )2 , (11.10)
𝑖
where 𝑎 𝑇𝑖 is the 𝑖 th row in the matrix 𝐴. Equivalently, we can express

the least-squares problem as minimizing the square of a 2-norm:
minimize ||𝐴𝑥 − 𝑏|| 22 (11.11)
which can be expressed equivalently as
||𝐴𝑥 − 𝑏|| 22 = (𝐴𝑥 − 𝑏)𝑇 (𝐴𝑥 − 𝑏) = 𝑥 𝑇 𝐴𝑇 𝐴𝑥 − 2𝑏 𝑇 𝐴𝑥 + 𝑏 𝑇 𝑏 (11.12)
This is the same as the general QP form, where 𝑄 = 2𝐴𝑇 𝐴, 𝑓 = −2𝐴𝑇 𝑏,

and 𝑏 𝑇 𝑏 is just a constant and so does not affect the optimal solution.
Thus, least squares is an unconstrained QP. Least squares actually has an
analytic solution if 𝐴 has full rank (𝑥 ∗ = (𝐴𝑇 𝐴)−1 𝐴𝑇 𝑏), so the machinery
of a QP is not necessary. However, we can add constraints in QP form
to solve constrained least squares problems, which generally do not have
analytic solutions.
Example 11.2: A constrained least-squares QP.
The left pane of Fig. 11.3 shows some example data that is both noisy and
biased relative to the true (but unknown) underlying curve represented as a
dashed line. Given the data points we would like to estimate the underlying
functional relationship. We assume that the relationship is cubic:
𝑦(𝑥) = 𝑎1 𝑥 3 + 𝑎2 𝑥 2 + 𝑎3 𝑥 + 𝑎4 (11.13)
and need to estimate the coefficients 𝑎1 , . . . , 𝑎 4 . As discussed above this can

be posed as a QP problem, or even more simply as an analytic problem. The
resulting least-squares fit is shown in middle pane of Fig. 11.3.
30 30 30
20 20 20
𝑦 𝑦 𝑦
10 10 10
0 0 0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥 𝑥 𝑥
Suppose from some careful measurements, or additional data, we know Figure 11.3: hello
an upper bound on the function value at a few places. For this example we
assume that we know that 𝑓 (−2) ≤ −2, 𝑓 (0) ≤ 4, 𝑓 (2) ≤ 26. These requirements
can be posed as linear constraints:
 𝑎1 
(−2)3 (−2)2 1   −2
 −2  𝑎2   
 0 1  ≤4
 0 0  𝑎3    (11.14)
 23 1    26 
 22 2 𝑎   
 4
We add these linear constraints to our quadratic objective (minimizing the sum
of the squared error) and the resulting problem is still a QP. The resulting
solution is shown in the right pane of Fig. 11.3, which results in a much more
accurate fit.
Example 11.3: Linear-quadratic regulator (LQR) controller.
Another common example of a QP occurs in optimal control. Consider a

discrete-time linear dynamic system:
𝑥 𝑡+1 = 𝐴𝑥 𝑡 + 𝐵𝑢𝑡 (11.15)
where 𝑥 𝑡 is the deviation from a desired state at time 𝑡 (for example, the
positions and velocities of an aircraft), and 𝑢𝑡 are the control inputs that we
want to optimize (for example, control surface deflections). The above dynamic
equation can be used as a set of linear constraints in an optimization, but we
must decide on an objective.
One would like to have small 𝑥 𝑡 because that would mean reducing the error
in our desired state quickly, but we would also like to have small 𝑢𝑡 because
small control inputs require less energy. These are competing objectives, where
a small control input will take longer to minimize error in a state, and vice-versa.
One way to express this objective is as a quadratic function:
1Õ 𝑇
𝑁
minimize 𝑥 𝑡 𝑄𝑥 𝑡 + 𝑢𝑡𝑇 𝑅𝑢𝑡 , (11.16)
2
𝑡=0
where the weights in 𝑄 and 𝑅 reflect our preferences on how important it

is to have small state error versus small control inputs. (This is an example
of a multiobjective function, which we explained in Chapter 9) The equation
has a form like kinetic energy, and the LQR problem could be thought of as
determining the control inputs that minimize the energy expended, subject
to the vehicle dynamics. This particular choice of objective was intentional
because it means that the problem is a convex QP (as long as we choose positive
weights). Because it is convex, this problem can be solved reliably and efficiently,
both necessary conditions for a robust control law.
11.4 Second-Order Cone Programming:
A second-order cone program (SOCP) has a linear objective and a second-

order cone constraint.
minimize 𝑓 𝑇𝑥
subject to ||𝐴 𝑖 𝑥 + 𝑏 𝑖 || 2 ≤ 𝑐 𝑇𝑖 𝑥 + 𝑑 𝑖 (11.17)
𝐺𝑥 + ℎ = 0
If 𝐴 𝑖 = 0 then this form reduces to a linear programming problem.

One useful subset of SOCP is a quadratically-constrained quadratic
program (QCQP). A QCQP is the same as a QP but with the addition of
quadratic inequality constraints instead of linear ones, that is,
1 𝑇
2
1 𝑇
𝑥 𝑅 𝑖 𝑥 + 𝑐 𝑇𝑖 𝑥 + 𝑑 𝑖 ≤ 0 for 𝑖 = 1, . . . , 𝑚
2
Both 𝑄 and 𝑅 must be positive semidefinite for the QCQP to be convex.
A QCQP reduces to a QP if 𝑅 = 0. We saw QCQPs when solving
trust-region problems Section 4.5, although for trust-region problems
only an approximate solution method is typically used.
Every QCQP can be expressed as a SOCP (though not vice-versa).

The QCQP in Eq. 11.18 can be written in this equivalent form:
minimize 𝑦
subject to ||𝐹𝑥 + 𝑔|| 2 ≤ 𝑦
(11.19)
𝐴𝑥 + 𝑏 = 0
||𝐺 𝑖 𝑥 + ℎ 𝑖 || 2 ≤ 0
If we square both sides of the first and last constraint, we see that
this formulation is exactly equivalent to the QCQP where 𝑄 = 2𝐹𝑇 𝐹,
𝑓 = 2𝐹𝑇 𝑔, 𝑅 𝑖 = 2𝐺𝑇𝑖 𝐺 𝑖 , 𝑐 𝑖 = 2𝐺𝑇𝑖 ℎ 𝑖 and 𝑑 𝑖 = ℎ 𝑇𝑖 ℎ 𝑖 . The matrices 𝐹 and
𝐺 𝑖 are the square roots of the matrices 𝑄 and 𝑅 𝑖 respectively (divided
by two), and would be computed from a factorization.
11.5 Disciplined Convex Optimization
Disciplined convex optimization allows us to build convex problems

using a specific set of rules and mathematical functions. By following
this set of rules, the problem can be translated automatically in a form
that can be efficiently solved using convex optimization algorithms.§ § Grant et al.130
shows how the disciplined
The following are examples of functions that are convex: convex problem rules allow for translat-
ing the problem into an efficiently solvable
• Exponential functions: 𝑒 𝑎𝑥 where 𝑎 is any real number form automatically.
130. Grant et al., Disciplined Convex Pro-
• Power functions: 𝑥 𝑎 for 𝑎 ≥ 1 or 𝑎 ≤ 0 or −𝑥 𝑎 for 0 ≤ 𝑎 ≤ 1 gramming. 2006
• Negative logarithms: − log(𝑥)
• Norms, including absolute value: ||𝑥||
• Maximum function: max(𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 )
• log-sum-exp: log(𝑒 𝑥1 + 𝑒 𝑥2 + . . . + 𝑒 𝑥 𝑛 )
The functions do not need to be continuously differentiable because

this is not a requirement of convexity.
A disciplined convex problem can be formulated using any of these
functions for our objective or inequality constraints. We can also use
various operations that preserve convexity to build up more complex
expressions. Some of the more common operations are:
• Multiplying a convex function by a positive constant
• Adding convex functions
• Composing a convex function with an affine function, i.e., if 𝑓 (𝑥)

is convex, then 𝑓 (𝐴𝑥 + 𝑏) is also convex
• Taking the maximum of two convex functions
While these functions and operations greatly expand the types of

convex problems that we can solve beyond LPs and QPs, they are still
restrictive within the broader scope of nonlinear programming. Still,
for objectives and constraints that require only simple mathematical
expressions, there is a possibility that it can be posed as a disciplined
convex optimization problem. The original expression of a problem is
often not convex, but can be made convex through a transformation to
a mathematically equivalent problem. Some of these transformation
techniques, including performing a change of variables, adding slack
variables, or expressing the objective in a different form. Successfully
recognizing and applying these techniques is a skill that takes practice.
Tip 11.4: Software for disciplined convex programming.
CVX and its variants are free popular tools for disciplined convex program-
ming with interfaces for multiple programming languages.¶ ¶ https://stanford.edu/~boyd/software.
html
11.6 Geometric Programming
A geometric program (GP) is not convex, but can be transformed into

an equivalent convex problem. GPs are defined using monomials and
posynomials. A monomial is a function of the form:
𝑓 (𝑥) = 𝑐𝑥 1𝑎1 𝑥2𝑎2 · · · 𝑥 𝑛𝑎 𝑛 (11.20)
where 𝑐 > 0 and all 𝑥 𝑖 > 0. A posynomial is a sum of monomials:
Õ
𝑁
𝑎1𝑗 𝑎2𝑗 𝑎𝑛 𝑗
𝑓 (𝑥) = 𝑐 𝑗 𝑥1 𝑥2 · · · 𝑥 𝑛 (11.21)
𝑗=1
where all 𝑐 𝑗 > 0.
Example 11.5: Monomials and posynomials in engineering.
Monomials and posynomials appear in many engineering expressions. For

example, the calculation of lift from the definition of the lift coefficient is a
monomial:
1
𝐿 = 𝐶 𝐿 𝜌𝑉 2 𝑆 (11.22)
2
Total incompressible drag, a sum of parasitic and induced drag, is a posynomial:
𝐶 𝐿2
𝐷 = 𝐶 𝐷 𝑝 𝑞𝑆 + 𝑞𝑆 (11.23)
𝜋𝐴𝑅𝑒
A GP in standard form is written as:
minimize 𝑓0 (𝑥)
subject to 𝑓𝑖 (𝑥) ≤ 1 (11.24)
ℎ 𝑖 (𝑥) = 1
where all of the 𝑓𝑖 are posynomials and the ℎ 𝑖 are monomials. This
problem does not fit into any of the convex optimization problems
defined in the previous section, and it is not convex. The reason why
this formulation is useful is that we can convert it into an equivalent
convex optimization problem.
First, we take the logarithm of the objective and of both sides of the
constraints:
minimize log 𝑓0 (𝑥)
subject to log 𝑓𝑖 (𝑥) ≤ 0 (11.25)
log ℎ 𝑖 (𝑥) = 0.
Let us further examine the equality constraints. Recall that ℎ 𝑖 is a
monomial, so writing one of the constraints explicitly results in the
form:
log(𝑐𝑥 1𝑎1 𝑥2𝑎2 . . . 𝑥 𝑛𝑎 𝑛 ) = 0 (11.26)
Using the properties of logarithms, this can be expanded to an equiva-
lent expression:
log 𝑐 + 𝑎 1 log 𝑥1 + 𝑎2 log 𝑥2 + . . . + 𝑎 𝑛 log 𝑥 𝑛 = 0 (11.27)
Introducing a change of variables, 𝑦 𝑖 = log 𝑥 𝑖 , results in the following

equality constraint:
𝑎1 𝑦1 + 𝑎2 𝑦2 + . . . + 𝑎 𝑛 𝑦𝑛 + log 𝑐 = 0
(11.28)
𝑎 𝑇 𝑦 + log 𝑐 = 0
This is an affine constraint in 𝑦.

The objective and inequality constraints are more complex because
they are posynomials. The expression log 𝑓𝑖 written in terms of a
posynomial results in:
©Õ 𝑎𝑛 𝑗 ª
𝑁
log 𝑐 𝑗 𝑥1 𝑥2 . . . 𝑥 𝑛 ®
𝑎1𝑗 𝑎 2𝑗
(11.29)
« 𝑗=1 ¬
Because this is a sum of products, we cannot use the logarithm to

expand each term. However, we still introduce the same change of
variables (expressed as 𝑥 𝑖 = 𝑒 𝑦𝑖 ):
©Õ ª
𝑁
log 𝑓𝑖 = log 𝑐 𝑗 𝑒 𝑦1 𝑎1𝑗 𝑒 𝑦2 𝑎2𝑗 . . . 𝑒 𝑦𝑛 𝑎 𝑛 𝑗 ®
« 𝑗=1 ¬
© Õ𝑁
ª
= log 𝑐 𝑗 𝑒 𝑦1 𝑎1𝑗 +𝑦2 𝑎2𝑗 +𝑦𝑛 𝑎 𝑛 𝑗 ® (11.30)
« 𝑗=1 ¬
©Õ 𝑎𝑇𝑗 𝑦+𝑏 𝑗 ª
𝑁
= log 𝑒 ® where 𝑏 𝑗 = log 𝑐 𝑗 .
« 𝑗=1 ¬
This is a log-sum-exp of an affine function. As mentioned in the previous
section, log-sum-exp is convex, and a convex function composed with an
affine function is a convex function. Thus, the objective and inequality
constraints are convex in 𝑦. Because the equality constraints are affine,
we have a convex optimization problem obtained through a change of
variables.
Geometric programming has been successfully used for aircraft
design applications using relationships like the simple ones shown in
131. Hoburg et al., Geometric Programming
Ex. 11.5.131 for Aircraft Design Optimization. 2014
Unfortunately, many other functions do not fit this form (e.g., design
variables that can be positive or negative, terms with negative coeffi-
cients, trigonometric functions, logarithms, exponents). GP modelers
use various techniques to extend usability including using a Taylor’s
series across a restricted domain, fitting functions to posynomials,132 132. Hoburg et al., Data fitting with
geometric-programming-compatible soft-
and rearranging expressions to other equivalent forms including im- max functions. 2016
plicit relationships. A good deal of creativity and some sacrifice in
fidelity is usually needed to create a corresponding GP from a general
nonlinear programming problem. Still, if the sacrifice in fidelity is
not too great, there is a big upside as it comes with all the benefits of
convexity (guaranteed convergence, global optimality, efficiency, no
parameter tuning, and limited scaling issues).
One extension to geometric programming is signomial program-
ming. A signomial program has the same form except that the coeffi-
cients 𝑐 𝑖 can be positive or negative (the design variables 𝑥 𝑖 must still
be strictly positive). Unfortunately, this problem cannot be transformed
to a convex one, so it can no longer guarantee a global optimum. Still, a
signomial program can usually be solved using a sequence of geometric
programs, so it is much more efficient than solving the general nonlinear
problem. Signomial programs have been used to extend the range
of design problems that can be solved using geometric programing

techniques.133,134 133. Kirschen et al., Application of Sig-
nomial Programming to Aircraft Design.
2018
Tip 11.6: Software for Geometric Programming 134. York et al., Turbofan Engine Sizing and
Tradeoff Analysis via Signomial Program-
ming. 2018
GPKit‖ is a useful, freely available software package for posing and solving
geometric programming (and signomial programming) models.
‖
https://gpkit.readthedocs.io
11.7 Summary
Convex optimization problems are highly desirable as they do not

require any parameter tuning, starting points, derivatives, and converge
reliably and rapidly to the global optimum. The tradeoff is that the form
of the objective and constraints must meet stringent requirements. These
requirements often necessitate simplifying the physics models and often
require clever reformulations. The reduction in model fidelity is still
well suited to domains where optimizations are performed repeatedly
in time (e.g., controls, machine learning), or for high-level conceptual
design studies. Linear programming and quadratic programming
in particularly are widely used across many domains and form the
bases of many of the gradient-based algorithms used to solve general
non-convex problems.
Problems
a) The optimum found through convex optimization is guaran-

teed to be the global optimum.
b) Cone programming problems are a special case of quadratic
programming problems.
c) It is sometimes possible to obtain distinct feasible regions in
linear optimization.
d) A quadratic problem is a problem with quadratic objective
and quadratic constraints.
e) A quadratic problem is only convex if the Hessian of the
objective function is positive definite.
f) Solving a quadratic problem is easy because the solution can
be obtained analytically.
g) Least-square regression is a type of quadratic programming

problem.
h) Second-order cone programming problems feature a linear
objective and a second-order cone constraint.
i) Disciplined convex optimization builds convex problems by
using convex differentiable functions.
j) It is possible to transform some nonconvex problems into
convex ones by using a change of variables, adding slack
variables, or reformulating the objective function.
k) A geometric program is not convex but can be transformed
into an equivalent convex program.
l) Convex optimization algorithms work well as long as a good
starting point is provided.
11.2 Solve using a convex solver (not a general nonlinear solver).
minimize 𝑥12 + 3𝑥22

subject to 𝑥1 + 4𝑥 2 ≥ 2
3𝑥1 + 2𝑥2 ≥ 5
𝑥1 ≥ 0, 𝑥2 ≥ 0
11.3 The following foods are available to you at your nearest grocer.
Food Cost Nutrient 1 Nutrient 2 Nutrient 3

A 7.68 0.16 1.41 2.40
B 9.41 0.47 0.58 3.95
C 6.74 0.87 0.56 1.78
D 3.95 0.62 1.59 4.50
E 3.13 0.29 0.42 2.65
F 6.63 0.46 1.84 0.16
G 5.86 0.28 1.23 4.50
H 0.52 0.25 1.61 4.70
I 2.69 0.28 1.11 3.11
J 1.09 0.26 1.88 1.74
Minimize the amount you spend while making sure you get at
least 5 units of Nutrient 1, between 8 and 20 units of nutrient 2,
and between 5 and 30 units of nutrient 3. Also be sure not to buy
more than 4 units of any one food item, just for variety. Determine
the optimal amount of each item to purchase and the total cost.
11.4 Consider the following simplified aircraft wing design problem.

Our goal is to primarily size the wing area (𝑆), aspect ratio (𝐴𝑅),
and flight speed (𝑉), in order to minimize drag:
1
𝐷 = 𝐶 𝐷 𝜌𝑉 2 𝑆 (11.31)
2
where the drag coefficient is a sum of parasitic drag, lift-dependent
drag, and drag of the rest of the aircraft.
𝑆wet 𝐶𝐿2 1
𝐶 𝐷 = 𝑘𝐶 𝑓 + + (𝐶𝐷𝑆)other (11.32)
𝑆 𝜋𝐴𝑅𝑒 𝑆
The skin friction coefficient is a function of Reynolds number:
0.074
𝐶𝑓 = (11.33)
𝑅𝑒 0.2
where the Reynolds number is:
p
𝜌𝑉 𝑆/𝐴𝑅
𝑅𝑒 = (11.34)
𝜇
We need to add some constraints. One is that lift equals weight:
1
𝑊 = 𝐶 𝐿 𝜌𝑉 2 𝑆 (11.35)
2
where the weight is a sum of the wing weight and the fixed weight
of the rest of the aircraft:
𝑊 = 𝑊𝑤 + 𝑊other (11.36)
and the wing weight is estimated using a statistical fit:

√
−5 𝑁load (𝑆 · 𝐴𝑅)3/2 𝑊other 𝑊
𝑊𝑤 = 45.42𝑆 + 8.71 × 10 (11.37)
𝑆𝜏
Another constraint is that we need enough wing area to fly at our
desired stall speed:
2𝑊
≤ 𝐶 𝐿max (11.38)
𝜌𝑉𝑠2 𝑆
Variables that were not defined above are fixed parameters shown
in the following table.
Parameter Value Unit Description

𝜌 1.23 kg/m3 density of air
𝜇 1.78 × 10−5 kg/(m sec) viscosity of air
𝑘 1.2 form factor
𝑆wet /𝑆 2.05 ratio of wetted area to
reference area
𝑒 0.96 Oswald efficiency factor
(𝐶𝐷𝑆)other 0.0306 drag area for rest of
aircraft
𝑊other 4940 N weight of rest of aircraft
𝑁load 2.5 load factor
𝜏 0.12 airfoil thickness-to-chord
ratio
𝑉𝑠 22 m/s stall speed
𝐶 𝐿max 2 maximum lift coefficient
at landing
Formulate the above problem as a geometric program and solve

∗∗ GPkit,
it.∗∗ https://gpkit.readthedocs.io is
one good option to solve this. If you get
stuck this problem is an example in the pa-
per Geometric Programming for Aircraft
Design Optimization as well as an exam-
ple in the documentation for GPkit (with
slightly different values).
Optimization Under Uncertainty
12
Uncertainty is always present in engineering design. For example,
manufacturing processes create deviations from the specifications,
operating conditions vary from the ideal, and some parameters are
inherently variable. Optimization with deterministic inputs can lead
to poorly performing designs. To create robust and reliable designs,
we must treat the relevant parameters and design variables as random
variables. Optimization under uncertainity (OUU) is the optimization of
systems in the presence of uncertain parameters or design variables.
1. Define robustness and reliability in the context of opti-

mization under uncertainty.
2. Describe and use several strategies for both robust opti-

mization and reliability.
3. Understand the pros and cons for the following forward

propagation methods: direct quadrature, Monte Carlo
methods, first-order perturbation methods, and polyno-
mial chaos.
4. Use some of the forward propagation methods within an

optimization.
12.1 Introduction
We call a design robust if its performance is less sensitive to inherent

variability. In other words, the objective function is less sensitive to
variations in the random design variables and parameters. Similarly,
we call a design reliable if it is less prone to failure under variability. In
other words, the constraints have a lower probability of being violated ∗ While we maintain a distinction in this
under variations in the random design variables and parameters.∗ book, much of the literature include both
of these concepts under the umbrella of
robust optimization.
333
Example 12.1: Robust versus reliable designs.
A familiar example of robust design occurs when playing the board game
Monopoly. On a given turn, if you knew for certain where an opponent was
going to land next, it would make sense to put all of your funds into developing
that one property to its fullest extent. However, because their next position is
uncertain, a better strategy might be to develop multiple nearby properties (each
to a lesser extent because of a fixed monetary resource). This is an example of
robust design: the expected return is less sensitive to input variability. However,
because you develop multiple properties, a given property will have a lower
return than if you had only developed one property. This is a fundamental
tradeoff in robust design. An improvement in robustness generally is not
free; instead, it requires a tradeoff in peak performance. This is known as a
risk-reward tradeoff.
A familiar example of reliable design is when planning a trip to the airport.
Experience suggests that it is not a good idea to use average times to plan your
arrival down to the minute. Instead, if you want a high probability of making
your flight, you plan for variability in traffic and security lines and add a buffer
to your departure time. This is an example of a reliable design: it is less prone
to failure under variability. Reliability is also not free, and generally requires a
tradeoff in the objective (in this example, optimal use of time perhaps).
In this chapter, we first provide a brief review of some elements of

statistics and probability theory. Next, we discuss how uncertainty can
be used in the objective function allowing for robust designs, and how
it can be used in constraints allowing for reliable designs. Finally, we
discuss a few different methods for propagating input uncertainties
through a computational model to produce output statistics, a process
called forward propagation. This chapter only presents a introduction
to these topics. Uncertainty quantification, and OUU, is a large and
actively growing field.
12.2 Statistics Review
Imagine measuring the axial strength of a rod by perform a tensile test

with many rods, each designed to be identical. Even with “identical”
rods, every time you perform the test you get a different result (hopefully
with relatively small differences). This variation has many potential
sources including variation in the manufactured size and shape, in
the composition of the material, in the contact between the rod and
testing fixture. In this example, we would call the axial strength
a random variable, and the result from one test would be a random
sample. The random variable, axial strength, is a function of several
other random variables such as bar length, bar diameter, and material
Young’s modulus.
One measurement does not tell us anything about how variable the
axial strength is, but if we perform the test many times we can learn a
lot about its distribution. From this information we can infer various
statistical quantities like the mean value of the axial strength. The mean
of some variable 𝑥 that is measured 𝑁 times is estimated as:
1 Õ
𝑁
𝜇𝑥 = 𝑥𝑖 (12.1)
𝑁
𝑖=1
Note that this is actually a sample mean, which would differ from
the population mean (the true mean if you could measure every bar).
With enough samples the sample mean will approach the population
mean. In this brief introduction we won’t distinguish between sample
and population statistics.
Another important quantity is the variance or standard deviation.
This is a measure of spread, or how far away our samples are from the
mean. The unbiased† estimate of the variance is:
† Unbiased means that the expected value
of the sample variance is the same as the
1 Õ
𝑁 true population variance. If 𝑁 was used in
the denominator, rather than 𝑁 − 1, then
𝜎𝑥2 = (𝑥 𝑖 − 𝜇𝑥 )2 (12.2) the two quantities differ by a constant.
𝑁 −1
𝑖=1
and the standard deviation is just the square root of the variance. A
small variance implies that measurements are clustered tightly around
the mean, whereas a large variance means that measurements are
spread out far from the mean. The variance can also be written in the
mathematically equivalent, but more computationally friendly format:
!
1 Õ
𝑁

𝜎𝑥2 = 𝑥 2𝑖 − 𝑁𝜇2𝑥 (12.3)
𝑁 −1
𝑖=1
More generally, we might want to know what the probability is of

getting a bar with a specific axial strength. In our testing, we could
tabulate the frequency of each measurement in a histogram. If done
enough times, it would define a smooth curve as shown in Fig. 12.1a.
This curve is called the probability density function (PDF), 𝑝(𝑥), and it tells
us the relative probability of a certain value occurring. More specifically,
a PDF gives the probability of getting a value with a certain range:
∫ 𝑏
Prob[𝑎 ≤ 𝑥 ≤ 𝑏] = 𝑝(𝑥)𝑑𝑥 (12.4)
𝑎
The total integral of the PDF must be one since it contains all possible
outcomes (100%). ∫ ∞
𝑝(𝑥)𝑑𝑥 = 1 (12.5)
−∞
From the PDF we can also measure various statistics like the mean:
∫ ∞
𝜇𝑥 = 𝐸[𝑥] = 𝑥𝑝(𝑥)𝑑𝑥 (12.6)
−∞
This quantity is also referred to as the expected value of x (𝐸[𝑥]). We

can also compute the variance from its definition:
∫ ∞
𝜎𝑥2 = (𝑥 − 𝜇𝑥 )2 𝑝(𝑥)𝑑𝑥 (12.7)
−∞
or in a mathematically equivalent format:

∫ ∞
𝜎𝑥2 = 𝑥 2 𝑝(𝑥)𝑑𝑥 − 𝜇2𝑥 (12.8)
−∞
1
0.3
0.8
0.2 0.6
𝑝(𝜎) 𝑝(𝜎)
0.4
0.1
0.2
0 0 Figure 12.1: Comparison between

990 995 1,000 1,005 1,010 990 995 1,000 1,005 1,010 probability density function and
𝜎 𝜎
cumulative distribution function for
(a) Probability density function for the (b) Cumulative distribution function for a simple example.
axial strength of a rod. the axial strength of a rod.
A related concept is the cumulative distribution function (CDF), which

is the cumulative integral of the PDF:
∫ 𝑥
𝐹(𝑥) = 𝑓 (𝑡)𝑑𝑡 (12.9)
−∞
The capital 𝐹 denotes the CDF and the lowercase 𝑓 the PDF. As
an example, the CDF for the axial strength distribution is shown in
Fig. 12.1b. The CDF always approaches 1 as 𝑥 → ∞.
We often fit a named distribution to the PDF of empirical data. One
of the most popular distributions is the Gaussian or Normal distribution.
Its PDF is:
1 −(𝑥 − 𝜇)2
𝑝(𝑥; 𝜇, 𝜎 2 ) = √ exp (12.10)
𝜎 2𝜋 2𝜎2
𝜇 = 1, 𝜎 = 0.5
0.6
𝑝(𝑥) 0.4
𝜇 = 3, 𝜎 = 1.0
0.2 Figure 12.2: Two normal distribu-

tions. Changing the mean causes
a shift along the x-axis. Increasing
0 the standard deviation causes the
−1 0 1 2 3 4 5 6 7 PDF to spread out.
𝑥
For a Gaussian distribution the mean and variance are clearly visible
in the function, but keep in mind these quantities are defined for
any distribution. Figure 12.2 shows two normal distributions with
different means and standard deviations to illustrate the effect of those
parameters. A few other popular distributions, including a uniform,
Weibull, lognormal, and exponential distribution are shown in Fig. 12.3.
These only give a flavor of different named distributions, many others
exist.
0.3 0.3
0.2 0.2
𝑝(𝑥) 𝑝(𝑥)
0.1 0.1
0 0
0 2 4 6 0 2 4 6
𝑥 𝑥
(a) Uniform distribution (b) Weibull distribution
0.3 0.3
0.2 0.2
𝑝(𝑥) 𝑝(𝑥)
0.1 0.1
0 0
0 2 4 6 0 2 4 6 Figure 12.3: A few other example
𝑥 𝑥
probability distributions.
(c) Lognormal distribution (d) Exponential distribution
12.3 Robust Design
We illustrate some of the key concepts in robust design through exam-

ples.
Example 12.2: A robust airfoil optimization.
Consider a simple airfoil optimization. Figure 12.4 shows the drag coefficient
of an RAE 2822 airfoil, as a function of Mach number, evaluated by an inviscid
compressible flow solver.
·10−2
2.2
1.8
𝑐𝑑
1.6
1.4
Figure 12.4: Inviscid drag coeffi-
1.2 cient, in counts, of the RAE 2822
1 airfoil as a function of Mach num-
0.64 0.66 0.68 0.7 0.72 0.74 ber.
Mach number
This is a typical drag rise curve, where increasing Mach number leads to
stronger shock waves and an associated increase in wave drag. Now let’s try to
change the shape of the airfoil to allow us to fly a little bit faster without large
increases in drag. We could perform an optimization to minimize the drag of
this airfoil at Mach 0.71. The resulting drag curve of this optimized airfoil is
shown in Fig. 12.5 in comparison to the baseline RAE 2822 airfoil.
·10−4
220
200
180
𝑐𝑑
160
Figure 12.5: The red curve shows
140 the drag of the airfoil optimized
to minimize drag at 𝑀 = 0.71,
120 corresponding to the dot. The drag
100 is low at the requested point, but
0.64 0.66 0.68 0.7 0.72 0.74 off-design performance is poor.
Mach number
Note that the drag is low at Mach 0.71 (as requested!), but any deviation
from the target Mach number causes significant drag penalties. In other words,
the design is not robust.
One way to improve the design, is to use what is called a multi-point
optimization. We minimize a weighted sum of the drag coefficient evaluated at
three different Mach numbers (𝑀 = 0.68, 0.71, 0.725). The resulting drag curve
is shown in gray in Fig. 12.6.
·10−4
220
200
180
𝑐𝑑
160
Figure 12.6: The gray curve shows
140 the drag of the airfoil optimized
to minimize the average drag at
120 the three denoted points. The ro-
100 bustness of the design is greatly
0.64 0.66 0.68 0.7 0.72 0.74 improved.
Mach number
A multipoint optimization is a simplified example of optimization under

uncertainty. Effectively, we have treated Mach number as a random parameter
with a given probability at three discrete values. We then minimized the
expected value of the drag. This simple change significantly increased the
robustness of the design. The drag at our desired speed of Mach 0.71 is not
quite as low as the single point case, but it is less sensitive to deviations from
this desired operating point. As noted in the introduction, a trade-off in peak
performance is required to achieve enhanced robustness.
Example 12.3: A robust wind farm layout optimization.
Wind farm layout optimization is another example of optimization under

uncertainty, but with a more complex probability distribution compared to the
highly simplified multipoint formulation. The positions of wind turbines in a
wind farm have a strong impact on overall performance because their wakes
interfere with one another. The primary goal of wind farm layout optimization
is to position the turbines to reduce interference and thus maximize power
production. In this example there are nine turbines, and the constraints on
this problem are purely geometric: the turbines must stay within a specified
boundary and must not be too close to any other turbine.
One of the primary challenges of wind farm layout optimization is that the
wind is highly variable. To keep the example simple, we will assume that wind
speed is constant, but that wind direction is an uncertain parameter. Figure 12.7
shows a PDF of wind direction for an actual wind farm, which is known as a
wind rose, and is more commonly visualized as shown on the right plot. We
see that the wind is predominately out of the west, with another peak coming
out of the south. Because of the variable nature of the wind it is difficult to
intuit an optimal layout.
We solve the problem two ways. The first way is to solve the problem
deterministically (i.e., ignore the variability). Commonly this is done by using
·10−3 N
8
NW NE
Relative Probability
4
W E
2
Figure 12.7: Left: probability den-
sity function of wind direction.
0
SE
Right: same PDF but visualized as a
0 90 180 270 360 SW
wind rose.
Wind Direction (deg) S
mean values for uncertain parameters, often with the assumption that the
variability is Gaussian or at least symmetric. In this case, the wind direction is
periodic, and very asymmetric, so instead we optimize using the most probable
wind direction (261◦ ). The second way is to treat this as an OUU problem.
Instead of maximizing the power for one direction, we maximize the expected
value of the power across all directions. This is straightforward to compute
from the definition of expected value because this is one-dimensional function.
Section 12.5 explains other ways to perform forward propagation.
Figure 12.8 shows the power as a function of wind direction for both cases.
Note that the deterministic approach does indeed allow for higher power
production when the wind comes from the west (and 180 degrees from that),
but that power drops considerably for other directions. In contrast, the OUU
result is much less sensitive to changes in wind direction. The expected value
of power is 58.6 MW for the deterministic case, and 66.1 MW for the OUU case,
which represents over a 12% improvement‡ . ‡ The wind energy community does not
use expected power directly, but rather an-
nual energy production, which is just the
80
expected power times utilization
OUU
70
60
Power (MW)
50
40
Figure 12.8: Wind farm power, as a
1 dir function of wind direction, for two
30 cases: optimized deterministically
using the most probable direction,
0 90 180 270 360 optimized under uncertainty.
Wind Direction (deg)
We can also see the tradeoff in the optimal layouts. The left side of Fig. 12.9
shows the optimal layout using the deterministic formulation, with the wind
coming from the predominant direction (the direction we optimized for).
The wakes are shown in blue and the boundaries with a dashed line. The
wind turbines have spaced themselves out so that there is very little wake
interference. However, when the wind changes direction the performance
degrades significantly. The right side of Fig. 12.9 shows the same layout, but
when the wind is in the second-most probable direction. In this direction many
of the turbines are operating in the wake of another turbine and produce much
less power.
Figure 12.9: Left: Deterministic case

with the primary wind direction.
Right: Deterministic case with the
secondary wind direction.
In contrast, the robust design is shown in Fig. 12.10 for the predominant
wind direction on the left and the second-most probable direction on the right.
In both cases the wake effects are relatively minor, though not quite as ideally
placed in the predominant direction. The tradeoff in performance for that one
direction, allows the design to be more robust as the wind changes direction.
Figure 12.10: Left: OUU case with

the primary wind direction. Right:
OUU case with the secondary wind
direction.
This example again highlights the classic risk-reward tradeoff. The maxi-
mum power achieved at the most probable wind speed is reduced, in exchange
for reduced power variability and thus higher energy production in the long
run.
Both Examples 12.2 and 12.3 used the expected value, or mean, as
the objective function. However, there are other useful forms of OUU
objective functions that may be more suitable. Consider the following

options:
1. Minimize the mean of the function: 𝜇 𝑓 (𝑥).
2. Minimize the standard deviation (or variance) of the function:

𝜎 𝑓 (𝑥).
3. Minimize the mean plus or minus some number of standard

deviations: 𝜇 𝑓 (𝑥) ± 𝑘𝜎 𝑓 (𝑥).
4. Perform a multiobjective optimization trading off the mean and

standard deviation. A Pareto front can be a useful tool to assess
this risk-reward tradeoff (see Chapter 9).
5. Minimize other statistical quantities like the 95% percentile of the

distribution.
6. Minimize a reliability metric (see next section for further discus-

sion on reliability): Prob[ 𝑓 (𝑥) > 𝑓𝑐𝑟𝑖𝑡 ], which means to minimize
the probability that the objective exceeds some critical value.
12.4 Reliability
In addition to affecting the objective function, uncertainty also affects

constraints. As with robustness, we begin with an example.
Example 12.4: Reliability with the Barnes function.
Consider the Barnes function shown on the left side of Fig. 12.11. The three
red lines are the three nonlinear constraints of the problem and the red regions
highlight regions of infeasibility. With deterministic inputs, the optimal value
sits right on the constraint. An uncertainty ellipse shown around the optimal
point highlights the fact that the solution is not reliable. Any variability in the
inputs will create a significant probability for one or more of the constraints to
be violated (just like the real life problem where you are likely to be late if you
plan your arrival assuming zero variability).
Conversely, the right side of Fig. 12.11 shows a reliable optimum, with the
same uncertainty ellipse. We see that it is highly probable that the design
will satisfy all constraints under the input variation. However, as noted in
the introduction, increased reliability presents a performance trade-off with a
corresponding increase in the objective function.
Figure 12.11: Left: The constrained

deterministic optimum sits right
𝑥2 𝑥2
on the constraint and if there is
any variability is likely to violate a
constraint. Right: The reliable opti-
mum will still satisfy the constraints
even with variability.
𝑥1 𝑥1
In some engineering disciplines, increasing the reliability is han-

dled in a basic way through safety factors. These safety factors are
deterministic, but are usually derived through statistical means.
Example 12.5: Connecting safety factors to reliability.
If we were constraining the stress (𝜎) in a structure to be less than the

material’s yield stress (𝜎 𝑦 ) we would not want to use a constraint of the form:
𝜎(𝑥) ≤ 𝜎 𝑦 . (12.11)
This would be dangerous because we know there is inherent variability in the

loads, and uncertainty in the yield stress of the material. Instead we often use
a simple safety factor.
𝜎(𝑥) ≤ 𝜂𝜎 𝑦 , (12.12)
where 𝜂 is a total safety factor that accounts for safety factors from loads,
materials, and failure modes. Of course, not all applications have standards-
driven safety factors already determined. The statistical approach discussed in
this chapter is useful in these situations to allow for reliable designs.
To create reliable design we change our deterministic inequality

constraints:
𝑔(𝑥) ≤ 0 (12.13)
to the form:
Prob[𝑔(𝑥) ≤ 0] ≥ 𝑅 (12.14)
where 𝑅 is the reliability level. In words, we want the probability of
constraint satisfaction to exceed some pre-selected reliability level. For
example, if we set 𝑅 = 0.999 the solution must satisfy the constraints
with a probability of 99.9%. Thus, we can explicitly set the reliability
level that we wish to achieve, with associated trade-offs in the level of
performance for the objective function.
12.5 Forward Propagation
In the previous sections we have assumed that we know the statistics

(e.g., mean, standard deviation, etc.) of the outputs of interest (objectives
and constraints). However, we generally do not have that information.
Instead, we only know the probability density functions (or some of its
magnitudes, e.g., mean, variance) of the inputs§ . Forward propagation § Or at least we can make some assump-
tions that characterize the input uncertain-
methods propagate input uncertainties through a numerical model to ties.
compute output statistics.
Uncertainty quantification is a large field unto itself, and we cannot
hope to do more than provide a broad introduction in this chapter. We
introduce four well-known methods for forward propagation: direct
quadrature, Monte Carlo methods, first-order perturbation methods,
and polynomial chaos.
12.5.1 Direct Quadrature

The mean and variance of an output function 𝑓 are defined as:
∫
𝜇𝑓 = 𝑓 (𝑥)𝑝(𝑥)𝑑𝑥 (12.15)
∫
𝜎2𝑓 = 𝑓 (𝑥)2 𝑝(𝑥)𝑑𝑥 − 𝜇2𝑓 (12.16)
and one way to estimate them is to use direct numerical quadrature. In

other words, the estimation of each integral becomes a summation:
∫ Õ
𝑓 (𝑥)𝑑𝑥 ≈ 𝑓 (𝑥 𝑖 )𝑤 𝑖 (12.17)
𝑖
where 𝑤 𝑖 are specific weights. The nodes where the function is evalu-
ated, and the corresponding weights, are determined by the quadrature
strategy (e.g., rectangle rule, trapezoidal rule, Newton–Cotes, Clenshaw–
Curtis, Gaussian, Gauss–Konrod).
The difficulty of numerical quadrature is extending to multiple
dimensions (also known as cubature), and, unfortunately, most of the
time there is more than one uncertain variable. The most obvious
extension for multidimensional quadrature is a full-grid tensor product.
This type of grid is created by discretizing the nodes in each dimension,
and then evaluating at every combination of nodes. Mathematically,
the quadrature formula can be written as
∫ ÕÕ Õ
𝑓 (𝑥)𝑑𝑥1 𝑑𝑥2 . . . 𝑑𝑥 𝑛 ≈ ... 𝑓 (𝑥 𝑖 , 𝑥 𝑗 , . . . , 𝑥 𝑛 )𝑤 𝑖 𝑤 𝑗 . . . 𝑤 𝑛
𝑖 𝑗 𝑛
(12.18)
While conceptually straightforward, this approach is subject to the curse

of dimensionality. The number of points we need to evaluate at grows
exponentially with the number of input dimensions.
One approach to deal with the exponential growth is to use a
sparse grid method first proposed by Smolyak.135 The details are 135. Smolyak, Quadrature and interpola-
tion formulas for tensor products of certain
beyond our scope, but the basic idea is that through intelligently chosen classes of functions. 1963
points, we can maintain the level of accuracy of a full tensor grid
while evaluating at far fewer points. Different quadrature rules (e.g.,
trapezoidal, Clenshaw-Curtis) produce different grids.
Example 12.6: Sparse grid methods for quadrature.
Figure 12.12 shows a comparison between a two-dimension full tensor grid

using the Clenshaw-Curtis exponential rule (left) and a level 5 sparse grid
(right) using the same quadrature strategy.
1 1
0.5 0.5
𝑥2 0 𝑥2 0
Figure 12.12: Comparison between
−0.5 −0.5 a two-dimensional full tensor grid
(left) and a level 5 sparse grid (right)
using the Clenshaw-Curtis exponen-
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 tial rule.
𝑥1 𝑥1
For a problem with dimension 𝑑, and approximately 𝑁 sample

points in each dimension, the full tensor grid has a computational
complexity of 𝒪(𝑁 𝑑 ), whereas the sparse grid method has a complexity
of 𝒪(𝑁(log 𝑁)𝑑−1 ) with comparable accuracy. This scaling alleviates the
curse of dimensionality to some extent, but the number of evaluation
points is still strongly dependent on problem dimensionality—making
it intractable in high dimensions. The method introduced in the next
section is independent of the problem dimensionality, though is not
without its own drawbacks.
12.5.2 Monte Carlo simulation

As we saw in the previous section, direct numerical quadrature faces
the curse of dimensionality. Monte Carlo simulation can be used to
alleviate this issue. Monte Carlo methods attempt to approximate the
integrals of the previous section, by using the law of large numbers. The
basic idea is that output probability distributions can be approximated

by running the simulation many times with inputs randomly sampled
from the input probability distributions. There are three steps:
1. Random sampling. Sample N points 𝑥 𝑖 from the input probability

distributions using a random number generator.
2. Numerical experimentation. Evaluate the outputs at these points:

𝑓𝑖 = 𝑓 (𝑥 𝑖 ).
3. Statistical analysis. Compute statistics on the discrete output

distribution 𝑓𝑖 (e.g., using Eqs. 12.1 and 12.3).
We can also estimate Prob[𝑐(𝑥) ≤ 0] by counting how many times the

constraint was satisfied and dividing by 𝑁. If we evaluate enough
samples, our output statistics will converge to the true values by the law
of large numbers (though herein also lies its disadvantage, it requires a
large number of samples).
The Monte Carlo method has three main advantages. First, as noted
in the previous section, the convergence rate is independent of the
number of inputs. Whether we have 3 or 300 random input variables,
the convergence rate will be similar since we can sample from all inputs
at the same time. Second, the algorithm is easy to parallelize since
all of the function evaluations are completely independent. Third, in
addition to statistics like the mean and variance, we can also generate
the output probability distributions.
The major disadvantage of the Monte Carlo method is that even
though the convergence rate does √ not depend on the number of inputs,
the convergence rate is slow: 𝒪(1/ 𝑁). This means that improving the
accuracy by one decimal place requires approximately 100 times more
samples. That means that if you need three more digits of accuracy
you would have to use about one million times as many samples. It is
also hard to know what value of 𝑁 to use a priori. Usually we need
to determine an appropriate value for 𝑁 through convergence testing
(trying larger values of 𝑁 until the statistics converge).
One approach to achieve converged statistics with fewer iterations
is to use LHS instead of pure random sampling. LHS allows one to
better approximate the input distributions with fewer samples, and
is introduced in Section 10.2, in the context of surrogate modeling.
Various related sampling approaches exist, including importance sam-
pling, quasi-Monte Carlo, and Bayesian quadrature. Even with better
sampling methods, generally a large number of simulations are re-
quired, which can be particularly prohibitive if used as part of an OUU

problem.
Example 12.7: Forward propagation with Monte Carlo
Consider a problem with the following objective and constraint:
𝑓 (𝑥) = 𝑥12 + 2𝑥22 + 3𝑥32

(12.19)
𝑔(𝑥) = 𝑥1 + 𝑥2 + 𝑥3 ≤ 3.5
At the current iteration you are at the point 𝑥 = [1, 1, 1]. We assume that the
first variable is deterministic, whereas the latter two variables have uncertainty
under a normal distribution with the following standard deviations: 𝜎2 =
0.06, 𝜎3 = 0.2. We would like to compute the outputs statistics for 𝑓 (mean,
variance, and a histogram), and compute the reliability of the constraint at this
current iteration.
We don’t know how many samples we need in order to get reasonably
converged statistics, so need to perform a convergence study. For a given
number of samples, we generate random numbers normally distributed with
mean 𝑥 𝑖 and standard deviation 𝜎𝑖 , we then evaluate the functions and compute
the mean, variance, and reliability of the outputs. Figure 12.13) shows the
convergence of the mean and standard deviation under the “Random sampling”
curve.
1.4
6.2 Random sampling

1.3 LHS
LHS
6 1.2
𝜇 𝜎
Random sampling
1.1
5.8
1 Figure 12.13: Convergence of the
mean and standard deviation ver-
5.6 0.9 sus the number of samples using
101 102 103 104 105 106 101 102 103 104 105 106 Monte Carlo.
𝑁 𝑁
Count ·104
From the data it appears that we may need about 105 samples to confidently
have well converged statistics. Using that number of samples gives the following
1.5
results: 𝜇 = 6.133, 𝜎 = 1.242, reliability = 99.187%. Note that because of the
random sampling these results will vary somewhat between simulations. The 1
corresponding histogram of the objective function is seen in Fig. 12.14. The
output distribution does not appear to be normally distributed. Producing an 0.5
output histogram is one of the unique benefits of this method.

As discussed, LHS can be used to converge statistics with fewer samples. 0
5 10
The exact same process is repeated, except with LHS rather than pure random 𝑓
sampling. The convergence in the mean and standard deviation is shown under
the “LHS” label in Fig. 12.13. We see that convergence is quicker, especially in Figure 12.14: Histogram of objec-
the mean, requiring fewer samples by at least an order of magnitude. tive function for 100,000 samples.
12.5.3 First-order Perturbation Method

Perturbation methods are based on a local Taylor’s series expansion of
the functional output. If we use a first-order Taylor’s series, assume that
each uncertain input variable is mutually independent, and assume that
the PDFs of 𝑥 𝑖 are each symmetric, then we can derive the following
expressions for the mean and variance of some output f(x):
𝜇 𝑓 = 𝑓 (𝜇𝑥 )
𝑛
Õ 2
𝜕𝑓 (12.20)
𝜎2𝑓 = 𝜎𝑥 𝑖
𝜕𝑥 𝑖
𝑖=1
where 𝑥 contains all the uncertain variables (design variables and

parameters).¶ ¶ Using the covariance matrix allows for
a more general expression for first-order
The first equation just says the mean value of the function is the perturbation methods.136 . Higher-order
function evaluated at the mean value of 𝑥. You may recognize the Taylor’s series can also be used,137 but
they are less common because of their in-
second equation as it is commonly used in propagating errors from creased complexity.
experimental measurements. Its major limitations are 1) it relies on a 136. Smith, Uncertainty Quantification:
linearization (first-order Taylor’s series) and so will not be as accurate Theory, Implementation, and Applications.
2013
if the local function space is highly nonlinear, 2) it assumes that all
137. Cacuci, Sensitivity & Uncertainty
uncertain parameters are uncorrelated, which should be true for design Analysis, Volume 1. 2003
variables, but is not necessarily true for parameters, and 3) it assumes
symmetry, and so might be too inaccurate for something like the wind
farm example (Ex. 12.3).
We have not assumed that the input or output distributions are
normal (i.e., Gaussian). However, with a first-order series we can only
estimate mean and variance and not any of the higher moments (e.g.,
skewness, kurtosis).
As compared to Monte Carlo or other sampling methods, a pertur-
bation method can be much more efficient, but the end result is not a
distribution, but rather just some of its statistics (e.g., mean and vari-
ance). Additionally, we see that derivatives appear in our analysis. This
is desirable in the sense that the derivatives are what make this method
effective and efficient, but it also means that extra information needs
to be supplied. If, for example, we were propagating the variance of a
constraint for a reliability-based optimization, we would need to supply
gradients of the constraints for purposes of Eq. 12.20. Additionally,
if using a gradient-based method, we would need derivatives of the
whole expression, meaning second derivatives of the constraint with
respect to the uncertain variables. If we have exact gradients using the
methods discussed in Chapter 6 then this is usually not problematic.

If not, then propagating the uncertainty inside the analysis can be
numerically challenging.
As an alternative, Parkinson et al. proposed a simpler but more
approximate alternative for reliability-based optimization where the
uncertainty is computed outside of the optimization loop and an
additional assumption is made that each constraint is normally dis-
tributed.138 If 𝑔(𝑥) is normally distributed we can rewrite the constraint 138. Parkinson et al., A General Approach
for Robust Optimal Design. 1993
Prob[𝑔(𝑥) ≤ 0] ≥ 𝑅 as:
𝑔(𝑥) + 𝑘𝜎 𝑔 ≤ 0
where 𝑘 is chosen for a desired reliability level 𝑅. For example, 𝑘 =
2 implies a reliability level of 97.72% (one-sided tail of the normal
distribution). In many cases an output distribution is reasonably
approximated as normal, but for cases with nonnormal output this
method can introduce large error.
Keep in mind that with multiple active constraints, one must be
careful to appropriately choose the reliability level for each constraint
such that the overall reliability is in the desired range. Often the
simplifying assumption is made that the constraints are uncorrelated,
and thus the total reliability is the product of the reliabilities of each
constraint.
This simplified approach has the following steps:
1. Compute the deterministic optimum.
2. Estimate the standard deviation of each constraint 𝜎 𝑔 using

Eq. 12.20.
3. Adjust the constraints to 𝑔(𝑥)+ 𝑘𝜎 𝑔 ≤ 0 for some desired reliability

level and re-optimize.
4. Repeat as necessary.
While this simplification is approximate, it is very easy to use and the

magnitude of error is usually appropriate for the conceptual design
phase. If the errors are unacceptable, then the statistical quantities can
be computed inside the optimization. Keep in mind that this approach
only applies to reliability-based optimization and would not work if
there was uncertainty in the objective.
12.5.4 Polynomial Chaos

Polynomial chaos (also known as spectral expansions) is a class of for-
ward propagation methods take advantage of the inherent smoothness
of the outputs of interest using polynomial approximations.‖ A general

function that depends on uncertain variables 𝑥, can be represented as a
sum of basis functions 𝜓 𝑖 (usually polynomials) with weights 𝛼 𝑖 :
Õ
∞
𝑓 (𝑥) = 𝛼 𝑖 𝜓 𝑖 (𝑥) (12.21)
𝑖=0
although in practice we truncate the series after 𝑛 + 1 terms.

Õ
𝑛
𝑓 (𝑥) ≈ 𝛼 𝑖 𝜓 𝑖 (𝑥) (12.22)
𝑖=0
The required number of terms for a given input dimension 𝑑, and

polynomial order 𝑝 is:
(𝑑 + 𝑝)!
𝑛+1= (12.23)
𝑑!𝑝!
Different types of polynomials are used, but they are always chosen
to form an orthogonal basis. Two vectors 𝑢 and 𝑣 are orthogonal if their
dot product is zero
𝑢® · 𝑣® = 0 (12.24)
For functions, the definition is similar, but functions have an infinite
dimension and so an integral is required instead of a summation. Two
functions 𝑓 and 𝑔 are orthogonal over an interval 𝑎 to 𝑏 if their inner
product is zero. Different definitions for the inner product can be
defined. The simplest is:
∫ 𝑏
𝑓 (𝑥)𝑔(𝑥)𝑑𝑥 = 0 (12.25)
𝑎
For our purposes, the weighted inner product is more relevant:

∫ 𝑏
h 𝑓 , 𝑔i = 𝑓 (𝑥)𝑔(𝑥)𝑝(𝑥)𝑑𝑥 = 0 (12.26)
𝑎
where 𝑝(𝑥) is a weight function, and in our case is the probability density
function. The angle bracket notation, known as an inner product, will
be used in the remainder of this section.
The intuition is similar to that of vectors. Adding a non-orthogonal
vector to a set of vectors does not increase the span of the vector space.
In other words, the new vector could have been formed by a linear
combination of existing vectors in the set and so does not add any
new information. Similarly, we want to make sure that any new basis
functions we add are orthogonal to the existing set, so that the range of
functions that can be approximated is increased.
You may be familiar with this concept from its use in Fourier series.
In fact, this method is a truncated generalized Fourier series. Recall
that a Fourier series represents an arbitrary periodic function with a
series of sinusoidal functions. The basis functions in the Fourier series
are orthogonal.
By definition we choose the first basis function to be 𝜓0 = 1. This
just means the first term in the series is a constant (polynomial of order
0). Because the basis functions are orthogonal we know that
h𝜓 𝑖 , 𝜓 𝑗 i = 0 if 𝑖 ≠ 𝑗 (12.27)
Three main steps are involved in using these methods:
1. Select a orthogonal polynomial basis.
2. Compute coefficients to fit the desired function.
3. Compute statistics on the function of interest.
These three steps are overviewed below, though we begin with the last
step because it provides insight for the first two.
Compute Statistics
Using Eq. 12.6 the mean of the function 𝑓 is given by:
∫
𝜇𝑓 = 𝑓 (𝑥)𝑝(𝑥)𝑑𝑥 (12.28)
where 𝑝(𝑥) is the PDF for 𝑥. Using our polynomial approximation we

can express this as:
∫ Õ
𝜇𝑓 = 𝛼 𝑖 𝜓 𝑖 (𝑥)𝑝(𝑥)𝑑𝑥 (12.29)
𝑖
The coefficients 𝛼 𝑖 are constants and so can be taken out of the integral:
Õ ∫
𝜇𝑓 = 𝛼𝑖 𝜓 𝑖 (𝑥)𝑝(𝑥)𝑑𝑥
𝑖
∫ ∫ ∫
= 𝛼0 𝜓0 (𝑥)𝑝(𝑥)𝑑𝑥 + 𝛼1 𝜓1 (𝑥)𝑝(𝑥)𝑑𝑥 + 𝛼 2 𝜓2 (𝑥)𝑝(𝑥)𝑑𝑥 + . . .
Recall that 𝜓0 = 1. Also, because we can multiply all terms by 1 without

changing anything, we can rewrite this expression in terms of our
defined inner product as:
∫
𝜇 𝑓 = 𝛼0 𝑝(𝑥)𝑑𝑥 + 𝛼 1 h𝜓0 , 𝜓1 i + 𝛼2 h𝜓0 , 𝜓2 i + . . . (12.30)
Because the polynomials are orthogonal, all of the terms except the first
are zero (see Eq. 12.27), and by definition of a PDF, we know that the
integral of 𝑝(𝑥) over the domain must be one (see Eq. 12.5). Thus, we
have the simple result that the mean of the function is simply given by
the zeroth coefficient:
𝜇 𝑓 = 𝛼0 (12.31)
Using a similar approach we can derive a formula for the variance.
By definition, the variance is (Eq. 12.8):
∫
𝜎2𝑓 = 𝑓 (𝑥)2 𝑝(𝑥)𝑑𝑥 − 𝜇2𝑥 (12.32)
Substituting our polynomial representation and using the same tech-

niques used in deriving the mean:
∫ !2
Õ
𝜎2𝑓 = 𝛼 𝑖 𝜓 𝑖 (𝑥) 𝑝(𝑥)𝑑𝑥 − 𝛼 20
Õ
𝑖
∫
= 𝛼2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥)𝑑𝑥 − 𝛼 20
∫ ∫
𝑖
Õ
𝑛
= 𝛼 20 𝜓02 𝑝(𝑥)𝑑𝑥 + 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥)𝑑𝑥 − 𝛼20
∫
𝑖=1
Õ
𝑛
= 𝛼 20 + 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥)𝑑𝑥 − 𝛼20
∫
𝑖=1
Õ
𝑛
= 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥)𝑑𝑥
𝑖=1
Õ
𝑛
= 𝛼 2𝑖 h𝜓 2𝑖 i
𝑖=1
The inner product h𝜓 2𝑖 i = h𝜓 𝑖 , 𝜓 𝑖 i can generally be computed analyti-

cally, and if not, it can be computed easily through quadrature because
it only involves the basis functions. For multiple uncertain variables,
the formulas are the same:
𝜇 𝑓 = 𝛼0 (12.33)
Õ
𝑛
𝜎 2𝑓 = 𝛼 2𝑖 hΨ2𝑖 i (12.34)
𝑖=1
except that Ψ𝑖 are multidimensional basis polynomials defined by

products of single dimensional polynomials as will be shown in the next
section. The main takeaway is that the polynomial chaos formulation

allows us to compute the mean and variance easily, using our definition
of the inner product, and other statistics can be estimated by sampling
the polynomial expansion.
Selecting an Orthogonal Polynomial Basis

Referring again to Eq. 12.26 we need to find polynomials that satisfy
the orthogonality relationship for a particular probability density func-
tion. For some continuous probability distributions, corresponding
orthogonal polynomials are already known.∗∗ Table 12.1 summarizes ∗∗ Actually other polynomials can be used,
but these particular choices are optimal as

the polynomials that correspond to some common probability distribu- they produce exponential convergence.
tions.140 Referring again to Eq. 12.26, the polynomials correspond to 140. Eldred et al., Evaluation of Non-
𝑓 (𝑥) and 𝑔(𝑥), the probability distribution is 𝑝(𝑥), and the support range Intrusive Approaches for Wiener-Askey
Generalized Polynomial Chaos. 2008
forms the limits of integration 𝑎 and 𝑏. For a general distribution, e.g.,
one that was empirically derived, the orthogonal polynomials should
be generated numerically to preserve exponential convergence.140 140. Eldred et al., Evaluation of Non-
Intrusive Approaches for Wiener-Askey
Generalized Polynomial Chaos. 2008
Table 12.1: Orthogonal polynomials that correspond to some common probability

distributions.
Prob. Distribution Polynomial Support Range
Normal Hermite [−∞, ∞]
Uniform Legendre [−1, 1]
Beta Jacobi [−1, 1]
Exponential Laguerre [0, ∞]
Gamma Generalized Laguerre [0, ∞]
Example 12.8: Legendre polynomials.
The first few Legendre polynomials are:
𝜓0 = 1 𝑓 (𝑥)
1
𝜓1 = 𝑥 𝜓0
𝜓1
1 𝜓3
𝜓2 = (3𝑥 2 − 1) 0.5
2 (12.35)
1 0
𝜓3 = (5𝑥 3 − 3𝑥)
2
.. −0.5
𝜓2
.
−1
These polynomials are plotted in Fig. 12.15, and are orthogonal with respect to −1 −0.5 0 0.5 1
𝑥
a uniform probability distribution.
Figure 12.15: The first few Legen-
dre polynomials.
Multidimensional basis functions are simply defined by tensor

products. For example, if we had two variables from a uniform
probability distribution (and thus Legendre bases), then the first few
polynomials, up through second-order terms, are:
Ψ0 (𝑥) = 𝜓0 (𝑥1 )𝜓0 (𝑥 2 ) = 1

Ψ1 (𝑥) = 𝜓1 (𝑥1 )𝜓0 (𝑥 2 ) = 𝑥1
Ψ2 (𝑥) = 𝜓0 (𝑥1 )𝜓1 (𝑥 2 ) = 𝑥2
Ψ3 (𝑥) = 𝜓1 (𝑥1 )𝜓1 (𝑥 2 ) = 𝑥1 𝑥2
1
Ψ4 (𝑥) = 𝜓2 (𝑥1 )𝜓0 (𝑥 2 ) = (3𝑥 12 − 1)
2
1
Ψ5 (𝑥) = 𝜓0 (𝑥1 )𝜓2 (𝑥 2 ) = (3𝑥 22 − 1)
2
Note that 𝜓1 𝜓2 , for example, does not appear in the above list because
that is a third order polynomial and we truncated the series at second-
order terms. We should expect this number of basis function because
Eq. 12.23, with 𝑑 = 2, 𝑝 = 2, suggests that we should have six basis
functions.
Determine Coefficients
With the polynomial basis 𝜓 𝑖 fixed, we need to determine the appro-
priate coefficients 𝛼 𝑖 in Eq. 12.22. We discuss two ways to do this. The
first is with quadrature and is also known as non-intrusive spectral pro-
jection. The second is with regression and is also known as stochastic
collocation.
First, let’s use the quadrature approach. Beginning with the polyno-
mial approximation: Õ
𝑓 (𝑥) = 𝛼 𝑖 𝜓 𝑖 (𝑥) (12.36)
𝑖
we take the inner product of both sides with respect to 𝜓 𝑗 .

Õ
h 𝑓 (𝑥), 𝜓 𝑗 i = 𝛼 𝑖 h𝜓 𝑖 , 𝜓 𝑗 i (12.37)
𝑖
Making use of the orthogonality of the basis functions (Eq. 12.27) we

see that all of the terms in the summation are zero except:
h 𝑓 (𝑥), 𝜓 𝑖 i = 𝛼 𝑖 h𝜓2𝑖 i (12.38)
or ∫
1
𝛼𝑖 = 𝑓 (𝑥)𝜓 𝑖 (𝑥)𝑝(𝑥)𝑑𝑥 (12.39)
h𝜓2𝑖 i
Note that, as expected, the zeroth coefficient is simply the definition of

the mean. ∫
𝛼0 = 𝑓 (𝑥)𝑝(𝑥)𝑑𝑥 (12.40)
The remaining coefficients must be obtained through multidimen-

sional quadrature using the same types of approaches as discussed
in Section 12.5.1 (or even Section 12.5.2). This means that this ap-
proach inherits the same limitations of the chosen quadrature approach,
though the process can be more efficient if the distributions are well
approximated by the selected basis functions.
It may appear that to estimate 𝑓 (𝑥) (Eq. 12.22) we need to know 𝑓 (𝑥)
(Eq. 12.39). The distinction is that we just need to be able to evaluate
𝑓 (𝑥) and some predefined quadrature points, which in turn gives a
polynomial approximation for all of 𝑓 (𝑥).
The second approach to determining the coefficients is regression.
The equation in Eq. 12.22 is a linear equation and so we can estimate the
coefficients using least squares (although an underdetermined system
with regularization can be used as well). This means that we evaluate
the function 𝑚 times, with 𝑥 (𝑖) denoting the 𝑖 th sample, resulting in a
linear system:
 𝜓0 (𝑥 (1) ) . . . 𝜓 𝑛 (𝑥 (1) )   𝛼0   𝑓 (𝑥 (1) ) 

     
   ..   .. 
   . = . 
.. ..
(12.41)
 . .     
𝜓0 (𝑥 ) . . . 𝜓 𝑛 (𝑥 ) 𝛼 𝑛   𝑓 (𝑥 (𝑚) )
     
(𝑚) (𝑚)
Generally, we would like 𝑚, the number of sample points, to be at least

twice as large as 𝑛 + 1, the number of unknowns. Determining where to
sample, also known as the collocation points, is an important element
for this procedure to be effective.†† †† Several packages exist to facilitate use of
polynomial chaos methods.141 , 142

141. Adams et al., Dakota, A Multilevel
12.6 Summary Parallel Object-Oriented Framework for
Design Optimization, Parameter Estimation,
Uncertainty Quantification, and Sensitivity
Engineering problems are subject to variation under uncertainty. Op- Analysis: Version 6.0 User’s Manual. 2015
timizing when design variables and/or parameters have uncertain 142. Feinberg et al., Chaospy: An open
variability is called optimization under uncertainty. Robust optimiza- source tool for designing methods of uncer-
tainty quantification. 2015
tion seeks designs that are less sensitive to inherent variability in the
objective function. Common OUU objectives include minimizing the
mean or standard deviation, or performing multiobjective tradeoffs
in mean and standard deviation. Reliable optimization seeks designs
that have a reduced probability of failure due to variability in the con-
straints. In both scenarios, we need methods that propagate probability
distributions of the inputs (e.g., design variables) to statistics and in
some cases probability distributions of the outputs (e.g., objective and

constraints). This procedure is called forward propagation.
Four classes of forward propagation were discussed in this chapter.‡‡ ‡‡ This list is not exhaustive. For exam-
ple, all of these methods discussed in this

Direct quadrature uses numerical quadrature to evaluate the mean chapter are non-intrusive. Like intrusive
and variance of outputs. This process is relatively straightforward and methods for derivative computation, in-
trusive methods for forward propagation
effective. Its primary weakness is that it is limited to low dimensional require more upfront work but are more
problems (number of stochastic inputs). Sparse grids extend the accurate and efficient.
dimensional range, but not greatly.
Monte Carlo methods approximate the integrals (and output distri-
butions) using random sampling and the law of large numbers. These
methods are extremely easy to use and are independent of problem
dimension. Their major weakness is that they are inefficient, though
many of the alternatives are completely intractable at high dimen-
sions, making this an appropriate choice for many high-dimensional
problems.
Perturbation methods use a Taylor’s series expansion of the output
functions to estimate the mean and variance. These methods can be
efficient across a range of problem sizes, especially if accurate derivatives
are available. Their main weaknesses are that they require derivatives
(and hence second derivatives if using a gradient-based optimization
method), only work with symmetric input probability distributions,
and only provide the mean and variance (for first-order methods).
Polynomial chaos represents uncertain variables as a sum of or-
thogonal basis functions. This method is often a more efficient way to
characterize both statistical moments as well as output distributions.
However, the methodology is more complex, and is generally limited to
a small number of dimensions as the number of required basis functions
required grows exponentially.
Problems
a) Optimizing under uncertainty yields an objective value that

is no better than without considering uncertainty.
b) Optimization under uncertainty considers uncertainty in the
design variables as well as uncertainty in the models that
compute the objective and constraint functions.
c) The greater the reliability, the less likely the design is to have
a worse objective function value.
d) Reliability can be handled in a deterministic way using safety

factors, which ensure that the optimum has some margin
before the original constraint is violated.
e) Robust design can be obtained by simultaneously optimizing
a design for multiple conditions without using uncertainty
propagation.
f) Forward propagation computes the probability density func-
tions of the outputs and inputs for a given numerical model.
g) The computational cost of direct quadrature scales expo-
nentially with the number of random variables, while the
cost Monte Carlo is independent of the number of random
variables.
h) Monte Carlo methods approximate probability density func-
tions using random sampling and converge slowly.
i) The first-order perturbation method computes the probabil-
ity density functions using local Taylor’s series expansions.
j) Since the first-order perturbation method requires first-order
derivatives to compute the uncertainty metrics, optimization
under uncertainty using the first-order perturbation method
requires second-order derivatives.
k) Polynomial chaos is a forward propagation technique that
uses polynomial approximations with random coefficients
to model the input uncertainties.
l) The number of basis functions required by polynomial chaos
grows exponentially with the number of uncertain input
variables.
12.2 Simplified reliability-based optimization. This problem uses the sim-

plified version of a first-order perturbation method (Section 12.5.3).
The optimization problem below is a QP for simplicity, but ignore
that structure and solve it as a general nonlinear problem so
that you could reuse the approach (i.e., use a general nonlinear
optimizer and a method for estimating gradients).
For the following problem:
minimize 𝑓 = 𝑥12 + 2𝑥22 + 3𝑥 32

subject to 𝑔1 : 2𝑥1 + 𝑥2 + 2𝑥3 ≥ 6
𝑔2 : −5𝑥1 + 𝑥2 + 3𝑥3 ≤ −10
a) Find the deterministic optimum.

b) Find the worst-case, reliable optimum where Δ𝑥1 = Δ𝑥2 =

±0.1, Δ𝑥3 = ±0.05
c) Now, instead of the worst case tolerances, assume the vari-
ables are normally distributed with 𝜎𝑥1 = 𝜎𝑥2 = ±0.033,
𝜎𝑥3 = ±0.0167 (these values are 𝜎𝑖 = Δ𝑖 /3). Find the reliable
optimum where the target constraint reliability is 99.865%
for each constraint individually.
d) Compare the total target reliability with a Monte Carlo simu-
lation of reliability for all three approaches (using the normal
distributions for the input variations).
e) Briefly discussed any lessons learned.
12.3 Robust optimization of a wind farm.

We want to find the optimal turbine layout for a wind farm in
order to minimize cost of energy (COE). We will consider a very
simplified wind farm with only three wind turbines. The first
turbine will be fixed at (0, 0) and the 𝑥-positions of the back two
turbines will be fixed with 4 diameter spacing between them.
The only thing we can change is the 𝑦-position of the two back
turbines as shown in Fig. 12.16 (all dimensions in this problem
are in terms of rotor diameters). In other words, we just have two
design variables: 𝑦2 and 𝑦3 .
𝑇3
𝑦3
𝑇2
𝑦2
Wind
𝑇1
Figure 12.16: Wind farm layout

4 4
We further simplify by assuming the wind always comes from

the west as shown in the figure, and is always at a constant speed.
The wake model has a few parameters that define things like its
spread angle and decay rate. We will refer to this parameters as
𝛼, 𝛽, and 𝛿 (knowing exactly what each parameter corresponds to
is not important for our purposes). The supplementary resources
repository contains code for this problem.
a) Run the optimization deterministically. In other words we

will assume that the three wake parameters are deterministic:
𝛼 = 0.1, 𝛽 = 9, 𝛿 = 5. Because there are several possible

similar solutions we will add the following constraints:
𝑦𝑖 > 0 (bound)
𝑦3 > 𝑦2 (linear)
Don’t use [0, 0] as the starting point for the optimization, as

that occurs right at a flat spot in the wake (a fixed point) and
so you might not make any progress. Report the optimal
spacing that you find.
b) Now the wake parameters are not deterministic, but are
rather uncertain variables under some probability distribu-
tion. Specifically we have the following information for the
three parameters:
• 𝛼 is governed by a Weibull distribution with a scale
parameter of 0.1, and a shape parameter of 1.
• 𝛽 is given by a Normal distribution with a mean and
standard deviation of 𝜇=9, 𝜎=1
• 𝛿 is given by a Normal distribution with a mean and
standard deviation of 𝜇=5, 𝜎=0.4
Note that the mean for all of these distributions corresponds
to the deterministic value we used previously.
Using a Monte Carlo method, run an optimization under
uncertainty minimizing the 95th percentile for COE.
c) Once you have completed both optimizations, you should
perform a cross analysis by filling out the four numbers in
the table below. You take the two optimal designs that you
found, and you compare each on the two objectives (deter-
ministic, and 95th percentile). The first row corresponds to
the performance of the optimal deterministic layout. Evalu-
ate the performance of this layout using the deterministic
value for COE, and the 95th percentile that accounts for
uncertainty. Repeat for the optimal solution for the OUU
case. Discuss your findings.
Deterministic 95th percentile

COE COE
Deterministic Layout [ ] [ ]
OUU Layout [ ] [ ]
Multidisciplinary Design Optimization
13
As mentioned in Chapter 1, most engineering systems are multidis-
ciplinary, which means that the system is composed of different dis-
ciplines that are coupled. We prefer the term component instead of
“discipline” because it is more general, but we use both terms inter-
changeably depending on the context. When components in a system
represent different physics, the term “multiphysics” is commonly used.
All the optimization methods covered so far apply to multidisci-
plinary problems if we view the coupled multidisciplinary model as
just another model that computes the objective and constraint functions
for a given set of design variables. However, there are additional con-
siderations in the solution, derivative computation, and optimization
of coupled systems.
In this chapter, we build on Chapter 3 by introducing coupled
models and solvers for coupled systems. We also expand the derivative
computation methods of Chapter 6 to the coupled system case. Finally,
we introduce various MDO architectures, which are different options for
formulating and solving an MDO problem.
1. Describe when and why one might want to use different

optimization architectures.
2. Understand the differences between monolithic and dis-

tributed architectures.
3. Read XDSM diagrams.
4. Understand how derivatives are computed in coupled

models.
∗ Kroo143 describes many of the early chal-

13.1 Motivation lenges for large-scale MDO and some pro-
posed solutions
∗ 143. Kroo, MDO for Large-Scale Design.
1997
360
13.2 MDO Problem Representation
The MDO problem representation we use here is shown in Fig. 13.1

for a general three-component system. Here we use the functional
representation introduced in Section 13.3.2, where the states in each
component are hidden and we just see its output as a coupling variable
at the system level.
x0 , x1 x0 , x2 x0 , x3 x x
c1 Analysis 1 y1 y1 y1 y1
c2 y2 Analysis 2 y2 y2 y2
c3 y3 y3 Analysis 3 y3 y3
Global
c0
Constraints
f Objective Figure 13.1: MDO problem nomen-

clature and dependencies.
In MDO problems, we make the distinction between local design

variables, which directly affect only one component, and global design
variables, which directly affect more than one component. We denote
the vector of design variables local to component 𝑖 by 𝑥 𝑖 and global
variables by 𝑥 0 . The full vector of design variables is given by 𝑥 =
𝑇
𝑥0𝑇 , 𝑥1𝑇 , . . . , 𝑥 𝑇𝑁 .
The set of constraints is also split into global constraints and local
ones. Local constraints are computed by the corresponding component
and depend only on the variables available in that component. Global
constraints depend on more than one set of coupling variables. These
dependencies are also shown in Fig. 13.1.
13.3 Multidisciplinary Models
In general, these models would be coupled as shown in Fig. 13.2 for

a three-component case. Here, the states of each component affect all
other components, but it is common for a component to depend only
on a subset of the other system components.
Figure 13.2: Multidisciplinary

𝑢1 model composed of three numerical
𝑅 1 (𝑥, 𝑢) = 0
models. Each model solves for its
 𝑢1  own state variable vector (𝑢1 , 𝑢2 , 𝑢3 )
 
𝑢 = 𝑢2 
𝑢2 𝑢2
𝑥 𝑅 2 (𝑥, 𝑢) = 0 but in general requires the other
 𝑢3  state vectors as inputs. This set of
 
𝑢3 models would replace the single
𝑅 3 (𝑥, 𝑢) = 0 model in Fig. 3.17.
Example 13.1: Defining a multidisciplinary optimization problem.
Consider a multidisciplinary numerical model for an aircraft wing, where

the aerodynamics and structures disciplines are coupled to solve an aerostruc-
tural analysis and design optimization problem. For a given flow condition,
the aerodynamic solver computes the forces on the wing for a given wing
shape, while the structural solver computes the wing displacement for a
given set of applied forces. Thus, these two disciplinary models are coupled
as shown in Fig. 13.3. For a steady flow condition, there is only one wing
shape and a corresponding set of forces that satisfies both disciplinary models
simultaneously.
One possible design optimization problem based on these models would
be to minimize the drag by changing the wing shape and the structural sizing,
while satisfying a lift constraint and structural stress constraints. In this case, the
structural sizing variables are local design variables because only the structural
solver is directly affected by those variables. However, the wing shape variables
are global design variables because they affect both the aerodynamics and the
structure.
Shape Structural sizing
Surface
Aerodynamic pressures
solver
Structural
Displacements solver
Surface pressure
Drag, lift
integration
Stress Figure 13.3: Multidisciplinary nu-

Structural
computation stresses merical model for an aircraft wing.
Mathematically, a multidisciplinary model is no more than a larger

set of equations to be solved, where all the governing equation residuals

(𝑟), the corresponding state variables (𝑢), and all the design variables
(𝑥) are concatenated into single vectors. Then, we can still just write
the whole multidisciplinary model as 𝑟(𝑥, 𝑢) = 0.
However, it is often necessary or advantageous to partition the sys-
tem into smaller components for three main reasons. First, specialized
solvers are often already in place for a given set of governing equations,
which may be more efficient at solving their set of equations than a
general-purpose solver. In addition, some of these solvers might be
black boxes that do not provide an interface for using alternative solvers.
Second, there is an incentive for building the multidisciplinary system
in a modular way. For example, a component might be useful on its own
and should therefore be usable outside the multidisciplinary system. A
modular approach also facilitates the extension of the multidisciplinary
system and makes it easy to replace the model of a given discipline
with an alternative one. Finally, the overall system of equations may
be more efficiently solved if it is partitioned in a way that exploits the
system structure (see Section 13.5.4).
13.3.1 Components
In Section 3.3, we explained how all models can ultimately be written
as a system of residuals, 𝑟(𝑥, 𝑢) = 0 . When the system is large or
includes sub-models, it might be natural to partition the system into
components. We prefer to use the more general term components instead
of disciplines to refer to the sub-models resulting from the partitioning
because the partitioning of the overall model is not necessarily by
discipline (e.g., aerodynamics, structures). A system model might also
be partitioned by physical system component (e.g., wing, fuselage, or
an aircraft in a fleet) or by different conditions applied to the same
model (e.g., aerodynamic analyses at different flight conditions).
The partitioning can also be performed within a given discipline for
the same reasons cited above. In theory, the system model equations in
𝑟(𝑥, 𝑢) = 0 can be partitioned in any way, but only some partitions are
advantageous or make sense. The partitioning can also be hierarchical,
where a given component has one or multiple levels of sub-components.
Again, this might be motivated by efficiency, modularity, or both.
We denote a partitioning into 𝑛 components as





𝑟1 (𝑢1 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛 ) = 0



..


 .
𝑟(𝑢) = 0 ⇔ 𝑟 (𝑢 , . . . , 𝑢 , . . . , 𝑢 ) = 0 , (13.1)


𝑖 𝑖 𝑖 𝑛

 .


..

 𝑟𝑛 (𝑢1 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛 ) = 0

where each 𝑟 𝑖 and 𝑢𝑖 are vectors corresponding to the residuals and
states of component 𝑖. Here, we assume that each component can drive
its residuals to zero by varying only its states, although this is not
guaranteed in general. In the above, we have omitted the dependency
on 𝑥 because for now, we are just concerned with finding the state
variables that solve the governing equations for a fixed design. A
generic example with three components was illustrated in Fig. 13.2.
Components can be either implicit or explicit, a concept we intro-
duced in Section 3.3. To solve an implicit component, we need an
algorithm for driving the equation residuals, 𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛 ) = 0,
to zero by varying the states 𝑢𝑖 , while the other states remain fixed.
This algorithm could involve a matrix factorization in the case of a
linear system or a Newton solver for the case of a nonlinear system.
An explicit component is much easier to solve because the state of the
component is an explicit function of the states of the other components
𝑢𝑖 = 𝑔(𝑢 𝑗 ) for all 𝑗 ≠ 𝑖. This can be computed without factorization or
iteration. There is no loss of generality with the residual notation above
because the explicit component can be written as,
𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑛 ) = 𝑢𝑖 − 𝑔(𝑢 𝑗 ) = 0 ∀𝑗 ≠ 𝑖. (13.2)
Most disciplines involve a mix of implicit and explicit components

because, as mentioned in Section 3.3 and shown in Fig. 3.17, the
state variables are implicitly defined, while the objective function and
constraints are usually explicit functions of the state variables. In
addition, a discipline usually includes functions that translate inputs
and outputs, as discussed next.
Example 13.2: Residuals of the coupled aerostructural problem.
Let us formulate model for the aerostructural problem described in Ex. 13.1.
A possible model for the aerodynamics is a lifting line model given by the
linear system,
𝐴Γ = 𝛾, (13.3)
where 𝐴 is the matrix of aerodynamic influence coefficients and 𝛾 is a vector,
both of which depend on the wing shape. The state Γ is a vector that represents
the circulation (vortex strength) at each spanwise position on the wing. The
lift and drag scalars can be computed explicitly for a given Γ, so we will write
these dependencies as 𝐿 = 𝐿(Γ) and 𝐷 = 𝐷(Γ), omitting the detailed explicit
expressions for conciseness.
A possible model for the structures is a cantilevered beam modeled with
Euler–Bernoulli elements,
𝐾𝑑 = 𝑓 , (13.4)
where 𝐾 is the stiffness matrix, which depends on the beam shape and sizing.
The right-hand-side vector represents the applied forces at spanwise position
on the beam. The states 𝑑 are the displacements and rotations of each element.
The weight does not depend on the states and it is an explicit function of the
beam sizing and shape, so it does not involve the structural model (13.4). The
stresses are an explicit function of the displacements, so we can write 𝜎 = 𝜎(𝑑),
where 𝜎 is a vector whose size is the number of elements.
When we couple these two models, 𝐴 and 𝛾 depend on the wing displace-
ments 𝑑, and 𝑓 depends on the Γ. We can write all the implicit and explicit
equations as residuals:
𝑟1 = 𝐴(𝑑)Γ − 𝛾(𝑑) = 0
𝑟2 = 𝐿 − 𝐿(Γ) = 0
𝑟3 = 𝐷 − 𝐷(Γ) = 0 (13.5)
𝑟4 = 𝐾𝑑 − 𝑓 (Γ) = 0
𝑟5 = 𝜎 − 𝜎(𝑑) = 0.
We used Eq. (13.2) to transform the explicit equations into residuals for 𝑟2 , 𝑟3 ,
and 𝑟5 . The states of this system are,
𝑢1   Γ 
   
𝑢2   𝐿 
   
𝑢 = 𝑢3  ≡ 𝐷  . (13.6)
𝑢   𝑑 
 4  
𝑢   𝜎 
 5  
Because 𝑢2 , 𝑢3 , and 𝑢5 can be explicitly determined from 𝑢1 and 𝑢4 , we could
just solve for 𝑟1 and 𝑟4 and then just use the explict expressions. However, it
can be convenient to write all equations as a singular vector 𝑟(𝑢) = 0.
13.3.2 System-level representations and coupling variables

In MDO, coupling variables (𝑦) are variables that need to be passed from
one component to the other due to interdependencies in the system.
Sometimes, the coupling variables are just the state variables of one
component (or a subset of these) that get passed to another compo-
nent. In the general case for a component 𝑖, there is an intermediate
explicit function (𝑃𝑖 ) that translates the inputs coming from the other
components (𝑦 𝑗≠𝑖 ) to the required inputs 𝑝 𝑖 , as shown in Fig. 13.4. After

the component solves for its state variables 𝑢𝑖 , there might be another
function (𝑄 𝑖 ) that converts these states to output variables for other
components. The function (𝑄 𝑖 ) typically reduces the number of output
variables relative to the number of internal states, some times by orders
of magnitude.
𝑦 𝑗≠𝑖 𝑃𝑖 (𝑦 𝑗≠𝑖 ) 𝑝𝑖
𝑅 𝑖 (𝑢𝑖 , 𝑝 𝑖 ) = 0 𝑢𝑖 Figure 13.4: In the general case,

a solver includes a conversion of
inputs and outputs distinct from its
𝑄 𝑖 (𝑢𝑖 ) 𝑦 𝑖 = 𝑌𝑖 (𝑦 𝑗≠𝑖 ) states.
The system level representation of a coupled system is determined

by the variables that are “seen” and controlled at this level. These
variables are the system level variables that the system level solver is
responsible for. If the box shown in Fig. 13.4 is viewed as a black box,
then the internal states 𝑢𝑖 would be hidden at the system level, and the
relationship between its inputs and outputs can be represented by a
single function as 𝑦 𝑖 = 𝑌𝑖 (𝑦 𝑗≠𝑖 ). We call this the functional representation
of a coupled system. If a component is a black box and we have no
access to the residuals and the translation functions, this is the only
representation we get to see. This functional representation can be
written as


 𝑦1 = 𝑌1 (𝑦2 , . . . , 𝑦𝑛 )




..


 .
𝑦 = 𝑌(𝑦) ⇔ 𝑦 𝑖 = 𝑌𝑖 (𝑦 𝑗≠𝑖 ) , (13.7)



 .


..

 𝑦𝑛 = 𝑌𝑖 (𝑦1 , . . . , 𝑦𝑛−1 )

In this representation, we do not use the component residuals
associated with the state variables, so we need some other residuals
associated with the coupling variables. Thus, the residuals of the
functional representation can be written as
𝑅 𝑖 = 𝑦 𝑖 − 𝑌𝑖 (𝑦 𝑗≠𝑖 ) = 0, (13.8)
where 𝑦 𝑖 are the guesses for the coupling variables and 𝑌𝑖 are the actual
computed values.
The residual representation of the coupled system is an alternative
system level representation, where a general component, including
the translation functions, is represented by a set of residuals and

corresponding states, i.e., 𝑅(𝑢) = 0, as written earlier in Eq. 13.1.
This residual form is desirable because as we will see later in this
chapter, this enables us to formulate an efficient general way of solving
coupled systems and computing their derivatives. A system can be
converted to residuals and states by converting the functions (which
are explicit) to residuals using Eq. 13.2. The result is a component
with three subcomponents as shown in Fig. 13.5. This hints at another
powerful concept that we will use later, which is the concept of hierarchy,
where components can contain sub-components and multiple levels
are present.
𝑢 𝑗≠𝑖,𝑖−1,𝑖+1 𝑅 𝑖−1 (𝑢 𝑗 ) 𝑢𝑖−1
Figure 13.5: The translation of in-

𝑅 𝑖 (𝑢𝑖 , 𝑢𝑖−1 ) = 0 𝑢𝑖 puts and outputs can be repre-
sented as components with its own
state variables, so any coupled sys-
𝑅 𝑖+1 (𝑢𝑖 ) 𝑢𝑖+1 tem can be written as 𝑅(𝑢) = 0.
An example of a system with three solvers is shown in Fig. 13.6. On

the left figure, we show the three solvers written as a set of residuals and
states, including the translation functions. Each solver in this general
case requires three components to represent it. In the case where the
solver is a black box, the governing equations, residuals, and translation
functions are hidden, and all we see are the coupling variables. In an
even more general case, these two views can be mixed, where some
some solvers have exposed residuals and states, while others do not.
Furthermore, there might be translation functions between black boxes
that are exposed.
13.3.3 Coupled System Representations

To show how multidisciplinary systems are coupled, we use a de-
sign structure matrix (DSM), sometimes referred to as a dependency
structure matrix or an 𝑁 2 -diagram. An example of the DSM for a
hypothetical system is shown in Fig. 13.7a. In this matrix, the diagonal
elements represent the components, while the off-diagonal entries
denote coupling variables. A given coupling variable is computed by
the component in its row and is passed to the component in its column.† † In some of the DSM literature this defini-
tion is reversed, where “row” and “col-

As shown in Fig. 13.7a, there are in general off-diagonal entries both umn” are interchanged, resulting in a
above and below the diagonal, where the entries above feed forward, transposed matrix.
while entries below feed backward.
u1
R1 R1
u2 y1 y1
R2 Y1 (yR22, y3 )
u3 u3
R3 R3
u4
R4 R4
u5 y2
R5 Y2 (yR15, y3 ) y2
u6
R6 u6 R6
u7
R7 R7
u8 y3 y3
R8 Y3 (yR18, y2 )
u9 u9
R9 R9
The mathematical representation of these dependencies is given by Figure 13.6: Two system-level views
a graph (Fig. 13.7b), where the graph nodes are the components and the of coupled system with three
solvers: all components exposed
edges represent the information dependency. This graph is a directed and written as residuals and states
graph because in general there are three possibilities for a coupling: a (left) and black box representation
where only inputs and outputs for
single coupling one way or the other, or a two way coupling. A directed each solver are visible (right), where
graph is said to be cyclic when there are edges that form a closed loop, 𝑦1 , 𝑢3 , 𝑦2 , 𝑢6 , and 𝑦3 , 𝑢9 .
or cycles. In the example of Fig. 13.7b, there is a single cycle between
components B and C. When there are no closed loops, the graph is
acyclic. In this case, the whole system can be solved by solving each
component in turn, without having to iterate. A graph can also be
represented using an adjacency matrix (Fig. 13.7c), which has the same
structure as the transpose of the DSM.
The adjacency matrix for real-world systems is often a sparse ma-
trix, that is, it has many zeros in its entries. This means that in the
A y1 y1
corresponding DSM, each component depends only on a subset of all
the other components. We can take advantage of the structure of this
y2
sparsity in the solution of coupled systems. B
The DSM shows only data dependencies. We now introduce an

extended version of the DSM, called XDSM, which we use later in this y3 C y4
chapter to show process in addition to the data dependencies. Fig. 13.8

shows the XDSM for the same four-component system. When showing D
only the data dependencies, the only difference relative to DSM is that
the coupling variables are labeled explicitly, and the data paths are Figure 13.8: XDSM showing
drawn. In the next section, we add process to the XDSM. data dependencies for the four-
component coupled system of
Fig. 13.7.
B B
1 0 0 0
 
C 1 1 1 0
0 
 1 1 0
D 1 1 Figure 13.7: Different represen-
C D  0 1
tations of the dependencies of a
(a) Design structure (b) Directed graph (c) Adjacency hypothetical system.
matrix matrix
13.3.4 Solving Coupled Numerical Models

When considering the solution of coupled systems, also known as
multidisciplinary analysis (MDA), we usually assume that a solver
already exists that determines the states for each component and the
coupling variables it provides to the coupled system. Thus, in the
system-level view, we only deal with the coupling variables (denoted as
𝑦) and the internal states (𝑢) are hidden.
The most straightforward way to solve for coupled numerical models
is through a fixed-point iteration, which is analogous to the fixed-point
iteration methods mentioned in Section 3.7 and detailed in Appendix B.
The difference here is that instead of updating one state at the time, we
update a vector of coupling variables at each iteration corresponding
to a subset of the coupling variables in the overall coupled system.
Obtaining this vector of coupling variables in general involves the
solution of a nonlinear system. Therefore, these are called nonlinear
block variants of the linear fixed-point iteration methods.
The nonlinear block Jacobi method requires a guess for all coupling
variables to start with and calls for the solution of all components give
those guesses. Once all components have been solved, the coupling
variables are updated based on the new values computed by the
components, and all components are solved again. This iterative process
continues until the coupling variables do not change in subsequent
iterations. Because each component takes the coupling variables values
from the previous iteration, which have already been computed, all
components can be solved in parallel without communication. This
algorithm is formalized in Alg. 13.3. When applied to a system of
components, we call it the block-Jacobi method, where “block” refers
to each component.
The nonlinear block Jacobi method is also illustrated using an XDSM
in Fig. 13.9 for three components. The only input is the guess for the
coupling variables, 𝑢 𝑡 . The MDA block (step 0) is responsible for
iterating the system-level analysis loop and for checking if the system
has converged. The process line is shown as the thin black line to
distinguish from the data dependency connections (thick gray lines),
and follows the sequence of numbered steps. The analyses for each
component are all numbered the same (step 1), because they can be
done in parallel. Each component returns the coupling variables it
computes to the MDA iterator, closing the loop between step 2 and
step 1 (denoted as 2 → 1).
yt x0 , x1 x0 , x2 x0 , x3
0, 2 → 1 :
(no data) 1 : y2t , y3t 1 : y1t , y3t 1 : y1t , y2t
MDA
1:
y1 2 : y1
Analysis 1
1:
y2 2 : y2
Analysis 2
Figure 13.9: A nonlinear block

y3
1: Jacobi multidisciplinary analysis
2 : y3
Analysis 3 process to solve a three-component
coupled system.
Algorithm 13.3: Nonlinear block Jacobi algorithm
Inputs:
(0) (0)
𝑢 (0) = [𝑢1 , . . . , 𝑢𝑛 ]: Guesses for coupling variables
Outputs:
𝑢 = [𝑢1 , . . . , 𝑢𝑛 ]: System-level states
𝑘=1
while ||𝑢 (𝑘) − 𝑢 (𝑘−1) || 2 < 𝜖 do
for all 𝑖 ∈ {1, . . . , 𝑛} do Can be done in parallel
(𝑘) (𝑘) (𝑘−1)
𝑢𝑖 ← solve 𝑅 𝑖 𝑢𝑖 , 𝑢 𝑗 = 0, where 𝑗 ≠ 𝑖
end for
𝑘 = 𝑘+1
end while
The nonlinear block Gauss–Seidel algorithm is similar to its Jacobi

counterpart. The only difference is that when solving each component,
we use the latest coupling variables available instead of just using
the coupling variables from the previous iteration. We cycle through

each component 𝑖 = 1, . . . , 𝑛 in order. When computing 𝑢𝑖 by solving
component 𝑖 at iteration 𝑘, for all 𝑢 𝑗 that are inputs to component 𝑖,
(𝑘) (𝑘−1)
we use 𝑢 𝑗 for all 𝑗 < 𝑖 and 𝑢 𝑗 for all 𝑗 > 𝑖. This results in a better
convergence rate in general, but the components cannot be solved in
parallel because now each component depends on the current iteration’s
coupling variables from all previous components in the sequence. This
algorithm is illustrated in Fig. 13.10 and formalized in Alg. 13.4.
yt
0, 4 → 1 :
1 : y2t , y3t 2 : y3t
MDA
1:
y1 4 : y1 2 : y1 2 : y1
Analysis 1
2:
y2 4 : y2 3 : y2
Analysis 2
Figure 13.10: A block Gauss–Seidel

y3
3: multidisciplinary analysis (MDA)
4 : y3
Analysis 3 process to solve a three-discipline
coupled system.
Algorithm 13.4: Nonlinear block Gauss–Seidel algorithm
Inputs:
(0) (0)
𝑢 (0) = [𝑢1 , . . . , 𝑢𝑛 ]: Guesses for coupling variables
Outputs:
𝑘=1
while ||𝑢 (𝑘) − 𝑢 (𝑘−1) || 2 < 𝜖 do
for 𝑖 = 1, 𝑛 do
(𝑘) (𝑘) (𝑘) (𝑘) (𝑘−1) (𝑘−1)
𝑢𝑖 ← solve 𝑅 𝑖 𝑢1 , . . . , 𝑢𝑖−1 , 𝑢𝑖 , 𝑢𝑖+1 , . . . , 𝑢𝑛 =0
end for
𝑘 = 𝑘+1
end while
Both the block Jacobi and Gauss–Seidel methods converge linearly,

but Gauss–Seidel converges more quickly because each equation uses
the latest information available.

The order in which the components are solved makes a big difference
in the efficiency of the Gauss–Seidel method. In the best possible
scenario, the components can be reordered such that there are no entries
in the lower diagonal of the DSM, which means that each component
depends only on previously solved components and there are therefore
no feedback dependencies. In this case the block Gauss–Seidel method
would converge to the solution in one forward sweep.
In the more general case, even though we might not be able to
completely eliminate the lower diagonal entries, minimizing these
entries by reordering results in better convergence. This reordering can
also make the difference between convergence and non-convergence.
Newton’s method can also be applied to the system level. For a
system of nonlinear residual equations, the Newton step in the coupling
variables, Δ𝑢 = 𝑢 (𝑘+1) − 𝑢 (𝑘) can be found by solving the linear system
𝜕𝑅
Δ𝑢 = −𝑅 𝑢 (𝑘) , (13.9)
𝜕𝑢 𝑢=𝑢 (𝑘)
where we need the partial derivatives of all the residuals with respect
to the coupling variables to form the Jacobian matrix 𝜕𝑅/𝜕𝑢.
Expanding the concatenated residual and coupling variable vectors,
we get,
 𝜕𝑅1 𝜕𝑅 1 
 
 𝜕𝑢1 𝜕𝑢𝑛   Δ𝑢1   𝑅1 
···
 .  
 . ..   ..   .. 
 .
.. 
.  .   = −  . , (13.10)
 𝜕𝑅
.
  
 𝑛 · · · 𝜕𝑅 𝑛  Δ𝑢𝑛    
  𝑅 𝑛 
 𝜕𝑢1 𝜕𝑢𝑛 
where the derivatives in the block Jacobian matrix and the right hand
side are evaluated at the current iteration, 𝑢 (𝑘) . These derivatives can
be computed using any of the methods from Chapter 6. Note that this
Jacobian matrix has exactly the same structure of the DSM and is often
a sparse matrix. The full procedure is listed in Alg. 13.5.
Algorithm 13.5: Newton method for system-level convergence
Inputs: h i
(0) (0)
𝑢 (0) = 𝑢1 , . . . , 𝑢𝑛 : Guesses for coupling variables
Outputs:
𝑘=1
while ||𝑅|| 2 < 𝜖 do
for all 𝑖 ∈ {1, . . . , 𝑛} do Can be done in parallel
Compute 𝑅 𝑖
𝜕𝑅 𝑖
Compute for 𝑗 = 1, . . . , 𝑛
𝜕𝑢 𝑗
end for
Δ𝑢 ← solve block Newton system (13.10)
𝑢 (𝑘+1) = 𝑢 (𝑘) + Δ𝑢
𝑘 = 𝑘+1
end while
Like the plain Newton method, this coupled-Newton method has

similar advantages and disadvantages. The main advantage is that it
converges quadratically once it is close enough to the solution (if the
problem is well conditioned). The main disadvantage is that it might
not converge at all, depending on the initial guess.
13.4 Coupled Derivative Computation
As we well know by now, gradient-based optimization requires the

derivatives of the objective and constraints with respect to the design
variables. Any of the methods for computing derivatives from Chapter 6
can be used, but some require modifications. The difference is that,
in MDO, the computation of the functions of interest (objective and
constraints) requires the solution of a coupled system of components.
The finite-difference method can be used with no modification, as
long as an MDA is converged well enough for each perturbation in the
design variables. As explained in Section 6.4, the cost of computing
derivatives with the finite-difference method is proportional to the
number of variables. The constant of proportionality can increase
significantly compared to that of a single discipline because the MDA
convergence might be slow (especially if using a Jacobi or Gauss–Seidel
iteration).
The precision of the derivatives depends directly on the precision
of the functions of interest. In previous sections, we only needed to
concern ourselves with the precision of a single component. Now that
the function of interest depends on a coupled system, the precision of
the MDA must be considered. This precision depends on the precision
of each component as well as the convergence of the MDA. Even if
each component provides precise function values, the precision of the
derivatives might be compromised if the MDA is not converged well
enough.
The complex-step method and forward mode AD can also be used for
a coupled system, but some modifications are required. The complex-
step method requires all components to be able to take complex input

and compute the corresponding complex outputs, and similarly, AD
requires inputs and outputs that include derivative information. For a
given MDA, if one of these methods are applied to each component and
the coupling includes the derivative information, we can compute the
derivatives of the coupled system. When using AD, manual coupling
will be required if the components and the coupling are programmed
in different languages. While both of these methods produce precise
derivatives for each component, the precision of the derivatives for the
coupled system could be compromised by a low level of convergence
of the MDA. The reverse mode of AD for coupled systems would be
more involved: After an initial MDA, a reverse MDA would be run to
compute the derivatives.
Analytic methods (both direct and adjoint) can also be extended to
compute the derivatives of coupled systems. All the equations derived
for a single component in Section 6.7 are valid for coupled system if we
concatenate the residuals and the state variables.
Thus, the coupled version of the linear system for the direct
method (6.45) is
 𝜕𝑅1 𝜕𝑅 1   𝜕𝑅1 
   𝜙1   
 𝜕𝑢1 𝜕𝑢𝑛     𝜕𝑥 
···
 .  ..   .. 
 . .. .. 
 . = . ,
 . . .   
(13.11)
 𝜕𝑅 𝜕𝑅 𝑛  𝜙 𝑛   𝜕𝑅 𝑛 

 𝑛
 ···   
 𝜕𝑢1 𝜕𝑢𝑛   𝜕𝑥 
where 𝜙 𝑖 is the derivatives of the states from component 𝑖 with respect
to the design variable. Once we have solved for 𝜙, we can use the
coupled equivalent of the total derivative equation (6.46) to compute
the derivatives.
 𝜙1 
d𝑓 𝜕𝐹  .. 
 . .
𝜕𝐹 𝜕𝐹
= − ,..., (13.12)
d𝑥  
𝜙𝑛 
𝜕𝑥 𝜕𝑢1 𝜕𝑢𝑛
 
The coupled adjoint equations can be written as
 𝜕𝑅1 𝑇 𝜕𝑅 1 𝑇  𝑇  𝜕𝐹  𝑇
   𝜓1   𝜕𝑢 
 𝜕𝑢1 𝜕𝑢𝑛 
···
 .    1
 . ..   ..   .. 
 .
..
.   . = .  . (13.13)
 𝜕𝑅
.  
 𝑛𝑇 𝜕𝑅 𝑛 𝑇  𝜓 𝑛   𝜕𝐹 

 ···   
 𝜕𝑢1 𝜕𝑢𝑛   𝜕𝑢𝑛 
After solving for the coupled-adjoint vector using the equation above, we
can use the total derivative equation to compute the desired derivatives,
 𝜕𝑅1 
 
 𝜕𝑥 
d𝑓 𝜕𝐹 𝑇  . 
= − 𝜓1 , . . . , 𝜓𝑇𝑛  ..  . (13.14)
d𝑥 𝜕𝑥  𝜕𝑅 
 𝑛
 
 𝜕𝑥 
There is an alternative form for the coupled direct and adjoint
methods that was not useful for single models. The coupled direct and
adjoint methods derived above use the residual form of the governing
equations and are a natural extension of the corresponding methods
applied to single models. In this form, the residuals for all the equations
and the corresponding states are exposed at the system level. As
previously mentioned in Section 13.3.2, there is an alternative system-
level representation—the functional representation—that views each
model as a function relating its inputs and outputs 𝑦 = 𝑌(𝑦), where the
coupling variables 𝑦 represent the system-level states. We can derive
direct and adjoint methods from the functional representation.
The functional versions of these methods can be derived by defining
the residuals as 𝑅(𝑦) , 𝑦 − 𝑌(𝑦) = 0, where the states are now the
coupling variables. The linear system for the direct method (13.11) then
yields
 𝜕𝑌1   𝜕𝑌1 
 𝐼   𝜙¯ 1   
𝜕𝑌1
− ··· −
 𝜕𝑦𝑛     𝜕𝑥 
 . .. 
𝜕𝑦2
 .  ..   .. 
 .
.. ..
.   . = . , (13.15)
  ¯   𝜕𝑌 
. .
 𝜕𝑌
− 𝑛 𝜕𝑌𝑛  𝜙𝑛   𝑛 
 − ··· 𝐼   
 𝜕𝑦1 𝜕𝑦2   𝜕𝑥 
where 𝜙¯ 𝑖 = d𝑦 𝑖 /d𝑥. The total derivatives of the function of interest can

then be computed with
 𝜙¯ 1 
d𝑓 𝜕𝐹 𝜕𝐹 𝜕𝐹  .. 
= − ,...,  . . (13.16)
d𝑥  
𝜙¯ 𝑛 
𝜕𝑥 𝜕𝑦1 𝜕𝑦𝑛
 
The functional form of the coupled adjoint equations can be similarly
derived, yielding
 𝜕𝑌1 𝑇  𝑇  𝜕𝑌1 
 𝐼   𝜓¯ 1   
𝜕𝑌1 𝑇
− ··· −
 𝜕𝑦𝑛     𝜕𝑥 
 . .. 
𝜕𝑦2
 .  ..   .. 
 .
.. ..
.   . = . , (13.17)
  ¯   𝜕𝑌 
. .
 𝜕𝑌
− 𝑛 𝑇 𝜕𝑌𝑛 𝑇  𝜓 𝑛   𝑛 
 − ··· 𝐼   
 𝜕𝑦1 𝜕𝑦2   𝜕𝑥 
After solving for the coupled-adjoint vector using the equation above, we
can use the total derivative equation to compute the desired derivatives,
 𝜕𝐹 
 
 𝜕𝑦1 
d𝑓  . 
− 𝜓¯ 𝑇1 , . . . , 𝜓¯ 𝑇𝑛  ..  .
𝜕𝐹
= (13.18)
d𝑥 𝜕𝑥  𝜕𝐹 
 
 
 𝜕𝑦𝑛 
Finally, the unification of the methods for computing derivatives
introduced in Section 6.9 also applies to coupled systems and can be
used to derive the coupled direct and adjoint methods presented above.
Furthermore, the UDE (6.74) can also handle residual or functional
components, as long as they are ultimately expressed as residuals.
Tip 13.6: Coupled derivative computation.
Obtaining derivatives for each component of a multidisciplinary model and

propagating them to compute the coupled derivatives usually requires a high
implementation effort. OpenMDAO‡ was designed to help with this problem. ‡ http://openmdao.org
Even if exact derivatives can only be supplied for a subset of the models and
the rest are obtained by finite difference, the system derivatives will usually be
more accurate than applying finite difference at the system level.
13.5 Monolithic Architectures
Monolithic MDO architectures cast the design problem as a single

optimization. The only difference between the different monolithic
architectures is the set of design variables that the optimizer is respon-
sible for, which has repercussions on the set of constraints considered
and how the governing equations are solved.
§ § Martins et al.36 present a comprehensive
description of all MDO architectures, in-
cluding references to known applications
13.5.1 Multidisciplinary Feasible of each architecture.
The multidisciplinary design feasible (MDF) architecture is the closest

to a single discipline problem because the design variables, objective,
and constraints are the same as we would expect for a single discipline
problem. The only difference is that the computation of the objective
and constraints requires the solution of a coupled system instead of a
single system of governing equations. Therefore, all the optimization
algorithms covered in the previous chapters can be applied without

modification in MDF. The resulting optimization problem is

minimize 𝑓 𝑥, 𝑦 ∗
by varying 𝑥

subject to 𝑐 0 𝑥, 𝑦 ∗ ≤ 0
(13.19)
𝑐 𝑖 𝑥0 , 𝑥 𝑖 , 𝑦 𝑖∗ ≤ 0, 𝑖 = 1, . . . , 𝑁 .

while solving 𝑅 𝑖 𝑥, 𝑦 = 0 𝑖 = 1, . . . , 𝑁 .
by varying 𝑦
where an MDA is performed for each 𝑥 solve for the internal component
states and the coupling variables 𝑦 ∗ , using any of the methods from
Section 13.3.4. An MDA is converged when it finds a set of coupling
variables 𝑦 ∗ that makes all components consistent. That is, computing
the output coupling variables for each component using 𝑦 ∗ as an input,

𝑦 𝑖 = 𝑌𝑖 𝑥 0 , 𝑥 𝑖 , 𝑦 ∗𝑗≠𝑖 , 𝑖 = 1, . . . , 𝑁 , (13.20)
yields 𝑦 𝑖 = 𝑦 𝑖∗ for all components. Then, the objective and constraints

can be computed based on the current design variables and coupling
variables. An XDSM for MDF with three components is shown in
Fig. 13.11. Here we use a Gauss–Seidel iteration to converge the MDA,
but any other method for converging the MDA could be used.
One advantage of MDF is that the system-level states are physically
compatible if an optimization stops prematurely. This is advantageous
in an engineering design context when time is limited and we are not
as concerned with finding an optimal design in the strict mathematical
sense as with finding an improved design. However, it is not guaranteed
that the design constraints are satisfied if the optimization is terminated
early; that depends on whether the optimization algorithm maintains a
feasible design point or not.
The main disadvantage of MDF is that it requires an MDA for each
optimization iteration, which requires its own algorithm outside of the
optimization. Implementing an MDA algorithm can be time consuming
if one is not already in place. One of the easiest to implement is the
block Gauss–Seidel algorithm, but as we have seen, it converges slowly.
When using a gradient-based optimizer, gradient calculations are
also challenging for MDF because it requires coupled derivatives.
Finite-different derivative approximations are easy to implement, but
their poor scalability and precision are compounded by the MDA, as
explained in Section 13.4. Ideally, we would use one of the analytic
coupled derivative computation methods of Section 13.4, which require
a substantial implementation effort.
x (0) y (0)
0, 7 → 1 :
x∗ 2 : x0 , x1 3 : x0 , x2 4 : x0 , x3 6:x
Optimization
0, 4 → 1 :
2 : y2 , y3 3 : y3
MDA
2:
y1∗ 5 : y1∗ 3 : y1 4 : y1 6 : y1
Analysis 1
3:
y2∗ 5 : y2∗ 4 : y2 6 : y2
Analysis 2
4:
y3∗ 5 : y3∗ 6 : y3
Analysis 3
6:
7 : f ,c
Functions
13.5.2 Individual Discipline Feasible Figure 13.11: The MDF architecture

relies on an MDA to solve for the
The individual discipline feasible (IDF) architecture adds independent coupling and state variables for
each optimization iteration. In this
copies of the coupling variables to allow component analyses to run case, the MDA uses a Gauss–Seidel
independently and possibly in parallel. These copies are known approach.
as target variables, are controlled by the optimizer, while the actual
coupling variables are computed by the corresponding component.
Target variables are denoted by a superscript 𝑡 so that the coupling
variables produced by discipline 𝑖 is 𝑦 𝑖𝑡 . These variables represent the
current guesses for the coupling variables that are independent of the
corresponding actual coupling variables computed by each component.
To ensure the eventual consistency between the target coupling variables
and the actual coupling variables at the optimum, we define a set of
consistency constraints, 𝑐 𝑖𝑐 = 𝑦 𝑖𝑡 − 𝑦 𝑖 , which we add to the optimization
problem formulation.
The optimization problem for the IDF architecture is

minimize 𝑓 𝑥, 𝑦
by varying 𝑥, 𝑦 𝑡

subject to 𝑐 0 𝑥, 𝑦 ≤ 0

𝑐 𝑖 𝑥0 , 𝑥 𝑖 , 𝑦𝑖 ≤ 0 𝑖 = 1, . . . , 𝑁 , (13.21)
𝑐 𝑖𝑐 = 𝑦 𝑖𝑡 − 𝑦𝑖 = 0 𝑖 = 1, . . . , 𝑁 ,

while solving 𝑅 𝑖 𝑥, 𝑦 𝑖 , 𝑦 𝑡𝑗≠𝑗 = 0 𝑖 = 1, . . . , 𝑁 .
for 𝑦
where each component is solved independently to compute the corre-

sponding output coupling variables and constraints based on the target
coupling variables, that is

𝑦 𝑖 = 𝑌𝑖 𝑥0 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖 ,
(13.22)
𝑐 𝑖 = 𝑐 𝑖 𝑥0 , 𝑥 𝑖 , 𝑦 𝑖 , 𝑦 𝑡𝑗≠𝑖 ,
where unlike MDF, we do not need to make the coupling variables con-
sistent using an MDA. Instead, each component is solved independently
for each optimization iteration. Then 𝑓 and 𝑐0 are computed using
the current design variables 𝑥 and the latest available set of coupling
variables 𝑦. The XDSM for IDF is shown in Fig. 13.12.
x (0) , y t,(0)
0, 3 → 1 :
x∗ 2 : x, y t 1 : x0 , x1 , y2t , y3t 1 : x0 , x2 , y1t , y3t 1 : x0 , x3 , y1t , y2t
Optimization
2:
y1∗ 3 : f , c, c c
Functions
1:
y2∗ 2 : y1
Analysis 1
1:
y3∗ 2 : y2
Analysis 2
1:
2 : y3
Analysis 3
Figure 13.12: The IDF architecture

breaks up the MDA by letting the
optimizer solve for the coupling
variables that satisfy interdisci-
plinary feasibility.
One advantage of IDF is that each component can be solved in

parallel because they do not depend on each other directly. Instead, the
coupling between the components is resolved by the optimizer, which
iterates the target coupling variables, 𝑦 𝑡 until it satisfies the consistency
constraints, 𝑐 𝑐 , such that 𝑦 𝑡 = 𝑦.
This leads to the main disadvantage of IDF, which is that the
optimizer must handle more design variables and constraints compared
to the MDF architecture. If the number of coupling variables is large,
the size of the resulting optimization problem might be too large to
solve efficiently. This problem can be mitigated by careful selection
of the components or by aggregating the coupling variables to reduce
their dimensionality.
Another advantage of IDF is that if a gradient-based optimization
algorithm is used to solve the optimization problem, the optimizer
is typically more robust and has better convergence rate than the
fixed-point iteration algorithms of Section 13.3.4.
13.5.3 Simultaneous Analysis and Design

Simultaneous analysis and design (SAND) extends the idea of IDF by
moving not only the coupling variables to the optimization problem, but
all component states as well. The SAND approach requires exposing all
the components in form of the system-level view previously introduced
in Fig. 13.6. The residuals of the analysis become constraints that the
optimizer is responsible for.¶ ¶ When the residual equations arise from
discretized PDEs, we have what is called
This means that component solvers are no longer needed and PDE-constrained optimization. 144 .
the optimizer becomes responsible for simultaneously solving the 144. Biegler et al., Large-Scale PDE-
components for their states, the interdisciplinary compatibility for the Constrained Optimization. 2003
coupling variables, and the design optimization problem for the design
variables. Because the optimizer is controlling all these variables, SAND
is also known as a full-space approach. SAND can be stated as

minimize 𝑓0 𝑥, 𝑦
by varying 𝑥, 𝑦, 𝑢

subject to 𝑐0 𝑥, 𝑦 ≤ 0 (13.23)

𝑐 𝑖 𝑥0 , 𝑥 𝑖 , 𝑦𝑖 ≤ 0 for 𝑖 = 1, . . . , 𝑁

ℛ 𝑖 𝑥0 , 𝑥 𝑖 , 𝑦, 𝑢𝑖 = 0 for 𝑖 = 1, . . . , 𝑁 .
where we use the representation shown in Fig. 13.4, and therefore
there are two sets of explicit functions that translate the input coupling
variables of the component. The SAND architecture is also applicable
to single components, in which case there are no coupling variables.
The XDSM for SAND is shown in Fig. 13.13
x (0) , y (0) , u (0)
0, 2 → 1 :
x ∗, y ∗ 1 : x, y 1 : x0 , x1 , y , u1 1 : x0 , x2 , y , u2 1 : x0 , x3 , y , u3
Optimization
1:
2 : f ,c
Functions
1:
2 : R1
Residual 1
1:
2 : R2
Residual 2
1:
2 : R3
Residual 3
Because we are solving all variables simultaneously, the SAND Figure 13.13: The SAND architec-
architecture has the potential for being the most efficient way to get ture lets the optimizer solve for all
variables (design, coupling, and
to the optimal solution. In practice, however, it is unlikely that this is state variables) and component
advantageous when efficient component solvers are available. solvers are no longer needed.
The resulting optimization problem is the largest of all MDO archi-
tectures and requires an optimizer that scales well with the number
of variables. Therefore, a gradient-based optimization algorithm is
likely required, in which case, the derivative computation must also
be considered. Fortunately, SAND does not require derivatives of the
coupled system or even total derivatives that account for the component
solution; only partial derivatives of residuals are needed.
SAND is an intrusive approach because it requires access to the
residuals. These might not be available if components are provided as
black boxes. Rather than computing coupling variables 𝑦 𝑖 and state
variables 𝑢𝑖 by converging the residuals to zero each component 𝑖 just
compute the current residuals ℛ 𝑖 for the current values of the coupling
variables 𝑦, and the component states 𝑢𝑖 . ‖
The MAUD architecture was developed
by Hwang et al.39 , who realized that the
UDE provided the mathematical basis
13.5.4 Modular Analysis and Unified Derivatives for a new MDO framework that makes
sophisticated parallel solvers and cou-
pled derivative computations available
The modular analysis and unified derivatives (MAUD) architecture is
through a small set of user-defined func-
essentially MDF with built-in solvers and derivative computation that tions.
use the residual representation introduced in Section 13.3.2. ‖ There 39. Hwang et al., A computational architec-
ture for coupling heterogeneous numerical
are two main ideas in MAUD: 1) represent the coupled system as a models and computing coupled derivatives.
2018
single nonlinear system and 2) linearize the coupled system using the
UDE (6.74) and solve it for the coupled derivatives.
To represent the coupled system as a single nonlinear system,
we view the MDA as a series of residuals and variables, 𝑅 𝑖 (𝑢) = 0,
corresponding to each component 𝑖 = 1, . . . , 𝑛, as previously written
in Eq. 13.1. Unlike the previous architectures, there is no distinction
between the coupling variables and state variables; they are all just
states, 𝑢. As previously shown in Fig. 13.5, the coupling variables can
be considered to be the states by defining explicit components that
translate the inputs and outputs.
In addition, both the design variables and functions of interest
(objective and constraints) are also concatenated in the state variable
vector. Denoting the original states for the coupled system (13.1) as 𝑢,
¯
the new state is,
𝑥 
 
𝑢 , 𝑢¯  . (13.24)
𝑓
 
We also need to augment the residuals to have a solvable system. The
residuals corresponding to the design variables and output functions
are formulated using the residual for explicit functions introduced in
Eq. 13.2. The complete set of residuals is then,
 𝑥 − 𝑥0 
 
𝑅(𝑢) , 𝑟 − 𝑅 𝑢¯ (𝑥, 𝑢)
¯  , (13.25)
 𝑓 − 𝐹(𝑥, 𝑢) ¯ 

where 𝑥 0 are fixed inputs, and 𝐹(𝑥, 𝑢)
¯ is the actual computed value
of the function. Formulating fixed inputs and explicit functions as
residuals in this way might seem unnecessarily complicated, but it
facilitates the formulation of the MAUD architecture, just like it did for
the formulation of the UDE.
The two main ideas in MAUD mentioned above are directly as-
sociated with two main tasks: 1) the solution of the coupled system
and 2) the computation of the coupled derivatives. The formulation
of the concatenated states (13.24) and residuals (13.25) simplifies the
implementation of the algorithms that perform the above tasks. To
perform these tasks, MAUD assembles and solves four types of systems:
1. Fundamental system: This represents the numerical model and

is in general a discretized nonlinear system of equations.
𝑅(𝑢) = 0
2. Newton step: A linear system based on the numerical model

above whose solution yields an iteration toward the solution.
𝜕𝑅
Δ𝑢 = −𝑟
𝜕𝑢
3. Forward differentiation (left equality of UDE): A linear system

whose solution yields the derivative of all states with respect
to one selected state. The state is selected by the appropriate
choice of the column of the identity matrix. The selected states
are usually the ones associated with 𝑥.
𝜕𝑅 d𝑢
=ℐ
𝜕𝑢 d𝑟
4. Reverse differentiation (right equality of UDE): A linear system

whose solution yields the derivative of one selected state with
respect to all states. The state is selected by the appropriate choice
of the column of the identity matrix. The selected states are
usually the one associated with 𝑓 .
𝜕𝑅 𝑇 d𝑢 𝑇
=ℐ
𝜕𝑢 d𝑟
To efficiently solve the above systems of equations, MAUD provides

the option for grouping the components of the fundamental system
hierarchically. We show several examples of this grouping in Fig. 13.14.
Of the two-component system on the left column, the top one has
independent components that can be solved in parallel, while in the
bottom one the components are coupled and need a coupled solver. The
other systems have four components with different types dependencies
that can be solved using two levels: the first level consists of two groups
with two components each, and the higher level solves the two groups.
∗∗ MAUD
∗∗ was implemented in OpenM-
DAO V2 by Gray et al.91 . OpenMDAO
is an open-source framework developed
by NASA to facilitate the development
13.6 Distributed Architectures of multidisciplinary solvers. It includes
all the features of MAUD introduced in
this chapter and adds many other features,
The monolithic MDO architectures we have covered so far form and such as methods that take advantage of
solve a single optimization problem. Distributed architectures decom- sparsity in the system coupling.
pose this single optimization problem into a set of smaller optimization 91. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
problems, or disciplinary subproblems, which are then coordinated by a design, analysis, and optimization. 2019
system-level subproblem. One key requirement for these architectures is
that they must be mathematically equivalent to the original monolithic
problem so that they converge to the same solution.
Serial Parallel Coupled
x x x
u1 u1
u1
u2 u2
u3 u3
u2
u4 u4
f f f
(a) (b) (c)

Figure 13.14: Example problem
x x x structures and corresponding
MAUD hierarchical decomposi-
u1 u1
u1 tions. The problem structure is
u2 u2 shown using DSM. The hierarchical
decompositions are shown above
u3 u3 the matrices, where the components
u2
u4 u4 are the solid blue squares and the
groups are the boxes (serial groups
f f f in red, parallel groups in blue, and
coupled groups in gray).
(d) (e) (f)
There are two main motivations for distributed architectures. The

first one is the possibility of decomposing the problem to reduce the
computational time. The second motivation is to mimic the structure
of large engineering design teams, where disciplinary groups have
the autonomy to design their subsystem, so that MDO is more readily
adopted in industry. Overall, distributed MDO architectures have fallen
short on both of these expectations. Unless a problem has a special
structure, there is no distributed architecture that converges as rapidly
as a monolithic one. In practice, distributed architectures have not been
used much recently.
There are two main types of distributed architectures: those that
enforce multidisciplinary feasibility via an MDA somewhere in the
process and those that enforce multidisciplinary feasibility in some
other way (using constraints or penalties at the system level). This
is analogous to MDF and IDF, respectively, so we name these types
“distributed MDF” and “distributed IDF”.
13.6.1 Sequential Optimization

The sequential optimization approach is not considered to be an MDO
architecture because in general, it does not converge to the optimum of
the MDO problem. However, this is an intuitive approach to distributing
the optimization of a system with multiple coupled components. This
approach does not include a system-level subproblem. Instead, each

component is optimized in turn with respect to its local variable
while satisfying its constraints. This is an approach that is often
used in industry, where engineers are grouped by discipline, physical
subsystem, or both. This makes sense when the engineering system
being designed is too complex and the number of engineers too large
to coordinate a simultaneous design involving all groups.
The sequential optimization approach is analogous to a block-Gauss–
Seidel iteration, but in addition to solving for the component state
variables we also solve an optimization problem for the design variables
of that component. We can also view this approach as coordinate
descent, except that instead of optimizing one variable at the time, we
optimize a set of variables at the time.
13.6.2 Collaborative Optimization

The collaborative optimization (CO) MDO architecture is inspired on
how disciplinary teams work in the design of complex engineered
systems. This is a distributed IDF architecture, where the disciplinary
optimization problems are formulated to be independent of each other
by using target values of the coupling and global design variables.
These target values are then shared with all disciplines during every
iteration of the solution procedure. The complete independence of
disciplinary subproblems combined with the simplicity of the data-
sharing protocol makes this architecture attractive for problems with a
small amount of shared data.
The XDSM for CO is shown in Fig. 13.15. The system-level subprob-
lem is similar to the original optimization problem except that: (1) local
constraints are removed, (2) target coupling variables (𝑦 𝑡 ) are added
as design variables, and (3) a consistency constraint that quantifies the
difference between the target coupling variables and actual coupling
variables is added. This optimization problem can be written as

minimize 𝑓0 𝑥0 , 𝑥ˆ 1 , . . . , 𝑥ˆ 𝑁 , 𝑦 𝑡
by varying 𝑥0 , 𝑥ˆ 1 , . . . , 𝑥ˆ 𝑁 , 𝑦 𝑡

subject to 𝑐 0 𝑥0 , 𝑥ˆ 1 , . . . , 𝑥ˆ 𝑁 , 𝑦 𝑡 ≤ 0
𝐽𝑖∗ = || 𝑥ˆ 0𝑖 − 𝑥0 || 22 + || 𝑥ˆ 𝑖 − 𝑥 𝑖 || 22 +

||𝑦 𝑖𝑡 − 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖 || 22 = 0 for 𝑖 = 1, . . . , 𝑁
(13.26)
where 𝑥ˆ 0𝑖 are copies of the global design variables that passed to

discipline 𝑖 and 𝑥ˆ 𝑖 are copies of the local design variables passed to
the system subproblem. The constraint function 𝐽𝑖∗ is a measure of

the inconsistency between the values requested by the system-level
subproblem and the results from the discipline 𝑖 subproblem.
(0) (0) (0) (0)

x0 , x̂1...N , y t,(0) x̂0i , xi
0, 2 → 1 :
x0∗ System 1 : x0 , x̂1...N , y t 1.1 : yj6t=i 1.2 : x0 , x̂i , y t
Optimization
1:
2 : f0 , c0 System
Functions
1.0, 1.3 → 1.1 :

xi∗ 1.1 : x̂0i , xi 1.2 : x̂0i , xi
Optimization i
1.1 :
yi∗ 1.2 : yi
Analysis i
1.2 :
2 : Ji∗ 1.3 : fi , ci , Ji Discipline i
Functions
For each system-level iteration, the disciplinary subproblems do Figure 13.15: Diagram for the CO
architecture.
not include the original objective function. Instead the objective of
each subproblem is to minimize the inconsistency function. For each
discipline 𝑖 the subproblem is

minimize 𝐽𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖
by varying 𝑥ˆ 0𝑖 , 𝑥 𝑖 (13.27)

subject to 𝑐 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖 ≤ 0.
These subproblems are independent of each other and can be solved

in parallel. Thus, the system-level subproblem is responsible for
minimizing the design objective, while the discipline subproblems
minimize system inconsistency while satisfying local constraints. The †† Braun145 showed that the CO problem
statement is mathematically equivalent to

CO procedure is detailed in Alg. 13.7.†† They formulated two versions the original MDO problem.
of the CO architecture: CO1 and CO2 . The version presented in 145. Braun, Collaborative Optimization:
An Architecture for Large-Scale Distributed
Design. 1996
Section 13.6.2 is CO2 .
Algorithm 13.7: Collaborative optimization
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Corresponding objective value
𝑐 ∗ : Corresponding constraint values
0: Initiate system optimization iteration

repeat
1: Compute system subproblem objectives and constraints
for Each discipline 𝑖 (in parallel) do
1.0: Initiate disciplinary subproblem optimization
repeat
1.1: Evaluate disciplinary analysis
1.2: Compute disciplinary subproblem objective and constraints
1.3: Compute new disciplinary subproblem design point and 𝐽𝑖
until 1.3 → 1.1: Optimization 𝑖 has converged
end for
2: Compute a new system subproblem design point
until 2 → 1: System optimization has converged
In spite of the organizational advantage of having fully separate

disciplinary subproblems, CO suffers from numerical ill-conditioning.
This is because the constraint gradients of the system problem at an
optimal solution are all zero vectors, which violates the constraint
qualification requirement for the KKT conditions. This slows down
convergence when using a gradient-based optimization algorithm or
prevents convergence all together.
13.6.3 Analytical Target Cascading

Analytical target cascading (ATC) is a distributed IDF architecture
that uses penalties in the objective function to minimize the difference
between the target variables requested by the system-level optimization
and the actual variables computed by each discipline. This is an idea
similar to the CO architecture in the previous section, except that ATC
uses penalties instead of a constraint. The ATC system-level problem is
Õ
𝑁

minimize 𝑓0 𝑥, 𝑦 𝑡 + Φ𝑖 𝑥ˆ 0𝑖 − 𝑥 0 , 𝑦 𝑖𝑡 − 𝑦 𝑖 𝑥 0 , 𝑥 𝑖 , 𝑦 𝑡 +

𝑖=1
(13.28)
Φ0 𝑐 0 𝑥, 𝑦 𝑡
by varying 𝑥0 , 𝑦 𝑡 ,
where Φ0 is a penalty relaxation of the global design constraints and Φ𝑖

is a penalty relaxation of the discipline 𝑖 consistency constraints. The
𝑖 𝑡 ℎ discipline subproblem is:

minimize 𝑓0 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖 , 𝑦 𝑡𝑗≠𝑖 + 𝑓𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖 +

Φ𝑖 𝑦 𝑖𝑡 − 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖 , 𝑥ˆ 0𝑖 − 𝑥0 +

Φ0 𝑐 0 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖 , 𝑦 𝑡𝑗≠𝑖
by varying 𝑥ˆ 0𝑖 , 𝑥 𝑖

subject to 𝑐 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑖 𝑥ˆ 0𝑖 , 𝑥 𝑖 , 𝑦 𝑡𝑗≠𝑖 ≤ 0.
(13.29)
While the most common penalty functions in ATC are quadratic

penalty functions, other penalty functions are possible. As mentioned
in Section 5.3, penalty methods require a good selection of the penalty
weight values to converge fast and accurately enough.
Fig. 13.16 shows the ATC architecture XDSM, where 𝑤 denotes the
penalty function weights used in the determination of Φ0 and Φ𝑖 . The
details of ATC are described in Alg. 13.8.
Algorithm 13.8: Analytical target cascading
Inputs:
Outputs:
0: Initiate main ATC iteration

repeat
for Each discipline 𝑖 do
1: Initiate discipline optimizer
repeat
2: Evaluate disciplinary analysis
(0) (0) (0)

w (0) x0 , y t,(0) x̂0i , xi
0, 8 → 1 :
(no data) 6:w 3 : wi
w update
5, 7 → 6 :
x0∗ System 6 : x0 , y t 3 : x0 , y t 2 : yj6t=i
Optimization
6:
System and
7 : f0 , Φ0...N
Penalty
Functions
1, 4 → 2 :
xi∗ 6 : x̂0i , xi 3 : x̂0i , xi 2 : x̂0i , xi
Optimization i
3:
Discipline i
4 : fi , ci , Φ0 , Φi
and Penalty
Functions
2:
yi∗ 6 : yi 3 : yi
Analysis i
3: Compute discipline objective and constraint functions and Figure 13.16: Diagram for the ATC
penalty function values architecture
4: Update discipline design variables
until 4 → 2: Discipline optimization has converged
end for
5: Initiate system optimizer
repeat
6: Compute system objective, constraints, and all penalty functions
7: Update system design variables and coupling targets.
8: Update penalty weights
until 8 → 1: Penalty weights are large enough
13.6.4 Bilevel Integrated System Synthesis

Bilevel integrated system synthesis (BLISS) uses a series of linear
approximations to the original design problem, with bounds on the
design variable steps, to prevent the design point from moving so far
away that the approximations are too inaccurate. This is an idea similar
to that of trust-region methods in Section 4.5. These approximations are
constructed at each iteration using coupled derivatives (see Section 13.4).
The system level subproblem is formulated as

d 𝑓0∗
minimize ( 𝑓0∗ )0 + Δ𝑥0
d𝑥0
by varying Δ𝑥 0

d𝑐 0∗
subject to (𝑐 0∗ )0 + Δ𝑥0 ≤ 0 (13.30)
d𝑥0
∗
d𝑐 𝑖
(𝑐 ∗𝑖 )0 + Δ𝑥0 ≤ 0 for 𝑖 = 1, . . . , 𝑁
d𝑥0
Δ𝑥 0𝐿 ≤ Δ𝑥0 ≤ Δ𝑥0𝑈 .
The discipline 𝑖 subproblem is given by

d 𝑓0
minimize ( 𝑓0 )0 + Δ𝑥 𝑖
d𝑥 𝑖
by varying Δ𝑥 𝑖

d𝑐 0
subject to (𝑐0 )0 + Δ𝑥 𝑖 ≤ 0 (13.31)
d𝑥 𝑖

d𝑐 𝑖
(𝑐 𝑖 )0 + Δ𝑥 𝑖 ≤ 0
d𝑥 𝑖
Δ𝑥 𝑖𝐿 ≤ Δ𝑥 𝑖 ≤ Δ𝑥 𝑖𝑈 .
The extra set of constraints in both system-level and discipline sub-

problems denote the design variables bounds. To prevent violation of
the disciplinary constraints by changes in the shared design variables,
post-optimality derivatives (the change in the optimized disciplinary
constraints with respect to a change in the system design variables) are
required to solve the system-level subproblem.
Algorithm 13.9: Bilevel integrated system synthesis
Inputs:
Outputs:
(0) (0)
x (0) y t,(0) x0 xi
0, 11 → 1 :
(no data) Convergence
Check
1, 3 → 2 :
6 : yj6t=i 6, 9 : yj6t=i 6 : yj6t=i 2, 5 : yj6t=i
MDA
8, 10 :
x0∗ 11 : x0 System 6, 9 : x0 6, 9 : x0 9 : x0 6 : x0 2, 5 : x0
Optimization
4, 7 :
xi∗ 11 : x0 6, 9 : xi 6, 9 : xi 9 : xi 6 : xi 2, 5 : xi
Optimization i
6, 9 :
10 : f0 , c0 7 : f0 , c0 System
Functions
6, 9 :
10 : fi , ci 7 : fi , ci Discipline i
Functions
9:
Shared
10 : df /dx0 , dc/dx0
Variable
Derivatives
3:
Discipline i
7 : df0,i /dx0 , dc0,i /dx0
Variable
Derivatives
2:
yi∗ 3 : yi 6, 9 : yi 6, 9 : yi 9 : yi 6 : yi
Analysis i
𝑓 ∗ : Corresponding objective value Figure 13.17: Diagram for the BLISS

𝑐 ∗ : Corresponding constraint values architecture
0: Initiate system optimization

repeat
1: Initiate MDA
repeat
2: Evaluate discipline analyses
3: Update coupling variables
until 3 → 2: MDA has converged
4: Initiate parallel discipline optimizations
for Each discipline 𝑖 do
5: Evaluate discipline analysis
6: Compute objective and constraint function values and derivatives
with respect to local design variables
7: Compute the optimal solutions for the disciplinary subproblem
end for

9: Compute objective and constraint function values and derivatives with
respect to shared design variables using post-optimality analysis
10: Compute optimal solution to system subproblem
Figure 13.17 shows the XDSM for BLISS and the corresponding steps
are listed in Alg. 13.9. Since BLISS uses an MDA, it is a distributed MDF
architecture. Due to the linear nature of the optimization problems,
repeated interrogation of the objective and constraint functions is not
necessary once we have the gradients. If the underlying problem is
highly nonlinear, the algorithm may converge slowly. The variable
bounds may help the convergence if these bounds are properly chosen,
such as through a trust region framework.
13.6.5 Asymmetric Subspace Optimization

Asymmetric subspace optimization (ASO) is a distributed MDF archi-
tecture that is motivated by cases where there is a large discrepancy
between the cost of the disciplinary solvers. To reduce the number
of the more expensive disciplinary analysis, the cheaper disciplinary
analyses are replaced by disciplinary design optimizations inside the
overall MDA.
The system-level optimization subproblem is
Õ
minimize 𝑓0 𝑥, 𝑦 𝑥, 𝑦 + 𝑓 𝑘 𝑥0 , 𝑥 𝑘 , 𝑦 𝑘 𝑥0 , 𝑥 𝑘 , 𝑦 𝑗≠𝑘
𝑘
by varying 𝑥0 , 𝑥 𝑘

subject to 𝑐0 𝑥, 𝑦 𝑥, 𝑦 ≤0

𝑐 𝑘 𝑥0 , 𝑥 𝑘 , 𝑦 𝑘 𝑥0 , 𝑥 𝑘 , 𝑦 𝑗≠𝑘 ≤0 for all 𝑘,
(13.32)
where subscript 𝑘 denotes disciplinary information that remains outside

of the MDA. The disciplinary optimization subproblem for discipline 𝑖,
which is resolved inside the MDA, is

minimize 𝑓0 𝑥, 𝑦 𝑥, 𝑦 + 𝑓𝑖 𝑥 0 , 𝑥 𝑖 , 𝑦 𝑖 𝑥0 , 𝑥 𝑖 , 𝑦 𝑗≠𝑖
by varying 𝑥𝑖 (13.33)

subject to 𝑐 𝑖 𝑥0 , 𝑥 𝑖 , 𝑦 𝑖 𝑥0 , 𝑥 𝑖 , 𝑦 𝑗≠𝑖 ≤ 0.
Figure 13.18 shows a three-discipline case where the third discipline

is replace with a design optimization. The corresponding sequence of
operations in ASO is listed in Alg. 13.10.
(0) (0)
x0,1,2 y t,(0) x3
0, 10 → 1 :
∗
x0,1,2 System 9 : x0,1,2 2 : x0 , x1 3 : x0 , x2 6 : x0,1,2 5 : x0
Optimization
9:
Discipline 0, 1,
10 : f0,1,2 , c0,1,2
and 2
Functions
0, 8 → 2 :
2 : y2t , y3t 3 : y3t
MDA
2:
y1∗ 9 : y1 8 : y1 3 : y1 6 : y1 5 : y1
Analysis 1
3:
y2∗ 9 : y2 8 : y2 6 : y2 5 : y2
Analysis 2
4, 7 → 5 :
x3∗ 9 : x3 6 : x3 5 : x3
Optimization 3
6:
Discipline 0
7 : f0 , c0 , f3 , c3
and 3
Functions
5:
y3∗ 9 : y3 8 : y3 6 : y3
Analysis 3
Algorithm 13.10: ASO Figure 13.18: Diagram for the ASO

architecture
Inputs:
Outputs:

repeat
1: Initiate MDA
repeat
2: Evaluate Analysis 1
4: Initiate optimization of Discipline 3
repeat
6: Compute discipline 3 objectives and constraints

7: Update local design variables
until 7 → 5: Discipline 3 optimization has converged
8: Update coupling variables
until 8 → 2 MDA has converged
9: Compute objective and constraint function values for all disciplines 1
and 2
10: Update design variables
For a gradient-based system-level optimizer, the gradients of the

objective and constraints must take into account the suboptimization.
This requires coupled post-optimality derivative computation, which
increases the cost of both computational and implementation time com-
pared a normal coupled derivative computation. The total optimization
cost is only competitive with MDF if the discrepancy between each
disciplinary solver is high enough.
13.6.6 Other Distributed Architectures

There are other distributed MDF architectures other than BLISS and
ASO: concurrent subspace optimization (CSSO) and MDO of indepen-
dent subspaces (MDOIS)
CSSO requires surrogate models for the analyses for all disciplines.
The system-level optimization subproblem is solved based on the
surrogate models and is therefore fast. The discipline-level optimization
subproblem uses the actual analysis from the corresponding discipline
and surrogate models for all other disciplines. The solutions for each
discipline subproblem is used to update the surrogate models.
MDOIS only applies when no global variables exist. In this case, dis-
cipline subproblems are solved independently assuming fixed coupling
variables, and then an MDA is performed to update the coupling.
There are also other distributed IDF architectures. Some of these
are like CO in that they use a multilevel approach to enforce multidis-
ciplinary feasibility: BLISS-2000 and quasi-separable decomposition
(QSD). Other architectures enforce multidisciplinary feasibility with
penalties, like ATC: inexact penalty decomposition (IPD), exact penalty
decomposition (EPD), and enhanced collaborative optimization (ECO).
BLISS-2000 is a variation of BLISS that uses surrogate models to
represent the coupling variables for all disciplines. Each discipline
subproblem minimizes the linearized objective with respect to local
variables subject to local constraints. The system-level subproblem
minimizes the objective with respect to the global variables and coupling
variables while enforcing consistency constraints.
When using QSD, the objective and constraint functions are assumed
to be dependent only on the shared design variables and coupling
variables. Each discipline is assigned a “budget” for a local objective and
the discipline problems maximize the margin in their local constraints
and the budgeted objective. The system-level subproblem minimizes
the objective and budgets of each discipline while enforcing the global
constraints and a positive margin for each discipline.
IPD and EPD are applicable to MDO problems with no global
objectives or constraints. They are similar to ATC in that copies of
the share variables are used for every discipline subproblem and the
consistency constraints are relaxed with a penalty function. Unlike
ATC, however, the simpler structure of the discipline subproblems is
exploited to compute post-optimality derivatives to guide the system-
level optimization subproblem.
Like CO, ECO uses copies of the global variables. The discipline
subproblems minimize quadratic approximations of the objective while
enforcing local constraints and linear models of the nonlocal constraints.
The system-level subproblem minimizes the total violation of all con-
sistency constraints with respect to the global variables.
13.7 Summary
MDO architectures provide different options for solving MDO problems.

An acceptable MDO architecture must be mathematically equivalent to
the original problem and thus converge to the same optima. Sequential
optimization, while intuitive, is not mathematically equivalent to the
original problem and yields a design inferior to the MDO optimum.
MDO architectures are divided into two broad categories, as shown
in Fig. 13.19: monolithic architecture and distributed architectures.
Monolithic architectures solve a single optimization problem, while
distributed architecture solve optimization subproblems for each dis-
cipline and a system-level optimization problem. Overall, monolithic
architectures exhibit a much better convergence rate than distributed
architectures. 146 In the last few years, the vast majority of MDO appli- 146. Tedford et al., Benchmarking Multidis-
ciplinary Design Optimization Algorithms.
cations have used monolithic MDO architectures. 2010
The distributed architectures can be divided according to whether
they enforce multidisciplinary feasibility (through an MDA of the
whole system), or not. Distributed MDF architectures enforce multidis-
ciplinary feasibility through an MDA. The distributed IDF architectures
are like IDF in that no MDA is required. However, they must ensure
MDF/MAUD
Monolithic IDF
SAND
BLISS
MDO
architecture CSSO
classification Distributed MDF
MDOIS
ASO
Distributed CO
Multilevel
QSD
Penalty
Distributed IDF ATC
IPD/EPD
ECO
multidisciplinary feasibility in some other way. Some do this by formu- Figure 13.19: Classification of MDO
lating an appropriate multilevel optimization (such as CO) and others architectures.
use penalties to ensure this (such as ATC). ‡‡ ‡‡ Martins et al.36 describes all these MDO
There are a number of commercial MDO frameworks that are architectures in detail.
36. Martins et al., Multidisciplinary Design
available, including Isight/SEE 147 by Dassault Systèmes, ModelCen- Optimization: A Survey of Architectures.
ter/CenterLink by Phoenix Integration, modeFRONTIER by Esteco, 2013
147. Golovidov et al., Flexible implementa-
AML Suite by TechnoSoft, Optimus by Noesis Solutions, TechnoSoft’s tion of approximation concepts in an MDO
AML suite, Noesis Solutions’ Optimus, and VisualDOC by Vander- framework. 1998
plaats Research and Development 148 . These frameworks focus on 148. Balabanov et al., VisualDOC: A Soft-
ware System for General Purpose Integration
making it easy for users to couple multiple disciplines and to use the and Design Optimization. 2002
optimization algorithms through graphical user interfaces. They also
provide convenient wrappers to popular commercial engineering tools.
While this focus has made it convenient for users to implement and
solve MDO problems, the numerical methods used to converge the
multidisciplinary analysis (MDA) and the optimization problem are
usually not as sophisticated as the methods presented in this book. For
example, these frameworks often use fixed-point iteration to converge
the MDA. When derivatives are needed for a gradient-based optimizer,
finite-difference approximations are used rather than more accurate
analytic derivatives.
Problems
a) We prefer to use the term “component” instead of “discipline”

because it is more general.
b) Local design variables affect only one discipline in the MDO

problem, while global variables affect all disciplines.
c) All multidisciplinary models can be written in the functional
form, but not all can be written in the residual form.
d) The coupling variables are a subset of component state
variables.
e) Multidisciplinary models can be represented by directed
cyclic graphs where the nodes represent components and
edges represent coupling variables.
f) The nonlinear block Jacobi and Gauss–Seidel methods can
be used with any combination of component solvers.
g) All the derivative computation methods from Chapter 6 can
be implemented for coupled multidisciplinary systems.
h) Implicit analytic methods for derivative computation are
incompatible with the functional form of multidisciplinary
models.
i) The modular analysis and unified derivatives architecture is
based on the unified derivatives equation.
j) The MDF architecture has fewer design variables and more
constraints than IDF.
k) The main difference between monolithic and distributed
MDO architectures is that the distributed architectures per-
form optimization at multiple levels.
l) Sequential optimization is a valid MDO approach but the
main disadvantage is that it converges slowly.
13.2 Pick a multidisciplinary engineering system from the literature

or formulate one based on your experience.
a) Identify the different analyses and coupling variables.

b) List the design variables and classify them as local or global.
c) Identify the objective and constraint functions.
d) Draw a diagram similar to the one in Fig. 13.1 for your
system.
e) Exploration: Think about the objective that each discipline
would have if considered separately and discuss the trades
needed to optimize the multidisciplinary objective.
Mathematics Review
A
This chapter briefly reviews select mathematical concepts that are used
throughout the book.
1. Compute and understand the difference between partial

and total derivatives
2. Identify and use vector norms.
3. Perform matrix multiplications and compute derivatives

of matrix functions.
4. Perform Taylor’s series expansions.
A.1 Chain Rule, Partial Derivatives, and Total Derivatives
The single variable chain rule is needed for differentiating composite

functions. Given a composite function, 𝑓 (𝑔(𝑥)), the derivative with
respect to the variable 𝑥 is:
d d 𝑓 d𝑔
𝑓 (𝑔(𝑥)) = (A.1)
d𝑥 d𝑔 d𝑥
Example A.1: Single variable chain rule.
Let 𝑓 (𝑔(𝑥)) = sin(𝑥 2 ). In this case, 𝑓 (𝑔) = sin(𝑔), and 𝑔(𝑥) = 𝑥 2 . The
derivative with respect to 𝑥 is:
𝑑 𝑑 𝑑
𝑓 (𝑔(𝑥)) = (sin(𝑔)) (𝑥 2 ) = cos(𝑥 2 )(2𝑥) (A.2)
𝑑𝑥 𝑑𝑔 𝑑𝑥
If a function depends on more than one variable, then we need

to distinguish between partial and total derivatives. For example, if
398
𝑓 (𝑔(𝑥), ℎ(𝑥)) then 𝑓 is a function of two variables: 𝑔 and ℎ. The

application of the chain rule for this function is:
𝑑 𝜕 𝑓 𝑑𝑔 𝜕 𝑓 𝑑ℎ
𝑓 (𝑔(𝑥), ℎ(𝑥)) = + (A.3)
𝑑𝑥 𝜕𝑔 𝑑𝑥 𝜕ℎ 𝑑𝑥
where 𝜕/𝜕𝑥 indicates a partial derivative and 𝑑/𝑑𝑥 is a total derivative.

When taking a partial derivative, we take the derivative with respect
to only that variable, treating all other variables as constants. More
generally,
𝑛
Õ
𝑑 𝜕 𝑓 𝑑𝑔𝑖
( 𝑓 (𝑔1 (𝑥), . . . , 𝑔𝑛 (𝑥))) = (A.4)
𝑑𝑥 𝜕𝑔𝑖 𝑑𝑥
𝑖=1
Example A.2: Partial versus total derivatives.
Consider 𝑓 (𝑥, 𝑦(𝑥)) = 𝑥 2 + 𝑦 2 where 𝑦(𝑥) = sin(𝑥). The partial derivative of

𝑓 with respect to 𝑥 is:
𝜕𝑓
= 2𝑥 (A.5)
𝜕𝑥
whereas the total derivative of 𝑓 with respect to 𝑥 is:
𝑑𝑓 𝜕𝑓 𝜕 𝑓 𝑑𝑦
= +
𝑑𝑥 𝜕𝑥 𝜕𝑦 𝑑𝑥
(A.6)
= 2𝑥 + 2𝑦 cos(𝑥)
= 2𝑥 + 2 sin(𝑥) cos(𝑥)
Notice that the partial derivative and total derivative are quite different. For this
simple case we could also find the total derivative by direct substitution and then
using an ordinary one-dimensional derivative. Substituting in 𝑦(𝑥) = sin(𝑥)
directly into the original expression for 𝑓 :
𝑓 (𝑥) = 𝑥 2 + sin2 (𝑥) (A.7)

𝑑𝑓
= 2𝑥 + 2 sin(𝑥) cos(𝑥) (A.8)
𝑑𝑥
Example A.3: Multivariable chain rule.
Expanding on our single variable example, let 𝑔(𝑥) = cos(𝑥) and ℎ(𝑥) =
sin(𝑥) and 𝑓 (𝑔, ℎ) = 𝑔 2 ℎ 3 . Then 𝑓 (𝑔(𝑥), ℎ(𝑥)) = cos2 (𝑥) sin3 (𝑥) Applying
Eq. A.3 we have:
𝑑 𝜕 𝑓 𝑑𝑔 𝜕 𝑓 𝑑ℎ
( 𝑓 (𝑔(𝑥), ℎ(𝑥))) = +
𝑑𝑥 𝜕𝑔 𝑑𝑥 𝜕ℎ 𝑑𝑥
𝑑𝑔 𝑑ℎ
= 2𝑔 ℎ 3 + 𝑔 2 3ℎ 2 (A.9)
𝑑𝑥 𝑑𝑥
= −2𝑔 ℎ 3 sin(𝑥) + 𝑔 2 3ℎ 2 cos(𝑥)
= −2 cos(𝑥) sin4 (𝑥) + 3 cos3 (𝑥) sin2 (𝑥)
A.2 Vector and Matrix Norms
The most familiar norm for vectors is the 2-norm, which corresponds
to the Euclidean length of the vector:
k𝑥 k 2 = (𝑥12 + 𝑥22 + . . . + 𝑥 𝑛2 )1/2 , (A.10)
Because this norm is used so often, we often omit the subscript an

just write k𝑥k. More generally, we can refer to a class of norms called
𝑝-norms:
k𝑥 k 𝑝 = (|𝑥1 | 𝑝 + |𝑥2 | 𝑝 + . . . + |𝑥 𝑛 | 𝑝 )1/𝑝 (A.11)
Of all the 𝑝-norms, three that are most commonly used: the 2-norm we
just discussed, the 1-norm, and the ∞-norm. From the above definition,
||𝑥|| 1
we see that the 1-norm is the sum of the absolute values of all the entries
in x.
k𝑥k 1 = |𝑥1 | + |𝑥2 | + . . . + |𝑥 𝑛 | (A.12)
The application of ∞ in the 𝑝-norm definition is perhaps less obvious,
but as 𝑝 → ∞ the largest term in that sum dominates all of the others.
Raising that quantity to the power of 1/𝑝 causes the exponents to cancel,
||𝑥|| 2
leaving only the largest magnitude component of 𝑥. This is exactly how
the infinity norm is defined:
k𝑥 k ∞ = max |𝑥 𝑖 | (A.13)
𝑖
The infinity norm is commonly used in connection with optimization

convergence criteria.
||𝑥|| ∞
Several norms for matrices exist as well. One that is used in this
book is the Frobenius norm:
v
u
tÕ𝑚 Õ
𝑛
k𝐻 k 𝐹 = 𝐻𝑖𝑗2 , (A.14)
𝑖=1 𝑗=1
where 𝐻 is an 𝑚 × 𝑛 matrix. ||𝑥|| 𝑝
A.3 Matrix Multiplication
Consider a matrix 𝐴 ∈ R𝑚x𝑛 ∗ and a matrix 𝐵 ∈ R𝑛x𝑝 . The two matrices Figure A.1: Norms for two-
can be multiplied together (𝐶 = 𝐴𝐵) as follows: dimensional case.
∗ This
Õ
means that the matrix is comprised
𝑛
of real numbers and that it has 𝑚 rows
𝐶 𝑖𝑗 = 𝐴𝑖 𝑘 𝐵 𝑘 𝑗 (A.15) and 𝑛 columns.
𝑘=1
where 𝐶 ∈ R𝑚𝑥𝑝 . Notice that two matrices can be multiplied only if there
inner dimensions are equal (𝑛 in this case). The remaining products
discussed in section are just special cases of of matrix multiplication,
but are common enough that we discuss them separately.
A.3.1 Vector-Vector Products

In this book, a vector 𝑢 is a column vector, thus the row vector would
be repented as 𝑢 𝑇 . The product of two vectors can be performed in
two ways. The more common is called an inner product (also known
as a dot product, or scalar product). The inner product, is a functional,
meaning it is an operator that acts on vectors and produces a scalar.
In the real vector space, R𝑛 , the inner product of two vectors, 𝑢 and 𝑣,
whose dimension is equal, is defined algebraically as:
 𝑣1 
 
 𝑣2  Õ
𝑛
𝑢 𝑇 𝑣 = 𝑢1 𝑢2 . . . 𝑢𝑛  .  = 𝑢𝑖 𝑣 𝑖 (A.16)
 .. 
  𝑖=1
𝑣 𝑛 
 
Notice that the order is irrelevant:
𝑢𝑇 𝑣 = 𝑣𝑇 𝑢 (A.17)
In Euclidean space, where vectors possess magnitude and direction,

the inner product is defined as
𝑢 𝑇 𝑣 = ||𝑢|| ||𝑣|| cos(𝜃) (A.18)
where || · || indicates a 2-norm, and 𝜃 is the angle between the two

vectors.
An alternative vector-vector product is the outer product, which takes
the two vectors and multiplies them element-wise to produce a matrix.
Note that the outer product, unlike the inner product, does not require
the vectors to be of the same length.
 𝑢1   𝑢1 𝑣 1 𝑢1 𝑣 𝑛 
   𝑢1 𝑣2 ···
 𝑢2   𝑢2 𝑣1 𝑢2 𝑣 𝑛 
  𝑢2 𝑣2 ···
𝑢𝑣 𝑇 =  .  𝑣1 𝑣2 . . . 𝑣𝑛 =  . ..  (A.19)
 ..   .. .. ..
. 
   . .
𝑢 𝑚  𝑢 𝑚 𝑣 1 𝑢𝑚 𝑣 𝑛 
   𝑢𝑚 𝑣 2 ···
or in index form:
(𝑢𝑣 𝑇 )𝑖𝑗 = 𝑢𝑖 𝑣 𝑗 (A.20)
A.3.2 Matrix-Vector Products

Consider multiplying a matrix 𝐴 ∈ R𝑚𝑥𝑛 by a vector 𝑢 ∈ R𝑛 . The result
is a vector 𝑣 ∈ R𝑚 .
Õ
𝑛
𝑣 = 𝐴𝑢 ⇒ 𝑣 𝑖 = 𝐴 𝑖𝑗 𝑢 𝑗 (A.21)
𝑗=1
We can see that the entries in 𝑣 are dot products between the rows of 𝐴
and 𝑢:
 —— 𝑎 𝑇 —— 
 
 —— 𝑎 𝑇 —— 
1
 
𝑣= 𝑢
2
(A.22)
 ..

 . 
 —— 𝑎 𝑇 ——
 𝑚 
where 𝑎 𝑇𝑗 is the 𝑗 th row of the matrix 𝐴.
Alternatively, it could be thought of as a linear combination of the
columns of 𝐴 where the 𝑢 𝑗 are the weights:
| | |
     
𝑣 =  𝑎1  𝑢1 +  𝑎2  𝑢2 + . . . +
 
 𝑎 𝑛  𝑢𝑛
  (A.23)
| | |
     
where 𝑎 𝑖 are the columns of 𝐴.
We can also multiply by a vector on the left, instead of on the right:
𝑣𝑇 = 𝑢𝑇 𝐴 (A.24)
In this case a row vector is multiplied against a matrix producing a row

vector.
A.3.3 Quadratic Form (Vector-Matrix-Vector Product)

Another common product is a quadratic form. A quadratic form consists
of a row vector, times a matrix, times a column vector, producing a
scalar:
 𝐴11 𝐴1𝑛   𝑢1 
 𝐴12 ···  
 𝐴21 𝐴22 ··· 𝐴2𝑛   𝑢2 
 
𝛼 = 𝑢 𝑇 𝐴𝑢 = 𝑢1 𝑢2 . . . 𝑢𝑛  . ..   ..  (A.25)
 .. .. ..
.  .
 . .  
𝐴𝑛1 𝐴𝑛𝑛  𝑢 𝑛 
 𝐴𝑛2 ···  
or in index form:
Õ
𝑛 Õ
𝑛
𝛼= 𝑢𝑖 𝐴 𝑖𝑗 𝑢 𝑗 (A.26)
𝑖=1 𝑗=1
In general, a vector-matrix-vector product can have a non-square 𝐴

matrix, and the vectors would be two different sizes, but for a quadratic
form the two vectors 𝑢 are identical and thus 𝐴 is square. Also in a
quadratic form we assume that 𝐴 is symmetric (even if it isn’t, only
the symmetric part of 𝐴 contributes anyway so effectively it acts like a
symmetric matrix).
A.4 Matrix Types
There are several common types of matrices that appear regularly

throughout this book. We review some terminology here.
A diagonal matrix is a matrix where all off-diagonal terms are zero.
In other words 𝐴 is diagonal if:
𝐴 𝑖𝑗 = 0 for all 𝑖 ≠ 𝑗 (A.27)
The identity matrix 𝐼 is a special diagonal matrix where all diagonal

components are one.
The transpose of a matrix is defined as:
[𝐴𝑇 ]𝑖𝑗 = 𝐴 𝑗𝑖 (A.28)
Note that:
(𝐴𝑇 )𝑇 = 𝐴 (A.29)
𝑇
(𝐴 + 𝐵) = 𝐴 + 𝐵 𝑇 𝑇
(A.30)
𝑇
(𝐴𝐵) = 𝐵 𝐴 𝑇 𝑇
(A.31)
A symmetric matrix is one where the matrix is equal to its transpose:
𝐴 𝑖𝑗 = 𝐴 𝑗𝑖 (A.32)
The inverse of a matrix satisfies:
𝐴𝐴−1 = 𝐼 = 𝐴−1 𝐴 (A.33)
Not all matrices are invertible. Some common properties for inverses
are:
(𝐴−1 )−1 = 𝐴 (A.34)

(𝐴𝐵) −1 −1
=𝐵 𝐴 −1
(A.35)
−1 𝑇
[𝐴 ] = [𝐴 ] 𝑇 −1
(A.36)
A symmetric matrix 𝐴 is positive definite if for all vectors x in the real

space:
𝑥 𝑇 𝐴𝑥 > 0 (A.37)
Similarly, a positive semi-definite matrix satisfies the condition:
𝑥 𝑇 𝑀𝑥 ≥ 0 (A.38)
and a negative definite matrix satisfies the condition
𝑥 𝑇 𝑀𝑥 < 0 (A.39)
A.5 Matrix Derivatives
Let’s consider derivatives of a few common cases: linear and quadratic

functions. Combining the concept of partial derivatives and matrix
forms of equations allows us to find the gradients of matrix functions.
First, let us look at a linear function, 𝑓 , defined as
Õ
𝑛
𝑓 (𝑥) = 𝑎 𝑇 𝑥 + 𝑏 = 𝑎𝑖 𝑥𝑖 + 𝑏𝑖 (A.40)
𝑖=1
where 𝑎, 𝑥, and 𝑏 are vectors of length 𝑛, and 𝑎 𝑖 , 𝑥 𝑖 , and 𝑏 𝑖 are the ith
elements of 𝑎, 𝑥, and 𝑏, respectively. If we take the partial derivative
of each element with respect to an arbitrary element of 𝑥, namely 𝑥 𝑘 ,
we get " #
𝜕 Õ
𝑛
𝑎𝑖 𝑥𝑖 + 𝑏𝑖 = 𝑎𝑘 (A.41)
𝜕𝑥 𝑘
𝑖=1
Thus:
∇𝑥 (𝑎 𝑇 𝑥 + 𝑏) = 𝑎 (A.42)
Recall the quadratic form presented in Appendix A.3.3, we can
combine that with a linear term to form a general quadratic function:
𝑓 (𝑥) = 𝑥 𝑇 𝐴𝑥 + 𝑏 𝑇 𝑥 + 𝑐 (A.43)
where 𝑥, 𝑏 and 𝑐 are still vectors of length n, and 𝐴 is an 𝑛 by 𝑛

symmetric matrix. In index notation, 𝑓 looks like
Õ
𝑛 Õ
𝑛
𝑓 (𝑥) = 𝑥 𝑖 𝑎 𝑖𝑗 𝑥 𝑗 + 𝑏 𝑖 𝑥 𝑖 + 𝑐 𝑖 (A.44)
𝑖=1 𝑗=1
For convenience, we’ll separate the diagonal terms from the off
diagonal terms leaving us with
Õ
𝑛
Õ
𝑓 (𝑥) = 𝑎 𝑖𝑖 𝑥 2𝑖 + 𝑏 𝑖 𝑥 𝑖 + 𝑐 𝑖 + 𝑥 𝑖 𝑎 𝑖𝑗 𝑥 𝑗 (A.45)
𝑖=1 𝑗≠𝑖
Now we take the partial derivatives with respect to 𝑥 𝑘 as before yielding:
𝜕𝑓 Õ Õ
= 2𝑎 𝑘 𝑘 𝑥 𝑘 + 𝑏 𝑘 + 𝑥𝑗 𝑎𝑗𝑘 + 𝑎𝑘 𝑗 𝑥𝑗 (A.46)
𝜕𝑥 𝑘
𝑗≠𝑖 𝑗≠𝑖
We now move the diagonal terms back into the sums to get
𝜕𝑓 Õ 𝑛
= 𝑏𝑘 + (𝑥 𝑗 𝑎 𝑗 𝑘 + 𝑎 𝑘 𝑗 𝑥 𝑗 ), (A.47)
𝜕𝑥 𝑘
𝑗=1
which we can put back into matrix form as:
∇𝑥 𝑓 (𝑥) = 𝐴𝑇 𝑥 + 𝐴𝑥 + 𝑏 (A.48)
If 𝐴 is symmetric then 𝐴𝑇 = 𝐴 and thus:
∇𝑥 (𝑥 𝑇 𝐴𝑥 + 𝑏 𝑇 𝑥 + 𝑐) = 2𝐴𝑥 + 𝑏 (A.49)
A.6 Taylor Series Expansion
Series expansions are representations of given function in terms of

a series of other (usually simpler) functions. One common series
expansion is the Taylor series, which is expressed as a polynomial whose
coefficients are based on the derivatives of the original function at a
fixed point.
The Taylor series is a general tool that can be applied whenever
the function has derivatives. We can use this series to estimate the
value of the function near the given point, which is useful when the
function is difficult to evaluate directly. The Taylor series is used to
derive algorithms for finding zeroes of functions and algorithms for
minimizing functions.
To derive the Taylor series, we start with an infinite polynomial
series about an arbitrary point, 𝑥, to approximate the value of a function
at 𝑥 + Δ𝑥 using
𝑓 (𝑥 + Δ𝑥) = 𝑎0 + 𝑎1 Δ𝑥 + 𝑎 2 Δ𝑥 2 + . . . + 𝑎 𝑘 Δ𝑥 𝑘 + . . . (A.50)
We can make this approximation exact at Δ𝑥 = 0 by setting the first

coefficient to 𝑓 (𝑥). To find the appropriate value for 𝑎1 , we take the first
derivative to get
𝑓 0(𝑥 + Δ𝑥) = 𝑎 1 + 2𝑎2 Δ𝑥 + . . . + 𝑖𝑎 𝑘 Δ𝑥 𝑘−1 + . . . , (A.51)
which means that we need 𝑎 1 = 𝑓 0(𝑥) to obtain an exact derivative at 𝑥.

To derive the other coefficients, we systematically take the derivative of
both sides and the appropriate value of the first nonzero term (which
is always constant). Identifying the pattern yields the general formula
for the 𝑛 th -order coefficient
𝑓 (𝑘) (𝑥)
𝑎𝑘 = . (A.52)
𝑘!
Substituting this into the polynomial (A.50) yields the Taylor series
Õ
∞
Δ𝑥 𝑘
𝑓 (𝑥 + Δ𝑥) = 𝑓 (𝑘) (𝑥). (A.53)
𝑘!
𝑘=0
The series is typically truncated to use terms up to order 𝑚,

Õ
𝑚
Δ𝑥 𝑘
𝑓 (𝑥 + Δ𝑥) = 𝑓 (𝑘) (𝑥) + 𝒪(Δ𝑥 𝑚+1 ), (A.54)
𝑘!
𝑘=0
which yields an approximation with a truncation error of order 𝒪(Δ𝑥 𝑚+1 ).

In optimization, it is common use the first three terms (up to 𝑚 = 2) to
get a quadratic approximation.
Example A.4: Taylor series expansion for single variable.
Consider the scalar function of a single variable, 𝑓 (𝑥) = 𝑥 − 4 cos(𝑥). If we

use Taylor series expansions of this function about 𝑥 = 0, we get
1 1
𝑓 (Δ𝑥) = −4 + Δ𝑥 + 2Δ𝑥 2 − Δ𝑥 4 + Δ𝑥 6 − . . . . (A.55)
6 180
Three different truncations of this series are plotted and compared to the exact
function in Fig. A.2.
𝑛=2
6
=
The Taylor series in multiple dimensions is similar to the single
variable case, but more complicated. The first derivative of the function 𝑛
becomes a gradient vector and the second derivatives becomes a Hessian
matrix. Also, we need to define a direction along which we want to 𝑓
approximate the function, since that information is not inherent as it is 1
in a 1-D function. The Taylor series expansion in 𝑛-dimensions along a 𝑛=
direction 𝑝 can be written as 𝑛=4
Õ 1 ÕÕ
0
𝑛 𝑛 𝑛
𝜕𝑓 𝜕2 𝑓 𝑥
𝑓 (𝑥 + 𝛼𝑝) = 𝑓 (𝑥) + 𝛼 𝑝𝑘 + 𝛼2 𝑝𝑘 𝑝𝑙 + 𝒪(𝛼 3 ),
𝜕𝑥 𝑘 2 𝜕𝑥 𝑘 𝜕𝑥 𝑙 Figure A.2: Taylor series expan-
𝑘=1 𝑘=1 𝑙=1
(A.56) sions for 1-D example. The more
where 𝛼 is a scalar that determines how far to go in the direction 𝑝. In terms we consider from the Taylor
series, the better the approximation.
matrix form, we can write
1
𝑓 (𝑥 + 𝛼𝑝) = 𝑓 (𝑥) + 𝛼∇ 𝑓 (𝑥)𝑇 𝑝 + 𝛼 2 𝑝 𝑇 𝐻(𝑥)𝑝 + 𝒪 𝛼3 , (A.57)
2
where 𝐻 is the Hessian matrix.
Example A.5: Taylor series expansion for two variables.
Consider the following function of two variables,
1 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 . (A.58)
2
Performing a Taylor series expansion about 𝑥 = [0, −2]𝑇 , we get,

1 10 0
𝑓 (𝛼𝑝) = 18 + 𝛼 −2 − 14 𝑝 + 𝛼 2 𝑝 𝑇 𝑝 (A.59)
2 0 6
The original function, the linear approximation, and the quadratic approxima-
tion are compared in Fig. A.3.
Original function Linear approximation (𝑛 = 1) Quadratic approximation (𝑛 = 2)
Figure A.3: Taylor series approxi-

mations for 2-D example.
Linear Solvers
B
In Section 3.7 we present an overview of solution methods for discretized
systems of equations, followed by an introduction to Newton-based
methods for solving nonlinear equations. Here, we review solvers for
linear systems, which are required to solve for each step of Newton-
based methods.
1. Understand the differences between direct and indirect

methods for linear problems.
2. Understand how to apply these methods to nonlinear

problems.
If the equations are linear, they can be written as
𝑟(𝑢) = 𝑏 − 𝐴𝑢 = 0, (B.1)
where 𝐴 is a square (𝑛 × 𝑛) matrix and 𝑏 is a vector, and neither of these

depend on 𝑢. To solve this linear system, we can use either a direct
method or an iterative method.
B.1 Direct Methods
The standard way to solve linear systems of equations in a computer

is Gaussian elimination, which in matrix form is equivalent to 𝐿𝑈
decomposition. This is a decomposition (or factorization) of 𝐴, such as
𝐴 = 𝐿𝑈, where 𝐿 is a unit lower triangular matrix, and 𝑈 is an upper
triangular matrix, as shown in Fig. B.1.
The decomposition transforms the matrix 𝐴 into an upper-triangular
1
1
𝑈
= 1
matrix 𝑈 by introducing zeros below the diagonal, one column at the 𝐴
𝐿
1
1
time, starting with the first one and progressing from left to right. This 1
is done by subtracting multiples of each row from subsequent rows. Figure B.1: 𝐿𝑈 decomposition.
These operations can be expressed as sequence of multiplications with
408
lower triangular matrices 𝐿 𝑖 ,

𝐿𝑛−1 · · · 𝐿2 𝐿1 𝐴 = 𝑈. (B.2)
| {z }
𝐿−1
After completing these operations, we have 𝑈 and we can find 𝐿 by

computing 𝐿 = 𝐿−11
𝐿−1
2
· · · 𝐿−1
𝑛−1
.
Once we have this decomposition, we have 𝐿𝑈𝑢 = 𝑏. Setting 𝑈𝑢
to 𝑦, we can solve 𝐿𝑦 = 𝑏 for 𝑦 by forward substitution. Now we have
𝑈𝑢 = 𝑦, which we can solve by back substitution for 𝑢.
Algorithm B.1: Solving 𝐴𝑢 = 𝑏 by 𝐿𝑈 decomposition
Inputs:
𝐴: Nonsingular square matrix
𝑏: A vector
Outputs:
𝑢: Solution to 𝐴𝑢 = 𝑏
Perform forward substitution to solve 𝐿𝑦 = 𝑏 for 𝑦:
𝑏1 1 © Õ𝑖−1
ª
𝑦1 = , 𝑦𝑖 = 𝑏 𝑖 − 𝐿 𝑖𝑗 𝑦 𝑗 ® for 𝑖 = 2, . . . , 𝑛
𝐿11 𝐿 𝑖𝑖
« 𝑗=1 ¬
Perform backward substitution to solve the following 𝑈𝑢 = 𝑦 for 𝑢:
𝑦𝑛 1 © Õ
𝑛
ª
𝑢𝑛 = , 𝑢𝑖 = 𝑦𝑖 − 𝑈 𝑖𝑗 𝑢 𝑗 ® for 𝑖 = 𝑛 − 1, . . . , 1
𝑈𝑛𝑛 𝑈 𝑖𝑖
« 𝑗=𝑖+1 ¬
The process described above is not stable in general and needs to be

modified. In particular, roundoff errors are magnified in the backward
substitution when diagonal elements of 𝐴 have a small magnitude.
This issue is resolved by using partial pivoting, which interchanges rows
to obtain more favorable diagonal elements.
Cholesky decomposition is an 𝐿𝑈 decomposition specialized for the
case where the matrix 𝐴 is symmetric and positive definite. In this case,
pivoting is not necessary because the Gaussian elimination is always
stable for symmetric positive definite matrices. The decomposition can
be written as
𝐴 = 𝐿𝐷𝐿𝑇 , (B.3)
where 𝐷 = diag[𝑈11 , . . . , 𝑈𝑛𝑛 ]. This can be expressed as the matrix
product
𝐴 = 𝐺𝐺𝑇 , (B.4)
where 𝐺 = 𝐿𝐷 1/2 .
B.2 Iterative Methods
While direct methods are usually more efficient and robust, iterative
methods have several advantages:
• Iterative methods make it possible to trade between computational

cost and precision because they can be stopped at any point and
still yield an approximation of 𝑢. Direct methods, on the other
hand, only get the solution at the end of the process with the final
precision.
• Iterative methods have the advantage when a good guess for 𝑢

exists. This is often the case in optimization, where the 𝑢 from
the previous optimization iteration can be used as the guess for
the new evaluations (called a warm start).
• Iterative methods do not require forming and manipulating the

matrix 𝐴, which can be computational costly in terms of both time
and memory. Instead, iterative methods require the computation
of the residuals 𝑟(𝑢) = 𝐴𝑢 − 𝑏 and in the case of Krylov subspace
methods, products of 𝐴 with a given vector. Therefore, iterative
methods can be more efficient than direct methods for cases where
𝐴 is large and sparse. All that is needed is an efficient process to
get the product of 𝐴 with a given vector, as shown in Fig. B.2
Iterative methods are divided into fixed-point iteration methods 𝑣 𝐴𝑣

(also known as stationary iterative methods) and Krylov subspace
methods. Figure B.2: Iterative methods just
require a process to compute prod-
ucts of 𝐴 with an arbitrary vector
Fixed-point Iteration Methods
𝑣.
Fixed-point methods generate a sequence of iterates 𝑢 (1) , . . . , 𝑢 (𝑘) , . . .
using a function

𝑢 (𝑘+1) = 𝐺 𝑢 (𝑘) , 𝑘 = 0, 1, . . . , (B.5)
starting from an initial guess 𝑢0 . The function 𝐺(𝑢) is devised such that
the iterates converge to the solution 𝑢 ∗ , which satisfies 𝑟(𝑢 ∗ ) = 0. Many
fixed-point methods can be derived by splitting the matrix such that
𝐴 = 𝑀 − 𝑁. Then, 𝐴𝑢 = 𝑏 leads to 𝑀𝑢 = 𝑁𝑢 + 𝑏, and substituting this
into the linear system yields
𝑢 = 𝑀 −1 (𝑁𝑢 + 𝑏). (B.6)

Since 𝑁𝑢 = 𝑀𝑢 − 𝐴𝑢, substituting this in to the above equation results

in the iteration

𝑢 (𝑘+1) = 𝑢 (𝑘) + 𝑀 −1 𝑏 − 𝐴𝑢 (𝑘) = 𝑢 (𝑘) + 𝑀 −1 𝑟 𝑢 (𝑘) . (B.7)
The splitting matrix 𝑀 is fixed and constructed so that it is easy to

invert. The closer 𝑀 −1 is to the inverse of 𝐴, the better the iterations
work. We now introduce three fixed-point methods corresponding to
three different splitting matrices.
The Jacobi method consists of setting 𝑀 to be a diagonal matrix 𝐷
where the diagonal entries are those of 𝐴. Then,

𝑢 (𝑘+1) = 𝑢 (𝑘) + 𝐷 −1 𝑟 𝑢 (𝑘) . (B.8)
In component form, this can be written as

 Õ 
 
(𝑘+1) 1 𝑏 𝑖 − (𝑘) 
𝑢𝑖 =  𝐴 𝑖𝑗 𝑢 𝑗  , 𝑖 = 1, . . . , 𝑛𝑢 (B.9)
𝐴 𝑖𝑖  
 𝑗=1,𝑗≠𝑖

Using this method, each component in 𝑢 𝑘+1 is independent of each
other at a given iteration; they only depend on the previous iteration
values, 𝑢 𝑘 , and can therefore be done in parallel.
The Gauss–Seidel method is obtained by setting 𝑀 to be the lower
triangular portion of 𝐴, and can be written as
𝑢 𝑘+1 = 𝑢 𝑘 + 𝐸 −1 𝑅(𝑢 𝑘 ), (B.10)
where 𝐸 is the lower triangular matrix. Because of the triangular

matrix structure, each component in 𝑢 𝑘+1 is dependent on the previous
components in the vector, but the iteration can be performed in a single
forward sweep. Writing this in component form yields
 Õ Õ 
 
(𝑘+1) 1 𝑏 𝑖 − (𝑘+1) (𝑘) 
𝑢𝑖 =  𝐴 𝑢 − 𝐴 𝑖𝑗 𝑗  ,
𝑢 𝑖 = 1, . . . , 𝑛𝑢 . (B.11)
 
𝑖𝑗 𝑗
𝐴 𝑖𝑖
 𝑗<𝑖 𝑗>𝑖

Unlike the Jacobi iterations, a Gauss–Seidel iteration cannot be per-
formed in parallel because of the terms where 𝑗 < 𝑖 above, which
require the latest values. Instead, the states must be updated sequen-
tially. However, the advantage of Gauss–Seidel if that it generally
converges faster than Jacobi iterations.
The successive over-relaxation (SOR) method uses an update that
is a weighted average of the Gauss–Seidel update and the previous
iteration,
𝑢 𝑘+1 = 𝑢 𝑘 + 𝜔 [(1 − 𝜔) 𝐷 + 𝜔𝐸]−1 𝑅(𝑢 𝑘 ), (B.12)
where 𝜔 is a scalar between one and two. Setting 𝜔 = 1 above yields

the Gauss–Seidel method. SOR in component form is
 Õ Õ 
 
(𝑘+1) (𝑘) 𝜔 𝑏 𝑖 − (𝑘+1) (𝑘) 
𝑢𝑖 = (1−𝜔)𝑢𝑖 +  𝐴 𝑖𝑗 𝑢 𝑗 − 𝐴 𝑖𝑗 𝑢 𝑗  , 𝑖 = 1, . . . , 𝑛𝑢 .
𝐴 𝑖𝑖  
 𝑗<𝑖 𝑗>𝑖

(B.13)
With the right value of 𝜔, SOR converges faster than Gauss–Seidel.
Example B.2: Iterative methods applied to a simple linear system. 𝑢2

2
Suppose we have a linear system of two equations
1.5
2 −1 𝑢1 0
= . (B.14)
−2 3 𝑢2 1 1
𝑢 (0)
This corresponds to the two lines shown in Fig. B.3, where the solution is at 𝑢∗
0.5
their intersection.
Applying the Jacobian iteration (B.9), 0
0 1 2
(𝑘+1) 1 (𝑘) 𝑢1
𝑢1 = 𝑢2
2
1 (B.15) (a) Jacobi
(𝑘+1) (𝑘)
𝑢2 = 1 + 2𝑢1 .
3 2
𝑢2
Starting with the guess 𝑢 (0) = (2, 1), we get the iterations shown in Fig. B.3.
1.5
The Gauss–Seidel iteration (B.11) is similar, where the only change is that the
second equation use the latest state from the first one: 𝑢 (0)
1
(𝑘+1) 1 (𝑘) 𝑢∗
𝑢1 =𝑢
2 2 0.5
1 (B.16)
(𝑘+1) (𝑘+1)
𝑢2 = 1 + 2𝑢1 . 0
3 0 1 2
𝑢1
As expected, Gauss–Seidel converges faster than the Jacobi iteration, taking a
more direct path. The SOR iteration is (b) Gauss-Seidel
𝜔 (𝑘) 𝑢2
(𝑘+1) (𝑘)
𝑢1 = (1 − 𝜔)𝑢1 +𝑢 2
2 2
(𝑘+1) (𝑘) 𝜔 (𝑘)
(B.17)
𝑢2 = (1 − 𝜔)𝑢2 + 1 + 2𝑢1 . 1.5
3
𝑢 (0)
SOR converges even faster for the right values of 𝜔. The result shown here is 1
for 𝜔 = 1.2. 𝑢∗
0.5
0
0 1 2
Krylov Subspace Methods 𝑢1
(c) SOR
Krylov subspace methods are another class of iterative methods that
include the conjugate gradient method and the generalized mini- Figure B.3: Jacobi, Gauss–Seidel,
mum residual (GMRES) method. Compared to stationary methods, and SOR iterations.
Krylov methods have the advantage that they use information gathered
throughout the iterations. Instead of using a fixed splitting matrix,
Krylov methods effectively vary the splitting so that 𝑀 is changed
at each iteration according to some criteria that uses the information
gathered so far. For this reason, Krylov methods are usually more
efficient than fixed-point iterations.
Like fixed-point iteration methods, Krylov methods do not require
forming or storing 𝐴. Instead, the iterations require only matrix-vector
products of the form 𝐴𝑣, where 𝑣 is some vector given by the Krylov
algorithm. To be efficient, Krylov subspace methods require a good
preconditioner.
Test Problems
C
C.1 Unconstrained Problems
C.1.1 Slanted Quadratic Function

This is a smooth two-dimensional function suitable for a first test of a 8
𝑥2
gradient-based optimizer:
4
𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 + 𝑥 22 − 𝛽𝑥 1 𝑥2 , (C.1) 𝑥∗
0
where 𝛽 ∈ [0, 2). A 𝛽 value of zero corresponds to perfectly circular

contours. As 𝛽 increases, the contours become increasingly slanted. For
−4
𝛽 = 2, the quadratic becomes semidefinite and there is a line of weak −8

minima. For 𝛽 > 2, the quadratic is indefinite and there is no minimum. −10 −5 0
𝑥1
5 10
An intermediate value of 𝛽 = 3/2 is suitable for first tests and yields the
contours shown in Fig. C.1. Figure C.1: Slanted quadratic func-
tion for 𝛽 = 3/2
Global minimum: 𝑓 (𝑥 ∗ ) = 0.0 at 𝑥 ∗ = (0, 0)
C.1.2 Rosenbrock Function

The two-dimensional Rosenbrock function, shown in Fig. C.2, is: 𝑥2
2 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥 1 )2 + 100 𝑥2 − 𝑥 12 . (C.2)
𝑥∗
1
This is a classic benchmarking function because of its narrow turning
valley. The large difference between the maximum and minimum 0
curvatures, and the fact that the principal curvature directions change
along the valley makes it a good test for quasi-Newton methods. −1
−1 0 1
The Rosenbrock function can be extended to 𝑛-dimensions by 𝑥1
defining the sum,
Figure C.2: Rosenbrock function
𝑛−1
Õ
2
𝑓 (𝑥) = 100 𝑥 𝑖+1 − 𝑥 𝑖 2 + (1 − 𝑥 𝑖 )2 . (C.3)
𝑖=1
Global minimum: 𝑓 (𝑥 ∗ ) = 0.0 at 𝑥 ∗ = (1, 1, . . . , 1).

Local minimum: For 𝑛 ≥ 4, a local minimum exists near 𝑥 = (−1, 1, . . . , 1).
414
C Test Problems 415
C.1.3 Bean Function

The “bean” function was developed in this book as a milder version
of the Rosenbrock function: it has the same curved valley as the
Rosenbrock function without the extreme variations in curvature. The
function, shown in Fig. C.3, is
1 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 . (C.4)
2
Global minimum: 𝑓 (𝑥 ∗ ) = 0.09194 at 𝑥 ∗ = (1.21314, 0.82414) 𝑥2
3
C.1.4 Jones Function 2
This is a fourth-order smooth multimodal function that is useful to test 1 𝑥∗

global search algorithms and also gradient-based algorithms starting
from different points. There are saddle points, maxima, and minima, 0
with one global minimum. This function, shown in Fig. C.4 along with
−1
the local and global minima, is −2 0 2
𝑥1
𝑓 (𝑥 1 , 𝑥2 ) = 𝑥 14 + 𝑥24 − 4𝑥13 − 3𝑥23 + 2𝑥 12 + 2𝑥1 𝑥2 . (C.5) Figure C.3: Bean function
Global minimum: = −13.5320 at = (2.6732, −0.6759))

𝑓 (𝑥 ∗ ) 𝑥∗ 3
𝑥2
Local minimum: 𝑓 (𝑥) = −9.7770 at 𝑥 = (−0.4495, 2.2928)

2
𝑓 (𝑥) = −9.0312 at 𝑥 = (2.4239, 1.9219)
1
C.1.5 Hartmann Function

0
The Hartmann function is a three-dimensional smooth function with 𝑥∗

−1
multiple local minima.
−1 0 1 2 3
Õ © Õ
𝑥1
ª
4 3
𝑓 (𝑥) = − 𝛼 𝑖 exp − 𝐴 𝑖𝑗 (𝑥 𝑗 − 𝑃𝑖𝑗 )2 ® (C.6) Figure C.4: Jones multimodal func-
𝑖=1 « 𝑗=1 ¬ tion
𝑥3
where 1
𝛼 = [1.0, 1.2, 3.0, 3.2] , 𝑇
3 10 30
0.8 𝑥∗

0.1 10 35
0.6
𝐴 = 
10 30
,
3 0.4
0.1 10 35
 (C.7) 0.2
3689 1170 2673

 0
4699 4387 7470 0 0.5 1
𝑃 = 10−4 
8732 5547
. 𝑥2
1091
 381 5743 8828
 Figure C.5: An 𝑥2 − 𝑥3 slice of
Hartmann function at 𝑥 1 = 0.1148
C Test Problems 416
A slice of the function, at the optimal value of 𝑥1 = 0.1148, is shown

in Fig. C.5.
Global minimum: 𝑓 (𝑥 ∗ ) = −3.86278 at 𝑥 ∗ = (0.11480, 0.55566, 0.85254)
C.1.6 Aircraft Wing Design

We want to optimize the wing of a general aviation sized aircraft by
changing its wing span and chord. In general, we would add many more
design variables to a problem like this, but we are intentionally limiting
it to a simple two-dimensional problem so we can easily visualize the
results. Instead of minimizing drag, we want to minimize the required
power, thereby taking into account drag as well as propulsive efficiency
that is speed dependent.
The following section describes a basic performance estimation
methodology for a low-speed aircraft. Implementing it may not seem
like it has much to do with optimization. The physics are important
for our purposes, but practice translating equations and concepts into
code is an important element of formulating optimization problems in
general.
At level flight the aircraft must generate enough lift to equal the
required weight
𝐿 = 𝑊, (C.8)
and we will assume here that the total weight consists of a fixed aircraft
and payload weight 𝑊0 , and a component of the weight that depends
on the wing area 𝑆
𝑊 = 𝑊0 + 𝑊𝑠 𝑆. (C.9)
Our wing can produce a certain lift coefficient, 𝐶 𝐿 , and so we must
make the wing area, 𝑆, big enough to produce sufficient lift. Using the
definition of lift coefficient, the total lift can be computed as
𝐿 = 𝑞𝐶 𝐿 𝑆, (C.10)
where 𝑞 is the dynamic pressure

1 2
𝑞= 𝜌𝑣 . (C.11)
2
If we use a rectangular wing, then the wing area can be computed from
the wing span, 𝑏, and the chord, 𝑐, as
𝑆 = 𝑏𝑐. (C.12)
The drag of our aircraft consists of two components: viscous drag

and induced drag. The viscous drag can be approximated as
𝐷 𝑓 = 𝑘𝐶 𝑓 𝑞𝑆wet . (C.13)
C Test Problems 417
For a fully turbulent boundary layer, the skin friction coefficient, 𝐶 𝑓 ,

can be approximated as
0.074
𝐶𝑓 = . (C.14)
𝑅𝑒 0.2
In this equation the Reynolds number is based on the wing chord,
𝑅𝑒 = 𝜌𝑣𝑐/𝜇. The form factor, 𝑘, accounts for the effects of pressure
drag. The wetted area, 𝑆wet , is the area over which the skin friction drag
acts, which is a little more than twice the planform area. We will use
𝑆wet = 2.05𝑆. (C.15)
The induced drag is defined as
𝐿2
𝐷𝑖 = , (C.16)
𝑞𝜋𝑏 2 𝑒
where 𝑒 is the Oswald efficiency factor. Total drag is the sum of induced
and viscous drag, 𝐷 = 𝐷𝑖 + 𝐷 𝑓 ).
Our objective function, the power required by the motor for level
flight, is
𝐷𝑣
𝑃(𝑏, 𝑐) = , (C.17)
𝜂
where 𝜂 is the propulsive efficiency. We assume that our electric
propellers have a Gaussian efficiency curve (real efficiency curves aren’t
very Gaussian, but this is simple and will be sufficient for our purposes):

−(𝑣 − 𝑣)2
𝜂 = 𝜂max exp . (C.18)
2𝜎 2
In this problem, the lift coefficient is provided. Therefore, to satisfy

the lift requirement (C.8), we need to compute the velocity using
Eq. C.11 and Eq. C.10 as
s
2𝐿
𝑣= . (C.19)
𝜌𝐶 𝐿 𝑆
The parameters for this problem are given as:

C Test Problems 418
Parameter Value Unit Description

𝜌 1.2 kg/m3 density of air
𝜇 1.8 × 10−5 kg/(m sec) viscosity of air
𝑘 1.2 form factor
𝐶𝐿 0.4 lift coefficient
𝑒 0.80 Oswald efficiency factor
𝑊 1,000 N fixed aircraft weight
𝑊𝑆 8.0 N/m2 wing area dependent weight
𝜂max 0.8 peak propulsive efficiency
𝑣¯ 20.0 m/s flight speed at peak
propulsive efficiency
𝜎 5.0 m/s standard deviation of
efficiency function
This is the same problem that was presented in Ex. 1.2 of Chapter 1.
The optimal wing span and chord are 𝑏 = 25.48 m and 𝑐 = 0.50 m
respectively given the parameters. The contour and the optimal wing
shape are shown in Fig. C.6. 𝑐
1.5
Note that there are no structural considerations in this problem so
the resulting aircraft has a higher aspect ratio wing than is realistic. 1.2
This emphasizes the importance of carefully selecting the objective, and

0.9
including all relevant constraints.
0.6
C.1.7 Brachistochrone Problem
0.3
This is the classic problem proposed by Johann Bernoulli (see Section 2.2 5 15 25 35
𝑏
for the historical background). Although this was originally solved
analytically, we discretize the model and solve the problem using Figure C.6: Wing design problem
numerical optimization. This is a useful problem for benchmarking with power requirement contour
because you can change the number of dimensions.
A bead is set on wire that defines a path that we can shape. The
bead starts at some 𝑦-position ℎ with zero velocity. For convenience,
we define the starting point at 𝑥 = 0. From conservation of energy, we
can then find the velocity of the bead at any other location. The initial
potential energy is converted to kinetic energy, potential energy, and
dissipative work from friction acting along the path length, yielding:
𝑥 ∫
1
𝑚𝑔ℎ = 𝑚𝑣 2 + 𝑚 𝑔 𝑦 + 𝜇 𝑘 𝑚 𝑔 cos 𝜃 d𝑠,
2 0
1 (C.20)
0 = 𝑣 2 + 𝑔(𝑦 − ℎ) + 𝜇 𝑘 𝑔𝑥,
2
q
𝑣= 2𝑔(ℎ − 𝑦 − 𝜇 𝑘 𝑥).
C Test Problems 419
Now that we know the speed of the bead as a function of 𝑥, we can

compute the time it takes to traverse an infinitesimal element of length
d𝑠: ∫ 𝑥 𝑖 +d𝑥
d𝑠
Δ𝑡 =
𝑣(𝑥)
p
𝑥𝑖
∫ 𝑥 𝑖 +d𝑥 d𝑥 2 + d𝑦 2
= p
2𝑔(ℎ − 𝑦(𝑥) − 𝜇 𝑘 𝑥) (C.21)
r
𝑥𝑖
2
∫ 𝑥 𝑖 +d𝑥 1+
d𝑦
d𝑥 d𝑥
= p
𝑥𝑖 2𝑔(ℎ − 𝑦(𝑥) − 𝜇 𝑘 𝑥)
To discretize this problem, we can divide the path into linear
segments. As an example, Fig. C.7 shows the wire divided into 4 linear
segments (5 nodes) as an approximation of a continuous wire. The
slope 𝑠 𝑖 = (Δ𝑦/Δ𝑥)𝑖 is then a constant along a given segment, and 𝑦 (𝑥 𝑖 , 𝑦 𝑖 )
𝑦(𝑥) = 𝑦 𝑖 + 𝑠 𝑖 (𝑥 − 𝑥 𝑖 ). Making these substitutions results in Δ𝑦 (𝑥 𝑖+1 , 𝑦 𝑖+1 )
q
1 + 𝑠 𝑖2 ∫ 𝑥 𝑖+1
𝑑𝑥 Δ𝑥
Δ𝑡 𝑖 = p p . (C.22) 𝑥
2𝑔 𝑥𝑖 ℎ − 𝑦 𝑖 − 𝑠 𝑖 (𝑥 − 𝑥 𝑖 ) − 𝜇 𝑘 𝑥
Figure C.7: A discretized repre-
Performing the integration and simplifying (many steps omitted here) sentation of the brachistochrone
results in problem.
s q
2 Δ𝑥 2𝑖 + Δ𝑦 𝑖2
Δ𝑡 𝑖 = p p , (C.23)
𝑔 ℎ − 𝑦 𝑖+1 − 𝜇 𝑘 𝑥 𝑖+1 + ℎ − 𝑦 𝑖 − 𝜇 𝑘 𝑥 𝑖
where Δ𝑥 𝑖 = (𝑥 𝑖+1 − 𝑥 𝑖 ) and Δ𝑦 𝑖 = (𝑦 𝑖+1 − 𝑦 𝑖 ). The objective of the

optimization is to minimize the total travel time, so we need to sum up
the travel time across all of our linear segments,
Õ
𝑛−1
𝑇= Δ𝑡 𝑖 . (C.24)
𝑖=1
Minimization is unaffected by multiplying by a constant, so we can

remove the multiplicative constant for simplicity (we see that the
magnitude of the acceleration of gravity has no affect on the optimal
path)
q
Õ
𝑛−1 Δ𝑥 2𝑖 + Δ𝑦 𝑖2
minimize 𝑓 = p p (C.25)
𝑖=1 ℎ − 𝑦 𝑖+1 − 𝜇 𝑘 𝑥 𝑖+1 + ℎ − 𝑦𝑖 − 𝜇𝑘 𝑥 𝑖
by varying 𝑦𝑖 , 𝑖 = 1, . . . , 𝑛
C Test Problems 420
The design variables are the 𝑛−2 positions of the path parameterized
by 𝑦 𝑖 . The end points must be fixed, otherwise the problem is ill-defined,
which is why there are 𝑛 − 2 design variables instead of 𝑛. Note that
𝑥 is a parameter, meaning that it is fixed. You could space the 𝑥 𝑖 any
reasonable way and still find the same underlying optimal curve, but
it is easiest to just use uniform spacing. As the dimensionality of the
problem increases, the solution becomes more challenging. We will
use the following specifications:
• starting point: (𝑥, 𝑦) = (0, 1) m
• ending point: (𝑥, 𝑦) = (1, 0) m
• kinetic coefficient of friction 𝜇 𝑘 = 0.3
The analytic solution for the case with friction is more difficult to
derive, but the analytic solution for the frictionless case (𝜇 𝑘 = 0) with
our starting and ending points is:
𝑥 = 𝑎(𝜃 − sin(𝜃)),
(C.26)
𝑦 = −𝑎(1 − cos(𝜃)) + 1,
where 𝑎 = 0.572917 and 𝜃 ∈ [0, 2.412].
C.2 Constrained Problems
C.2.1 Barnes Problem

The Barnes problem was devised in an M.S. Thesis149 and has been 149. Barnes, A Comparative Study of Non-
linear Optimization Codes. 1967
used in various optimization demonstration studies. It is a nice starter
problem because it only has two dimensions for easy visualization
while also including constraints. The objective function contains the
following coefficients:
𝑎 1 = 75.196 𝑎 2 = -3.8112
𝑎 3 = 0.12694 𝑎 4 = -2.0567×10−3
𝑎 5 = 1.0345×10−5 𝑎6 = -6.8306
𝑎 7 = 0.030234 𝑎8 = -1.28134×10−3
𝑎 9 = 3.5256×10−5 𝑎10 = -2.266×10−7
𝑎 11 = 0.25645 𝑎12 = -3.4604×10−3
𝑎 13 = 1.3514×10−5 𝑎14 = -28.106
𝑎 15 = -5.2375×10−6 𝑎16 = -6.3×10−8
𝑎 17 = 7.0×10−10 𝑎18 = 3.4054×10−4
𝑎19 = -1.6638×10−6 𝑎20 = -2.8673
𝑎21 = 0.0005
C Test Problems 421
And for convenience we define the following quantities:
𝑦1 = 𝑥 1 𝑥 2 , 𝑦2 = 𝑦1 𝑥 1 , 𝑦3 = 𝑥22 , 𝑦4 = 𝑥12 (C.27)
The objective function is then:
𝑓 (𝑥1 , 𝑥2 ) = 𝑎1 + 𝑎2 𝑥 1 + 𝑎 3 𝑦4 + 𝑎4 𝑦4 𝑥1 + 𝑎5 𝑦42 + 𝑎6 𝑥2 + 𝑎7 𝑦1 +
𝑎8 𝑥1 𝑦1 + 𝑎 9 𝑦1 𝑦4 + 𝑎 10 𝑦2 𝑦4 + 𝑎11 𝑦3 + 𝑎12 𝑥2 𝑦3 + 𝑎 13 𝑦32 +
𝑎14 (C.28)
+ 𝑎 15 𝑦3 𝑦4 + 𝑎16 𝑦1 𝑦4 𝑥2 + 𝑎17 𝑦1 𝑦3 𝑦4 + 𝑎18 𝑥1 𝑦3 +
𝑥2 + 1
𝑎19 𝑦1 𝑦3 + 𝑎 20 exp(𝑎21 𝑦1 )
There are three constraints of the form 𝑔(𝑥) ≤ 0:

𝑦1
𝑔1 = 1 −
700
𝑦4 𝑥2
𝑔2 = 2 − (C.29)
25 5
𝑥1 𝑥 2
2
𝑔3 = − 0.11 − −1
500 50
The problem also has bound constraints. The original formulation
is bounded from [0, 80] in both dimensions, in which case the global 𝑥2
optimum occurs in the corner at 𝑥 ∗ = [80, 80] with a local minimum in 60
the middle. However, in our usage we preferred the global optimum to

not be in the corner so set the bounds to [0, 65] for 𝑥1 and [0, 70] for 𝑥2 . 40
The contour of this function is plotted in Fig. C.8.

𝑥∗
Global minimum: 𝑓 (𝑥 ∗ ) = −31.6368 at 𝑥 ∗ = (49.5263, 19.6228) 20
Local minimum: 𝑓 (𝑥) = −17.7067 at 𝑥 = (65, 70)

0 20 40 60
𝑥1
C.2.2 Ten-bar Truss Figure C.8: Barnes function
The ten bar truss is a classic optimization problem.150 In this problem,

we want to find the optimal cross-sectional areas for the ten-bar truss 150. Venkayya, Design of optimum struc-
tures. 1971
shown in Fig. C.9. A simple truss finite element code set up for this
particular configuration is available in the book code repository. The
function takes in an array of cross-sectional areas and returns the total
mass and an array of stresses for each truss member.
The objective of the optimization is to minimize the mass of the
structure, subject to the constraints that every segment does not yield
in compression or tension. The yield stress of all elements is 25 × 103
psi, except for member 9, which uses a stronger alloy with a yield stress
of 75 × 103 psi. Mathematically, the constraint is
|𝜎𝑖 | ≤ 𝜎 𝑦 𝑖 for 𝑖 = 1, . . . , 10, (C.30)

C Test Problems 422
ℓ ℓ
1 2
7 8 9 10
ℓ
5 6
3 4 Figure C.9: Ten-bar truss and ele-

ment numbers.
𝑃 𝑃
where the absolute value is needed to handle tension and compression

(with the same yield strength for tension and compression). Absolute
values are not differentiable at 0 and should be avoided in gradient-
based optimization if possible. Put this in a mathematically equivalent
form that avoids absolute value. Each element should have a cross-
sectional area of at least 0.1 in2 for manufacturing reasons (bound
constraint). In solving this optimization problem you may need to
consider scaling the objective and constraints.
While not needed to solve the problem, an overview of the equations
is discussed below. A truss element is the simplest type of finite element
and only has an axial degree of freedom. The theory and derivation
for truss elements is simple, but for our purposes we will jump to the
result. Given a 2D element oriented arbitrarily in space (Fig. C.10) we
can relate the displacements at the nodes to the forces at the nodes
through a stiffness relationship. 𝑣2
In matrix form the equation for a given element is 𝑓 = 𝐾 𝑒 𝑥. In detail
𝑢2
the equation is 2
𝐿
 𝑋1   𝑐2 −𝑐𝑠  𝑢1 
𝑣1
   𝑐𝑠 −𝑐 2  
 𝑌1  𝐸𝐴  𝑐𝑠 𝑠2 −𝑠 2  𝑣 1 
 =   
−𝑐𝑠 𝜙
1 𝑢1
 𝑋2  𝐿  −𝑐 2 𝑐𝑠  𝑢2  (C.31)
  −𝑐𝑠 𝑐2  
 𝑌2  −𝑐𝑠 𝑠 2  𝑣 2 
    
−𝑠 2 𝑐𝑠 Figure C.10: A truss element ori-
ented at some angle 𝜙, where 𝜙 is
measured from a horizontal line
where the displacement vector is 𝑥 = [𝑢1 , 𝑣1 , 𝑢2 , 𝑣2 ]𝑇 . The meanings emanating from the first node, ori-
for the variables in the equation are described in Table C.1. ented in the +x direction.
The stress in the truss element can be computed from the equation
𝜎 = 𝑆 𝑒 𝑥 where 𝜎 is a scalar, 𝑥 is the same vector as before, and the
element 𝑆 𝑒 matrix (really a row vector because stress is one-dimensional
for truss elements) is:
𝐸
𝑆𝑒 = −𝑐 −𝑠 𝑐 𝑠 (C.32)
𝐿
The global structure (an assembly of multiple finite elements) has the
same equations: 𝐹 = 𝐾𝑥 and 𝜎 = 𝑆𝑥, but now 𝑥 contains displacements
C Test Problems 423
Table C.1: The variables used in the stiffness equation.

Variable Description
𝑋𝑖 force in the x-direction at node i
𝑌𝑖 force in the y-direction at node i
𝐸 modulus of elasticity of truss element material
𝐴 area of truss element cross-section
𝐿 length of truss element
𝑐 cos 𝜙
𝑠 sin 𝜙
𝑢𝑖 displacement in the x-direction at node i
𝑣𝑖 displacement in the y-direction at node i
for all of the nodes in the structure 𝑥 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ]𝑇 . If we have

n nodes and m elements then 𝐹 ∈ ℛ 2𝑛 , 𝑥 ∈ ℛ 2𝑛 , 𝐾 ∈ ℛ 2𝑛×2𝑛 , 𝑆 ∈
ℛ 𝑚×2𝑛 , and 𝜎 ∈ ℛ 𝑚 . The elemental stiffness and stress matrices are
first computed and then assembled into the global matrices. This is
straightforward because the displacements and forces of the individual
elements add linearly.
After we assemble the global matrices we must remove any degrees
of freedom that are structurally rigid (already known to have zero
displacement). Otherwise, the problem is ill-defined and the stiffness
matrix will be ill-conditioned.
Given the geometry, materials, and external loading we can populate
the stiffness matrix and force vector. We can then solve for the unknown
displacements from
𝐹 = 𝐾𝑥 (C.33)
With the solved displacements we can compute the stress in each
element using
𝜎 = 𝑆𝑥 (C.34)
Bibliography
1 He, X., Li, J., Mader, C. A., Yildirim, A., and Martins, J. R. R. A., cited on p. 19
“Robust aerodynamic shape optimization—from a circle to an
airfoil,” Aerospace Science and Technology, Vol. 87, April 2019, pp. 48–
61.
doi: 10.1016/j.ast.2019.01.051
2 Betts, J. T., “Survey of numerical methods for trajectory optimiza- cited on p. 25
tion,” Journal of guidance, control, and dynamics, Vol. 21, No. 2, 1998,
pp. 193–207.
3 Bryson, A. E. and Ho, Y. C., Applied Optimal Control; Optimization, cited on p. 25
Estimation, and Control. Blaisdell Publishing, 1969.
4 Bertsekas, D. P., Dynamic programming and optimal control. Belmont, cited on p. 25
MA: Athena Scientific, 1995.
5 Kepler, J., Nova stereometria doliorum vinariorum (New solid geometry cited on p. 32
of wine barrels). Linz: Johannes Planck, 1615.
6 Ferguson, T. S., “Who Solved the Secretary Problem?” Statistical cited on p. 32
Science, Vol. 4, No. 3, August 1989, pp. 282–289.
doi: 10.1214/ss/1177012493
7 Fermat, P. de, Methodus ad disquirendam maximam et minimam cited on p. 33
(Method for the study of maxima and minima). 1636, Translated by
Jason Ross.
8 Lagrange, J.-.-L., Mécanique analytique. Paris, France, 1788, Vol. 1. cited on p. 34
9 Cauchy, A.-.-L., “Méthode générale pour la résolution des systèmes cited on p. 34

d’équations simultanées,” Comptes rendus hebdomadaires des séances
de l’Académie des sciences, Vol. 25, October 1847, pp. 536–538.
10 Hancock, H., Theory of Minima and Maxima. Boston, MA: Ginn and cited on p. 34
Company, 1917.
11 Menger, K., “Das botenproblem,” Ergebnisse eines Mathematischen cited on p. 34
Kolloquiums, Leipzig: Teubner, 1932, pp. 11–12.
12 Karush, W., “Minima of Functions of Several Variables with Inequal- cited on p. 35
ities as Side Constraints,” Master’s thesis, University of Chicago,
Chicago, IL, 1939.
424
Bibliography 425
13 Krige, D. G., “A statistical approach to some mine valuation and cited on p. 35

allied problems on the Witwatersrand,” Master’s thesis, University
of the Witwatersrand, Johannesburg, South Africa, 1951.
14 Markowitz, H., “Portfolio selection,” Journal of Finance, Vol. 7, March cited on p. 35
1952, pp. 77–91.
15 Davidon, W. C., “Variable Metric Method for Minimization,” SIAM cited on pp. 36, 103
Journal on Optimization, Vol. 1, No. 1, February 1991, pp. 1–17, issn:
1095-7189.
doi: 10.1137/0801001
16 Fletcher, R. and Powell, M. J. D., “A Rapidly Convergent Descent cited on pp. 36, 103
Method for Minimization,” The Computer Journal, Vol. 6, No. 2,
August 1963, pp. 163–168, issn: 1460-2067.
doi: 10.1093/comjnl/6.2.163
17 Wolfe, P., “Convergence Conditions for Ascent Methods,” SIAM cited on p. 36
Review, Vol. 11, No. 2, 1969, pp. 226–235.
doi: 10.1137/1011036
18 Wilson, R. B., “A simplicial algorithm for concave programming,” cited on p. 36
Ph.D. Dissertation, Harvard University, Cambridge, MA, June
1963.
19 Han, S.-P., “Superlinearly convergent variable metric algorithms cited on p. 36
for general nonlinear programming problems,” Mathematical Pro-
gramming, Vol. 11, No. 1, 1976, pp. 263–282.
doi: 10.1007/BF01580395
20 Powell, M. J. D., “Algorithms for nonlinear constraints that use cited on pp. 36, 154
Lagrangian functions,” Mathematical Programming, Vol. 14, No. 1,
December 1978, pp. 224–248.
doi: 10.1007/bf01588967
21 Holland, J. H., “Aptation in Natural and Artificial Systems,” Ann cited on p. 36
Arbor, MI: University of Michigan Press, 1975.
22 Hooke, R. and Jeeves, T. A., ““Direct Search” Solution of Numerical cited on p. 37
and Statistical Problems,” Journal of the ACM, Vol. 8, No. 2, 1961,
pp. 212–229.
23 Nelder, J. A. and Mead, R., “A Simplex Method for Function cited on pp. 37, 218
Minimization,” Computer Journal, Vol. 7, 1965, pp. 308–313.
24 Karmarkar, N., “A New Polynomial-Time Algorithm for Linear cited on p. 37
Programming,” Proceedings of the Sixteenth Annual ACM Symposium
on Theory of Computing, ser. STOC ’84, New York, NY, USA: Associa-
tion for Computing Machinery, 1984, pp. 302–311, isbn: 0897911334
doi: 10.1145/800057.808695
Bibliography 426
25 Pontryagin, L. S., Boltyanskii, V. G., Gamkrelidze, R. V., and cited on p. 37

Mishchenko, E. F., The Mathematical Theory of Optimal Processes.
Moscow, 1961, translated hy K. N. Triruguf[, edited by T. W.
Neustadt.
26 Bryson Jr, A. E., “Optimal Control—1950 to 1985,” IEEE Control cited on p. 37
Systems Magazine, Vol. 16, No. 3, June 1996, pp. 26–33.
doi: 10.1109/37.506395
27 Schmit, L. A., “Structural Design by Systematic Synthesis,” Proceed- cited on pp. 37, 170
ings of the 2nd National Conference on Electronic Computation, ASCE,
New York, NY, September 1960, pp. 105–132.
28 Schmit, L. A. and Thornton, W. A., “Synthesis of an Airfoil at cited on p. 38
Supersonic Mach Number,” CR 144, NASA, January 1965.
29 Fox, R. L., “Constraint Surface Normals for Structural Synthesis cited on p. 38
Techniques,” AIAA Journal, Vol. 3, No. 8, August 1965, pp. 1517–
1518.
doi: 10.2514/3.3182
30 Arora, J. and Haug, E. J., “Methods of Design Sensitivity Analysis cited on p. 38
in Structural Optimization,” AIAA Journal, Vol. 17, No. 9, 1979,
pp. 970–974.
doi: 10.2514/3.61260
31 Haftka, R. T. and Grandhi, R. V., “Structural shape optimization—A cited on p. 38
survey,” Computer Methods in Applied Mechanics and Engineering,
Vol. 57, No. 1, 1986, pp. 91–106, issn: 0045-7825.
doi: 10.1016/0045-7825(86)90072-1
32 Eschenauer, H. A. and Olhoff, N., “Topology optimization of cited on p. 38
continuum structures: A review,” Applied Mechanics Reviews, Vol.
54, No. 4, July 2001, pp. 331–390.
doi: 10.1115/1.1388075
33 Pironneau, O., “On optimum design in fluid mechanics,” Journal of cited on p. 38
Fluid Mechanics, Vol. 64, No. 01, 1974, p. 97, issn: 0022-1120.
doi: 10.1017/S0022112074002023
34 Jameson, A., “Aerodynamic Design via Control Theory,” Journal of cited on p. 38
Scientific Computing, Vol. 3, No. 3, September 1988, pp. 233–260.
doi: 10.1007/BF01061285
35 Sobieszczanski–Sobieski, J. and Haftka, R. T., “Multidisciplinary cited on p. 38
Aerospace Design Optimization: Survey of Recent Developments,”
Structural Optimization, Vol. 14, No. 1, 1997, pp. 1–23.
doi: 10.1007/BF011
Bibliography 427
36 Martins, J. R. R. A. and Lambe, A. B., “Multidisciplinary Design cited on pp. 38, 376, 396
Optimization: A Survey of Architectures,” AIAA Journal, Vol. 51,
No. 9, September 2013, pp. 2049–2075.
doi: 10.2514/1.J051895
37 Sobieszczanski–Sobieski, J., “Sensitivity of Complex, Internally cited on p. 38
Coupled Systems,” AIAA Journal, Vol. 28, No. 1, 1990, pp. 153–160.
doi: 10.2514/3.10366
38 Martins, J. R. R. A., Alonso, J. J., and Reuther, J. J., “A Coupled- cited on p. 38
Adjoint Sensitivity Analysis Method for High-Fidelity Aero-Structural
Design,” Optimization and Engineering, Vol. 6, No. 1, March 2005,
pp. 33–62.
doi: 10.1023/B:OPTE.0000048536.47956.62
39 Hwang, J. T. and Martins, J. R. R. A., “A computational architecture cited on pp. 38, 208, 381
for coupling heterogeneous numerical models and computing
coupled derivatives,” ACM Transactions on Mathematical Software,
Vol. 44, No. 4, June 2018, Article 37.
doi: 10.1145/3182393
40 Wright, M. H., “The interior-point revolution in optimization: cited on p. 39
History, recent developments, and lasting consequences,” Bulletin
of the American Mathematical Society, Vol. 42, 2005, pp. 39–56.
41 Grant, M., Boyd, S., and Ye, Y., “Global optimization—from theory cited on p. 39
to implementation,” Liberti, L. and Maculan, N., Eds. Springer,
2006, ch. Disciplined Convex Programming, pp. 155–210.
42 Wengert, R. E., “A Simple Automatic Derivative Evaluation Pro- cited on p. 39
gram,” Commun. ACM, Vol. 7, No. 8, August 1964, pp. 463–464,
issn: 0001-0782.
doi: 10.1145/355586.364791
43 Speelpenning, B., “Compiling fast partial derivatives of functions cited on p. 39
given by algorithms,” Ph.D. Dissertation, University of Illinois at
Urbana–Champaign, January 1980.
doi: 10.2172/5254402
44 Squire, W. and Trapp, G., “Using Complex Variables to Estimate cited on p. 39
Derivatives of Real Functions,” SIAM Review, Vol. 40, No. 1, 1998,
pp. 110–112, issn: 0036-1445 (print), 1095-7200 (electronic).
45 Martins, J. R. R. A., Sturdza, P., and Alonso, J. J., “The Complex- cited on pp. 39, 186
Step Derivative Approximation,” ACM Transactions on Mathematical
Software, Vol. 29, No. 3, 2003, pp. 245–262, September.
doi: 10.1145/838250.838251
46 Torczon, V., “On the Convergence of Pattern Search Algorithms,” cited on p. 40
SIAM Journal on Optimization, Vol. 7, No. 1, February 1997, pp. 1–25.
Bibliography 428
47 Jones, D., Perttunen, C., and Stuckman, B., “Lipschitzian optimiza- cited on pp. 40, 222, 223
tion without the Lipschitz constant,” Journal of Optimization Theory
and Application, Vol. 79, No. 1, October 1993, pp. 157–181.
48 Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., “Optimization by cited on p. 40
Simulated Annealing,” Science, Vol. 220, No. 4598, 1983, pp. 671–
680.
doi: 10.1126/science.220.4598.671
49 Kennedy, J. and Eberhart, R. C., “Particle Swarm Optimization,” cited on p. 40
IEEE International Conference on Neural Networks, Vol. IV, Piscataway,
NJ, 1995, pp. 1942–1948.
50 Forrester, A. I. and Keane, A. J., “Recent advances in surrogate- cited on p. 40
based optimization,” Progress in Aerospace Sciences, Vol. 45, No. 1,
2009, pp. 50–79, issn: 0376-0421.
doi: 10.1016/j.paerosci.2008.11.001
51 Bottou, L., Curtis, F. E., and Nocedal, J., “Optimization Methods cited on p. 41
for Large-Scale Machine Learning,” SIAM Review, Vol. 60, No. 2,
2018, pp. 223–311.
doi: 10.1137/16M1080173
52 Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M., cited on p. 41
“Automatic Differentiation in Machine Learning: A Survey,” Journal
of Machine Learning Research, Vol. 18, No. 1, January 2018, pp. 5595–
5637.
53 Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., cited on p. 53
Guy, R. T., Haddock, S. H. D., Huff, K. D., Mitchell, I. M., Plumbley,
M. D., Waugh, B., White, E. P., and Wilson, P., “Best Practices for
Scientific Computing,” PLoS Biology, Vol. 12, No. 1, 2014, e1001745.
doi: 10.1371/journal.pbio.1001745
54 Grotker, T., Holtmann, U., Keding, H., and Wloka, M., The Devel- cited on p. 54
oper’s Guide to Debugging, 2nd. 2012.
55 Ascher, U. M. and Greif, C., A first course in numerical methods. SIAM, cited on p. 58
2011.
56 Saad, Y., Iterative Methods for Sparse Linear Systems, 2𝑛𝑑 . SIAM, 2003. cited on p. 59
57 Hager, W. W. and Zhang, H., “A New Conjugate Gradient Method cited on p. 88
with Guaranteed Descent and an Efficient Line Search,” SIAM
Journal on Optimization, Vol. 16, No. 1, January 2005, pp. 170–192,
issn: 1095-7189.
doi: 10.1137/030601880
58 Nocedal, J. and Wright, S. J., Numerical Optimization, 2nd. Springer- cited on pp. 92, 113, 114, 152, 159
Verlag, 2006.
Bibliography 429
59 Broyden, C. G., “The Convergence of a Class of Double-rank cited on p. 103

Minimization Algorithms 1. General Considerations,” IMA Journal
of Applied Mathematics, Vol. 6, No. 1, 1970, pp. 76–90, issn: 1464-3634.
doi: 10.1093/imamat/6.1.76
60 Fletcher, R., “A new approach to variable metric algorithms,” The cited on p. 103
Computer Journal, Vol. 13, No. 3, March 1970, pp. 317–322, issn:
1460-2067.
doi: 10.1093/comjnl/13.3.317
61 Goldfarb, D., “A family of variable-metric methods derived by cited on p. 103
variational means,” Mathematics of Computation, Vol. 24, No. 109,
January 1970, pp. 23–23, issn: 0025-5718.
doi: 10.1090/s0025-5718-1970-0258249-6
62 Shanno, D. F., “Conditioning of quasi-Newton methods for func- cited on p. 103
tion minimization,” Mathematics of Computation, Vol. 24, No. 111,
September 1970, pp. 647–647, issn: 0025-5718.
doi: 10.1090/s0025-5718-1970-0274029-x
63 Fletcher, R., Practical Methods of Optimization, 2nd. Wiley, 1987. cited on pp. 104, 154
64 Conn, A. R., Gould, N. I. M., and Toint, P. L., Trust Region Methods. cited on pp. 112, 113, 114
SIAM, January 2000.
isbn: 0898714605
65 Steihaug, T., “The Conjugate Gradient Method and Trust Regions cited on p. 113
in Large Scale Optimization,” SIAM Journal on Numerical Analysis,
Vol. 20, No. 3, June 1983, pp. 626–637, issn: 1095-7170.
doi: 10.1137/0720042
66 Murray, W., “Analytical expressions for the eigenvalues and eigen- cited on p. 147
vectors of the Hessian matrices of barrier and penalty functions,”
Journal of Optimization Theory and Applications, Vol. 7, No. 3, March
1971, pp. 189–196.
doi: 10.1007/bf00932477
67 Forsgren, A., Gill, P. E., and Wright, M. H., “Interior Methods for cited on p. 147
Nonlinear Optimization,” SIAM Review, Vol. 44, No. 4, January
2002, pp. 525–597.
doi: 10.1137/s0036144502414942
68 Gill, P. E., Murray, W., and Saunders, M. A., “SNOPT: An SQP Al- cited on p. 152
gorithm for Large-Scale Constrained Optimization,” SIAM Review,
Vol. 47, No. 1, 2005, pp. 99–131.
doi: 10.1137/S0036144504446096
Bibliography 430
69 Liu, D. C. and Nocedal, J., “On the limited memory BFGS method cited on p. 154
for large scale optimization,” Mathematical Programming, Vol. 45,
No. 1-3, August 1989, pp. 503–528.
doi: 10.1007/bf01589116
70 Wächter, A. and Biegler, L. T., “On the implementation of an cited on p. 158
interior-point filter line-search algorithm for large-scale nonlinear
programming,” Mathematical Programming, Vol. 106, No. 1, April
2005, pp. 25–57.
doi: 10.1007/s10107-004-0559-y
71 Byrd, R. H., Hribar, M. E., and Nocedal, J., “An Interior Point cited on p. 158
Algorithm for Large-Scale Nonlinear Programming,” SIAM Journal
on Optimization, Vol. 9, No. 4, January 1999, pp. 877–900.
doi: 10.1137/s1052623497325107
72 Fletcher, R. and Leyffer, S., “Nonlinear programming without a cited on p. 161
penalty function,” Mathematical Programming, Vol. 91, No. 2, January
2002, pp. 239–269.
doi: 10.1007/s101070100244
73 Fletcher, R., Leyffer, S., and Toint, P., “A Brief History of Filter cited on p. 161
Methods,” ANL/MCS-P1372-0906, Argonne National Laboratory,
September 2006.
74 Benson, H. Y., Vanderbei, R. J., and Shanno, D. F., “Interior-Point cited on p. 161
Methods for Nonconvex Nonlinear Programming: Filter Methods
and Merit Functions,” Computational Optimization and Applications,
Vol. 23, No. 2, 2002, pp. 257–272.
doi: 10.1023/a:1020533003783
75 Kreisselmeier, G. and Steinhauser, R., “Systematic Control Design cited on p. 164
by Optimizing a Vector Performance Index,” IFAC Proceedings
Volumes, Vol. 12, No. 7, September 1979, pp. 113–117, issn: 1474-
6670.
doi: 10.1016/s1474-6670(17)65584-8
76 Hoerner, S. F., Fluid-Dynamic Drag. 1965. cited on p. 169
77 Lyness, J. N., “Numerical Algorithms Based on the Theory of Com- cited on p. 182
plex Variable,” Proceedings — ACM National Meeting, Washington
DC: Thompson Book Co., 1967, pp. 125–133.
78 Lyness, J. N. and Moler, C. B., “Numerical Differentiation of Ana- cited on p. 182
lytic Functions,” SIAM Journal on Numerical Analysis, Vol. 4, No. 2,
1967, pp. 202–210, issn: 0036-1429 (print), 1095-7170 (electronic).
Bibliography 431
79 Lantoine, G., Russell, R. P., and Dargent, T., “Using Multicomplex cited on p. 183
Variables for Automatic Computation of High-Order Derivatives,”
ACM Transactions on Mathematical Software, Vol. 38, No. 3, April
2012, pp. 1–21, issn: 0098-3500.
doi: 10.1145/2168773.2168774
80 Fike, J. and Alonso, J., “The Development of Hyper-Dual Numbers cited on p. 183
for Exact Second-Derivative Calculations,” 49th AIAA Aerospace
Sciences Meeting including the New Horizons Forum and Aerospace
Exposition, January 2011
doi: 10.2514/6.2011-886
81 Griewank, A., Evaluating Derivatives. Philadelphia: SIAM, 2000. cited on p. 187
82 Naumann, U., “The art of differentiating computer programs—an cited on p. 187

introduction to algorithmic differentiation.” SIAM, 2011.
83 Hascoët, L. and Pascual, V., “TAPENADE 2.1 User’s Guide,” Tech- cited on p. 194
nical report 300, INRIA, 2004.
84 Griewank, A., Juedes, D., and Utke, J., “Algorithm 755: ADOL-C: cited on p. 194
A Package for the Automatic Differentiation of Algorithms Written
in C/C++,” ACM Transactions on Mathematical Software, Vol. 22, No.
2, June 1996, pp. 131–167, issn: 0098-3500.
doi: 10.1145/229473.229474
85 Wiltschko, A. B., Merriënboer, B. van, and Moldovan, D., “Tangent: cited on p. 194
Automatic differentiation using source code transformation in
Python,” 2017.
86 Revels, J., Lubin, M., and Papamarkou, T., “Forward-Mode Auto- cited on p. 194
matic Differentiation in Julia,” arXiv:1607.07892, July 2016.
87 Neidinger, R. D., “Introduction to Automatic Differentiation and cited on p. 194
MATLAB Object-Oriented Programming,” SIAM Review, Vol. 52,
No. 3, January 2010, pp. 545–563.
doi: 10.1137/080743627
88 Betancourt, M., “A geometric theory of higher-order automatic cited on p. 194
differentiation,” arXiv:1812.11592 [stat.CO], December 2018.
89 Curtis, A. R., Powell, M. J. D., and Reid, J. K., “On the Estimation cited on p. 203
of Sparse Jacobian Matrices,” IMA Journal of Applied Mathematics,
Vol. 13, No. 1, February 1974, pp. 117–119, issn: 1464-3634.
doi: 10.1093/imamat/13.1.117
90 Gebremedhin, A. H., Manne, F., and Pothen, A., “What Color Is cited on p. 204
Your Jacobian? Graph Coloring for Computing Derivatives,” SIAM
Review, Vol. 47, No. 4, January 2005, pp. 629–705, issn: 1095-7200.
doi: 10.1137/s0036144504444711
Bibliography 432
91 Gray, J. S., Hwang, J. T., Martins, J. R. R. A., Moore, K. T., and cited on pp. 204, 208, 383
Naylor, B. A., “OpenMDAO: An open-source framework for mul-
tidisciplinary design, analysis, and optimization,” Structural and
Multidisciplinary Optimization, Vol. 59, No. 4, April 2019, pp. 1075–
1104.
doi: 10.1007/s00158-019-02211-z
92 Ning, A., “Using Blade Element Momentum Methods with Gradient- cited on p. 205
Based Design Optimization,,” 2020, (in review).
93 Martins, J. R. R. A. and Hwang, J. T., “Review and Unification of cited on p. 205
Methods for Computing Derivatives of Multidisciplinary Compu-
tational Models,” AIAA Journal, Vol. 51, No. 11, November 2013,
pp. 2582–2599.
doi: 10.2514/1.J052184
94 Yu, Y., Lyu, Z., Xu, Z., and Martins, J. R. R. A., “On the Influence cited on p. 214
of Optimization Algorithm and Starting Design on Wing Aerody-
namic Shape Optimization,” Aerospace Science and Technology, Vol.
75, April 2018, pp. 183–199.
doi: 10.1016/j.ast.2018.01.016
95 Rios, L. M. and Sahinidis, N. V., “Derivative-free optimization: A cited on pp. 214, 215
review of algorithms and comparison of software implementations,”
Journal of Global Optimization, Vol. 56, 2013, pp. 1247–1293.
doi: 10.1007/s10898-012-9951-y
96 Conn, A. R., Scheinberg, K., and Vicente, L. N., Introduction to cited on p. 216
Derivative-Free Optimization. SIAM, 2009.
97 Audet, C. and Hare, W., Derivative-Free and Blackbox Optimization. cited on p. 216
Springer, 2017.
doi: 10.1007/978-3-319-68913-5
98 Le Digabel, S., “Algorithm 909: NOMAD: Nonlinear Optimization cited on p. 216
with the MADS algorithm,” ACM Transactions on Mathematical
Software, Vol. 37, No. 4, 2011, pp. 1–15.
99 Jones, D. R., “Direct global optimization algorithm,” Encyclopedia cited on pp. 216, 228
of Optimization, Floudas, C. A. and Pardalos, P. M., Eds. Boston,
MA: Springer US, 2009, pp. 725–735, isbn: 978-0-387-74759-0
. doi: 10.1007/978-0-387-74759-0_128
100 Simon, D., Evolutionary Optimization Algorithms. John Wiley & Sons, cited on pp. 217, 237
June 2013.
isbn: 1118659503
101 Barricelli, N., “Esempi numerici di processi di evoluzione,” Metho- cited on p. 231
dos, 1954, pp. 45–68.
Bibliography 433
102 Jong, K. A. D., “An analysis of the behavior of a class of genetic cited on p. 231
adaptive systems,” Ph.D. Dissertation, University of Michigan,
Ann Arbor, MI, 1975.
103 Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T., “A fast and cited on pp. 233, 286
elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions
on Evolutionary Computation, Vol. 6, No. 2, April 2002, pp. 182–197.
doi: 10.1109/4235.996017
104 Deb, K., Multi-Objective Optimization Using Evolutionary Algorithms. cited on p. 239
John Wiley & Sons, July 2001.
isbn: 047187339X
105 Eberhart, R. and Kennedy, J. A., “New Optimizer Using Particle cited on p. 241
Swarm Theory,” Sixth International Symposium on Micro Machine
and Human Science, Nagoya, Japan, 1995, pp. 39–43.
106 Gutin, G., Yeo, A., and Zverovich, A., “Traveling salesman should cited on p. 261
not be greedy: Domination analysis of greedy-type heuristics for
the TSP,” Discrete Applied Mathematics, Vol. 117, No. 1-3, March
2002, pp. 81–86, issn: 0166-218X.
doi: 10.1016/s0166-218x(01)00195-0
107 Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., “Optimization cited on p. 270
by Simulated Annealing,” Science, Vol. 220, No. 4598, May 1983,
pp. 671–680, issn: 1095-9203.
doi: 10.1126/science.220.4598.671
108 Černý, V., “Thermodynamical approach to the traveling salesman cited on p. 270
problem: An efficient simulation algorithm,” Journal of Optimization
Theory and Applications, Vol. 45, No. 1, January 1985, pp. 41–51, issn:
1573-2878.
doi: 10.1007/bf00940812
109 Andresen, B. and Gordon, J. M., “Constant thermodynamic speed cited on p. 271
for minimizing entropy production in thermodynamic processes
and simulated annealing,” Physical Review E, Vol. 50, No. 6, Decem-
ber 1994, pp. 4346–4351, issn: 1095-3787.
doi: 10.1103/physreve.50.4346
110 Lin, S., “Computer Solutions of the Traveling Salesman Prob- cited on p. 272
lem,” Bell System Technical Journal, Vol. 44, No. 10, December 1965,
pp. 2245–2269, issn: 0005-8580.
doi: 10.1002/j.1538-7305.1965.tb04146.x
Bibliography 434
111 Press, W. H., Wevers, J., Flannery, B. P., Teukolsky, S. A., Vetterling, cited on p. 273
W. T., Flannery, B. P., and Vetterling, W. T., Numerical Recipes in
C: The Art of Scientific Computing. Cambridge University Press,
October 1992.
isbn: 0521431085
112 Haimes, Y. Y., Lasdon, L. S., and Wismer, D. A., “On a Bicriterion cited on p. 283
Formulation of the Problems of Integrated System Identification
and System Optimization,” IEEE Transactions on Systems, Man, and
Cybernetics, Vol. SMC-1, No. 3, July 1971, pp. 296–297.
doi: 10.1109/tsmc.1971.4308298
113 Das, I. and Dennis, J. E., “Normal-Boundary Intersection: A New cited on p. 283
Method for Generating the Pareto Surface in Nonlinear Multicrite-
ria Optimization Problems,” SIAM Journal on Optimization, Vol. 8,
No. 3, August 1998, pp. 631–657.
doi: 10.1137/s1052623496307510
114 Ismail-Yahaya, A. and Messac, A., “Effective generation of the Pareto cited on p. 286
frontier using the Normal Constraint method,” 40th AIAA Aerospace
Sciences Meeting & Exhibit, American Institute of Aeronautics and
Astronautics, January 2002.
doi: 10.2514/6.2002-178
115 Messac, A. and Mattson, C. A., “Normal Constraint Method with cited on p. 286
Guarantee of Even Representation of Complete Pareto Frontier,”
AIAA Journal, Vol. 42, No. 10, October 2004, pp. 2101–2111.
doi: 10.2514/1.8977
116 Hancock, B. J. and Mattson, C. A., “The smart normal constraint cited on p. 286
method for directly generating a smart Pareto set,” Structural and
Multidisciplinary Optimization, Vol. 48, No. 4, June 2013, pp. 763–775.
doi: 10.1007/s00158-013-0925-6
117 Schaffer, J. D., “Some Experiments in Machine Learning Using Vec- cited on p. 286
tor Evaluated Genetic Algorithms.” Ph.D. Dissertation, Vanderbilt
University, Nashville, TN, 1984.
118 Deb, K., Introduction to evolutionary multiobjective optimization, Mul- cited on p. 286
tiobjective Optimization, Springer Berlin Heidelberg, 2008, pp. 59–96.
doi: 10.1007/978-3-540-88908-3_3
119 Kung, H. T., Luccio, F., and Preparata, F. P., “On Finding the Maxima cited on p. 287
of a Set of Vectors,” Journal of the ACM, Vol. 22, No. 4, October 1975,
pp. 469–476.
doi: 10.1145/321906.321910
Bibliography 435
120 Forrester, A., Sobester, A., and Keane, A., Engineering Design via cited on p. 296
Surrogate Modelling: A Practical Guide. John Wiley & Sons, September
2008.
isbn: 0470770791
121 Rajnarayan, D., Haas, A., and Kroo, I., “A Multifidelity Gradient- cited on p. 306
Free Optimization Method and Application to Aerodynamic De-
sign,” 12th AIAA/ISSMO Multidisciplinary Analysis and Optimization
Conference, September 2008
doi: 10.2514/6.2008-6020
122 Ruder, S., “An overview of gradient descent optimization algo- cited on p. 313
rithms,” arXiv:1609.04747, 2016.
123 Goh, G., “Why Momentum Really Works,” Distill, 2017. cited on p. 313
doi: 10.23915/distill.00006
124 Diamond, S. and Boyd, S., “Convex Optimization with Abstract cited on p. 317
Linear Operators,” 2015 IEEE International Conference on Computer
Vision (ICCV), IEEE, December 2015.
doi: 10.1109/iccv.2015.84
125 Boyd, S. P. and Vandenberghe, L., Convex Optimization. Cambridge cited on p. 319
University Press, March 2004.
isbn: 0521833787
126 Lobo, M. S., Vandenberghe, L., Boyd, S., and Lebret, H., “Applica- cited on p. 319
tions of second-order cone programming,” Linear Algebra and its
Applications, Vol. 284, No. 1-3, November 1998, pp. 193–228.
doi: 10.1016/s0024-3795(98)10032-0
127 Vandenberghe, L. and Boyd, S., “Applications of semidefinite cited on p. 319
programming,” Applied Numerical Mathematics, Vol. 29, No. 3, March
1999, pp. 283–299.
doi: 10.1016/s0168-9274(98)00098-1
128 Vandenberghe, L. and Boyd, S., “Semidefinite Programming,” cited on p. 319
SIAM Review, Vol. 38, No. 1, March 1996, pp. 49–95.
doi: 10.1137/1038003
129 Parikh, N. and Boyd, S., “Block splitting for distributed optimiza- cited on p. 319
tion,” Mathematical Programming Computation, Vol. 6, No. 1, October
2013, pp. 77–102.
doi: 10.1007/s12532-013-0061-8
130 Grant, M., Boyd, S., and Ye, Y., Disciplined convex programming, cited on p. 325
Global Optimization, Kluwer Academic Publishers, 2006, pp. 155–
210.
doi: 10.1007/0-387-30528-9_7
Bibliography 436
131 Hoburg, W. and Abbeel, P., “Geometric Programming for Aircraft cited on p. 328
Design Optimization,” AIAA Journal, Vol. 52, No. 11, November
2014, pp. 2414–2426.
doi: 10.2514/1.j052732
132 Hoburg, W., Kirschen, P., and Abbeel, P., “Data fitting with geometric- cited on p. 328
programming-compatible softmax functions,” Optimization and
Engineering, Vol. 17, No. 4, August 2016, pp. 897–918.
doi: 10.1007/s11081-016-9332-3
133 Kirschen, P. G., York, M. A., Ozturk, B., and Hoburg, W. W., “Ap- cited on p. 329
plication of Signomial Programming to Aircraft Design,” Journal of
Aircraft, Vol. 55, No. 3, May 2018, pp. 965–987.
doi: 10.2514/1.c034378
134 York, M. A., Hoburg, W. W., and Drela, M., “Turbofan Engine cited on p. 329
Sizing and Tradeoff Analysis via Signomial Programming,” Journal
of Aircraft, Vol. 55, No. 3, May 2018, pp. 988–1003.
doi: 10.2514/1.c034463
135 Smolyak, S. A., “Quadrature and interpolation formulas for tensor cited on p. 345
products of certain classes of functions,” Dokl. Akad. Nauk SSSR,
Vol. 148, No. 5, 1963, pp. 1042–1045.
136 Smith, R. C., Uncertainty Quantification: Theory, Implementation, and cited on p. 348
Applications. SIAM, December 2013.
isbn: 1611973228
137 Cacuci, D., Sensitivity & Uncertainty Analysis, Volume 1. Chapman cited on p. 348
and Hall/CRC, May 2003.
doi: 10.1201/9780203498798
138 Parkinson, A., Sorensen, C., and Pourhassan, N., “A General Ap- cited on p. 349
proach for Robust Optimal Design,” Journal of Mechanical Design,
Vol. 115, No. 1, 1993, p. 74.
doi: 10.1115/1.2919328
139 Wiener, N., “The Homogeneous Chaos,” American Journal of Mathe- cited on p. 350
matics, Vol. 60, No. 4, October 1938, p. 897.
doi: 10.2307/2371268
140 Eldred, M., Webster, C., and Constantine, P., “Evaluation of Non- cited on p. 353
Intrusive Approaches for Wiener-Askey Generalized Polynomial
Chaos,” 49th AIAA Structures, Structural Dynamics, and Materials
Conference, American Institute of Aeronautics and Astronautics,
April 2008.
doi: 10.2514/6.2008-1892
Bibliography 437
141 Adams, B., Bauman, L., Bohnhoff, W., Dalbey, K., Ebeida, M.,
Eddy, J., Eldred, M., Hough, P., Hu, K., Jakeman, J., Stephens, J.,
Swiler, L., Vigil, D., and Wildey, T., “Dakota, A Multilevel Parallel cited on p. 355
Object-Oriented Framework for Design Optimization, Parameter
Estimation, Uncertainty Quantification, and Sensitivity Analysis:
Version 6.0 User’s Manual,” Sandia Technical Report SAND2014-
4633, Sandia National Laboratories, November 2015.
142 Feinberg, J. and Langtangen, H. P., “Chaospy: An open source tool cited on p. 355
for designing methods of uncertainty quantification,” Journal of
Computational Science, Vol. 11, November 2015, pp. 46–57.
doi: 10.1016/j.jocs.2015.08.008
143 Kroo, I. M., MDO for large-scale design, Multidisciplinary Design cited on p. 360
Optimization: State-of-the-Art, Alexandrov, N. and Hussaini, M. Y.,
Eds., SIAM, 1997, pp. 22–44.
144 Biegler, L. T., Ghattas, O., Heinkenschloss, M., and Bloemen cited on p. 380
Waanders, B. van, Eds., Large-Scale PDE-Constrained Optimization.
Springer–Verlag, 2003.
145 Braun, R. D., “Collaborative Optimization: An Architecture for cited on p. 386
Large-Scale Distributed Design,” Ph.D. Dissertation, Stanford Uni-
versity, Stanford, CA 94305, 1996.
146 Tedford, N. P. and Martins, J. R. R. A., “Benchmarking Multi- cited on p. 395
disciplinary Design Optimization Algorithms,” Optimization and
Engineering, Vol. 11, No. 1, February 2010, pp. 159–183.
doi: 10.1007/s11081-009-9082-6
147 Golovidov, O., Kodiyalam, S., Marineau, P., Wang, L., and Rohl, cited on p. 396
P., “Flexible implementation of approximation concepts in an
MDO framework,” 7th AIAA/USAF/NASA/ISSMO Symposium on
Multidisciplinary Analysis and Optimization, American Institute of
Aeronautics and Astronautics, 1998.
doi: 10.2514/6.1998-4959
148 Balabanov, V., Charpentier, C., Ghosh, D. K., Quinn, G., Van- cited on p. 396
derplaats, G., and Venter, G., “VisualDOC: A Software System
for General Purpose Integration and Design Optimization,” 9th
AIAA/ISSMO Symposium on Multidisciplinary Analysis and Optimiza-
tion, Atlanta, GA, 2002.
149 Barnes, G. K., “A Comparative Study of Nonlinear Optimization cited on p. 420
Codes,” Master’s thesis, The University of Texas at Austin, 1967.
150 Venkayya, V., “Design of optimum structures,” Computers & Struc- cited on p. 421
tures, Vol. 1, No. 1-2, August 1971, pp. 265–309, issn: 0045-7949.
doi: 10.1016/0045-7949(71)90013-7
Index
aerostructural analysis, 362 dominated, 161, 281

asymptotic convergence, 55 dynamic programming princi-
ple of optimality, 264
back propagation, 311
bean function, 98, 415 elitism, 241, 290
binary design varialbes, 250 error
black-box model, 17 absolute, 48
Boltzman distribution, 270 relative, 48
brachistochrone problem, 33, 122 roundoff, 48
branch and bound, 253 errors
programming, 53
callback functions, 14 exhaustive search, 252
constraints expected improvement, 305
active, 12 explicit function, 45
equality, 11
inactive, 12 finite-difference method, 175
inequality, 12 fixed-point iterations, 59
infeasible, 12 forward propagation, 344
convergence
rate of, 54 genetic algorithm
convex function, 19, 318 binary, 274
cross validation, 302 geometric program, 326
cumulative density function, 336 greedy algorithms, 260
curse of dimensionality, 296, 345
Hartmann function, 415
debugging, 53
implicity equations, 45
design variables, 7
infill, 305
binary, 250
integer design varialbes, 250
converting integer to binary,
inversion sampling, 298
253
iterative solver, 52, 54, 59
discrete, 250
integer, 250 Jones function, 231, 415
disciplined convex programming,
318, 325 knapsack problem, 266
discrete design varialbes, 250 Kriging, 299
438
Index 439
kriging, 304 Pareto

Krylov subspace methods, 59 anchor points, 283
optimal, 281
Latin hypercube sampling, 297, set, 281
346 utopia point, 284
least squares, 300, 322 polynomial chaos, 349
linear probability density function, 336
convergence, 55
linear mixed integer program- quadratic
ming, 254 convergence, 55
linear program, 319 quadratic program, 321
linear regression model, 299 quadratically-constrained quadratic
program, 113, 324
Markov chain, 262 quadrature, 344
maximization as minimization,
10 reliable design, 333, 342
mean, 335 residual, 45
memoization, 263 residual form, 45
minibatch, 312 robust design, 333, 338
minimum robustness
global, local, weak, 18 design, 333
Monte Carlo, 345
multidisciplinary design opti- simulated annealing, 270
mization, 2 state variables, 45
multimodal function, 18 successive over-relaxation (SOR),
59
neural networks, 307 superlinear convergence, 56
Newton surrogate-based optimization,
Isaac, 33 296
method for minimization,
101 tabulation, 263
method for root finding, 59 Taylor’s series, 348
NP-complete, 251 training data, 296
numerical noise, 52 traveling salesman, 251, 272
objective function, 9 unimodal function, 18

multiple (multiobjective), 278
variance, 335
optimization problem statement,
14 warm start, 122, 410
optimization software, 14 weighted directed graph, 260
optimization under uncertain-
ity, 333
overfitting, 302

Mdobook-J R R A MARTINS

Uploaded by

Copyright:

Available Formats

Mdobook-J R R A MARTINS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mdobook-J R R A MARTINS

Uploaded by

Copyright:

Available Formats

Engineering

This is a working draft that we are updating frequently.

Draft compiled on Sunday 7th February, 2021 at 16:31

2 A Short History of Optimization 31

3 Numerical Models and Solvers 42

4 Unconstrained Gradient-Based Optimization 70

5 Constrained Gradient-Based Optimization 123

6 Computing Derivatives 173

7 Gradient-Free Optimization 212

8 Discrete Optimization 250

9 Multiobjective Optimization 278

10 Surrogate-Based Optimization 295

11 Convex Optimization 317

12 Optimization Under Uncertainty 333

12.3 Robust Design 338

13 Multidisciplinary Design Optimization 360

A Mathematics Review 398

B Linear Solvers 408

C Test Problems 414

Despite its usefulness, design optimization remains underused in in-

implementations instead of existing tools for solving optimization

We are indebted to many students at our respective institutions that

Joaquim Martins and Andrew Ning

(MDO), which applies numerical optimization techniques to the design

By the end of this chapter you should be able to:

1. Understand the design optimization process.

2. Formulate an optimization problem.

3. Identify key characteristics to classify optimization prob-

4. Recognize some salient characteristics in selecting an ap-

1.1 Design Optimization Process

short, it usually relies on simplified models and human intuition. For

Initial Evaluate Is the design Yes

Change initial design

design, engineers might decide to reformulate the optimization problem

1.2 Optimization Problem Formulation

The design optimization process requires the designer to translate

Figure 1.3: Compared to the con-

is undesirable or unrealistic from an engineering point of view—the

The next step is to gather and much data and information as

be understood. The computational time for the analysis needs to be

1.2.1 Design Variables

The first consideration in the definition of the allowable design

Some optimization algorithms require the user to provide initial

Example 1.1: Design variables for wing design.

3 Figure 1.6: Wing design space

1.2.2 Objective Function

depending on the problem. For example, a designer might want to

Example 1.2: Objective function for wing design.

While we can sometimes visualize the variation of the objective

When using optimization software, do not forget to check the convention

Some texts omit the equality constraints without loss of generality

Figure 1.9: Example of two-

It is possible to over-constrain the problem such that there is no

Example 1.4: Constraints for wing design. 1.2

We now add a design constraint for the power minimization problem of

In addition to the possibility of a large number of design variables and

1.2.4 Optimization Problem Statement

The setup of an optimization problem varies depending on the particular

Tip 1.6: Ease into the problem.