Notes On Adjoint Methods MIT

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6
At a glance
Powered by AI
The notes discuss adjoint methods, which provide an efficient way to compute the gradient dg/dp of some function g with respect to parameters p. This gradient is useful for sensitivity analysis, optimization, and inverse problems.

The gradient dg/dp gives a measure of the sensitivity of the solution to the parameters p. It also indicates a useful search direction for optimization of g.

When the number P of parameters is large, adjoint methods allow the gradient dg/dp to be computed with cost independent of P and usually comparable to solving for x once. This makes large-scale inverse problems and optimization possible.

Notes on Adjoint Methods for 18.

335
Steven G. Johnson
Created Spring 2006, updated December 17, 2012.

Introduction

can be found in [2]. Subsequently, Gil Strang wrote a


nice introduction to adjoint methods in his book [3],
Given the solution x of a discretized PDE or some including a discussion of the important topic of autoother set of M equations parameterized by P vari- matic differentiation (for which adjoint or reverse
ables p (design parameters, a.k.a. control variables differentiation is a key idea).
or decision parameters), we often wish to compute
some function g(x, p) based on the parameters and
the solution. For example, if the PDE is a wave equa- 2 Linear equations
tion, we might want to know the scattered power in
some direction. Or, for a mechanical simulation, we Suppose that the column-vector x solves the M M
might want to know the load-bearing capacity of the linear equation Ax = b where we take b and A to be
1
structure. Or for a fluid, we might wish to know the real and to depend in some way on p. To evaluate
flow rate somewhere. Often, however, we want to the gradient directly, we would do
know more than just the value of gwe also want to
dg
dg
. Adjoint methods give an effiknow its gradient dp
= gp + gx xp
dp
dg
cient way to evaluate dp , with a cost independent of
P and usually comparable to the cost of solving for x where the subscripts indicate partial derivatives (gx
is a row vector, xp is an M P matrix, etc.). Since g
once.
The gradient of g with respect to p is extremely is a given function, gp and gx are presumably easy to
useful. It gives a measure of the sensitivity of our an- compute. On the other hand, computing xp is hard:
swer to the parameters p (which may, for example, evaluating it directly by differentiating Ax = b by a
come from some experimental measurements with parameter pi gives x pi = A1 (b pi A pi x). That is,
some associated uncertainties). Or, we may want we would have to solve an M M linear equation for
to perform an optimization of g, picking the p that P right-hand sides, once for every compont of p; this
produce some desired result; in this case the gradi- is impractical if P and M are large.
More explicitly, the problematic term is:
ent indicates a useful search direction (e.g. for nonlinear conjugate-gradient optimization). For largegx [ |{z}
A1 (bp Ap x)] = [gx A1 ] (bp Ap x),
scale optimization problems, the number P of design gx xp = |{z}
| {z } | {z } | {z }
parameters can be hundreds, thousands, or more
1M MM
1M
MP
MP
this is common in shape or topology optimization,
in which p controls the placement and shape of ar- where Ap x denotes the M P matrix with columns
2
bitrary blobs of different materials constituting a A pi x for i = 1, . . . P. One way of looking at the diffigiven structure/design. Sometimes, this process is culty is that in the first equation we multiply a M M
2
called inverse design: finding the problem that yields matrix by a M P matrix, which costs O(M P)
1
a given solution instead of the other way around. work, or equivalently we have multiplications of A
1 This involves no loss of generality, since complex linear equaWhen P  1, the amazing efficiency of adjoint methods makes inverse design possible.
tions can always be written as real linear equations of twice the
I hadnt found any textbook description of adjoint size2 by taking the real and imaginary parts as separate variables.
Ap is a rank-3 tensor or three-dimensional mamethods that I particularly like, which is part of my trix,Technically,
although it almost certainly isnt stored this way. For exammotivation for writing up these notes. One introduc- ple, A pi x could be computed for each i separately without saving
tion can be found in [1], and a more general treatment A pi . Often, A pi will be very sparse.
1

by P vectors (i.e., solves of P right-hand sides, which fp = 0 and thus x p = f1


x fp . Hence, we write
in practice would likely use a factorization of A or
an iterative solver rather than explicitly computing dg = gp +gx xp = gp gx [ f1 fp ] = gp [gx f1 ] fp .
x
|{z} |{z}
| {zx } |{z}
|{z}
A1 ).3 However, this can be ameliorated simply by dp
1M MM MP
1M MP
parenthesizing in a different way [3],4 as shown in
T
1
the last expression. If we multiply = gx A first, We solve for x by whatever method, then solve for
that corresponds to only a single solution of an ad- from
joint equation5
fT = gT ,
(2)
AT = gTx .

(1)

and finally obtain


and then we multiply a single vector T by our M P
dg
= gp T fp .
(3)
matrix for only (MP) work. Putting it all together,
dp f=0
we obtain:
The only difference is that the adjoint equation (2) is

dg
T
T
not simply the adjoint of the equation for x. Still, it
= gp fp = gp (Ap x bp ).
dp f=0
is a single M M linear equation for that should
be of comparable (or lesser) difficulty to solving for
Again, A(p) and b(p) are presumably specified an- x.
alytically and thus Ap and bp can easily be computed (in some cases automatically, by automatic
program differentiators such as ADIFOR). Note that 4 Eigenproblems
the adjoint problem is of the same size as the original Ax = b system, can use the same factorization As a more complicated example illustrating the use
(e.g. LU factorization A = LU immediately gives of equations (2) and (3) from the previous sections,
AT = U T LT ), has the same condition number, and let us suppose that we are solving a linear eigenhas the same spectrum of eigenvalues (the eigenval- problem Ax = x and looking at some function
ues of A and AT are identical) so iterative algorithms g(x, , p). For simplicity, assume that A is reali.e.,
will have similar performance (and can use similar symmetric and that is simple (non-degenerate;
6 In this case, we
x
is
the
only
eigenvector
for
).
preconditioners)in every sense, solving the adjoint
problem should be no harder than solving the origi- now have M + 1 unknowns described by the column
vector:


nal problem.
x
x =
.

The eigenequation f = Ax x only gives us M


equations and doesnt completely determine x , for
two reasons. First, of course, there are many possiIf x satisfies some general, possibly nonlinear, equa- ble eigenvalues, but lets assume that we have picked
tions f(x, p) = 0, the process is almost exactly the one in some fashion (e.g. the smallest , or the
same. Differentiating the f equation, we find fx xp + closest to , or the third largest ||, or ...). Second,
the eigenequation does not determine the length |x|;
3 If M is sparse, then the cost might be significantly less than
lets arbitrarily pick |x| = 1 or xT x = 1. This gives us
this O(M 2 P) upper bound, but in any case solving P right-hand

sides will be significantly more costly than solving a single right- M + 1 equations f = 0 where:
hand side for the adjoint formulation.


4 Another way of looking at this, and the source of the nof
f =
.
tation, is to think of sort of a Lagrange multiplier process: rexT x 1

Nonlinear equations

place g with g = g T f by adding a multiple of f = 0, and


then choose is a clever way to cancel the annoying derivative
term. This gives the same result, and may be easier to generalize to some more complicated circumstances, however, such as
differential-algebraic equations [2].
5 For complex-valued x and A and real g, instead of the transpose AT one typically obtains the adjoint A = AT (the conjugatetranspose).

6 Problems involving degenerate eigenvalues occur surprisingly often in optimization of eigenvalues (e.g. when maximizing the minimum eigenvalue of some system), and must be treated
with special care. In that case, a generalization of the gradient
is required to determine sensitivities or the steepest-descent direction [4], a more elaborate version of what is called degenerate
perturbation theory in quantum mechanics [?].

Well need M + 1 adjoint variables :





=
.

0.2

0.15

0.1

The adjoint equations (2) then give:


0.05

= gTx 2 x,
(A )

(4)

xT = g .

(5)

0.05

The first equation, at first glance, seems to be prob


V / 1000
0.1
lematic: A is singular, with a null space of x. Its,
okay, though! First, we have to choose so that so0.15
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
lutions of equation (4) exist: the right-hand side must
x
be orthogonal to x so that it is not in the null space
of A . That is, we must have xT (gTx 2 x) = 0, Figure 1: Optimized V (x) (scaled by 1/1000) and
and thus = xT gTx /2 (since xT x = 1), and therefore (x) for (x) = 1 + sin[x + cos(3x)] after 500
0
satisfies:
cg iterations.
= (1 xxT )gTx = PgTx
(A )

(6)
Now, however, we will specify a particular 0 (x) and
find the V (x) that gives (x) 0 (x) for the groundstate eigenfunction (i.e. for the smallest eigenvalue
E). In particular, we will find the V (x) that minimizes
Z

where P = 1 xxT is the projection operator into the


space orthogonal to x. This equation then has a solution, and in fact it has infinitely many solutions: we
can add any multiple of x to and still have a solution. Equivalently, we can write = 0 + x for
xT 0 = 0 and some . Fortunately, is determined
by (5): = g . Finally, with 0 determined by
(6),7 we can find the desired gradient via (3):

dg
= gp T A p x = gp T0 A p x + g xT A p x.
dp f=0
(7)
dg
If we compare with dp
= gp + gx xp + g p , we immediately see that p = xT A p x. This is a well-known
result from quantum physics and perturbation theory,
where it is known as the Hellman-Feynman theorem.

g=
1

|(x) 0 (x)|2 dx.

To solve this numerically, we will discretize the interval x [1, 1) with M equally-spaced points xn =
2
), and solve for the solution (xn )
nx (x = M+1
at these points, denoted by the vector . That is,
to compare with the notation of the previous sections, we have the eigenvector x = , the eigenvalue
= E, and the parameters V (xn ) or p = V. If we
discretize the eigenoperator with the usual center = E
for:
difference scheme, we get A

2 1 0
0 1
1 2 1 0


1 0 1 2 1 0

A= 2 .
+diag(V).
..

x ..
.

1 2 1
1 0
0 1 2

Example inverse design

As a more concrete example of an inverse-design


problem, lets consider the Schrodinger eigenequation in one dimension,


d2
2 +V (x) (x) = E(x),
dx

As before, we normalize (and 0 ) to T = 1,8


giving a projection operator P = 1 T (or P =
with periodic boundaries (x + 2) = (x). Nor1 |i h|, in Dirac notation). The discrete version
mally, we take a given V (x) and solve for and E.
of g is now

7 Since P commutes with A , we can solve for easily by


0
an iterative method such as conjugate gradient: if we start with an
initial guess orthogonal to x, all subsequent iterates will also be
orthogonal to x and will thus converge to 0 (except for roundoff,
which can be corrected by multiplying the final result by P).

, V) = (
0 )T (
0 )x
g(
8 We

also
R have an arbitrary choice of sign, which we fix by
choosing dx > 0.

where 0 is 0 (xn ), our target eigenfunction. There 0 )T x and thus, by eq. (6), we
fore, g = 2(
find via:
= 2P(
0 )x,
(A E)

(8)
0.3

= 0 (
= 0 since gE = 0). gV and gE are
with P
both 0. Moreover, AVn is simply the matrix with 1 at
(n, n) and 0s elsewhere, and thus from (7):

0.25

0.2

0.15

0.1

dg
= n n
dVn

0
0.05

V / 10000

dg
where is the point=
or equivalently dV
0.05
wise product (.* in Matlab).
0.1
Whew! Now how do we solve these equations nu0.15
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
merically? This is illustrated by the Matlab function
x
schrodinger_fd_adj given below. We set up A as
a sparse matrix, then find the smallest eigenvalue and
Figure 2: Optimized V (x) (scaled by 1/10000) and
eigenvector via the eigs function (which uses an it(x) for 0 (x) = 1 |x| for |x| < 0.5, after 5000 cg
erative Arnoldi method). Then we solve (8) for via
iterations.
the Matlab pcg function (preconditioned conjugategradient, although we dont bother with a preconditioner).
dg
Then, given g and dV
, we then just plug it into
some optimization algorithm. In particular, nonlinear conjugate gradient seems to work well for this
problem.9

5.1

Optimization results
0.2

(x)

10 cg iterations
In this section, we give a few example results from
20
0.18
40
80
running the above procedure (nonlinear cg optimiza0.16
160
320
tion) for M = 100. As the starting guess for our opti5000
0.14
mization, well just use V (x) = 0. That is, we are do0.12
ing a local optimization in a 100-dimensional space,
0.1
using the adjoint method to get the gradient. It is
0.08
somewhat remarkable that this worksin a few sec0.06
onds on a PC, it converges to a very good solution!
0.04
Well try a couple of example 0 (x) functions. To
0.02
start with, lets do 0 (x) = 1 + sin[x + cos(3x)].
0
(Note that the ground-state will never have any
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
x
nodes, so we require 0 0 everywhere.) This
0 (x), along with the resulting (x) and V (x) after
500 cg iterations, are shown in figure 1. The solution Figure 3: Optimized (x) for 0 (x) = 1 |x|
(x) matches 0 (x) very well except for a couple for |x| < 0.5, after various numbers of nonlinear
of small ripples, and V (x) is quite complicatednot conjugate-gradient iterations (from 10 to 10000).
something you could easily guess!

9 I used the nonlinear conjugate-gradient Matlab conj_grad


routine from:
http://www2.imm.dtu.dk/~hbn/Software/

Oh, but that 0 was too easy! Lets try one with 5.2 Matlab code
discontinuities: 0 (x) = 1 |x| for |x| < 0.5 and 0
dg
, not to
The following code solves for g and dV
otherwise (which looks a bit like a house). This
mention the eigenfunction and the corresponding
0 (x), along with the resulting (x) and V (x) after
eigenvalue E, for a given V and 0 .
500 cg iterations, are shown in figure 2. Amazingly,
it still captures 0 pretty well, although it has a bit
% Usage: [g,gp,E,psi] = schrodinger_fd_adj(x, V, psi0)
more trouble with the discontinuities than with the
%
slope discontinuity. This time, we let it converge for
% Given a column-vector x(:) of N equally spaced x points an
5000 cg iterations to give it a bit more time. Was this
% V of the potential V(x) at those points, solves Schrodinge
really necessary? In figure 3, we plot (x) for 10,
%
[ -d^2/dx^2 + V(x) ] psi(x) = E psi(x)
20, 40, 80, 160, 320, and 5000 cg iterations. It gets
% with periodic boundaries for the lowest "ground state" eig
the rough shape pretty quickly, but the discontinuous
% wavefunction psi.
features are converging fairly slowly. (Presumably
%
this could be improved if we found a good precondi% Furthermore, it computes the function g = integral |psi tioner, or perhaps by a different optimization method
% the gradient gp = dg/dV (at each point x).
or objective function.)
function [g,gp,E,psi] = schrodinger_fd_adj(x, V, psi0)
dx = x(2) - x(1);
N = length(x);
A = spdiags([ones(N,1), -2 * ones(N,1), ones(N,1)], -1:1,
A(1,N) = 1;
A(N,1) = 1;
A = - A / dx^2 + spdiags(V, 0, N,N);

opts.disp = 0;
[psi,E] = eigs(A, 1, sa, opts);
E = E(1,1);
if sum(psi) < 0
psi = -psi; % pick sign; note that psi * psi = 1 from e
end
gpsi = psi - psi0;
g = gpsi * gpsi * dx;
gpsi = gpsi * 2*dx;

P = @(x) x - psi * (psi * x); % projection onto direction

[lambda,flag] = pcg(A - spdiags(E*ones(N,1), 0, N,N), P(gp


lambda = P(lambda);
gp = -real(conj(lambda) .* psi);
disp(g);

Initial-value problems

matrix exponential
turns out to be more complicated:
R
0
0
Ap = 0t eBt Bp eB(tt ) dt 0 , and so,

So far, we have looked at x that are determined by


Z t
dg
simple algebraic equations (which may come from
= gp + T (t t 0 )Bp x(t 0 )dt 0 + T bp .
dp
a PDE, etcetera). What if, instead, we are determin0
ing x by integrating a set of equations in time? The
simplest example of this is an initial-value problem This is especially unfortunate0 because it usually
0
for a linear, time-independent, homogeneous set of means that we have to store x(t ) at all times 0 t
t
in
order
to
compute
the
integral.
Adjoint
methods
ODEs:
are storage-intensive for time-dependent problems!
x = Bx
More generally, of course, one might wish to inwhose solution after a time t for x(0) = b is formally: clude time-varying A, nonlinearities, inhomogeneous
(source) terms, etcetera, into the equations to intex = x(t) = eBt b.
grate. A very general formulation of the problem,
for differential-algebraic equations (DAEs), can be
This, however, is exactly a linear equation Ax = b found in [2]. A similar general principle remains,
with A = eBt , so we can just quote our results from however: the adjoint variable is determined by
earlier! That is, suppose we are optimizing (or eval- integrating a similar (adjoint) DAE, using the final
uating the sensitivity) of some function g(x, p) based value of x(t) to compute the initial condition of (0).
on the solution x at time t. Then we find the adjoint In fact, the (t) equation is actually often interpreted
vector via (1):
as being integrated backwards in time from t to 0.
Alternatively, one can consider a discrete-time sitT
eB t = gTx .
uation of recurrence equations, in which case the adjoint problem is a recurrence backward in time
Equivalently, is the exactly the solution (t) after
see my online notes on adjoint methods for recura time t of its own adjoint ODE:
rences.
T

=B

References

with initial condition (0) = gTx . We should have


expected this by now: solving for always involves
a task of similar complexity to finding x, so if we
found x by integrating an ODE then we find by
an ODE too! Of course, we need not solve these
ODEs by matrix exponentials; we can use RungeKutta, forward Euler, or (if B comes from a PDE)
whatever scheme we deem appropriate (e.g. CrankNicolson).
One important property to worry about is stability,
and here we are in luck. The eigenvalues of B and
BT are complex-conjugates, and so if one is stable
(eigenvalues with absolute values 1) then the other
is!
dg
via
Finally, we can write down the gradient dp
equation (3):

[1] R. M. Errico, What is an adjoint model?, Bulletin Am. Meteorological Soc., vol. 78, pp. 2577
2591, 1997.
[2] Y. Cao, S. Li, L. Petzold, and R. Serban,
Adjoint sensitivity analysis for differentialalgebraic equations: The adjoint DAE system
and its numerical solution, SIAM J. Sci. Comput., vol. 24, no. 3, pp. 10761089, 2003.
[3] G. Strang, Computational Science and Engineering. Wellesley, MA: Wellesley-Cambridge
Press, 2007.
[4] A. P. Seyranian, E. Lund, and N. Olhoff, Multiple eigenvalues in structural optimization problems, Structural Optimization, vol. 8, pp. 207
227, 1994.

dg
= gp T (Ap x bp ).
dp
Now, since A = eBt , one might be tempted to write
Ap = Bpt A, but this is not true except in the very
special case where Bp commutes with B! Unfortunately, the general expression for differentiating a
6

You might also like