PDE-constrained optimization and the adjoint method1 One method to approximate dp f is to compute np finite differences over the
Andrew M. Bradley October 15, 2019 (original November 16, 2010) elements of p. Each finite difference computation requires solving g(x, p) = 0. For
moderate to large np , this can be quite costly.
PDE-constrained optimization and the adjoint method for solving these and re- In the program to solve g(x, p) = 0, it is likely that the Jacobian matrix ∂x g is
lated problems appear in a wide range of application domains. Often the adjoint calculated (see Sections 1.3 and 1.5 for further details). The adjoint method uses
method is used in an application without explanation. The purpose of this tuto- the transpose of this matrix, gxT , to compute the gradient dp f . The computational
rial is to explain the method in detail in a general setting that is kept as simple cost is usually no greater than solving g(x, p) = 0 once and sometimes even less
as possible. costly.
We use the following notation: the total derivative (gradient) is denoted dx
(usually denoted d(·)/dx or ∇x ); the partial derivative, ∂x (usually, ∂(·)/∂x ); the
differential, d. We also use the notation fx for both partial and total derivatives 1.2 Derivation
when we think the meaning is clear from context. Recall that a gradient is a row
In this section, we consider the slightly simpler function f (x); see below for the
vector, and this convention induces sizing conventions for the other operators.
full case.
We use only real numbers in this presentation.
First,
dp f = dp f (x(p)) = ∂x f dp x (= fx xp ). (1)
1 The adjoint method
Second,
Let x ∈ Rnx and p ∈ Rnp . Suppose we have the function f (x, p) : Rnx × Rnp → R
and the relationship g(x, p) = 0 for a function g : Rnx × Rnp → Rnx whose partial g(x, p) = 0 everywhere implies
derivative gx is everywhere nonsingular. What is dp f ? dp g = 0.
(Note carefully that dp g = 0 everywhere only because g = 0 everywhere. It is
1.1 Motivation certainly not the case that a function that happens to be 0 at a point also has a
0 gradient there.) Expanding the total derivative,
The equation g(x, p) = 0 is often solved by a complicated software program that
implements what is sometimes called a simulation or the forward problem. Given
gx xp + gp = 0.
values for the parameters p, the program computes the values x. For example, p
could parameterize boundary and initial conditions and material properties for a As gx is everywhere nonsingular, the final equality implies xp = −gx−1 gp . Substi-
discretized PDE, and x are the resulting field values. f (x, p) is often a measure of tuting this latter relationship into (1) yields
merit: for example, fit of x to data at a set of locations, the smoothness of x or p,
the degree to which p attains a particular objective. Minimizing f is sometimes dp f = −fx gx−1 gp .
called the inverse problem.
The expression −fx gx−1 is a row vector times an nx × nx matrix and may be
Later we shall discuss seismic tomography. In this application, x are the field understood in terms of linear algebra as the solution to the linear equation
values in the wave (also referred to as the acoustic, second-order linear hyperbolic,
second-order wave, etc.) equation, p parameterizes the earth model and initial and gxT λ = −fxT , (2)
boundary conditions, f measures the difference between measured and synthetic
waveforms, and g encodes the wave equation, initial and boundary conditions, where T is the matrix transpose. The matrix conjugate transpose (just the trans-
and generation of synthetic waveforms. pose when working with reals) is also called the matrix adjoint, and for this reason,
the vector λ is called the vector of adjoint variables and the linear equation (2)
The gradient dp f is useful in many contexts: for example, to solve the op- is called the adjoint equation. In terms of λ, dp f = λT gp .
timization problem minp f or to assess the sensitivity of f to the elements of
p. A second derivation is useful. Define the Lagrangian
1 This document is licensed under CC BY 4.0. L(x, p, λ) ≡ f (x) + λT g(x, p),
1
where in this context λ is the vector of Lagrange multipliers. As g(x, p) is every- 1.5 Partial derivatives
where zero by construction, we may choose λ freely, f (x) = L(x, p, λ), and
We have seen that ∂x g is the Jacobian matrix for the nonlinear function g(x, p) for
dp f (x) = dp L = ∂x f dp x + dp λT g + λT (∂x gdp x + ∂p g) fixed p. To obtain the gradient dp f , ∂p g is also needed. This quantity generally
= fx xp + λT (gx xp + gp ) because g = 0 everywhere is no harder to calculate than gx . But it will almost certainly require writing
T T additional code, as the original software to solve just g(x, p) = 0 does not require
= (fx + λ gx )xp + λ gp .
it.
If we choose λ so that gxT λ = −fxT , then the first term is zero and we can avoid
calculating xp . This condition is the adjoint equation (2). What remains, as in
the first derivation, is dp f = λT gp . 2 PDE-constrained optimization problems
1.3 The relationship between the constraint and adjoint Partial differential equations are used to model physical processes. Optimiza-
equations tion over a PDE arises in at least two broad contexts: determining parameters
of a PDE-based model so that the field values match observations (an inverse
Suppose g(x, p) = 0 is the linear (in x) equation A(p)x−b(p) = 0. As ∂x g = A(p), problem); and design optimization: for example, of an airplane wing.
the adjoint equation is A(p)T λ = −fxT . The two equations differ in form only by A common, straightforward, and very successful approach to solving PDE-
the adjoint. constrained optimization problems is to solve the numerical optimization problem
If g(x, p) = 0 is a nonlinear equation, then software that solves the system resulting from discretizing the PDE. Such problems take the form
for x given a particular value for p quite likely solves, at least approximately, a
sequence of linear equations of the form minimize f (x, p)
p
∂x g(x, p)∆x = −g(x, p). (3) subject to g(x, p) = 0.
∂x g = gx is the Jacobian matrix for the function g(x, p), and (3) is the linear An alternative is to discretize the first-order optimality conditions corresponding
system that gives the step to update x in Newton’s method. The adjoint equation to the original problem; this approach has been explored in various contexts for
gxT λ = −fxT solves a linear system that differs in form from (3) only by the adjoint theoretical reasons but generally is much harder and is not as practically useful
operation. a method.
Two broad approaches solve the numerical optimization problem. The first
1.4 f is a function of both x and p approach is that of modern, cutting-edge optimization packages: converge to a
feasible solution (g(x, p) = 0) only as f converges to a minimizer. The second
Suppose our function is f (x, p) and we still have g(x, p) = 0. How does this approach is to require that x be feasible at every step in p (g(x, p) = 0).
change the calculations? As The first approach is almost certainly the better approach for almost all prob-
T
dp f = fx xp + fp = λ gp + fp , lems. However, practical considerations turn out to make the second approach
the better one in many applications. For example, a research effort may have
the calculation changes only by the term fp , which usually is no harder to compute produced a complicated program to solve g(x, p) = 0 (the PDE or forward prob-
in terms of computational complexity than fx . lem), and one now wants to solve an optimization problem (inverse problem) using
this existing code. Additionally, other properties of particularly time-dependent
f directly depends on p, for example, when the modeler wishes to weight or pe-
problems can make the first approach very difficult to implement.
nalize certain parameters. For example, suppose f originally measures the misfit
between simulated and measured data; then f depends directly only on x. But In the second approach, the problem solver must evaluate f (x, p), solve
suppose the model parameters p vary over space and the modeler prefers smooth g(x, p) = 0, and provide the gradient dp f . Section 1 provides the necessary
distributions of p. Then a term can be added to f that penalizes nonsmooth p tools at a high level of generality to perform the final step. But at least one class
values. of problems deserves some additional discussion.
2
2.1 Time-dependent problems problem:
Z T
Time-dependent problems have special structure for two reasons. First, the ma- L≡ [f (x, p, t) + λT h(x, ẋ, p, t)] dt + µT g(x(0), p).
trices of partial derivatives have very strong block structure; we shall not discuss 0
this low-level topic here. Second, and the subject of this section, time-dependent
problems are often treated by semi-discretization: the spatial derivatives are made The vector of Lagrangian multipliers λ is a function of time, and µ is another
explicit in the various operators, but the time integration is treated as being vector of multipliers that are associated with the initial conditions. Because the
continuous; this method of lines induces a system of ODE. The method-of-lines two constraints h = 0 and g = 0 are always satisfied by construction, we are free
treatment has two implications. First, the adjoint equation for the problem is to set the values of λ and µ, and dp L = dp F . Taking this total derivative,
also an ODE induced by the method of lines, and the derivation of the adjoint Z T
equation must reflect that. Second, the forward and adjoint ODE can be solved dp L = [∂x f dp x + ∂p f + λT (∂x hdp x + ∂ẋ hdp ẋ + ∂p h)] dt
by standard adaptive ODE integrators. 0
+ µT (∂x(0) g dp x(0) + ∂p g). (5)
2.1.1 The adjoint method The integrand contains terms in dp x and dp ẋ. The next step is to integrate by
parts to eliminate the second one:
Consider the problem Z T Z T
T T T
Z T λ ∂ẋ h dp ẋ dt = λ ∂ẋ h dp x 0
− [λ̇T ∂ẋ h + λT dt ∂ẋ h] dp x dt. (6)
minimize F (x, p), where F (x, p) ≡ f (x, p, t) dt, 0 0
p 0
Substituting this result into (5) and collecting terms in dp x and dp x(0) yield
subject to h(x, ẋ, p, t) = 0 (4)
g(x(0), p) = 0, Z T
dp L = [(∂x f + λT (∂x h − dt ∂ẋ h) − λ̇T ∂ẋ h)dp x + fp + λT ∂p h] dt
0
where p is a vector of unknown parameters; x is a (possibly vector-valued) function
of time; h(x, ẋ, p, t) = 0 is an ODE in implicit form; and g(x(0), p) = 0 is the initial + λT ∂ẋ h dp x T
+ (−λT ∂ẋ h + µT gx(0) ) 0 dp x(0) + µT gp .
condition, which is a function of some of the unknown parameters. The ODE h
As we have already discussed, dp x(T ) is difficult to calculate. Therefore, we set
may be the result of semi-discretizing a PDE, which means that the PDE has −1
λ(T ) = 0 so that the whole term is zero. Similarly, we set µT = λT ∂ẋ h|0 gx(0) to
been discretized in space but not time. An ODE in explicit form appears as
ẋ = h̄(x, p, t), and so the implicit form is h(x, ẋ, p, t) = ẋ − h̄(x, p, t). cancel the second-to-last term. Finally, we can avoid computing dp x at all other
times t > 0 by setting
A gradient-based optimization algorithm requires the user to calculate the total
derivative (gradient) ∂x f + λT (∂x h − dt ∂ẋ h) − λ̇T ∂ẋ h = 0.
Z T
dp F (x, p) = [∂x f dp x + ∂p f ] dt. The algorithm for computing dp F follows:
0
1. Integrate h(x, ẋ, p, t) = 0 for x from t = 0 to T with initial conditions
Calculating dp x is difficult in most cases. As in Section 1, two common approaches
g(x(0), p) = 0.
simply do away with having to calculate it. One approach is to approximate the
gradient dp F (x, p) by finite differences over p. Generally, this requires integrating 2. Integrate ∂x f + λT (∂x h − dt ∂ẋ h) − λ̇T ∂ẋ h = 0 for λ from t = T to 0 with
np additional ODE. The second method is to develop a second ODE, this one in initial conditions λ(T ) = 0.
the adjoint vector λ, that is instrumental in calculating the gradient. The benefit
of the second approach is that the total work of computing F and its gradient is 3. Set
approximately equivalent to integrating only two ODE. Z T
−1
The first step is to introduce the Lagrangian corresponding to the optimization dp F = [fp + λT ∂p h] dt + λT ∂ẋ h 0 gx(0) gp .
0
3
2.1.2 The relationship between the constraint and adjoint equations As a check, let us calculate the total derivative directly. The objective is
Z T Z T
Suppose h(x, ẋ, p, t) is the first-order explicit linear ODE h = ẋ − A(p)x − b(p). a
x dt = aebt dt = (ebT − 1).
Then hx = −A(p) and hẋ = I, and so the adjoint equation is fx −λT A(p)−λ̇T = 0. 0 0 b
The adjoint equation is solved backward in time from T to 0. Let τ ≡ T −t; hence Taking the derivative of this expression with respect to a and, separately, b yields
dt = −dτ . Denote the total derivative with respect to τ by a prime. Rearranging the same results we obtained by the adjoint method.
terms in the two equations,
ẋ = A(p)x + b(p) 2.2 Seismic tomography and the adjoint method
λ0 = A(p)T λ − fxT .
Seismic tomography images the earth by solving an inverse problem. One for-
The equations differ in form only by an adjoint. mulation of the problem, following [Tromp, Tape, & Liu, 2005] and hiding many
details in the linear operator A(m), is
2.1.3 A simple closed-form example N Z T
1X
minimize χ(s), where χ(s) ≡ ks(yr , t) − d(yr , t)k2 dt,
m 2 r=1 0
As an example, let’s calculate the gradient of
Z T subject to s̈ = A(m)s + b(m)
x dt g(s(0), m) = 0
0
k(ṡ(0), m) = 0,
subject to ẋ = bx
x(0) − a = 0. where s and d are synthetic and recorded three-component waveform data, the
synthetic data are recorded at N stations located at yr , m is a vector of model
Here, p = [a b]T and g(x(0), p) = x(0) − a. We follow each step: components that parameterize the earth model and initial and boundary condi-
tions, A(m) is the spatial discretization of the PDE, b(m) contains the discretized
1. Integrating the ODE yields x(t) = aebt . boundary conditions and source terms, and g and k give initial conditions.
First, let us identify the components of the generic problem (4). The fields x
2. f (x, p, t) ≡ x and so ∂x f = 1. Similarly, h(x, ẋ, p, t) ≡ ẋ − bx, and so
are now s, m is p, the integrand
∂x h = −b and ∂ẋ h = 1. Therefore, we must integrate
N
1X
1 − bλ − λ̇ = 0 f (s, m, t) ≡ ks(yr , t) − d(yr , t)k2 ,
2 r=1
λ(T ) = 0,
and the differential equation in implicit form is
which yields λ(t) = b−1 (1 − eb(T −t) ).
h(s, s̈, m, t) ≡ s̈ − A(m)s − b(m).
3. ∂p f = [0 0], ∂p h = [0 −x], gx(0) = 1, and gp = [−1 0]. Therefore, the first
component of the gradient is The only difference between the models is (4) has a differential equation that is
first order in time, whereas in the tomography problem, the differential equation
−1
λT ∂ẋ h 0 gx(0) gp = λ(0) · 1 · 1−1 · (−1) = b−1 (−1 + ebT ); (the second-order linear wave equation) is second order in time. Hence initial
conditions must be specified for both s and ṡ, and the integration by parts in
and as ∂b g = 0, (6) must be done twice. Following the same methodology as in Section 2.1.1 (see
Z T Z T Section 2.2.1 for further details), the adjoint equation is
a bT a
−λx dt = b−1 (eb(T −t) − 1)aebt dt = T e − 2 (ebT − 1)
0 0 b b λ̈ = A(m)T λ − fs (7)
is the second component. λ(T ) = λ̇(T ) = 0.
4
The term which differs from the Lagrangian for the first-order problem by the term for the
N additional initial condition and the simplified form of h. The total derivative is
X
fs (y, t) = (s(yr , t) − d(yr , t))δ(y − yr ), (8) Z T
r=1 dm L = [∂s f dm s + ∂m f + λT (dm s̈ − ∂s h̄dm s − ∂m h̄)] dt
0
where δ is the delta function. [Tromp, Tape, & Liu, 2005] call this equation
+ µT (∂s(0) g dm s(0) + ∂m g) + η T (∂ṡ(0) k dm ṡ(0) + ∂m k).
waveform adjoint source. Observe again that the forward and adjoint equations
differ in form only by the adjoint. The gradient Integrating by parts twice,
Z T
−1 −1
Z T T
Z T
dm χ = [χm + λT hm ] dt + λT hs̈ |0 kṡ(0) km − λ̇T hs̈ |0 gs(0) gm λT dm s̈ dt = λT dm ṡ
T
− λ̇T dm s + λ̈T dm s dt.
0 0 0
0 0
Z T
−1 −1
= [0 − λT ∂m (A(m)s + b(m))] dt + λ(0)T kṡ(0) km − λ̇(0)T gs(0) gm . Substituting this and grouping terms,
0
(9) Z T
dm L = [(fs − λT h̄s + λ̈T )dm s + fm − λT h̄m ]dt ⇒fs − λT h̄s + λ̈T = 0
We slightly abuse notation by writing ∂m A(m)s; in terms of the discretization of 0
the wave equation and using Matlab notation to extract the columns of A(m), −1
+ (µT gs(0) + λ̇T )|0 dm s(0) ⇒µT = −λ̇(0)T gs(0)
this expression is more correctly written −1
X + (η T kṡ(0) − λT )|0 dm ṡ(0) ⇒η T = λ(0)T kṡ(0)
∂m A(m)s = ∂m A(m):,i si . + λT dm ṡ|T ⇒λ(T ) = 0
i
T
− λ̇ dm s|T ⇒λ̇(T ) = 0
χm = 0 because the model variables m do not enter the objective. Additionally, T T
the initial conditions are the earthquake and so typically are also independent of + µ gm + η km .
m. Hence (9) can be simplified to
We have indicated the suitable multiplier values to the right of each term. Putting
Z T everything together, the adjoint equation is
dm χ = − λT ∂m (A(m)s + b(m)) dt.
0 λ̈ = h̄Ts λ − fsT
λ(T ) = λ̇(T ) = 0
2.2.1 The adjoint method for the second-order problem
and the total derivative of F is
The derivation in this section follows that in Section 2.1.1 for the first-order
problem. For simplicity, we assume the ODE can be written in explicit form. Z T
−1 −1
The general problem is dm F = dm L = [fm − λT h̄m ]dt − λ̇(0)T gs(0) gm + λ(0)T kṡ(0) km .
0
Z T
minimize F (s, m), where F (s, m) ≡ f (s, m, t) dt, 2.2.2 The continuous, rather than discretized, second-order problem
m 0
subject to s̈ = h̄(s, m, t)
So far we have viewed A(m) as being a matrix that results from discretizing a
g(s(0), m) = 0 PDE. This is the proper way to view the adjoint problem in practice: the gradient
k(ṡ(0), m) = 0. of interest is that of the problem on the computer, which in general must be a
discretized representation of the original continuous problem. However, it is still
The corresponding Lagrangian is helpful to see how the adjoint method is applied to the fully continuous problem.
Z T
We shall continue to use the notation A(m) to donote the spatial operator, but
L≡ [f (s, m, t) + λT (s̈ − h̄(s, m, t))] dt + µT g(s(0), m) + η T k(ṡ(0), m), (10) now we view it as something like A(x; m) = α(x; m)∇ · (β(x; m)∇), where α and
0
5
β are functions of space x and parameterized by the model parameters m, and ∇
is the gradient operator.
The key difference between the discretized and continuous problems is the
inner product between the Lagrange multiplier λ and the fields. In the dis-
T
Rcretized problem, we write λ A(m)s; in the continuous problem, we write
Ω
λ(x)A(x; m)s(x) dx, where Ω is the domain over which the fields are defined.
Here we assume s(x) is a scalar field for clarity. Then the Lagrangian like (10) is
Z T Z
L≡ f (s, m, t) + λ(s̈ − h̄(s, m, t)) dx dt +
0 Ω
Z
[µg(s(0), m) + ηk(ṡ(0), m)] dx,
Ω
In general, the derivations we have seen so far can be carried out with spatial
integrals replacing the discrete inner products.
References
J. Tromp, C. Tape, Q. Liu, “Seismic tomography, adjoint methods, time reversal
and banana-doughnut kernels”, Geophys. J. Int. (2005) 160, 195–216.
Y. Cao, S. Li, L. Petzold, R. Serban, “Adjoint sensitivity analysis for differential-
algebraic equations: The adjoint DAE system and its numerical solution”, SIAM
J. Sci. Comput. (2003) 23(3), 1076–1089.