Adjoint Tutorial
Adjoint Tutorial
Andrew M. Bradley July 7, 20242 (original November 16, 2010) calculated (see Sections 1.3 and 1.5 for further details). The adjoint method uses
the transpose of this matrix, gxT , to compute the gradient dp f . The computational
PDE-constrained optimization and the adjoint method for solving these and re- cost is usually no greater than solving g(x, p) = 0 once and sometimes even less
lated problems appear in a wide range of application domains. Often the adjoint costly.
method is used in an application without explanation. The purpose of this tuto-
rial is to explain the method in detail in a general setting that is kept as simple
as possible. 1.2 Derivation
We use the following notation: the total derivative (gradient) is denoted dx In this section, we consider the slightly simpler function f (x); see below for the
(usually denoted d(·)/dx or ∇x ); the partial derivative, ∂x (usually, ∂(·)/∂x ); the full case.
differential, d. We also use the notation fx for both partial and total derivatives
when we think the meaning is clear from context. Recall that a gradient is a row First,
vector, and this convention induces sizing conventions for the other operators.
We use only real numbers in this presentation. dp f = dp f (x(p)) = ∂x f dp x (= fx xp ). (1)
Second,
1 The adjoint method g(x, p) = 0 everywhere implies
dp g = 0.
Let x ∈ Rnx and p ∈ Rnp . Suppose we have the function f (x, p) : Rnx × Rnp → R
and the relationship g(x, p) = 0 for a function g : Rnx × Rnp → Rnx whose partial
(Note carefully that dp g = 0 everywhere only because g = 0 everywhere. It is
derivative gx is everywhere nonsingular. What is dp f ?
certainly not the case that a function that happens to be 0 at a point also has a
0 gradient there.) Expanding the total derivative,
1.1 Motivation
gx xp + gp = 0.
The equation g(x, p) = 0 is often solved by a complicated software program that
implements what is sometimes called a simulation or the forward problem. Given As gx is everywhere nonsingular, the final equality implies xp = −gx−1 gp . Substi-
values for the parameters p, the program computes the values x. For example, p tuting this latter relationship into (1) yields
could parameterize boundary and initial conditions and material properties for a
discretized PDE, and x are the resulting field values. f (x, p) is often a measure of dp f = −fx gx−1 gp .
merit: for example, fit of x to data at a set of locations, the smoothness of x or p,
the degree to which p attains a particular objective. Minimizing f is sometimes The expression −fx gx−1 is a row vector times an nx × nx matrix and may be
called the inverse problem. understood in terms of linear algebra as the solution to the linear equation
The gradient dp f is useful in many contexts: for example, to solve the op- gxT λ = −fxT , (2)
timization problem minp f or to assess the sensitivity of f to the elements of
p. where T is the matrix transpose. The matrix conjugate transpose (just the trans-
One method to approximate dp f is to compute np finite differences over the pose when working with reals) is also called the matrix adjoint, and for this reason,
elements of p. Each finite difference computation requires solving g(x, p) = 0. For the vector λ is called the vector of adjoint variables and the linear equation (2)
moderate to large np , this can be quite costly. is called the adjoint equation. In terms of λ, dp f = λT gp .
1 This
document is licensed under CC BY 4.0.
A second derivation is useful. Define the Lagrangian
2 Revisions:
(2019) CC license, fix some typos. (2024) Reorganize Sec. 2; remove seismic
tomography example. L(x, p, λ) ≡ f (x) + λT g(x, p),
1
where in this context λ is the vector of Lagrange multipliers. As g(x, p) is every- 1.5 Partial derivatives
where zero by construction, we may choose λ freely, f (x) = L(x, p, λ), and
We have seen that ∂x g is the Jacobian matrix for the nonlinear function g(x, p) for
dp f (x) = dp L = ∂x f dp x + dp λT g + λT (∂x gdp x + ∂p g) fixed p. To obtain the gradient dp f , ∂p g is also needed. This quantity generally
= fx xp + λT (gx xp + gp ) because g = 0 everywhere is no harder to calculate than gx . But it will almost certainly require writing
T T additional code, as the original software to solve just g(x, p) = 0 does not require
= (fx + λ gx )xp + λ gp .
it.
If we choose λ so that gxT λ = −fxT , then the first term is zero and we can avoid
calculating xp . This condition is the adjoint equation (2). What remains, as in
the first derivation, is dp f = λT gp . 2 PDE-constrained optimization problems
1.3 The relationship between the constraint and adjoint Partial differential equations are used to model physical processes. Optimiza-
equations tion over a PDE arises in at least two broad contexts: determining parameters
of a PDE-based model so that the field values match observations (an inverse
Suppose g(x, p) = 0 is the linear (in x) equation A(p)x−b(p) = 0. As ∂x g = A(p), problem); and design optimization: for example, of an airplane wing.
the adjoint equation is A(p)T λ = −fxT . The two equations differ in form only by A common, straightforward, and very successful approach to solving PDE-
the adjoint. constrained optimization problems is to solve the numerical optimization problem
If g(x, p) = 0 is a nonlinear equation, then software that solves the system resulting from discretizing the PDE. Such problems take the form
for x given a particular value for p quite likely solves, at least approximately, a
sequence of linear equations of the form minimize f (x, p)
p
∂x g = gx is the Jacobian matrix for the function g(x, p), and (3) is the linear An alternative is to discretize the first-order optimality conditions corresponding
system that gives the step to update x in Newton’s method. The adjoint equation to the original problem; this approach has been explored in various contexts for
gxT λ = −fxT solves a linear system that differs in form from (3) only by the adjoint theoretical reasons but generally is much harder and is not as practically useful
operation. a method.
Two broad approaches solve the numerical optimization problem. The first
1.4 f is a function of both x and p approach is that of modern, cutting-edge optimization packages: converge to a
feasible solution (g(x, p) = 0) only as f converges to a minimizer. The second
Suppose our function is f (x, p) and we still have g(x, p) = 0. How does this approach is to require that x be feasible at every step in p (g(x, p) = 0).
change the calculations? As The first approach is almost certainly the better approach for almost all prob-
T
dp f = fx xp + fp = λ gp + fp , lems. However, practical considerations turn out to make the second approach
the better one in many applications. For example, a research effort may have
the calculation changes only by the term fp , which usually is no harder to compute produced a complicated program to solve g(x, p) = 0 (the PDE or forward prob-
in terms of computational complexity than fx . lem), and one now wants to solve an optimization problem (inverse problem) using
this existing code. Additionally, other properties of particularly time-dependent
f directly depends on p, for example, when the modeler wishes to weight or pe-
problems can make the first approach very difficult to implement.
nalize certain parameters. For example, suppose f originally measures the misfit
between simulated and measured data; then f depends directly only on x. But In the second approach, the problem solver must evaluate f (x, p), solve
suppose the model parameters p vary over space and the modeler prefers smooth g(x, p) = 0, and provide the gradient dp f . Section 1 provides the necessary
distributions of p. Then a term can be added to f that penalizes nonsmooth p tools at a high level of generality to perform the final step. But at least one class
values. of problems deserves some additional discussion.
2
2.1 Time-dependent problems problem:
Z T
Time-dependent problems have special structure for two reasons. First, the ma- L≡ [f (x, p, t) + λT h(x, ẋ, p, t)] dt + µT g(x(0), p).
trices of partial derivatives have very strong block structure; we shall not discuss 0
this low-level topic here. Second, and the subject of this section, time-dependent
problems are often treated by semi-discretization: the spatial derivatives are made The vector of Lagrangian multipliers λ is a function of time, and µ is another
explicit in the various operators, but the time integration is treated as being vector of multipliers that are associated with the initial conditions. Because the
continuous; this method of lines induces a system of ODE. The method-of-lines two constraints h = 0 and g = 0 are always satisfied by construction, we are free
treatment has two implications. First, the adjoint equation for the problem is to set the values of λ and µ, and dp L = dp F . Taking this total derivative,
also an ODE induced by the method of lines, and the derivation of the adjoint Z T
equation must reflect that. Second, the forward and adjoint ODE can be solved dp L = [∂x f dp x + ∂p f + λT (∂x hdp x + ∂ẋ hdp ẋ + ∂p h)] dt
by standard adaptive ODE integrators. 0
+ µT (∂x(0) g dp x(0) + ∂p g). (5)
2.1.1 The adjoint method for the first-order problem The integrand contains terms in dp x and dp ẋ. The next step is to integrate by
parts to eliminate the second one:
Consider the problem Z T Z T
T T T
Z T λ ∂ẋ h dp ẋ dt = λ ∂ẋ h dp x 0
− [λ̇T ∂ẋ h + λT dt ∂ẋ h] dp x dt. (6)
minimize F (x, p), where F (x, p) ≡ f (x, p, t) dt, 0 0
p 0
Substituting this result into (5) and collecting terms in dp x and dp x(0) yield
subject to h(x, ẋ, p, t) = 0 (4)
g(x(0), p) = 0, Z T
dp L = [(∂x f + λT (∂x h − dt ∂ẋ h) − λ̇T ∂ẋ h)dp x + fp + λT ∂p h] dt
0
where p is a vector of unknown parameters; x is a (possibly vector-valued) function
of time; h(x, ẋ, p, t) = 0 is an ODE in implicit form; and g(x(0), p) = 0 is the initial + λT ∂ẋ h dp x T
+ (−λT ∂ẋ h + µT gx(0) ) 0 dp x(0) + µT gp .
condition, which is a function of some of the unknown parameters. The ODE h
As we have already discussed, dp x(T ) is difficult to calculate. Therefore, we set
may be the result of semi-discretizing a PDE, which means that the PDE has −1
λ(T ) = 0 so that the whole term is zero. Similarly, we set µT = λT ∂ẋ h|0 gx(0) to
been discretized in space but not time. An ODE in explicit form appears as
ẋ = h̄(x, p, t), and so the implicit form is h(x, ẋ, p, t) = ẋ − h̄(x, p, t). cancel the second-to-last term. Finally, we can avoid computing dp x at all other
times t > 0 by setting
A gradient-based optimization algorithm requires the user to calculate the total
derivative (gradient) ∂x f + λT (∂x h − dt ∂ẋ h) − λ̇T ∂ẋ h = 0.
Z T
dp F (x, p) = [∂x f dp x + ∂p f ] dt. The algorithm for computing dp F follows:
0
1. Integrate h(x, ẋ, p, t) = 0 for x from t = 0 to T with initial conditions
Calculating dp x is difficult in most cases. As in Section 1, two common approaches
g(x(0), p) = 0.
simply do away with having to calculate it. One approach is to approximate the
gradient dp F (x, p) by finite differences over p. Generally, this requires integrating 2. Integrate ∂x f + λT (∂x h − dt ∂ẋ h) − λ̇T ∂ẋ h = 0 for λ from t = T to 0 with
np additional ODE. The second method is to develop a second ODE, this one in initial conditions λ(T ) = 0.
the adjoint vector λ, that is instrumental in calculating the gradient. The benefit
of the second approach is that the total work of computing F and its gradient is 3. Set
approximately equivalent to integrating only two ODE. Z T
−1
The first step is to introduce the Lagrangian corresponding to the optimization dp F = [fp + λT ∂p h] dt + λT ∂ẋ h 0 gx(0) gp .
0
3
2.1.2 The relationship between the constraint and adjoint equations As a check, let us calculate the total derivative directly. The objective is
Z T Z T
a
Suppose h(x, ẋ, p, t) is the first-order explicit linear ODE h = ẋ − A(p)x − b(p). x dt = aebt dt = (ebT − 1).
Then hx = −A(p) and hẋ = I, and so the adjoint equation is fx −λT A(p)−λ̇T = 0. 0 0 b
The adjoint equation is solved backward in time from T to 0. Let τ ≡ T −t; hence Taking the derivative of this expression with respect to a and, separately, b yields
dt = −dτ . Denote the total derivative with respect to τ by a prime. Rearranging the same results we obtained by the adjoint method.
terms in the two equations,
ẋ = A(p)x + b(p) 2.1.4 The adjoint method for the second-order problem
λ̇ = A(p)T λ − fxT . The derivation in this section follows that in Section 2.1.1 for the first-order
problem. For simplicity, we assume the ODE can be written in explicit form.
The equations differ in form only by an adjoint. The general problem is
Z T
2.1.3 A simple closed-form example minimize F (s, m), where F (s, m) ≡ f (s, m, t) dt,
m 0
4
We have indicated the suitable multiplier values to the right of each term. Putting
everything together, the adjoint equation is
λ̈ = h̄Ts λ − fsT
λ(T ) = λ̇(T ) = 0
For simplicity, suppose h̄(s, m, t) = A(m)s + b(m). So far we have viewed A(m)
as being a matrix that results from discretizing a PDE. This is often the best way
to view the adjoint problem in practice: the gradient of interest is that of the
problem on the computer, which in general must be a discretized representation
of the original continuous problem. However, it is still helpful to see how the
adjoint method is applied to the fully continuous problem.
We shall continue to use the notation A(m) to denote the spatial operator, but
now we view it as something like A(x; m) = α(x; m)∇ · (β(x; m)∇), where α and
β are functions of space x and parameterized by the model parameters m, and ∇
is the gradient operator.
The key difference between the discretized and continuous problems is the
inner product between the Lagrange multiplier λ and the fields. In the dis-
T
Rcretized problem, we write λ A(m)s; in the continuous problem, we write
Ω
λ(x)A(x; m)s(x) dx, where Ω is the domain over which the fields are defined.
Here we assume s(x) is a scalar field for clarity. Then the Lagrangian like (8) is
Z T Z
L≡ f (s, m, t) + λ(s̈ − h̄(s, m, t)) dx dt +
Z0 Ω
In general, the derivations we have seen so far can be carried out with spatial
integrals replacing the discrete inner products.
References
Y. Cao, S. Li, L. Petzold, R. Serban, “Adjoint sensitivity analysis for differential-
algebraic equations: The adjoint DAE system and its numerical solution”, SIAM
J. Sci. Comput. (2003) 23(3), 1076–1089.