Adjoint Tutorial PDF
Adjoint Tutorial PDF
Adjoint Tutorial PDF
Bradley June 4, 2013 (original November 16, 2010) PDE-constrained optimization and the adjoint method for solving these and related problems appear in a wide range of application domains. Often the adjoint method is used in an application without explanation. The purpose of this tutorial is to explain the method in detail in a general setting that is kept as simple as possible. We use the following notation: the total derivative (gradient) is denoted dx (usually denoted d()/dx or x ); the partial derivative, x (usually, ()/x ); the dierential, d. We also use the notation fx for both partial and total derivatives when we think the meaning is clear from context. Recall that a gradient is a row vector, and this convention induces sizing conventions for the other operators. We use only real numbers in this presentation.
One method to approximate dp f is to compute np nite dierences over the elements of p. Each nite dierence computation requires solving g (x, p) = 0. For moderate to large np , this can be quite costly. In the program to solve g (x, p) = 0, it is likely that the Jacobian matrix x g is calculated (see Sections 1.3 and 1.5 for further details). The adjoint method uses T , to compute the gradient dp f . The computational the transpose of this matrix, gx cost is usually no greater than solving g (x, p) = 0 once and sometimes even less costly.
1.2
Derivation
In this section, we consider the slightly simpler function f (x); see below for the full case. First, dp f = dp f (x(p)) = x f dp x (= fx xp ). (1)
Second, g (x, p) = 0 everywhere implies dp g = 0. (Note carefully that dp g = 0 everywhere only because g = 0 everywhere. It is certainly not the case that a function that happens to be 0 at a point also has a 0 gradient there.) Expanding the total derivative, gx xp + gp = 0.
1 As gx is everywhere nonsingular, the nal equality implies xp = gx gp . Substituting this latter relationship into (1) yields 1 dp f = fx gx gp . 1 The expression fx gx is a row vector times an nx nx matrix and may be understood in terms of linear algebra as the solution to the linear equation T T gx = fx ,
Let x Rnx and p Rnp . Suppose we have the function f (x, p) : Rnx Rnp R and the relationship g (x, p) = 0 for a function g : Rnx Rnp Rnx whose partial derivative gx is everywhere nonsingular. What is dp f ?
1.1
Motivation
The equation g (x, p) = 0 is often solved by a complicated software program that implements what is sometimes called a simulation or the forward problem. Given values for the parameters p, the program computes the values x. For example, p could parameterize boundary and initial conditions and material properties for a discretized PDE, and x are the resulting eld values. f (x, p) is often a measure of merit: for example, t of x to data at a set of locations, the smoothness of x or p, the degree to which p attains a particular objective. Minimizing f is sometimes called the inverse problem. Later we shall discuss seismic tomography. In this application, x are the eld values in the wave (also referred to as the acoustic, second-order linear hyperbolic, second-order wave, etc.) equation, p parameterizes the earth model and initial and boundary conditions, f measures the dierence between measured and synthetic waveforms, and g encodes the wave equation, initial and boundary conditions, and generation of synthetic waveforms. The gradient dp f is useful in many contexts: for example, to solve the optimization problem minp f or to assess the sensitivity of f to the elements of p. 1
(2)
where T is the matrix transpose. The matrix conjugate transpose (just the transpose when working with reals) is also called the matrix adjoint, and for this reason, the vector is called the vector of adjoint variables and the linear equation (2) is called the adjoint equation. In terms of , dp f = T gp . A second derivation is useful and in practice tidier to use. Dene the Lagrangian L(x, p, ) f (x) + T g (x, p),
where in this context is the vector of Lagrange multipliers. As g (x, p) is everywhere zero by construction, we may choose freely, f (x) = L(x, p, ), and dp f (x) = dp L = x f dp x + dp T g + T (x g dp x + p g ) = fx xp + T (gx xp + gp ) = (fx + gx )xp + gp .
T T , then the rst term is zero and we can avoid = f x If we choose so that gx calculating xp . This condition is the adjoint equation (2). What remains, as in the rst derivation, is dp f = T gp . T T
1.5
Partial derivatives
because g = 0 everywhere
We have seen that x g is the Jacobian matrix for the nonlinear function g (x, p) for xed p. To obtain the gradient dp f , p g is also needed. This quantity generally is no harder to calculate than gx . But it will almost certainly require writing additional code, as the original software to solve just g (x, p) = 0 does not require it.
1.3
Suppose g (x, p) = 0 is the linear (in x) equation A(p)x b(p) = 0. As x g = A(p), T . The two equations dier in form only by the adjoint equation is A(p)T = fx the adjoint. If g (x, p) = 0 is a nonlinear equation, then software that solves the system for x given a particular value for p quite likely solves, at least approximately, a sequence of linear equations of the form x g (x, p)x = g (x, p). (3)
Partial dierential equations are used to model physical processes. Optimization over a PDE arises in at least two broad contexts: determining parameters of a PDE-based model so that the eld values match observations (an inverse problem); and design optimization: for example, of an airplane wing. A common, straightforward, and very successful approach to solving PDEconstrained optimization problems is to solve the numerical optimization problem resulting from discretizing the PDE. Such problems take the form minimize f (x, p)
p
subject to g (x, p) = 0. An alternative is to discretize the rst-order optimality conditions corresponding to the original problem; this approach has been explored in various contexts for theoretical reasons but generally is much harder and is not as practically useful a method. Two broad approaches solve the numerical optimization problem. The rst approach is that of modern, cutting-edge optimization packages: converge to a feasible solution (g (x, p) = 0) only as f converges to a minimizer. The second approach is to require that x be feasible at every step in p (g (x, p) = 0). The rst approach is almost certainly the better approach for almost all problems. However, practical considerations turn out to make the second approach the better one in many applications. For example, a research eort may have produced a complicated program to solve g (x, p) = 0 (the PDE or forward problem), and one now wants to solve an optimization problem (inverse problem) using this existing code. Additionally, other properties of particularly time-dependent problems can make the rst approach very dicult to implement. In the second approach, the problem solver must evaluate f (x, p), solve g (x, p) = 0, and provide the gradient dp f . Section 1 provides the necessary tools at a high level of generality to perform the nal step. But at least one class of problems deserves some additional discussion. 2
x g = gx is the Jacobian matrix for the function g (x, p), and (3) is the linear system that gives the step to update x in Newtons method. The adjoint equation T T gx = f x solves a linear system that diers in form from (3) only by the adjoint operation.
1.4
Suppose our function is f (x, p) and we still have g (x, p) = 0. How does this change the calculations? As dp f = fx xp + fp = gp + fp , the calculation changes only by the term fp , which usually is no harder to compute in terms of computational complexity than fx . f directly depends on p, for example, when the modeler wishes to weight or penalize certain parameters. For example, suppose f originally measures the mist between simulated and measured data; then f depends directly only on x. But suppose the model parameters p vary over space and the modeler prefers smooth distributions of p. Then a term can be added to f that penalizes nonsmooth p values.
T
2.1
Time-dependent problems
problem:
T
Time-dependent problems have special structure for two reasons. First, the matrices of partial derivatives have very strong block structure; we shall not discuss this low-level topic here. Second, and the subject of this section, time-dependent problems are often treated by semi-discretization: the spatial derivatives are made explicit in the various operators, but the time integration is treated as being continuous; this method of lines induces a system of ODE. The method-of-lines treatment has two implications. First, the adjoint equation for the problem is also an ODE induced by the method of lines, and the derivation of the adjoint equation must reect that. Second, the forward and adjoint ODE can be solved by standard adaptive ODE integrators. 2.1.1 The adjoint method
L
0
The vector of Lagrangian multipliers is a function of time, and is another vector of multipliers that are associated with the initial conditions. Because the two constraints h = 0 and g = 0 are always satised by construction, we are free to set the values of and , and dp L = dp F . Taking this total derivative,
T
dp L =
0
The integrand contains terms in dp x and dp x . The next step is to integrate by parts to eliminate the second one:
T T
T x dt = T x h dp x h dp x f (x, p, t) dt,
0
T 0
T T x [ h + dt x h] dp x dt.
(6)
where
F (x, p)
0
(4)
Substituting this result into (5) and collecting terms in dp x and dp x(0) yield
T
dp L =
0
T h)dp x + fp + T p h] dt [(x f + T (x h dt x h) x + T x h dp x
T T T + (T x h + gx(0) ) 0 dp x(0) + gp .
where p is a vector of unknown parameters; x is a (possibly vector-valued) function of time; h(x, x, p, t) = 0 is an ODE in implicit form; and g (x(0), p) = 0 is the initial condition, which is a function of some of the unknown parameters. The ODE h may be the result of semi-discretizing a PDE, which means that the PDE has been discretized in space but not time. An ODE in explicit form appears as (x, p, t), and so the implicit form is h(x, x, (x, p, t). x =h p, t) = x h A gradient-based optimization algorithm requires the user to calculate the total derivative (gradient)
T
As we have already discussed, dp x(T ) is dicult to calculate. Therefore, we set 1 (T ) = 0 so that the whole term is zero. Similarly, we set T = T x h|0 gx(0) to cancel the second-to-last term. Finally, we can avoid computing dp x at all other times t > 0 by setting T h = 0. x f + T (x h dt x h) x The algorithm for computing dp F follows: 1. Integrate h(x, x, p, t) = 0 for x from t = 0 to T with initial conditions g (x(0), p) = 0. T h = 0 for from t = T to 0 with 2. Integrate x f + T (x h dt x h) x initial conditions (T ) = 0. 3. Set
T
dp F (x, p) =
0
[x f dp x + p f ] dt.
Calculating dp x is dicult in most cases. As in Section 1, two common approaches simply do away with having to calculate it. One approach is to approximate the gradient dp F (x, p) by nite dierences over p. Generally, this requires integrating np additional ODE. The second method is to develop a second ODE, this one in the adjoint vector , that is instrumental in calculating the gradient. The benet of the second approach is that the total work of computing F and its gradient is approximately equivalent to integrating only two ODE. The rst step is to introduce the Lagrangian corresponding to the optimization 3
dp F =
0
1 [fp + T p h] dt + T x h 0 gx(0) gp .
2.1.2
Suppose h(x, x, p, t) is the rst-order explicit linear ODE h = x A(p)x b(p). T T Then hx = A(p) and hx = I , and so the adjoint equation is fx A(p) = 0. The adjoint equation is solved backward in time from T to 0. Let T t; hence dt = d . Denote the total derivative with respect to by a prime. Rearranging terms in the two equations, x = A(p)x + b(p)
T = A(p)T fx .
x dt =
0 0
aebt dt =
a bT (e 1). b
Taking the derivative of this expression with respect to a and, separately, b yields the same results we obtained by the adjoint method.
2.2
The equations dier in form only by an adjoint. 2.1.3 A simple closed-form example
Seismic tomography images the earth by solving an inverse problem. One formulation of the problem, following [Tromp, Tape, & Liu, 2005] and hiding many details in the linear operator A(m), is minimize (s), where (s) 1 2
N 0 T
s(yr , t) d(yr , t)
r =1
dt,
subject to s = A(m)s + b(m) x dt g (s(0), m) = 0 k (s (0), m) = 0. where s and d are synthetic and recorded three-component waveform data, the synthetic data are recorded at N stations located at yr , m is a vector of model components that parameterize the earth model and initial and boundary conditions, A(m) is the spatial discretization of the PDE, b(m) contains the discretized boundary conditions and source terms, and g and k give initial conditions. First, let us identify the components of the generic problem (4). The elds x are now s, m is p, the integrand f (s, m, t) 1 2
N
subject to x = bx x(0) a = 0. Here, p = [a b]T and g (x(0), p) = x(0) a. We follow each step: 1. Integrating the ODE yields x(t) = aebt . 2. f (x, p, t) x and so x f = 1. Similarly, h(x, x, p, t) x bx, and so x h = b and x h = 1. Therefore, we must integrate =0 1 b (T ) = 0, which yields (t) = b1 (1 eb(T t) ). 3. p f = [0 0], p h = [0 x], gx(0) = 1, and gp = [1 0]. Therefore, the rst component of the gradient is
1 1 T x (1) = b1 (1 + ebT ); h 0 gx(0) gp = (0) 1 1
s(yr , t) d(yr , t) 2 ,
r =1
and the dierential equation in implicit form is h(s, s , m, t) s A(m)s b(m). The only dierence between the models is (4) has a dierential equation that is rst order in time, whereas in the tomography problem, the dierential equation (the second-order linear wave equation) is second order in time. Hence initial conditions must be specied for both s and s , and the integration by parts in (6) must be done twice. Following the same methodology as in Section 2.1.1 (see Section 2.2.1 for further details), the adjoint equation is = A(m)T fs (T ) = 0. (T ) = 4 (7)
and as b g = 0,
T T
x dt =
0 0
b1 (eb(T t) 1)aebt dt =
a a bT T e 2 (ebT 1) b b
The term
N
which diers from the Lagrangian for the rst-order problem by the term for the additional initial condition and the simplied form of h. The total derivative is (s(yr , t) d(yr , t)) (y yr ), (8) dm L =
0 T
fs (y, t) =
r =1
where is the delta function. [Tromp, Tape, & Liu, 2005] call this equation waveform adjoint source. Observe again that the forward and adjoint equations dier in form only by the adjoint. The gradient
T
dm =
0 T
1 T |0 g 1 gm [m + T hm ] dt + T hs |0 ks (0) km hs s(0) 0
T dm s dt = T dm s (9)
T 0
T dm s
+
0 0
T dm s dt.
=
0
[0 + m A(m)s] dt + (0)
1 ks (0) km
(0)T g 1 gm . s(0)
We slightly abuse notation by writing m A(m)s; in terms of the discretization of the wave equation and using Matlab notation to extract the columns of A(m), this expression is more correctly written m A(m)s =
i
dm L =
0
(0)T g 1 T = s(0)
1 T = (0)T ks (0)
m A(m):,i si .
m = 0 because the model variables m do not enter the objective. Additionally, the initial conditions are the earthquake and so typically are also independent of m. Hence (9) can be simplied to
T
+ T dm s |T T dm s|T + T gm + T km .
(T ) = 0 (T ) = 0
dm =
0
m A(m)s dt.
We have indicated the suitable multiplier values to the right of each term. Putting everything together, the adjoint equation is =h T f T s s (T ) = 0 (T ) = and the total derivative of F is
T
2.2.1
The derivation in this section follows that in Section 2.1.1 for the rst-order problem. For simplicity, we assume the ODE can be written in explicit form. The general problem is
T
dm F = dm L =
0
where
F (s, m)
0
f (s, m, t) dt,
2.2.2
So far we have viewed A(m) as being a matrix that results from discretizing a PDE. This is the proper way to view the adjoint problem in practice: the gradient of interest is that of the problem on the computer, which in general must be a discretized representation of the original continuous problem. However, it is still helpful to see how the adjoint method is applied to the fully continuous problem. We shall continue to use the notation A(m) to donote the spatial operator, but now we view it as something like A(x; m) = (x; m) ( (x; m)), where and 5
L
0
are functions of space x and parameterized by the model parameters m, and is the gradient operator. The key dierence between the discretized and continuous problems is the inner product between the Lagrange multiplier and the elds. In the discretized problem, we write T A(m)s; in the continuous problem, we write (x)A(x; m)s(x) dx, where is the domain over which the elds are dened. Here we assume s(x) is a scalar eld for clarity. Then the Lagrangian like (10) is
T
L
0
f (s, m, t) +
(s, m, t)) dx ( sh
dt +
In general, the derivations we have seen so far can be carried out with spatial integrals replacing the discrete inner products.
References
J. Tromp, C. Tape, Q. Liu, Seismic tomography, adjoint methods, time reversal and banana-doughnut kernels, Geophys. J. Int. (2005) 160, 195216. Y. Cao, S. Li, L. Petzold, R. Serban, Adjoint sensitivity analysis for dierentialalgebraic equations: The adjoint DAE system and its numerical solution, SIAM J. Sci. Comput. (2003) 23(3), 10761089.