You Only Linearize Once: Alexey Radul, Adam Paszke, Roy Frostig, Matthew J. Johnson, Dougal Maclaurin
You Only Linearize Once: Alexey Radul, Adam Paszke, Roy Frostig, Matthew J. Johnson, Dougal Maclaurin
You Only Linearize Once: Alexey Radul, Adam Paszke, Roy Frostig, Matthew J. Johnson, Dougal Maclaurin
1 INTRODUCTION
Automatic differentiation (AD) powers not only deep machine learning, but also applications
in broader numerical computing, from robotics to molecular dynamics to weather simulation.
For example, in deep learning the task of choosing how to set the billions of parameters of a
neural network is almost always attacked with AD: a programmer writes a prediction function
representing how the network maps parameters and example inputs to predicted outputs, as well as
a scalar-valued loss function penalizing deviations from expected outputs to predicted ones. Then
to incrementally improve a given setting of the parameters, the programmer need only ask an AD
Authors’ addresses: Alexey Radul, axch@google.com, Google Research, USA; Adam Paszke, apaszke@google.com, Google
Research, Poland; Roy Frostig, frostig@google.com, Google Research, USA; Matthew J. Johnson, mattjj@google.com, Google
Research, USA; Dougal Maclaurin, dougalm@google.com, Google Research, USA.
This work is licensed under a Creative Commons Attribution 4.0 International License.
© 2023 Copyright held by the owner/author(s).
2475-1421/2023/1-ART43
https://doi.org/10.1145/3571236
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:2 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
system for the gradient of the loss with respect to the parameters on some example inputs and
outputs, then update the parameters by adding a small negative multiple of that gradient. Repeat a
million times, et voilà, la singularité est proche!
AD is traditionally organized into two modes, forward-mode and reverse-mode. Each mode
has its advantages and applications. But most systems only implement one mode, or if both are
implemented, the implementations are almost entirely separate.
Do they need to be? After all, reverse-mode AD is similar to forward-mode AD, except one
computes the derivative part backward. We take this description literally: Our goal is to separate
differentiation proper from direction reversal, and identify a clear boundary between them. The
“direction reversal” is transposition, an interesting and potentially useful code transformation in
its own right. The boundary between differentiation and transposition consists of provably linear
functions: linearity (proven by our type system) is a sufficient condition for transposability, and we
show that automatic derivatives are provably linear.
This simplifies implementations of reverse-mode AD (see Figure 1). The actual differentiation is
the province of the widely known and relatively simple (and covariant!) forward-mode transforma-
tion. Reverse mode is derived therefrom by a separate transformation that now must only confront
linearity, not derivatives. The notion of linearity we formalize here serves as a simpler and clearer
(and mechanically verifiable) input invariant for transposition than one of the form “this program
was generated by forward-mode AD, so it involves no unacceptable thing X.”
This same architecture also makes AD systems easier for a user to extend. A user wishing to
supply an efficient custom derivative for some function need only supply the forward (or reverse)
derivative; as long as it is provably linear, the reverse (resp., forward) derivative can be derived
automatically.
Reverse-mode automatic differentiation is already implemented in the proposed way in two
systems we know of: JAX1 [Bradbury et al. 2018; Frostig et al. 2018] and Dex [Paszke et al. 2021]. Our
hope is to make the technique easier to maintain within these systems, by formalizing the provable
linearity invariant; and to make it more widely accessible outside of them by describing it on a
simple, stand-alone object language. We also hope to pave the way for exposing transposition as a
user-accessible higher-order function in its own right, by clarifying its input and output invariants.
More concretely, our contributions are that:
• We explicitly decompose reverse-mode automatic differentiation into forward mode (Sec-
tion 5), a new unzipping transformation (Section 6), and transposition (Section 7).
• We introduce a new linearly-typed intermediate language for linear computations (Section 4),
whose type system captures the provable linearity invariant. We call the language “Linear A.”
• We define and prove correctness for unzipping and transposition, and we prove that they
preserve work, code size, and provable linearity. Together with the known results on correct-
ness and work- and size-preservation for forward mode, this implies correctness and work-
and size-preservation of reverse mode when implemented this way.2
2 PRELIMINARIES
2.1 Mathematical Differentiation Operations Corresponding to AD’s Two Modes
At its core, AD brings two particular differentiation operations from mathematical functions to
programming language functions. These two operations correspond to the forward and reverse
modes. The first operation, the Jacobian-vector product (JVP), underlies forward-mode and answers
1 The
jax.grad function is exactly a composition of three functions that implement the steps we outline in this work.
2 In
our cost model, transposition and unzipping preserve work exactly, and program size up to a constant factor. Forward
mode, as usual, only preserves work up to a constant factor.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:3
(a) 𝑓 (b) forward deriv- (c) forward derivative, un- (d) derivative, (e) reverse deriva-
ative zipped transposed tive
the question: if I perturb my input 𝑥 a bit in some direction 𝑣, how does my output change? More
precisely, given a sufficiently nice mathematical function 𝑓 : R𝑛 → R𝑚 , we can define the Jacobian
at 𝑥 ∈ R𝑛 as the linear map 𝜕𝑓 (𝑥) : R𝑛 → R𝑚 such that
𝑓 (𝑥 + 𝑣) = 𝑓 (𝑥) + 𝜕𝑓 (𝑥)(𝑣) + 𝑜 (∥𝑣 ∥), ∀𝑣 ∈ R𝑛 .
We then say a Jacobian-vector product for 𝑓 is the mapping
(𝑥, 𝑣) ↦→ (𝑓 (𝑥), 𝜕𝑓 (𝑥)(𝑣)).
This definition is compositional, in that the JVP of 𝑓 ◦ 𝑔 can be expressed by applying the JVP of 𝑓
to the result of the JVP of 𝑔. It also allows for efficient evaluation, as we make precise in the sequel.
The second core mathematical operation represented in automatic differentiation is the vector-
Jacobian product (VJP), which underlies reverse-mode. It answers a subtler question: given a linear
function on small perturbations to a function’s output, what is the corresponding linear function on
small perturbations to the input? That is, for some fixed 𝑥, given a vector 𝑢 ∈ R𝑚 , find the vector
𝑤 ∈ R𝑛 such that
⟨𝑢, 𝜕𝑓 (𝑥)(𝑣)⟩ = ⟨𝑤, 𝑣⟩, ∀𝑣 ∈ R𝑛 ,
where ⟨·, ·⟩ denotes the standard inner product. This mapping from 𝑢 to 𝑤 is in fact linear, so we
can define a new linear map 𝜕𝑓 (𝑥)𝑇 : R𝑚 → R𝑛 by
⟨𝑢, 𝜕𝑓 (𝑥)(𝑣)⟩ = ⟨𝜕𝑓 (𝑥)𝑇 (𝑢), 𝑣⟩.
We then say the vector-Jacobian product for 𝑓 is the mapping
𝑥 ↦→ (𝑓 (𝑥), 𝑢 ↦→ 𝜕𝑓 (𝑥)𝑇 (𝑢)).
This choice of definition is also compositional: the VJP of 𝑓 ◦ 𝑔 can be expressed by composing the
VJPs of 𝑓 and 𝑔, though the linear functions must be composed in reverse order as 𝜕𝑔(𝑥)𝑇 ◦ 𝜕𝑓 (𝑥)𝑇 .
Hence the name "reverse-mode"!
The VJP is the more useful operation for gradient-based optimization, like in training neural
networks. That’s because with one VJP we compute the direction in parameter space orthogonal
to the loss’s local level sets. That is, for a function 𝑓 : R𝑛 → R, the gradient is simply ∇𝑓 (𝑥) =
𝜕𝑓 (𝑥)𝑇 (1) ∈ R𝑛 . The gradient can also be computed using forward-mode, but then it requires 𝑛
evaluations, losing the asymptotic efficiency of AD.
The implementation of reverse-mode is often considered much harder than that of forward-mode,
with little shared code between the two. Yet, the “real work” of differentiation is embedded in the
linear functions 𝜕𝑓 (𝑥) and 𝜕𝑓 (𝑥)𝑇 , which are just transposes of each other.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:4 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
2.2 Linearity
Linearity is the key to relating forward and reverse modes. The word “linear” actually has two
relevant meanings when discussing a programming language for numerical computations:
1. From substructural logic and linear type systems, a subprogram 𝑃 (𝑥) is substructurally linear
if it uses its argument 𝑥 exactly once.
2. From linear algebra, a (mathematical) function 𝑓 (𝑥) is algebraically linear if it is a vector
space homomorphism, i.e., if 𝑓 (𝑥 + 𝑦) = 𝑓 (𝑥) + 𝑓 (𝑦) and 𝑓 (𝑐𝑥) = 𝑐 𝑓 (𝑥) for scalar 𝑐.
These two notions are not named the same by accident. Indeed, when is a polynomial, written as
a sum of products of variables (no exponentiation), algebraically linear in 𝑥? Exactly when every
nonzero monomial term uses 𝑥 exactly once. This is how, when we get to Theorem 4.1, we will be
able to prove algebraic linearity using an appropriately designed substructural type system.
The Jacobian-vector product (JVP) map at a point is, by mathematical definition, linear alge-
braically, but it also turns out (as we will also show in Section 5) that its instantiation in AD is
linear substructurally. That substructural linearity will, in turn, allow us to mechanically transpose
(Section 7) the program for computing 𝑓 into a (also substructurally linear) program that com-
putes 𝑓 T with the same performance. This then recovers reverse-mode AD, which computes the
vector-Jacobian-product (VJP) map.
Part of the point of this exercise is that the transposition transformation we will define in
Section 7 applies to all substructurally linear functions. Substructurally linear functions thus
form an abstraction boundary between differentiation and transposition; and transposition (of
user-defined functions) could then be exposed as a code transformation directly.
Note that there are algebraically linear functions that our type system will not accept as sub-
structurally linear, and on which our transposition transformation would fail. For example, the
function 𝑓 (𝑥) = (𝑥 ∗ 𝑥)/𝑥 is linear in 𝑥, but it cannot be typed as linear in Linear A as written.
Moreover, attempting to transpose such a function as we do, by recurring on the subexpressions,
is doomed to failure—the subexpression . . . /𝑥 is not linear in 𝑥 in isolation, so has no transpose.
Fortunately, as we prove in Section 5, automatic differentiation never produces such functions.
3 NOTATION REFERENCE
By convention, we use an over-dot to name linear terms: 𝑣. ¤ The goal is to evoke the conventional
physics notation for derivatives, since automatic differentiation is our main topic. Formally, however,
we give the dot itself no structured meaning. So, the variable 𝑣¤ is just a different variable from 𝑣;
any relationship it may have to 𝑣 emerges from our program transformations, rather than from
how we write it.
The double-dot 𝑣¥ connotes co-tangents (not double derivatives), i.e., linear terms that appear in
a transposed function. The nomenclature is from the reverse phase of reverse-mode AD.
The underline 𝑣 means “zero or more of these elements”; we put it below rather than above to
avoid clashing with dots: 𝑣.
¤ When the same thing appears in different underlined expressions in
the same context, they are parallel. For instance, when a rule mentions both 𝑣 and 𝑣 : 𝜏, those lists
are the same length.
The semicolon (𝑥; 𝑦) separates non-linear entities (on the left) from linear ones (on the right).
The angle brackets ⟨𝑥, 𝑦⟩ mean dot product when defining the meaning of transposition, and
delimit transposition environments when explaining its implementation.
The square brackets 𝑒 [𝑥; 𝑦] mean evaluating an expression with values 𝑥 and 𝑦 bound to its
free non-linear and linear variables respectively.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:5
x x¤
x x¤
y y¤
y y¤
z z¤
Fig. 2. Dataflow diagrams of (a) a general Linear A expression 𝑒, and (b) sequential composition of such
expressions. The non-linear results of an expression (left) depend only only on its non-linear free variables,
whereas the linear results (right, dotted) depend non-linearly on the non-linear free variables and linearly on
the linear free variables. Composition preserves this (non-)dependence pattern.
4 LANGUAGE
We introduce Linear A, a model language of indexed linear computations. The main idea in Linear
A is that the syntax marks which values are supposed to be linear (substructurally and therefore
algebraically) and which are not. Non-linear values can be computed arbitrarily (and may happen to
be algebraically linear), but may not depend on linear values. Linear values, on the other hand, must
be computed linearly from linear inputs, but may depend on non-linear values through scaling.
This leads to the indexed-linear pattern of data flow shown in Figure 2.
For example, the function 𝑔(𝑥; 𝑦) = (𝑥 2 ; 𝑥 2𝑦) is indexed linear. The first result of 𝑔 depends
non-linearly on 𝑥 and does not depend on 𝑦 at all; whereas the second result depends linearly on 𝑦,
but the specific linear function of 𝑦 is indexed by 𝑥. If we tag 𝑥 and the first result as “non-linear”,
and 𝑦 and the second result as “linear”, 𝑔 will type-check as a valid function of Linear A.3
The Linear A language is not designed as a user-facing language, but rather as a sublanguage
that a given instance of automatic differentiation operates on. To wit, if one seeks to differentiate
some 𝐹 at some point 𝑥, the value of 𝑥 determines the control flow choices in the implementation of
𝐹 , and the program corresponding to the path actually taken (sometimes called a “trace”) becomes
straight-line. Differentiating such straight-line trace programs is the core task of AD, and we
restrict Linear A to be total, functional, and first-order in order to focus our attention on it. While a
more expressive object language would certainly be preferable, the technique of AD by explicit
transposition has not been previously described even in this restricted setting; in Section 10, we
will touch on some obstacles to formalizing transposition on a Turing-complete object language.
Many differentiable array languages are DSLs of the same expressive power as Linear A, embed-
ded in a host language that provides Turing-complete metaprogramming facilities. Furthermore,
reverse-mode automatic differentiation of control flow constructs (including higher-order functions)
cannot be performed without forming a dynamic trace of the computation, and the languages
conventionally used in those traces have expressive power similar to Linear A. For example, Pearl-
mutter and Siskind [2008a] and Wang et al. [2019] represent the trace as a sequence of applications.
Similarly, Krawiec et al. [2022] propose an AD system capable of differentiating a rich higher-order
functional language, but the differentiation involves tracing an expression in a language almost
identical to the linear fragment of Linear A. Finally, the language used internally in the JAX project
Bradbury et al. [2018], which implements our presented method, is also very close to Linear A.
3 It will also type-check if we tag both inputs and both outputs “non-linear,” but not with any other tagging.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:6 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
The main differences are that JAX has a significantly larger number of mathematical primitives,
as well as some limited control-flow operators (conditionals, and loops with a statically known
trip-count). None of these are fundamental to the presented method and for that reason we elide
them for simplicity. JAX additionally supports array types with statically known shapes, which
are isomorphic to product types in Linear A. On the other hand, Linear A is slightly richer than
JAX’s internal language, as it allows nested let-bindings to have an arbitrary expression on the
right hand side, not only immediate primitive applications.
Many details of Linear A are unconventional, so we include point-by-point rationales. We focus
on the syntax in Section 4.1, then present and discuss the type system in Section 4.2, state our cost
model in Section 4.3, and formally define our notion of indexed linearity in Section 4.4, where we
also prove that the type system enforces it. Finally, Section 4.5 defines the natural linearity erasure
transformation on Linear A, which we will occasionally need later.
4.1 Syntax
The syntax of Linear A appears in Figure 3. We make Linear A a first-order language by only
permitting function definitions at the top level. We make Linear A total by disallowing recursion:
function names are bound only after their definition, as given by the program order. We also
syntactically mark which values are linear vs not. In our notation, this distinction shows up as
returns, binders, and function application forms taking two lists of arguments (or formal parameters):
the non-linear arguments before the semicolon and the linear ones after.
Our function 𝑔(𝑥; 𝑦) = (𝑥 2 ; 𝑥 2𝑦) from the previous section would be written:
def g1,1
1,1 ( 𝑥 : R; 𝑦¤ : R ) → (R; R) =
let( 𝑎 : R; ) = 𝑥 ∗ 𝑥 in
let( ; 𝑏¤ : R ) = 𝑎 ∗ 𝑦¤ in
¤
(𝑎; 𝑏)
Note that ∗ (as well as +) are available as primitives to both linear and non-linear computations in
Linear A, just with different typing rules.
4.1.1 Tuples and Multiple Returns. We distinguish between product types (written ⊗), which are
first-class in Linear A, and multiple-value returns (written (𝑣; 𝑣)),
¤ which are the only construct
permitted to mention both linear and non-linear values together. An expression returning multiple
values cannot be bound to one variable, but must instead be bound componentwise. We enforce this
in the syntax by indexing every expression 𝑒𝑚,𝑛 by the number 𝑚 of non-linear results it returns
and the number 𝑛 of linear results it returns. As a short-hand, we write 𝑒 for a one-value non-linear
expression, and 𝑒¤ for a one-value linear expression. Products, in contrast, may be bound to variables,
but must contain only linear or only non-linear values.
Multiple-value returns are a somewhat unusual syntactic choice. In our case, they are convenient,
because our transposition transformation of Section 7 will be running expressions “backwards”.
Since an expression is free to reference more than one variable from its environment, its transpose
must be able to return more than one result; and it’s more elegant to do so directly in the syntax.
4.1.2 A-Normal Form. The Linear A syntax enforces administrative normal form [Sabry and
Felleisen 1992] on expressions. Where needed, A-normal form can be enforced by standard tech-
niques; we ask the reader to indulge us when we write some transformation results without
explicitly introducing all the intermediate variables that our syntax technically requires. We do
this for clarity and brevity. Under the same aegis, we also only allow primitives to be unary or
binary, and all primitives except dup and drop return exactly one result. We likewise occasionally
abuse notation and write (; 𝑒, 𝑣) where 𝑒 may return multiple results. The intended meaning is
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:7
Types
𝜏, 𝜎, 𝜋 ::= R real scalar
| ⊗𝜏 tuple type
Fixed-arity Expressions
𝑒 1,0 ::= 𝑢, 𝑣, 𝑤, 𝑧 non-linear variable
|𝑙 literal; 𝑙 ∈ R
| ⊗ (𝑣 ) tuple constructor
| sin(𝑣) | cos(𝑣) | exp(𝑣) | 𝑣1 + 𝑣2 | . . . primitives for non-linear fragment
𝑒 0,1 ::= 𝑢,¤ 𝑣, ¤ 𝑤,
¤ 𝑧¤ linear variable
| 0¤ 𝜏¤ linear zero of type 𝜏¤
| ⊗(𝑣¤ ) tuple constructor
| 𝑣¤1 + 𝑣¤2 linear addition
| 𝑣 ∗ 𝑣¤ right-linear multiplication
𝑒 0,2 ::= dup( 𝑣) ¤ explicit fan-out for linear values
𝑒 ::= drop(𝑒𝑚,𝑛 )
0,0 explicit drop
Definitions
𝑚,𝑛
𝑑 ::= def 𝑓𝑜,𝑝 ( 𝑣 : 𝜏; 𝑣¤ : 𝜏¤ ) → (𝜏; 𝜏¤) = 𝑒 𝑜,𝑝 𝑚, 𝑛 -in, 𝑜, 𝑝 -out function; |𝑣 | = 𝑚 , | 𝑣¤ | = 𝑛 , |𝜏 | = 𝑜 , |𝜏¤ | = 𝑝
Programs
𝑃 ::= 𝜖 | 𝑑, 𝑃 list of function definitions
Environments
Γ ::= 𝜖 | 𝑣 : 𝜏, Γ unordered type environment
Σ ::= 𝜖 | 𝑣 → 𝑤, Σ variable association map (for J in Section 5)
Δ ::= ⟨⟩ | ⟨𝑣 : 𝜏, Δ⟩ ordered type environment (for T in Section 7)
Fig. 3. Syntax of Linear A. Expressions are syntactically indexed by how many non-linear and linear results
they return. A function name enters scope after its definition—no recursive function calls. The underline 𝑣
means “zero or more of these elements”; we put it below rather than above to avoid clashing with dots: 𝑣.
¤
an expression that executes 𝑒 and returns all of its results, and also the variables 𝑣. This can be
arranged within the syntax of Linear A by allocating fresh names to briefly hold the results of 𝑒.
A-normal form does make our work preservation results easier. To wit, it’s possible and reasonable
to implement a differentiate-unzip-transpose AD system that operates directly on compound
expressions. Such a system would, however, need to be careful to introduce temporaries to avoid
duplicating code expressions (and therefore work); for example in rule JPrimMul in Figure 7.
A-normal form saves us from that because the intermediate names are already present.
4.1.3 Asymmetric Linear Multiply. Linear computations in Linear A model real multiplication as
linear in the right-hand argument and non-linear in the left-hand argument. (This materializes in
rule TypeLinMul in Figure 4 checking the left argument in the non-linear environment and the right
argument in the linear environment.) Which is of course immaterial, because real multiplication
commutes, but we intentionally pick one to keep the linearity of variables syntactically apparent.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:8 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
TypeRet
𝑣 : 𝜏; 𝑣¤ : 𝜏¤ ⊢ (𝑣; 𝑣¤ ) : (𝜏; 𝜏¤)
Γ1 ; Γ¤1 ⊢ 𝑒 1 : (𝜏; 𝜏¤) Γ2 , 𝑣 : 𝜏; Γ¤2 , 𝑣¤ : 𝜏¤ ⊢ 𝑒 2 : (𝜎; 𝜎¤ )
TypeLet
Γ1 ∪ Γ2 ; Γ¤1 ⊎ Γ¤2 ⊢ let( 𝑣 : 𝜏; 𝑣¤ : 𝜏¤ ) = 𝑒 1 in 𝑒 2 : (𝜎; 𝜎¤ )
Γ, 𝑤 : 𝜏; Γ¤ ⊢ 𝑒 : (𝜎; 𝜎¤ )
TypeUnpack
Γ ∪ {𝑣 : ⊗𝜏 }; Γ¤ ⊢ let( ⊗ 𝑤 : ⊗𝜏; ) = 𝑣 in 𝑒 : (𝜎; 𝜎¤ )
¤ 𝑤¤ : 𝜏¤ ⊢ 𝑒 : (𝜎; 𝜎¤ )
Γ; Γ,
¤ 𝑣¤ : ⊗𝜏¤ ⊢ let( ; ⊗ 𝑤¤ : ⊗𝜏¤ ) = 𝑣¤ in 𝑒 : (𝜎; 𝜎¤ )
TypeLinUnpack
Γ; Γ,
def 𝑓 ( 𝑣 : 𝜏; 𝑣¤ : 𝜏¤ ) → (𝜎; 𝜎¤ ) = 𝑒
TypeApp
𝑣 : 𝜏; 𝑣¤ : 𝜏¤ ⊢ 𝑓 (𝑣; 𝑣¤ ) : (𝜎; 𝜎¤ )
TypeVar TypeLit
𝑣 : 𝜏; 𝜀 ⊢ 𝑣 : (𝜏;) 𝜀; 𝜀 ⊢ 𝑙 : (R;)
TypePrim1 TypePrim2
𝑣 : R; 𝜀 ⊢ sin(𝑣) : (R;) {𝑣1 : R} ∪ {𝑣2 : R}; 𝜀 ⊢ 𝑣1 + 𝑣2 : (R;)
TypeTup TypeLinTup
∪{𝑣 : 𝜏 }; 𝜀 ⊢ ⊗𝑣 : ( ⊗𝜏;) 𝜀; 𝑣¤ : 𝜏¤ ⊢ ⊗ 𝑣¤ : (; ⊗𝜏¤)
TypeLinVar TypeLinZero
𝜀; 𝑣¤ : 𝜏¤ ⊢ 𝑣¤ : (; 𝜏¤) 𝜀; 𝜀 ⊢ 0¤ 𝜏¤ : (; 𝜏¤)
TypeLinPlus TypeLinMul
𝜀; 𝑣¤1 : 𝜏,
¤ 𝑣¤2 : 𝜏¤ ⊢ 𝑣¤1 + 𝑣¤2 : (; 𝜏¤) 𝑣1 : R; 𝑣¤2 : 𝜏¤ ⊢ 𝑣1 ∗ 𝑣¤2 : (; 𝜏¤)
Γ; Γ¤ ⊢ 𝑒 : (𝜏; 𝜏¤)
TypeDup TypeDrop
𝜀; 𝑣¤ : 𝜏¤ ⊢ dup( 𝑣)
¤ : (; 𝜏,
¤ 𝜏¤) Γ; Γ¤ ⊢ drop(𝑒) : (;)
⊢𝑓
𝑣 : 𝜏; 𝑣¤ : 𝜏¤ ⊢ 𝑒 : (𝜎; 𝜎¤ )
TypeDef
⊢ def 𝑓 ( 𝑣 : 𝜏; 𝑣¤ : 𝜏¤ ) → (𝜎; 𝜎¤ ) = 𝑒
Fig. 4. Linear A typing rules. This is a standard simply-typed system, except that it enforces that all non-linear
variables are used at least once, and all linear variables are used exactly once, up to the explicit drop and dup
operations. The TypePrim rules can be extended to a larger set of primitive differentiable maps.
Of course, multiplication is actually multi-linear, but we leave explicitly modeling this as an open
avenue for future research.
4.2 Typing
The type system for Linear A appears in Figure 4. It is a conventional substructural type system
that enforces the indexed linearity that Linear A is trying to capture.
4.2.1 Substructural Typing. The type system is linear on the linear fragment of Linear A, enforcing
that every linear variable is used exactly once, except where duplicated or dropped by explicit
dup or drop operations. We track linear duplications and deletions because they transpose to
addition and zero, respectively. Our earlier analogy between algebraic and substructural linearity
of a polynomial appears here in the rules TypeLinPlus, which requires both arguments to be linear,
and TypeLinMul, which requires exactly the right-hand argument to be linear.
Less conventionally, we also enforce that every non-linear variable is used at least once (by an
explicit drop if needed). Thus even the non-linear fragment of Linear A requires a substructural (if
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:9
not linear, per se) type system. We track drop of non-linear variables because we have to charge for
it in our cost model (see Section 4.3.1 for a more complete rationale); we do not track non-linear
duplication because we do not charge for dup.
Since our requirement for non-linear variables is to use each at least once, we often pattern-match
a non-linear type environment as Γ1 ∪ Γ2 (e.g., in TypeLet) or ∪Γ𝑖 , implying “Γ𝑖 contains exactly
the free variables of 𝑒𝑖 ; the incoming environment Γ must be the (not necessarily disjoint) union of
the Γ𝑖 .” Similarly, when speaking of the linear environment, we write Γ¤1 ⊎ Γ¤2 to imply that in this
case the union must be disjoint. In the end, the actual use of variables is enforced by each leaf rule
requiring that the incoming environment contain exactly the variables referenced by that leaf and
no others. In practice, one would implement such a type system by inspecting the free variables of
subexpressions rather than by guessing and checking a partition of the environment, but we elide
that consideration from the rules for the sake of brevity.
4.2.2 Polymorphism. We define linear zero, linear addition, and linear multiplication to be type-
indexed—the same operation will type-check with different types in different places as needed
(rules TypeLinZero, TypeLinPlus, TypeLinMul). This is justified because every type that a linear
variable can have defines a unique vector space isomorphic to R𝑛 for some 𝑛. Operationally, the zero
(respectively, summation and scaling) just happens elementwise, recurring into tuples as needed.
The reason to make linear operations type-indexed like this is that drop and dup are naturally
type-indexed (i.e., drop can drop a value of any type), but they transpose to zero and addition,
respectively. We could also type-index non-linear operations, but we don’t need to, and it would
cease to be justified if Linear A were extended to include sum types. Thus, TypeLit, TypePrim1
and TypePrim2 call for the R type.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:10 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
problem by charging for dup of linear variables, but then forward differentiation wouldn’t be
(locally) work-preserving unless we charged for dup of non-linear variables as well. Instead, we
follow the observation that charging for dup should have no asymptotic effect, because usually
both duplicates would participate in some other costly operation. In order to make it so, we must
charge for drop (of both linear and non-linear variables), to make it into a costly operation; and
doing that indeed suffices for the work preservation proofs to go through. Another reason why
charging for drop instead of dup is more natural, is that drop should be uncommon, or even absent
from non-pathological programs since it is only necessary in presence of dead code.
4.3.2 Costless Tuples. Why do we not charge for constructing or unpacking tuples, or binding or
referencing variables, which would correspond to costly allocation and access of memory in a real
implementation? The main reason is that reverse-mode AD actually does increase the asymptotic
memory footprint of a program. In our system, this is visible in the ULet and UDef rules in
Section 6, which drastically increase the lifetimes of intermediate non-linear variables (including
across function calls).
4.4 Linearity
Every well-typed Linear A expression is indexed linear when viewed as a function from its environ-
ment to its return value(s). We illustrated the dependence and linearity structure in Figure 2; now
we formalize and prove it in Theorem 4.1. Specifically, a Linear A expression 𝑒 defines a collection of
algebraically linear functions—from its free linear variables to its linear outputs—indexed by values
of its free non-linear variables. The indexing is non-trivial in general because linear multiplication
takes a non-linear argument on the left.
Theorem 4.1. Linear A expressions are indexed linear. Consider an expression 𝑒 of Linear A.
Consider also any set of well-typed values x for the non-linear variables 𝑣 free in 𝑒, and any two sets of
well-typed values x¤ and y¤ for the linear variables 𝑣¤ free in 𝑒; and any scalar 𝑐. We write 𝑒 [x; x¤ ] for
the result of evaluating 𝑒 with 𝑣 bound to x and 𝑣¤ bound to x¤ . We use subscripts as indexing here, so
𝑒 [x; x¤ ]𝑘 is the 𝑘-th result of evaluating 𝑒 on x and x¤ . Then:
1. The total work of evaluating 𝑒 [x; x¤ ] is independent of the values x and x¤ ;
2. The non-linear results are independent of x¤ , i.e., 𝑒 [x; x¤ ]𝑘 = 𝑒 [x; y¤ ]𝑘 for each non-linear result 𝑘;
3. The linear results are linear in x¤ , i.e., 𝑒 [x; x¤ + y¤ ]𝑙 = 𝑒 [x; x¤ ]𝑙 + 𝑒 [x; y¤ ]𝑙 and 𝑒 [x; 𝑐 · x¤ ]𝑙 = 𝑐 · 𝑒 [x; x¤ ]𝑙
for each linear result 𝑙.
The equality, addition and scaling here are mathematical operations over reals, and we assume the
denotation of arithmetic in Linear A to be carried out on infinite-precision reals. Given approximate
(e.g., floating-point) semantics, the equalities might not be exact.
Note that Claim 1 of this theorem, while true for Linear A, is not future-proof. A variant of the
language that supported (non-linear) conditionals or recursion would need to weaken Claim 1 to
allow work (including termination) to depend on the non-linear values x, though work should
remain independent of the linear values x¤ .
Proof. By structural induction, first on the program and then on the syntax of expressions and
function bodies. Work-independence (Claim 1) follows immediately from the lack of conditional
control flow in the language.
The proofs of Claims 2 and 3 are similar enough that we discuss them jointly. The interesting
case is multi-value let, so we consider it first. The inductive hypothesis asserts that the non-linear
values returned by the bound expression are independent of the linear free variables thereof, and
the returned linear values satisfy the additive and scaling laws. Since the set of free variables of the
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:11
body expression is a union of free variables of the entire expression and the let bound values, we
can again invoke the inductive hypothesis to arrive at the desired conclusion.
Tuple unpacking is entirely analogous to the multi-value let. Function application follows
directly from induction on the function body.4 The drop form is trivial, and all other forms can be
verified directly because they do not have subexpressions. In particular, indexed linearity depends
upon the only primitive operations available to the linear fragment of Linear A being zero, plus,
and scaling by a non-linear constant; and the presence of actually non-linear primitives in the
non-linear fragment of Linear A is what prevents us from strengthening the linearity claims to the
non-linear variables. □
Corollary 4.2. Every closed term in Linear A returns 0 for all its linear results.
Proof. If the term 𝑒 is closed, we can vacuously scale all of the free linear variables of 𝑒 without
changing the evaluation of 𝑒. All the linear results must therefore be 0. Formally, Claim 3 gives us
𝑒 [; ]𝑙 = 2𝑒 [; ]𝑙 = 0 for every linear result position 𝑙 of 𝑒. □
In particular, for the statement of Theorem 4.1 not to be vacuous, one must allow x¤ and y¤ to be
non-zero without asking where those non-zero values came from. This is ok, though, because we
are generally interested in open terms of Linear A, such as the bodies of Linear A functions. In that
case, we can assume the non-zero linear values are provided externally through the substitution of
free variables.
Lemma 4.3. The work a Linear A function or expression does is at least the number of real scalars
in its linear inputs, less the number of real scalars in its linear outputs.
Proof. The only linear operations in Linear A that return fewer scalar results than they consume
are + and drop, each of which costs 1 per net scalar consumed. Since all linear variables must be
either consumed or returned, the result follows. □
We state this fact here because it’s a curious general property of our cost model. We will use
it only to argue that eliminating useless code (e.g., by short-circuiting in the rule for transposing
drop(𝑒)) does not increase work (!).
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:12 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
x x¤
x x x¤
𝐽𝑒 1 | x
𝑒1
𝐽𝑒 | x
𝑒 𝑒 y y¤
𝐽𝑒 2 | y
𝑒2
y y y¤
z z¤
Fig. 5. Dataflow diagrams of (a) a purely non-linear expression in Linear A and (b) its derivative, plus (c) a
derivative made of a sequential composition stacked vertically. The primal (non-linear) results of a derivative
depend only only on its primal (non-linear) inputs, whereas the tangent results depend non-linearly on the
primal (non-linear) inputs and linearly on the tangent (linear) inputs. Note the distinction between J , the
code transform we define, and 𝐽 , the Jacobian of an expression 𝑒.
Lemma 4.4. Linearity erasure preserves the semantics of an expression, and preserves work in our
cost model.
Proof. By inspection. □
5 AUTOMATIC DIFFERENTIATION
Our first step is a code transformation J performing forward-mode automatic differentiation5 .
Specifically, J transforms the non-linear fragment of Linear A into the full language as follows. For
any purely non-linear 𝑒 in Linear A, and input x, J (𝑒) [x; x¤ ] has non-linear results computing the
same output as 𝑒 [x; ], and linear results computing the directional derivative of 𝑒 at the point x in
the direction x¤ . We illustrate the data flow produced by J in Figure 5. Here we view the expression
𝑒 as a function from its free variables 𝑣 to its results. A code example appears in Figure 6.
To compute a forward derivative end-to-end, we need to supply the initial direction of differ-
entiation (1 for an R → R𝑚 primal function). Since we cannot write a non-zero linear literal in
well-typed Linear A program, we have to use the linearity erasure transform L from Section 4.5.
For 𝐹 : R → R𝑚 ,
𝑑
𝐹 (𝑥), 𝐹 (𝑥) = L (J (𝐹 ))(𝑥, 1; ).
𝑑𝑥
Part of the point of this whole exercise is that J will be very familiar to students of automatic
differentiation; it’s just a forward-mode transformation that computes tangents in tandem with
primals. The rules in Figure 7 are completely standard, with the one wrinkle that the derivative is
emitted into the linear fragment of Linear A. Ergo, the result being well-typed (substructurally) in
Linear A means that we have a proof that the derivative of a function is algebraically linear (with
respect to the direction in which the derivative is taken).
This is the first step in our separation of concerns: differentiation proper is confined to the
relatively simple J , whereas arranging to run the derivative backward to obtain reverse-mode AD
is the province of U and T (in Section 6 and Section 7, respectively). The latter two know only
about linearity, not about differentiation, nor about any special relationship between the non-linear
(primal) and the linear (tangent) computations. The type system of Linear A is the abstraction
boundary that lets differentiation and transposition be implemented independently.
5J corresponds to the jax.jvp transform from JAX.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:13
1,1
def f J 1,1 ( 𝑢 : R; 𝑢¤ : R ) → (R; R) =
let( 𝑣 : R; ) = sin(𝑢) in
def f1,0
1,0 ( 𝑢 : R; ) → (R; ) = J let( 𝑑𝑢 : R; ) = cos(𝑢) in
let( 𝑣 : R; ) = sin(𝑢) in −−−−→ let( ; 𝑣¤ : R ) = 𝑑𝑢 ∗ 𝑢¤ in
let( 𝑤 : R; ) = − 𝑣 in
let( 𝑤 : R; ) = − 𝑣 in
(𝑤;)
let( ; 𝑤¤ : R ) = − 𝑣¤ in
(𝑤; 𝑤) ¤
Fig. 6. Example of forward differentiation with J , transforming 𝑓 (𝑢) = − sin(𝑢) on the left into J (𝑓 ), which
computes both the original 𝑓 and its directional derivative. The input function 𝑓 must be coded in Linear A
as all non-linear. The result uses both the linear and non-linear fragments of Linear A.
5.1 Details
5.1.1 Non-linear Input. We define J to operate only on the purely non-linear subset of Linear
A. This avoids perturbation confusion problems: all linear variables in Linear A refer to the same
perturbation. We can still get nested AD by alternating J and the linearity erasure transform L.
5.1.2 Introducing dup. The non-linear fragment of Linear A allows variables to be referenced
multiple times, but the linear fragment does not. Ergo, whenever a non-linear variable is re-used,
we must introduce dup operations on the corresponding linear variables. The JLet rule in Figure 7
spells this out: the rule detects which variables 𝑢 are repeated; constructs the needed fresh variables
¤ introduces the duplication operations 𝑤,
¤ 𝑧;
𝑤, ¤ and takes care that the tangent of 𝑢 in 𝑒 1
¤ 𝑧¤ = dup(𝑢);
¤ whereas in 𝑒 2 it’s 𝑧.
is 𝑤, ¤
One instance seeming sufficient, Figure 7 omits similar manipulations for the other rules. The
full system has a rule JUnpackB for the case where the tuple variable 𝑣 does appear free in the body
𝑒; it differs from JUnpackA only in that the binding for 𝑣 is propagated to the recursive call, and
the tangent 𝑣¤ is duplicated for its appearance in the transformed body. Likewise, JRet, JApp, JTup,
and JPrimMul have variants that add the necessary dup calls if any of the non-linear 𝑣 repeat.
5.1.3 Other Omissions. We also omit rules for the other non-linear primitives. They differ from
JPrimSin and JPrimMul only in the derivatives they must insert into the result; we trust that
a unary and a binary example are adequately clear. Finally, there are no rules for linear syntax
because J operates on the non-linear fragment of Linear A.
5.1.4 Extra Non-linear Computations. Note that J in general introduces non-linear computations
that are not needed for the original non-linear program—these are the actual partial derivatives.
For example, in JPrimSin we see a new cos(𝑣). This is bound to a non-linear variable, but used only
by the linear result of J (sin). Figure 1 elided these extra non-linear computations, hiding them
in the red arrows between the blue primal blocks and orange derivative blocks. Our downstream
unzipping transformation U treats them as part of the blue blocks, since they are non-linear.
5.1.5 Tangent Types. The rules in Figure 7 depend on assuming that any Linear A type 𝜏 is also
a suitable type for its tangent. As we have defined Linear A to have only real scalars and nested
tuples thereof, this proposition is costless; but the correct choice of tangent type becomes more
interesting in a language with sum types. One conventional choice is to still use 𝜏 itself as the type
of tangents to 𝜏, though that loses the information that both the primal and the tangent value are
necessarily the same variant. Retaining that information requires a dependent type system; we
look forward to the community working out a satisfying one for this purpose.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:14 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
J (Σ | 𝑒) ⇝ 𝑒 ′
𝑣 distinct
JRet
J (𝑣 → 𝑣¤ | (𝑣;)) ⇝ (𝑣; 𝑣)
¤
J (Σ1 , 𝑢 → 𝑤¤ | 𝑒 1 ) ⇝ 𝑒 1′ J (Σ2 , 𝑢 → 𝑧, ¤ 𝑣 → 𝑣¤ | 𝑒 2 ) ⇝ 𝑒 2′
¤ 𝑤,
𝑣, ¤ 𝑧¤ fresh 𝑢 : 𝜏 free in both 𝑒 1 and 𝑒 2
JLet
J (Σ1 ⊎ Σ2 ⊎ {𝑢 → 𝑢¤ } | let( 𝑣 : 𝜎; ) = 𝑒 1 in 𝑒 2 )
¤ in let( 𝑣 : 𝜎; 𝑣¤ : 𝜎 ) = 𝑒 1′ in 𝑒 2′
⇝ let( ; 𝑤¤ : 𝜏, 𝑧¤ : 𝜏 ) = (; dup(𝑢))
𝑣 distinct
JApp JVar
J (𝑣 → 𝑣¤ | 𝑓 (𝑣;)) ⇝ 𝑓 J (𝑣; 𝑣)
¤ J (𝑣 → 𝑣¤ | 𝑣) ⇝ (𝑣; 𝑣)
¤
𝑣 distinct
JTup JLit
J (𝑣 → 𝑣¤ | ⊗𝑣) ⇝ (⊗𝑣; ⊗ 𝑣)
¤ J (𝜀 | 𝑙) ⇝ (𝑙; 0R )
JPrimSin
J (𝑣 → 𝑣¤ | sin(𝑣)) ⇝ (sin(𝑣); cos(𝑣) ∗ 𝑣)
¤
𝑣1 , 𝑣2 distinct
JPrimMul
J (𝑣1 → 𝑣¤1 , 𝑣2 → 𝑣¤2 | 𝑣1 ∗ 𝑣2 ) ⇝ (𝑣1 ∗ 𝑣2 ; 𝑣1 ∗ 𝑣¤2 + 𝑣2 ∗ 𝑣¤1 )
J (Σ | 𝑒) ⇝ 𝑒 ′
JDrop
J (Σ | drop(𝑒)) ⇝ drop(𝑒 ′ )
J (𝑓 ) ⇝ 𝑓 J
J (𝑣 → 𝑣¤ | 𝑒) ⇝ 𝑒 ′ 𝑣¤ fresh
JDef
J (def 𝑓 ( 𝑣 : 𝜏; ) → (𝜎; ) = 𝑒) ⇝ def 𝑓 ( 𝑣 : 𝜏; 𝑣¤ : 𝜏 ) → (𝜎; 𝜎) = 𝑒 ′
J
Fig. 7. Forward-mode automatic differentiation in Linear A. The constructed expression 𝑒 ′ emits primal
and tangent results together as a multi-value return. The argument Σ maps primal variable names to the
corresponding tangent variable names. Note in the result of rule JPrimMul, the multiplication on the left is
non-linear, whereas the two multiplications on the right are right-linear. Several near-redundant rules are
omitted; see text.
5.2 Formalization
Our theorem about forward differentiation is conventional. The interesting thing about it is that
the J transformation puts the derivative into the linear fragment of Linear A. We therefore have
a proof obligation that automated derivatives are substructurally linear; but otherwise this is a
repetition of known results that forward differentiation is correct, and work-preserving up to a
constant factor.
Theorem 5.1. Automatic differentiation is total, correct, and work-preserving. Consider an
expression 𝑒, well-typed in Γ; {} in the non-linear fragment of Linear A. Let x be well-typed values for
the (necessarily non-linear) free variables 𝑣 : 𝜏 of 𝑒. Allocate fresh names 𝑣¤ of types 𝜏 for the tangents,
and let x¤ be well-typed values for them. Then:
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:15
Proof. The proof proceeds by structural induction on the program. We work out the JLet rule
as a template for the others. The first subtlety is that there may be some non-linear variables, 𝑢
in the notation, that are free in both the bound expression 𝑒 1 and the body 𝑒 2 . To construct the
parallel linear expression, we must insert explicit dups for their corresponding linear variables
¤ This is notated as let( ; 𝑤,
𝑢. ¤ in in JLet, implying one let and one dup per
¤ 𝑧¤ ) = (; dup(𝑢))
duplicated variable among the 𝑢. ¤ This duplication is also why the map from primal variables to
their corresponding tangent variables may be non-trivial and needs to be maintained explicitly.
Work preservation (Claim 2) follows because the dups introduced by JLet cost 0 in our cost model.
Non-linear correctness (Claim 3) follows from the inductive hypothesis because we reconstruct
the non-linear part of the computation exactly in the result J (𝑒). Linear correctness (Claim 4) is a
calculation:
J (𝑒) [x; x¤ ] = J (𝑒 2 ) [x2, 𝑒 1 [x; ]; x¤ 2, J (𝑒 1 ) [x1 ; x¤ 1 ]] definition of J (𝑒)
= 𝐽𝑒 2 | x2,𝑒1 [x;] ( x¤ 2, J (𝑒 1 ) [x1 ; x¤ 1 ]) induction on 𝑒 2
= 𝐽𝑒 2 | x2,𝑒1 [x;] x¤ 2, 𝐽𝑒 1 | x1 ( x¤ 1 ) induction on 𝑒 1
= 𝐽𝑒 | x ( x¤ ). chain rule on 𝑒 2 ◦ 𝑒 1
The chain rule is for the composition of multivariate functions 𝑒 2 ◦𝑒 1 augmented with side inputs x2
and their tangents x¤ 2 . Here we somewhat abuse our own notation, writing ( 𝐽𝑒 | · ) for the Jacobian
of 𝑒 at a point, J (𝑒) [·; ·] for the linear results of evaluating J (𝑒), and 𝑒 [·; ] with no linear inputs
for the non-linear results of evaluating 𝑒. The reader must also understand the index 1 to refer to
variables free in 𝑒 1 and 2 to refer to those free in 𝑒 2 , with any overlap among linear ones taken care
of by the above-mentioned chain of let and dup.
Induction on the program (including previously processed definitions) comes in when checking
JApp, because we rely on correctness of the transformed callee for correctness of the transformed
call site. The handling of JUnpack is analogous to JLet, and the other rules are straightforward. □
6 UNZIPPING
We cannot directly transpose arbitrary Linear A programs. Why not? In the transposed result we
are trying to run the non-linear part forward and the linear part backward. However, the first
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:16 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
U (𝑒) ⇝ 𝐸; 𝑒 ′ ; 𝑒¤′
URet
U ((𝑣; 𝑣))
¤ ⇝ □; (𝑣;); (; 𝑣)
¤
U (𝑒 1 ) ⇝ 𝐸 1 ; 𝑒 1′ ; 𝑒¤1′ U (𝑒 2 ) ⇝ 𝐸 2 ; 𝑒 2′ ; 𝑒¤2′
ULet
U (let( 𝑣 : 𝜏; 𝑣¤ : 𝜏¤ ) = 𝑒 1 in 𝑒 2 ) ⇝ 𝐸 1 , let( 𝑣 : 𝜏; ) = 𝑒 1′ in 𝐸 2 ; 𝑒 2′ ; let( ; 𝑣¤ : 𝜏¤ ) = 𝑒¤1′ in 𝑒¤2′
U (𝑒) ⇝ 𝐸; 𝑒 ′ ; 𝑒¤′
UUnpack
U (let( ⊗ 𝑤 : ⊗𝜏; ) = 𝑣 in 𝑒) ⇝ let( ⊗ 𝑤 : ⊗𝜏 ) = 𝑣 in 𝐸; 𝑒 ′ ; 𝑒¤′
U (𝑒) ⇝ 𝐸; 𝑒 ′ ; 𝑒¤′
ULinUnpack
U (let( ; ⊗ 𝑤¤ : ⊗𝜏¤ ) = 𝑣¤ in 𝑒) ⇝ 𝐸; 𝑒 ′ ; let( ⊗ 𝑤¤ : ⊗𝜏¤ ) = 𝑣¤ in 𝑒¤′
U (𝑒) ⇝ 𝐸; 𝑒 ′ ; 𝑒¤′
UDrop
U (drop(𝑒)) ⇝ 𝐸; drop(𝑒 ′ ); drop(𝑒¤′ )
U ( 𝑓 ) ⇝ 𝑓 .nonlin, 𝑓 .lin
′ ′
U (𝑒) ⇝ 𝐸; 𝑒 ; 𝑒¤ 𝑤 : 𝜋 are all the names bound by 𝑣 and 𝐸 and read by 𝑒¤′ 𝑢 fresh
UDef
¤ = 𝑒) ⇝ def 𝑓 .nonlin( 𝑣 : 𝜏; ) → (𝜎, ⊗𝜋 ; ) = 𝐸(𝑒 ′, ⊗(𝑤)),
U (def 𝑓 ( 𝑣 : 𝜏; 𝑣¤ : 𝜏¤ ) → (𝜎; 𝜎)
¤ = let( ⊗ 𝑤 : ⊗𝜋 ) = 𝑢 in 𝑒¤′
def 𝑓 .lin( 𝑢 : ⊗𝜋 ; 𝑣¤ : 𝜏¤ ) → (; 𝜎)
Fig. 8. Unzipping converting Linear A to Linear B. Unzipping transforms a mixed linear and non-linear
expression 𝑒 into a shared non-linear context 𝐸, a non-linear result expression 𝑒 ′ , and a linear result expression
𝑒¤′ . Note that we always unzip recursively, because Linear B does not allow non-linear expressions to read
linear variables, even to drop them. Most language primitives are handled by the generic rules ULeaf and
ULinLeaf.
sub-expression of a general Linear A program may produce both linear and non-linear results. Do
we run that subexpression before or after the remainder?
Fortunately, the non-linear part of a Linear A expression cannot depend on any of its linear
inputs. Ergo, it should be possible to hoist it above the linear part (this section) to just transpose
the latter (Section 7).
We begin by defining what we want:
Definition 6.1 (Linear B). Linear B is a sublanguage of Linear A with two restrictions:
• All functions and expressions of Linear B must return either all-linear or all-nonlinear
results—no mixing.
• Expressions of Linear B that produce non-linear results must not read linear variables, not
even to drop them.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:17
1,0
def f J .nonlin2,0 ( 𝑢 : R; ) → (R, ⊗R; ) =
let( 𝑣 : R; ) = sin(𝑢) in
1,1
def f J 1,1 ( 𝑢 : R; 𝑢¤ : R ) → (R; R) = let( 𝑑𝑢 : R; ) = cos(𝑢) in
let( 𝑣 : R; ) = sin(𝑢) in let( 𝑤 : R; ) = − 𝑣 in
let( 𝑑𝑢 : R; ) = cos(𝑢) in U let( 𝑡 : ⊗R; ) = ⊗ (𝑑𝑢) in
let( ; 𝑣¤ : R ) = 𝑑𝑢 ∗ 𝑢¤ in −−−−→ (𝑤, 𝑡 ;)
let( 𝑤 : R; ) = − 𝑣 in def f J .lin1,1
0,1 ( 𝑡 : ⊗R; 𝑢¤ : R ) → (; R) =
let( ; 𝑤¤ : R ) = − 𝑣¤ in let( ⊗ 𝑑𝑢; ) = 𝑡 in
(𝑤; 𝑤) ¤ let( ; 𝑣¤ : R ) = 𝑑𝑢 ∗ 𝑢¤ in
let( ; 𝑤¤ : R ) = − 𝑣¤ in
¤
(; 𝑤)
Fig. 9. Example of unzipping. On the left we have the derivative J (𝑓 ) of 𝑓 (𝑢) = − sin(𝑢) from Figure 6, and
on the right the non-linear and linear functions it unzips to, J (𝑓 ).nonlin and J (𝑓 ).lin. In AD terminology,
J (𝑓 ).nonlin is the forward phase, which computes the original function 𝑓 and returns all the intermediates,
or the tape, (in this case the variable 𝑑𝑢) needed for the derivative. The function J (𝑓 ).lin is called the
linearization in Frostig et al. [2021]. It consumes the tape produced by the forward phase and computes (in
substructurally and therefore algebraically linear fashion) the forward derivative of 𝑓 ; in Figure 10 (Section 7)
we will transpose it to obtain the reverse phase for computing the derivative of 𝑓 in reverse mode.
Figure 8 gives the transformation U 6 that “unzips” any Linear A expression or function into
a pair of Linear B expressions or functions—one that computes only linear results and one that
computes only non-linear results. Figure 9 shows the unzip of an example function.
The result of U on an expression is a bit subtle: we wish for U to split the input expression 𝑒
into a non-linear fragment 𝑒 ′ and a linear fragment 𝑒¤′. However, 𝑒 may compute a (non-linear)
intermediate value that will be used for both linear and non-linear results. To allow 𝑒 ′ and 𝑒¤′ to both
use that value without having to recompute it, U (𝑒) also produces a context 𝐸 of let bindings, which
represent all the intermediate (non-linear) computations that 𝑒 ′ and 𝑒¤′ may share. All the actual
work of the non-linear fragment ends up in the context 𝐸. The rule UDef exposes the variables
bound by 𝐸 as outputs of the emitted non-linear fragment 𝑓 .nonlin, so they can be read without
recomputation by the linear fragment 𝑓 .lin.
The context 𝐸 is the “tape” (as it is often called [Griewank and Walther 2008]) that gives reverse-
mode AD its work-preservation. In the example of Figure 9 it consists of the single variable 𝑑𝑥. One
of our contributions is disentangling the construction of the tape from AD proper. The time-space
tradeoff the tape represents is handled by the unzipping transformation, and neither (forward)
differentiation nor transposition need concern themselves with it.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:18 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
6.2 Formalization
Given an expression 𝑒 well-typed in Γ; Γ¤ in Linear A, our transformation U (𝑒) produces a context
𝐸 (syntax in Figure 3), a non-linear expression 𝑒 ′, and a linear residual expression 𝑒¤′. We need the
free non-linear variables 𝑤 of the expression 𝑒¤′, with their types 𝜋. Then we define
• The tape 𝑤. These are variables carrying the needed non-linear information from Γ and 𝐸 to
𝑒¤′.
• The non-linear partial 𝐸(𝑒 ′, 𝑤). This is the expression formed by allocating fresh variables 𝑢
for the results of 𝑒 ′ and plugging let( 𝑢; ) = 𝑒 ′ in (𝑢, 𝑤;) into the hole in 𝐸. The non-linear
partial is a purely non-linear expression that returns the results of 𝑒 ′ together with the tape.
• The reconstruction 𝐸(𝑒 ′, 𝑤; 𝑒¤′). Given fresh variables 𝑢 and 𝑢¤ for all the results of 𝑒, this is the
expression let( 𝑢, 𝑤; ) = 𝐸(𝑒 ′, 𝑤) in let( ; 𝑢¤ ) = 𝑒¤′ in (𝑢; 𝑢)¤ that should be equivalent to 𝑒
if the non-linear partial and the residual are correct. Note that here we bind 𝑤 to themselves:
the point is that we arrange for the context 𝐸 not to be in scope in 𝑒¤′, and carry all the needed
information explicitly in the tape 𝑤.
Theorem 6.2. Unzipping is total, correct, and work-preserving. Consider any Linear A ex-
pression 𝑒 well-typed in Γ; Γ¤ with 𝑤 and 𝜋 as above. Then
1. 𝐸; 𝑒 ′; 𝑒¤′ = U (𝑒) exist;
2. The non-linear partial 𝐸(𝑒 ′, 𝑤) is well-typed in Γ; {} in Linear B;
3. The residual 𝑒¤′ is well-typed in 𝑤 : 𝜋; Γ¤ in Linear B; and
4. The reconstruction 𝐸(𝑒 ′, 𝑤; 𝑒¤′) is well-typed in Γ; Γ¤ in Linear A and computes the same result
with the same work as 𝑒.
Proof. The proof consists of our usual structural induction. One subtlety is well-typedness of
the non-linear partial 𝐸(𝑒 ′, 𝑤) when 𝑒 is a mixed let expression (rule ULet). We can hoist 𝐸 2
and 𝑒 2′ past the linear variables 𝑣¤ because recursive unzipping will ensure that 𝐸 2 and 𝑒 2′ do not
read any linear variables. We also need to check that the non-linear partial uses every non-linear
variable at least once. By assumption, any non-linear variable 𝑣 from Γ appears in either 𝑒 1 or 𝑒 2 .
By induction, 𝑣 therefore appears in either the non-linear partial of 𝑒 1 or of 𝑒 2 . If 𝑣 appears in 𝐸 1 , 𝑒 1′ ,
𝐸 2 , or 𝑒 2′ directly, then 𝑣 appears in either 𝐸 or 𝑒 ′. Otherwise, 𝑣 must either appear in the tape 𝑤 1
due to being free in 𝑒¤1′ , or in 𝑤 2 due to being free in 𝑒¤2′ . Therefore, 𝑣 is free in 𝑒¤′, because the latter
only binds 𝑣, ¤ which cannot shadow 𝑣, being linear variables. Hence we will add 𝑣 to the 𝑤, and it
will again appear in the non-linear partial 𝐸(𝑒, 𝑤) of 𝑒. A similar argument covers the non-linear
variables 𝑣 bound by the let. Checking variable consumption for the residual is a simpler variation
of the same argument.
Some notes: The UDef rule coordinates with the UApp rule to transport the tape in a single
variable of tuple type rather than increasing the size of every call site by the number of intermediate
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:19
values of the function. The UApp rule relies on program ordering, because we read the type of the
unzipped callee 𝑓 .nonlin in order to annotate the type of the call site.7 To enforce the invariant
that non-linear expressions of Linear B do not consume linear variables, we explicitly unzip drop
expressions in rule UDrop. The proof of claim 4 requires induction on the program to assert that
any functions called by 𝑒 compute the correct values, but is otherwise straightforward. □
7 TRANSPOSITION
To JVP, or VJP? That is the question:
Whether ’tis nobler in the mind to suffer
The slings and arrows of outrageous re-implementation,
Or to take up Arms against a Sea of troubles,
And by transposing, end them.
We turn to our main focus: transposing linear functions. Recall that our ultimate goal is to
derive reverse-mode automatic differentiation from forward mode, without having to duplicate
derivative rule implementations. We do this through the notion of transposition. Every linear
function 𝑓 : R𝑛 → R𝑚 has a unique transpose 𝑓 T : R𝑚 → R𝑛 defined by consistency with respect
to dot product:
⟨𝑥, 𝑓 T (𝑦)⟩ = ⟨𝑓 (𝑥), 𝑦⟩. (1)
For a given nonlinear function 𝐹 : →R𝑛 R𝑚
and input 𝑥 ∈ R𝑛 ,
this is exactly the relationship
between the forward derivative 𝑑𝐹 : (𝑦 ∈ R𝑛 ) ↦→ (𝐽 𝐹 |𝑥 )𝑦 and the reverse derivative 𝑑𝐹 T : (𝑧 ∈
R𝑚 ) ↦→ 𝑧 (𝐽 𝐹 |𝑥 ), where 𝐽 𝐹 |𝑥 is the Jacobian of 𝐹 at 𝑥.
Concretely, to compute a gradient of 𝐹 : R𝑛 → R at x ∈ R𝑛 , we differentiate 𝐹 with J to produce
𝐹 , unzip it into the linear and non-linear components 𝐹 J .nonlin, 𝐹 J .lin, transpose the latter
J
An example transposition is given in Figure 10. We use the double-dot notation 𝑣¥ as a mnemonic
for cotangent variables (not second-order derivatives). Each cotangent variable holds a partial
derivative of the final result of 𝐹 with respect to something, which are propagated in the opposite
order from the primal computation. We assign no formal meaning to the dots themselves: 𝑣¥ is just
a different variable than 𝑣,
¤ even though we mean for it to carry the cotangent corresponding to 𝑣.¤
7.1 Details
7.1.1 Input to Transposition. We transpose a function that’s well-typed in Linear B with the T
transformation8 given in Figure 11. Recall from Definition 6.1 that Linear B is the sub-language of
Linear A where all expressions must return either all-linear or all-non-linear results. We restrict
7 This is trivial for Linear A as presented, because it does not admit recursion; but this point becomes more subtle if recursion
is permitted. Unzipping in a recursive language must be able to construct recursive types to represent the tapes of recursive
computations. This can be accomplished by indirection: for each function 𝑓 , allocate a fresh name 𝜋 𝑓 for the type of the
tuple carrying the tape. Then unzipping can use that name fill in type annotations of call sites of 𝑓 before 𝑓 is unzipped,
and unzipping 𝑓 provides the definition of 𝜋 𝑓 (which may refer to 𝜋 𝑓 if 𝑓 is recursive).
8 T is available in JAX as the jax.linear_transpose function. Note that JAX does not distinguish between linear and
non-linear expressions in its language, so the relevant typing judgements are reconstructed on the fly during transposition.
The transform assumes that the given function can be typed according to the type system presented here and any failure to
infer a valid type is reported to the user as an error.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:20 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
1,0
def f J .nonlin2,01,0
( 𝑢 : R; ) → (R, ⊗R; ) = def f J .nonlin2,0 ( 𝑢 : R; ) → (R, ⊗R; ) =
let( 𝑣 : R; ) = sin(𝑢) in let( 𝑣 : R; ) = sin(𝑢) in
let( 𝑑𝑢 : R; ) = cos(𝑢) in let( 𝑑𝑢 : R; ) = cos(𝑢) in
let( 𝑤 : R; ) = − 𝑣 in let( 𝑤 : R; ) = − 𝑣 in
let( 𝑡 : ⊗R ) = ⊗ (𝑑𝑢) in let( 𝑡 : ⊗R ) = ⊗ (𝑑𝑢) in
(𝑤, 𝑡 ;) (𝑤, 𝑡 ;)
1,1
def f J .lin1,1
0,1 ( 𝑡 : ⊗R; 𝑢¤ : R ) → (; R) = def f J .linT 0,1 ( 𝑡 : ⊗R; 𝑤¥ : R ) → (; R) =
let( ⊗ 𝑑𝑢; ) = 𝑡 in T
let( ⊗ 𝑑𝑢; ) = 𝑡 in
let( ; 𝑣¤ : R ) = 𝑑𝑢 ∗ 𝑢¤ in −−−−→ let( ; 𝑣¥ : R ) = − 𝑤¥ in
let( ; 𝑤¤ : R ) = − 𝑣¤ in let( ; 𝑢¥ : R ) = 𝑑𝑢 ∗ 𝑣¥ in
¤
(; 𝑤) ¥
(; 𝑢)
Fig. 10. Completing our running example by transposing the derivative. On the left, we have the forward
phase J (𝑓 ).nonlin and residual J (𝑓 ).lin of 𝑓 (𝑢) = − sin(𝑢) from Figure 9. On the right, we transpose
just the linear residual J (𝑓 ).lin to obtain the reverse phase T (J (𝑓 ).lin), which computes the derivative
of 𝑓 in reverse mode.
to Linear B because general Linear A functions are free to interleave linear with non-linear com-
putations, but we transpose only the strictly linear fragment. Converting an arbitrary Linear A
expression into Linear B was the purpose of the unzipping transformation U of Section 6.
7.1.2 Transposition Environments. To read the transposition rules, it helps to view an expression
as a function from its environment to its results, and to think of its environment as ordered. So, if 𝑒¤
consumes the linear variables of an ordered type environment Δ, ¤ the transpose 𝑒¥ must produce
¤
one result corresponding to each variable in Δ. Likewise, for every result of 𝑒, ¤ 𝑒¥ must consume a
corresponding free variable. The transform T carries names for these (cotangent) variables in a
second ordered type environment Δ. ¥
¤ ¥
We use the notation Δ, Δ and angle brackets to emphasize that they are ordered vectors instead
of the unordered mappings Γ, ¤ Γ,
¥ and match the order of the relevant expression results. When
we compose and decompose expressions and their corresponding ordered environments, it may
be necessary to insert shuffling operations to maintain this ordering invariant. These shuffles are
uninteresting and expressible using only let and multi-value return, both of which cost 0 in our
model, so we leave them implicit. The ordering is irrelevant for type-checking, so we freely use Δ¤
and Δ¥ as typing environments as well.
7.1.3 Omitted Rules. Figure 11 omits the rules TUnpack and TLinUnpack for non-linear and linear
tuple unpacking, because they are identical to the corresponding multiple-value let rules, mutatis
mutandis. There are no specific rules for transposing an expression that produces non-linear results
because it transposes to itself (rule TNonLin). Linear B requires non-linear expressions to have no
linear sub-expressions (not even through drop), so transposition need not traverse them.
7.1.4 Benefit from Explicit drop and dup. Now we have the payoff from requiring all linear variables
¤ would have to return zeros for variables in Δ¤ that were
to be explicitly consumed: without that, T (𝑒)
not free variables of 𝑒,
¤ and would later have to do work to add those zeros to the real cotangents.
The payoff to requiring linear variables to be consumed exactly once is that rules do not need
to introduce additions to merge the results of transposed sub-expressions—those cotangents are
already guaranteed to correspond to different variables, and the summations are produced by the
TDup rule.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:21
¤ Δ
T ( Δ, ¥ | 𝑒)
¤ ⇝ 𝑒¥
TRet
T ( ⟨𝑣¤ : 𝜏¤ ⟩, ⟨𝑣¥ : 𝜏¤ ⟩ | (; 𝑣))
¤ ⇝ (; 𝑣¥)
TApp
¤ ⇝ 𝑓 .linT (𝑣; 𝑣¥)
T ( ⟨𝑣¤ : 𝜏¤ ⟩, ⟨𝑣¥ : 𝜏¥ ⟩ | 𝑓 .lin(𝑣; 𝑣))
𝑒 non-linear
TNonLin
T (𝜀, 𝜀 | 𝑒) ⇝ 𝑒
TLinVar TLinZero
T ( ⟨𝑣¤ : 𝜏¤ ⟩, ⟨𝑣¥ : 𝜏¤ ⟩ | 𝑣)
¤ ⇝ 𝑣¥ T (𝜖, ⟨𝑣¥ : 𝜏¤ ⟩ | 0𝜏¤ ) ⇝ drop( 𝑣¥)
𝑤¥ fresh
TLinTup
T ( ⟨𝑣¤ : 𝜏¤ ⟩, ⟨𝑣¥ : ⊗𝜏¤ ⟩ | ⊗(𝑣))
¤ ⇝ let( ⊗ 𝑤¥ : ⊗𝜏¥ ) = 𝑣¥ in (; 𝑤)
¥
TLinAdd
T ( ⟨𝑣¤1 : 𝜏,
¤ 𝑣¤2 : 𝜏¤ ⟩, ⟨𝑣¥ : 𝜏¤ ⟩ | 𝑣¤1 + 𝑣¤2 ) ⇝ dup( 𝑣¥)
TLinMul
T ( ⟨𝑣¤2 : 𝜏¤ ⟩, ⟨𝑣¥2 : 𝜏¤ ⟩ | 𝑣1 ∗ 𝑣¤2 ) ⇝ 𝑣1 ∗ 𝑣¥2
T ( 𝑓 .lin) ⇝ 𝑓 .linT
Fig. 11. Linear B transposition. Linear B is the subset of Linear A where all functions and expressions return
only linear or only non-linear results, and non-linear expressions do not read linear variables. Transposition
leaves non-linear expressions as they are, but order-reverses linear expressions: a linear expression 𝑒¤ with
𝑛 free linear variables and 𝑚 linear results becomes a linear expression 𝑒¥ with 𝑚 free linear variables and 𝑛
results computing the (linear-algebraic) transpose of 𝑒.¤
7.2 Formalization
The goal of transposition is to transform a linear function 𝑓 into its transpose 𝑓 T , as defined by
pulling back the dot product (1). We now prove that it does so, and that 𝑓 T isn’t more expensive
than 𝑓 .
Theorem 7.1. Transposition is total, correct, and amortized work-preserving. Consider a
¤ well typed in non-linear and linear environments Γ, Γ.
linear Linear B expression 𝑒, ¤ Fix an ordering
¤ ¤
of variables in Γ to form Δ. Let x and x¤ be well-typed values corresponding to the free non-linear
¤ respectively. Let x¥ be values matching the types of the (linear) results from
and linear variables in 𝑒,
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:22 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
¤ and let 𝐿in and 𝐿out be the number of scalars present in the linear inputs x¤ and (linear) results x¥ ,
𝑒,
respectively. Let W [𝑒] ¤ denote the amount of work done to execute 𝑒. ¥ be an ordered environment
¤ Let Δ
giving variable names and types for the cotangents x¥ of the (linear) results of 𝑒. ¤ Then
¤
1. 𝑒¥ = T ( Δ, Δ|𝑒) ¥ ¥
¤ exists and is well-typed in Linear B, in environments Γ; Δ, and emits linear results
corresponding to Δ; ¤
2. W [𝑒] ¥ + 𝐿in ≤ W [𝑒] ¤ + 𝐿out ; and
3. ⟨¤x, 𝑒¥ [x; x¥ ]⟩ = ⟨𝑒¤ [x; x¤ ], x¥ ⟩.
Claim 2 is the amortized work preservation criterion for transposition. The amortization correc-
tions 𝐿in and 𝐿out are the number of scalar linear inputs consumed by the original and transposed
expressions, respectively. To see intuitively why a correction is needed, consider transposing a
function that takes a single argument and fans it out (with dup) 𝑛 times. In our cost model, this
function does no work, because dup is free; but its transpose must perform 𝑛 summations, and thus
pay 𝑛. We restore work preservation by crediting an extra (𝑛 + 1) − 1 for the 𝑛 excess inputs the
transposition consumes.
Another intuition for the amortization argument is that producing 𝐿out results obliges the caller
to eventually pay 𝐿out to drop or use them. A transposition that reduces that burden on the caller
should get credit it can use to amortize the additional summations it has to do internally, and
vice versa. These corrections cancel when Linear A expressions are composed, so transposition is
exactly work-preserving internally, and the discrepancy only shows up for the end-to-end Linear A
computation.
Claim 3 is the correctness criterion for transposition: 𝑒¥ pulls dot product in the output space of 𝑒¤
back(ward) along 𝑒¤ to its input space, which is exactly the mathematical definition of transposition
of linear operators.9 Note that on the left-hand side of Claim 3, the x¤ are treated as extra-linguistic
values and x¥ are fed in as linear inputs to 𝑒, ¥ whereas on the right-hand side, the x¥ are extra-linguistic
and the x¤ are fed in as linear inputs to 𝑒. ¤
While we state our results on transposition in terms of vector spaces over the familiar field R of
real numbers, one can straightforwardly extend them to a variant of Linear A modeling modules
over any ring R. If R is not commutative, one needs to introduce a second linear scaling primitive
𝑒¤ · 𝑒, corresponding to scalar right-multiplication in R-modules. Scalar left- and right-multiplication
then transpose to each other. Our dot-product notation also becomes non-commutative, and Claim
3 becomes two claims: ⟨¤x, 𝑒¥ [x; x¥ ]⟩ = ⟨𝑒¤ [x; x¤ ], x¥ ⟩ and ⟨¥𝑒 [x; x¥ ], x¤ ⟩ = ⟨¥x, 𝑒¤ [x; x¤ ]⟩. Augmenting R with
sufficient structure to define differentiation, and adjusting forward differentiation to correctly
compute derivatives, is left as an exercise for the enterprising reader.
Proof of Claim 1. Well-typing of transposition follows by structural induction on the syntax of
Linear B. The most interesting rule is TLetLin. The expression 𝑒¤ is well-typed in Γ; Δ¤ by assumption,
so we can split the incoming environment Δ¤ into a distinct union of the variables Δ¤ 1 free in 𝑒¤1
and Δ¤ 2 free in 𝑒¤2 (except for the 𝑣¤ bound in this let form). The results of the overall expression 𝑒¤
are the same as those of the body 𝑒¤2 , so we can transpose 𝑒¤2 with respect to the augmented linear
environment Δ¤ 2 ⊎ ⟨𝑣 : 𝜏⟩ (up to a permutation) and the same expected results Δ. ¥ This gives us the
transposed body 𝑒¥2 , which by induction is well-typed in Γ; Δ, ¥ and produces results corresponding
to Δ¤ 2 ⊎ ⟨𝑣 : 𝜏⟩ (up to a permutation). We now need to split them into the 𝑣¥ corresponding to the 𝑣, ¤
which will be read by 𝑒¥1 , and the 𝑤¥ corresponding to Δ¤ 2 , which we will return from the result 𝑒. ¥
We now have the arguments needed to transpose the bound expression 𝑒¤1 : its linear environment
Δ¤ 1 and variables 𝑣¥ corresponding to its results. They line up correctly because the original expression
𝑒¤ was well-typed as a let. Transposing 𝑒¤1 gives us 𝑒¥1 , which reads the 𝑣¥ from its environment and
9 In the language of algebraic geometry, if 𝑒¤ implements a pushforward of 𝑒, then 𝑒¥ implements a pullback of 𝑒.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:23
produces results corresponding to Δ¤ 1 . Our result expression 𝑒¥ binds the 𝑣¥ from 𝑒¥2 so 𝑒¥1 can read
them, and splices the results corresponding to Δ¤ 1 and Δ¤ 2 into results corresponding to Δ.
¤ The result
𝑒¥ uses all the variables from Δ¥ exactly once because the body 𝑒¥2 does.
The other rules proceed similarly. One subtlety worth noting is that in TLetNonLin, we assume
that all free linear variables of 𝑒¤ are in fact free in the body 𝑒¤2 , and none occur in the bound
expression 𝑒 1 . This rule is why Linear B expressly forbids non-linear expressions from consuming
linear variables, which they might otherwise do with drop. □
Proof of Claim 2. Work preservation needs induction on the program as well as the syntax
of expressions. The proof amounts to checking that the work corrections do, indeed, cancel. We
again consider TLetLin first. If we denote by | · | the number of linear scalars present in some set
of variables, we compute
W [𝑒]
¤ = W [𝑒¤1 ] + W [𝑒¤2 ] cost of let
≥ W [𝑒¤1 ] + W [𝑒¥2 ] + (| Δ¤ 2 | + |𝑣¤ |) − | Δ|
¥ induction on 𝑒¤2
≥ W [𝑒¥1 ] + | Δ¤ 1 | − |𝑣¥ | + W [𝑒¥2 ] + (| Δ¤ 2 | + |𝑣¤ |) − | Δ|
¥ induction on 𝑒¤1
= W [𝑒] ¤ − | Δ|,
¥ + | Δ| ¥ 𝑣¥𝑖 correspond to 𝑣¤𝑖
as desired.
Handling TApp is where we need to use induction on the program being transposed. To wit, in
this case, W [𝑒] ¤ equals the work 𝑤 done to evaluate (the body of) the function 𝑓 .lin being called,
and W [𝑒] ¥ equals the work 𝑤 T of 𝑓 .lin T , and the corrections correspond as well. An induction
only on expression syntax would not give us the needed premise that 𝑤 ≥ 𝑤 T + correction, but
induction on the program does, because Linear B has no recursion.
We also find it instructive to attend to work preservation of TDup. The term dup(𝑣) ¤ itself does
no work in our cost model, but the addition it transposes to does work equal to |𝑣¤ |, the number
of real scalars in the variable. This is what we needed those corrections for: dup(𝑣) ¤ produces 2|𝑣¤ |
results, while the addition only produces |𝑣¤ | results, so the corrected work is conserved. Similar
reasoning applies to TAdd and TLit.
The rule TDrop is where the inequality in work preservation may be strict: we short-circuit
the subexpression 𝑒, just emitting zero cotangents for its free linear variables. The resulting work
W [0 T (𝜏¤𝑖 ) ] is zero, corrected to |𝑣¤ |. To argue work preservation, we must invoke Lemma 4.3 to
conclude that W [drop(𝑒)] ¤ ≥ |𝑣¤ |, and thus lower-bound the corrected work of the pre-transposition
expression at |𝑣¤ | as well. □
Proof of Claim 3. Correctness is also an induction on the program and the syntax of expressions.
To give a flavor for the leaf-node behavior of transposition, let’s prove correctness of TLinAdd and
TDup. For type-compatible 𝑣¤1, 𝑣¤2 , and 𝑣,
¥
⟨(𝑣¤1, 𝑣¤2 ), dup(𝑣)⟩
¥ = ⟨¤𝑣 1, 𝑣⟩
¥ + ⟨¤𝑣 2, 𝑣⟩
¥ = ⟨¤𝑣 1 + 𝑣¤2, 𝑣⟩.
¥
Read from left to right, this is exactly the correctness equation for TDup, and from right to left for
TLinAdd. Note that this simple proof critically depends on the distributivity of multiplication over
addition, also known as the linear factoring rule, which is crucial for the efficiency of reverse-mode
autodiff (see e.g. Brunel et al. [2019]).
Transposition on interior AST nodes is exemplified by TLetLin. To begin the calculation, we
stipulate the disjoint partition of x¤ into x¤ 1 corresponding to Δ¤ 1 and x¤ 2 corresponding to Δ¤ 2 ; and a
similar (but not necessarily disjoint) split of x into x1 and x2 . To simplify notation, we name the
intermediate quantities z¤ , z¥, and x¥ 2 , as, respectively, the forward evaluation of the bound expression
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:24 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
𝑒¤1 , the corresponding backward evaluation of the transposed body 𝑒¥2 , and the residual results of 𝑒¥2
which correspond to the free linear values x¤ 2 :
z¤ = 𝑒¤1 [x1 ; x¤ 1 ]
z¥ ⊎ x¥ 2 = 𝑒¥2 [x2 ; x¥ ].
Then we conclude the correctness proof for TLetLin by calculation:
⟨𝑒¤ [x; x¤ ], x¥ ⟩ = ⟨𝑒¤2 [x2 ; z¤ ⊎ x¤ 2 ], x¥ ⟩ 1-step evaluation of 𝑒¤
= ⟨¤z ⊎ x¤ 2, 𝑒¥2 [x2 ; x¥ ]⟩ induction on 𝑒¤2
= ⟨¤z, z¥⟩ + ⟨¤x2, x¥ 2 ⟩ partitioning dot product
= ⟨¤x1, 𝑒¥1 [x1 ; z¥]⟩ + ⟨¤x2, x¥ 2 ⟩ induction on 𝑒¤1
= ⟨¤x, 𝑒¥1 [x1 ; z¥] ⊎ x¥ 2 ⟩ unpartitioning dot product
= ⟨¤x, 𝑒¥ [x; x¥ ]⟩. 1-step (un)evaluation of 𝑒¥
Other syntactic forms proceed similarly, and more simply. TApp again requires induction on the
whole program being transposed. □
Corollary 7.2. If we transpose an expression 𝑒¤ twice, the result is equivalent to 𝑒¤ and does at most
the same amount of work.
8 CODE SIZE
When implementing practical compilers, the size of the (intermediate) program representation is
also an important consideration alongside its ultimate runtime, because it limits the runtime and
memory cost of the compiler itself. We thus emphasize that our transformations increase code size
only linearly, because they turn function definitions and function calls into new definitions and
calls, without duplicating or inlining function bodies.
We can argue this formally by modeling the size |𝑃 | of a Linear A program 𝑃 simply as “every
node in the grammar from Figure 3 has size 1”. We then deduce
Theorem 8.1. Erasure, differentiation, unzipping, and transposition are size-preserving.
There exist constants 𝑐 L , 𝑐 T , 𝑐 U , 𝑐 J such that for any Linear A program 𝑃,
|L (𝑃)| ≤ 𝑐 L |𝑃 |,
|J (𝑃)| ≤ 𝑐 J |𝑃 |,
|U (𝑃)| ≤ 𝑐 U |𝑃 |, and
|T (𝑃)| ≤ 𝑐 T |𝑃 |,
where L, J , U, and T are the erasure, differentiation, unzipping, and transposition transformations
defined in Sections 4.5, 5, 6, and 7, respectively.
Proof. Mostly by inspection, but there is one subtlety. In rule UApp (Section 6), the type 𝜎 of the
tape of the applied function 𝑓 appears in the emitted let binding. In Linear A as presented, 𝜎 will
have size proportional to the number of non-linear intermediates in 𝑓 , which may be asymptotically
larger than the call site being transformed. This is, however, easy to fix: just augment the type
system with synonyms, and annotate that binding with a synonym for the tape instead. When
function calls are nested, this amounts to retaining structure sharing in the tape types.
The same phenomenon is also why linearity erasure L (Section 4.5) has to introduce dedicated
functions for vector space operations on arbitrary types, instead of inlining them. □
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:25
The constant 𝑐 J depends on the set of primitives and their derivatives defined for Linear A, but
𝑐 L , 𝑐 U and 𝑐 T are absolute, given by the size of the expressions constructed by our rules for L, U
and T , respectively. It is necessary to count variable references in program size (even though we
model them with zero runtime cost), because for instance J introduces calls to dup for repeated
references to (non-linear) variables.
9 RELATED WORK
Most of the material presented here can be found in one form or another in existing work, but
the composition of individual pieces is novel. The AD architecture we describe was sketched out
briefly in Frostig et al. [2021] and again in Paszke et al. [2021], and is in use in the JAX and Dex
projects [Bradbury et al. 2018; Frostig et al. 2018]. Transposition as a program transformation
has been previously explored by Piponi [2009]. Unzipping is an instance of partial evaluation—a
technique widely known and explored [Futamura 1983; Jones et al. 1993]—with the difference that
we are interested not only in the residual program, but also in the program formed by operations
that are considered “statically computable”. Automatic differentiation itself has seen rapid growth
in recent years, especially for functional languages, which we attempt to briefly summarize below.
Forward-mode differentiation is significantly simpler to embed in existing programming lan-
guages, for example by the means of templates [Piponi 2004], operator overloading [Walther et al.
2003] and Haskell-style typeclasses or ML-style functors [Elliott 2009; Karczmarczuk 1998]. The
true difficulty, both conceptually and implementation-wise, arises for reverse mode.
Initially, reverse-mode autodiff development was focused on imperative languages. In particular,
the implicit reuse of values in the original computation was consistently translated to destructive
mutation. This made higher-order differentiation more difficult, and meant that it generally did
not have a natural non-monadic translation into pure functional languages. This challenge was
overcome by Pearlmutter and Siskind [2008b], with the introduction of reverse mode as an aug-
mentation of the original program with “backpropagator” closures replacing the tape. As shown by
Wang et al. [2019] a similar transformation can be performed by expressing the non-local control
using delimited continuations, manifested by the shift and reset operators. Furthermore, Brunel
et al. [2019] use a similar method to perform provably correct reverse-mode differentiation over
lambda calculus extended with linear negation. Especially interestingly, they already make the
connection between linear logic and algebraic linearity that underlies the work presented here.
Finally, Krawiec et al. [2022] outline a provably correct and efficient implementation of AD in a
higher-order language, at a cost of requiring an extra-linguistic monadic translation.
The purpose of the monadic translation of Krawiec et al. [2022] is to stage out a program,
expressing the linearization of primal evaluation at a given point, as the primal is running. But,
since control flow is dependent only on the non-linear values, the staged program never uses
control flow or higher-order functions itself (linearization of only the branches taken gets inlined),
and is in fact uses a language almost identical to the linear part of Linear A.
That approach can be factorized through the lens taken in our work. Krawiec et al. [2022]
describes reverse-mode AD in two phases. The first phase simultaneously: traces the metalanguage
into a representation without control flow and higher-order functions (our Linear A); performs
forward mode AD (our J ); and partially evaluates (like our unzipping U, but specializing the
linear function to exact values, instead of separating non-linear compute). In the second phase,
transposition is implemented as an evaluator function that traverses the staged linear program.
Complementing the previous implementation-focused work, multiple explorations of autodiff
semantics have been developed. Many of those give up the asymptotic efficiency guarantees,
which are critical for practical implementations of autodiff, but they do help shed light onto the
underlying theory. Elliott [2018] outlines a minimal autodiff system for a language following
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:26 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
category theoretical foundations. Similarly to our work he also extensively reasons about the
linearity of derivatives, including the introduction of a linear function type. Abadi and Plotkin
[2019] describe consistent operational and denotational semantics for autodiff in a first-order
language, while Huot et al. [2020] extend it to a higher-order language using diffeology semantics.
Mazza and Pagani [2021] outline semantics for a higher-order language with conditionals that are
consistent with the mathematical notion of differentiation everywhere except for a measure 0 set.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:27
can either store the result on the tape (to minimize work), or emit code in the linear residual
that recomputes it (to reduce memory pressure by the emitted program). The AD community has
devoted considerable attention to this tradeoff under the heading of checkpointing AD; perhaps
there are ideas there that could be ported to general partial evaluation in memory-limited settings.
Array size polymorphism. One of the limitations of Linear A as defined is that the size of
any tuple is syntactically apparent, so it is not possible to write size-polymorphic (i.e., “array
processing”) functions in Linear A. Array-size-polymorphic functions are of course critical in
practice, but are mostly orthogonal to the transformations we introduced here. We leave a full
formalization to future work, with the comment that JAX resolves array size polymorphism at trace
time (in our terms, during metaprogramming of Linear A), whereas Dex handles it by extending
the internal representation with constructs that deal with it directly.
Sparse matrix algorithms. In a different direction, indexed linear functions are a representation
for matrices. Indeed, they are a maximally expressive representation for matrices, that can encode
any structured (or unstructured) sparsity pattern whatever. This statement is not very deep: the
sine qua non of a sparse matrix representation is to be able to compute matrix-vector products 𝑀𝑥;
the function that closes over the matrix 𝑀 and computes that product with a vector 𝑥 it accepts as
argument is algebraically linear; and it seems obvious that any such function can be written to
type-check as linear in 𝑥 in some reasonable variant of Linear A (perhaps extended with control
flow as needed).
From this lens, our contribution is a universal sparsity-preserving transposition operation
for sparse matrix representations (as well as a formalization of the intuition that an automatic
derivative is a sparse representation of the Jacobian of the differentiated function). Are there any
other sparse-matrix operations that can be rendered universal as code transformations on Linear A?
Can sparse-matrix algorithms for, say, matrix multiplication be fruitfully recovered from a known
code transformation (e.g., simplification or partial evaluation) applied to the composition of the
corresponding indexed linear functions? Can new sparse-matrix algorithms or representations be
derived more simply or robustly from this direction?
11 CONCLUSION
We presented a decomposition of reverse-mode automatic differentiation into three distinct program
transformations: forward-mode differentiation, unzipping, and transposition. The interface between
these transformations is a (substructural) linear type system for checking (algebraic) linearity.
Automatic derivatives type-check as linear, unzipping separates the linear and non-linear parts of a
computation, and transposition runs the linear part backward to efficiently compute its transpose.
This decomposition clarifies and simplifies automatic differentiation systems, by separating the
mind-bending direction-reversal needed for reverse-mode AD from the actual differentiating. This
decomposition also sheds light on checkpointing strategies—the decision of whether to save an
intermediate value or recompute it is entirely the province of the unzipping transformation, which
amounts to performing partial evaluation in a way that trades off run-time computation for lower
memory usage. Transposition handles any set of choices mechanically.
12 ACKNOWLEDGEMENTS
The authors would like to thank Pavel Sountsov for the observation that Linear A does not handle
shape polymorphism, Colin Carroll for the use case of simplifying custom gradients, and Gordon
Plotkin for introducing us to linearity, especially the connection between linear algebra and linear
types. We also thank our anonymous reviewers for their remarkably detailed and insightful feedback.
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
43:28 A. Radul, A. Paszke, R. Frostig, M. Johnson, D. Maclaurin
REFERENCES
Martín Abadi and Gordon D. Plotkin. 2019. A Simple Differentiable Programming Language. Proc. ACM Program. Lang. 4,
POPL, Article 38 (dec 2019), 28 pages. https://doi.org/10.1145/3371106
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula,
Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of
Python+NumPy programs. Google. http://github.com/google/jax
Aloïs Brunel, Damiano Mazza, and Michele Pagani. 2019. Backpropagation in the Simply Typed Lambda-Calculus with
Linear Negation. Proc. ACM Program. Lang. 4, POPL, Article 64 (dec 2019), 27 pages. https://doi.org/10.1145/3371132
Conal Elliott. 2018. The Simple Essence of Automatic Differentiation. Proc. ACM Program. Lang. 2, ICFP, Article 70 (jul
2018), 29 pages. https://doi.org/10.1145/3236765
Conal M. Elliott. 2009. Beautiful Differentiation. In Proceedings of the 14th ACM SIGPLAN International Conference on
Functional Programming (Edinburgh, Scotland) (ICFP ’09). Association for Computing Machinery, New York, NY, USA,
191–202. https://doi.org/10.1145/1596550.1596579
Roy Frostig, Matthew Johnson, and Chris Leary. 2018. Compiling machine learning programs via high-level tracing. In
Machine Learning and Systems (MLSys). https://mlsys.org/Conferences/doc/2018/146.pdf
Roy Frostig, Matthew J. Johnson, Dougal Maclaurin, Adam Paszke, and Alexey Radul. 2021. Decomposing reverse-mode
automatic differentiation. In LAFI: POPL workshop on Languages for Inference. Association for Computing Machinery,
New York, NY, USA. https://arxiv.org/abs/2105.09469
Yoshihiko Futamura. 1983. Partial computation of programs. In RIMS Symposia on Software Science and Engineering, Eiichi
Goto, Koichi Furukawa, Reiji Nakajima, Ikuo Nakata, and Akinori Yonezawa (Eds.). Springer Berlin Heidelberg, Berlin,
Heidelberg, 1–35. https://doi.org/10.1007/3-540-11980-9_13
Andreas Griewank and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation
(second ed.). Society for Industrial and Applied Mathematics, USA. https://doi.org/10.1137/1.9780898717761
Mathieu Huot, Sam Staton, and Matthijs Vákár. 2020. Correctness of Automatic Differentiation via Diffeologies and
Categorical Gluing. In Foundations of Software Science and Computation Structures - 23rd International Conference,
FOSSACS 2020, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2020, Dublin,
Ireland, April 25-30, 2020, Proceedings (Lecture Notes in Computer Science, Vol. 12077), Jean Goubault-Larrecq and Barbara
König (Eds.). Springer, 319–338. https://doi.org/10.1007/978-3-030-45231-5_17
Neil D. Jones, Carsten K. Gomard, and Peter Sestoft. 1993. Partial Evaluation and Automatic Program Generation. Prentice-Hall,
Inc., USA.
Jerzy Karczmarczuk. 1998. Functional Differentiation of Computer Programs. In Proceedings of the Third ACM SIGPLAN
International Conference on Functional Programming (Baltimore, Maryland, USA) (ICFP ’98). Association for Computing
Machinery, New York, NY, USA, 195–203. https://doi.org/10.1145/289423.289442
Faustyna Krawiec, Simon Peyton Jones, Neel Krishnaswami, Tom Ellis, Richard A. Eisenberg, and Andrew Fitzgibbon. 2022.
Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation. Proc. ACM Program.
Lang. 6, POPL, Article 48 (jan 2022), 30 pages. https://doi.org/10.1145/3498710
Damiano Mazza and Michele Pagani. 2021. Automatic Differentiation in PCF. Proc. ACM Program. Lang. 5, POPL, Article 28
(jan 2021), 27 pages. https://doi.org/10.1145/3434309
Adam Paszke, Daniel D. Johnson, David Duvenaud, Dimitrios Vytiniotis, Alexey Radul, Matthew J. Johnson, Jonathan Ragan-
Kelley, and Dougal Maclaurin. 2021. Getting to the Point: Index Sets and Parallelism-Preserving Autodiff for Pointful
Array Programming. Proc. ACM Program. Lang. 5, ICFP, Article 88 (aug 2021), 29 pages. https://doi.org/10.1145/3473593
Barak A Pearlmutter and Jeffrey Mark Siskind. 2008a. Reverse-mode AD in a functional framework: Lambda the ultimate
backpropagator. ACM Transactions on Programming Languages and Systems (TOPLAS) 30, 2 (2008), 1–36. https:
//doi.org/10.1145/1330017.1330018
Barak A. Pearlmutter and Jeffrey Mark Siskind. 2008b. Reverse-Mode AD in a Functional Framework: Lambda the Ultimate
Backpropagator. ACM Trans. Program. Lang. Syst. 30, 2, Article 7 (mar 2008), 36 pages. https://doi.org/10.1145/1330017.
1330018
Dan Piponi. 2004. Automatic Differentiation, C++ Templates, and Photogrammetry. Journal of Graphics Tools 9, 4 (2004),
41–55. https://doi.org/10.1080/10867651.2004.10504901
Dan Piponi. 2009. Two Tricks for the Price of One: Linear Filters and Their Transposes. J. Graphics, GPU, & Game Tools 14,
1 (2009), 63–72. https://doi.org/10.1080/2151237X.2009.10129275
Amr Sabry and Matthias Felleisen. 1992. Reasoning about Programs in Continuation-Passing Style. SIGPLAN Lisp Pointers
V, 1 (jan 1992), 288–298. https://doi.org/10.1145/141478.141563
Andrea Walther, Andreas Griewank, and Olaf Vogel. 2003. ADOL-C: Automatic Differentiation Using Operator Overloading
in C++. PAMM 2, 1 (2003), 41–44. https://doi.org/10.1002/pamm.200310011
Fei Wang, Daniel Zheng, James Decker, Xilun Wu, Grégory M. Essertel, and Tiark Rompf. 2019. Demystifying Differentiable
Programming: Shift/Reset the Penultimate Backpropagator. Proc. ACM Program. Lang. 3, ICFP, Article 96 (jul 2019),
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.
You Only Linearize Once 43:29
31 pages. https://doi.org/10.1145/3341700
Proc. ACM Program. Lang., Vol. 7, No. POPL, Article 43. Publication date: January 2023.