Derivatives, Backpropagation, and Vectorization
Derivatives, Backpropagation, and Vectorization
Derivatives, Backpropagation, and Vectorization
Justin Johnson
September 6, 2017
1 Derivatives
1.1 Scalar Case
You are probably familiar with the concept of a derivative in the scalar case:
given a function f : R → R, the derivative of f at a point x ∈ R is defined as:
f (x + h) − f (x)
f 0 (x) = lim
h→0 h
Derivatives are a way to measure change. In the scalar case, the derivative
of the function f at the point x tells us how much the function f changes as the
input x changes by a small amount ε:
f (x + ε) ≈ f (x) + εf 0 (x)
For ease of notation we will commonly assign a name to the output of f ,
∂y
say y = f (x), and write ∂x for the derivative of y with respect to x. This
∂y
notation emphasizes that ∂x is the rate of change between the variables x and
∂y
y; concretely if x were to change by ε then y will change by approximately ε ∂x .
We can write this relationship as
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
You should read this as saying “changing x to x + ∆x implies that y will
∂y
change to approximately y + ∆x ∂x ”. This notation is nonstandard, but I like
it since it emphasizes the relationship between changes in x and changes in y.
The chain rule tells us how to compute the derivative of the compositon of
functions. In the scalar case suppose that f, g : R → R and y = f (x), z = g(y);
then we can also write z = (g ◦f )(x), or draw the following computational graph:
f g
x−
→y−
→z
The (scalar) chain rule tells us that
∂z ∂z ∂y
=
∂x ∂y ∂x
1
∂z ∂y
This equation makes intuitive sense. The derivatives ∂y and ∂x give:
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
∂z
y → y + ∆y =⇒ z →≈ z + ∆y
∂y
Combining these two rules lets us compute the effect of x on z: if x changes
∂y ∂y
by ∆x then y will change by ∂x ∆x, so we have ∆y = ∂x ∆x. If y changes by
∂z ∂z ∂y
∆y then z will change by ∂y ∆y = ∂y ∂x ∆x which is exactly what the chain rule
tells us.
f (x + h) − f (x)
∇x f (x) = lim
h→0 khk
Now the gradient ∇x f (x) ∈ RN is a vector, with the same intuition as the
scalar case. If we set y = f (x) then we have the relationship
∂y
x → x + ∆x =⇒ y →≈ y + · ∆x
∂x
The formula changes a bit from the scalar case to account for the fact that
∂y
x, ∆x, and ∂x are now vectors in RN while y is a scalar. In particular when
∂y
multiplying ∂x by ∆x we use the dot product, which combines two vectors to
give a scalar.
One nice outcome of this formula is that it gives meaning to the individual
∂y
elements of the gradient ∂x . Suppose that ∆x is the ith basis vector, so that
the ith coordinate of ε is 1 and all other coordinates of ε are 0. Then the dot
∂y ∂y
product ∂x · ∆x is simply the ith coordinate of ∂x ; thus the ith coordinate of
∂y
∂x tells us the approximate amount by which y will change if we move x along
the ith coordinate axis.
∂y
This means that we can also view the gradient ∂x as a vector of partial
derivatives:
∂y ∂y ∂y ∂y
= , ,...,
∂x ∂x1 ∂x2 ∂xN
where xi is the ith coordinate of the vector x, which is a scalar, so each
∂y
partial derivative ∂xi
is also a scalar.
2
1.3 Jacobian: Vector in, Vector out
Now suppose that f : RN → RM takes a vector as input and produces a vector
as output. Then the derivative of f at a point x, also called the Jacobian, is
the M × N matrix of partial derivatives. If we again set y = f (x) then we can
write:
∂y1 ∂y1
···
∂x1 ∂xN
∂y
= ... .. ..
∂x . .
∂y ∂yM
M
∂x1 ··· ∂xN
The Jacobian tells us the relationship between each element of x and each
∂y ∂yi
element of y: the (i, j)-th element of ∂x is equal to ∂xj
, so it tells us the amount
by which yi will change if xj is changed by a small amount.
Just as in the previous cases, the Jacobian tells us the relationship between
changes in the input and changes in the output:
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
∂y
Here ∂x is a M × N matrix and ∆x is an N -dimensional vector, so the
∂y
product ∂x ∆x is a matrix-vector multiplication resulting in an M -dimensional
vector.
The chain rule can be extended to the vector case using Jacobian matrices.
Suppose that f : RN → RM and g : RM → RK . Let x ∈ RN , y ∈ RM , and
z ∈ RK with y = f (x) and z = g(y), so we have the same computational graph
as the scalar case:
f g
x−
→y−
→z
The chain rule also has the same form as the scalar case:
∂z ∂z ∂y
=
∂x ∂y ∂x
∂z ∂y
However now each of these terms is a matrix: ∂y is a K × M matrix, ∂x is
∂z ∂z ∂y
a M × N matrix, and ∂x is a K × N matrix; the multiplication of ∂y and ∂x is
matrix multiplication.
3
1.4 Generalized Jacobian: Tensor in, Tensor out
Just as a vector is a one-dimensional list of numbers and a matrix is a two-
dimensional grid of numbers, a tensor is a D-dimensional grid of numbers1 .
Many operations in deep learning accept tensors as inputs and produce
tensors as outputs. For example an image is usually represented as a three-
dimensional grid of numbers, where the three dimensions correspond to the
height, width, and color channels (red, green, blue) of the image. We must
therefore develop a derivative that is compatible with functions operating on
general tensors.
Suppose now that f : RN1 ×···×NDx → RM1 ×···×MDy . Then the input to f
is a Dx -dimensional tensor of shape N1 × · · · × NDx , and the output of f is a
Dy -dimensional tensor of shape M1 × · · · × MDy . If y = f (x) then the derivative
∂y
∂x is a generalized Jacobian, which is an object with shape
4
X ∂y
∂y ∂y
∆x = (∆x)i = · ∆x
∂x j i
∂x i,j ∂x j,:
5
∂y
However, there’s a problem with this approach: the Jacobian matrices ∂x and
∂y
∂w are typically far too large to fit in memory.
As a concrete example, let’s suppose that f is a linear layer that takes as
input a minibatch of N vectors, each of dimension D, and produces a minibatch
of N vectors, each of dimension M . Then x is a matrix of shape N × D, w is a
matrix of shape D × M , and y = f (x, w) = xw is a matrix of shape N × M .
∂y
The Jacobian ∂x then has shape (N × M ) × (N × D). In a typical neural
∂y
network we might have N = 64 and M = D = 4096; then ∂x consists of
64 · 4096 · 64 · 4096 scalar values; this is more than 68 billion numbers; using
32-bit floating point, this Jacobian matrix will take 256 GB of memory to store.
Therefore it is completely hopeless to try and explicitly store and manipulate
the Jacobian matrix.
However it turns out that for most common neural network layers, we can
∂y ∂L
derive expressions that compute the product ∂x ∂y without explicitly forming
∂y
the Jacobian ∂x . Even better, we can typically derive this expression without
∂y
even computing an explicit expression for the Jacobian ∂x ; in many cases we
can work out a small case on paper and then infer the general formula.
Let’s see how this works out for the case of the linear layer f (x, w) = xw.
Set N = 1, D = 2, M = 3. Then we can explicitly write
y = y1,1 y1,2 y1,3 = xw (1)
w1,1 w1,2 w1,3
= x1,1 x1,2 (2)
w2,1 w2,2 w2,3
T
x1,1 w1,1 + x1,2 w2,1
= x1,1 w1,2 + x1,2 w2,2 (3)
x1,1 w1,3 + x1,2 w2,3
6
∂L ∂L ∂y
= (4)
∂x1,1 ∂y ∂x1,1
∂L ∂L ∂y
= (5)
∂x1,2 ∂y ∂x1,2
∂L
Viewing these derivatives as generalized matrices, ∂y has shape (1)×(N ×M )
∂y ∂L
and ∂x1,1 has shape (N ×M )×(1); their product ∂x1,1 then has shape (1)×(1). If
∂L ∂y
we instead view ∂y and ∂x1,1 as matrices of shape N ×M , then their generalized
∂L ∂y
matrix product is simply the dot product ∂y · ∂x1,1 .
Now we compute
∂y
∂y ∂y1,2 ∂y1,3
= ∂x1,1 ∂x1,1 ∂x1,1 = w1,1 w1,2 w1,3 (6)
∂x1,1 1,1
∂y
∂y ∂y1,2 ∂y1,3
= ∂x1,1 ∂x1,2 ∂x1,2 = w2,1 w2,2 w2,3 (7)
∂x1,2 1,2
where the final equality comes from taking the derivatives of Equation 3 with
respect to x1,1 .
We can now combine these results and write
∂L ∂L ∂y
= · = dy1,1 w1,1 + dy1,2 w1,2 + dy1,3 w1,3 (8)
∂x1,1 ∂y ∂x1,1
∂L ∂L ∂y
= · = dy1,1 w2,1 + dy1,2 w2,2 + dy1,3 w2,3 (9)
∂x1,2 ∂y ∂x1,2
∂L
This gives us our final expression for ∂x :
∂L ∂L
= ∂x1,1 ∂x∂L (10)
∂x 1,2
T
dy1,1 w1,1 + dy1,2 w1,2 + dy1,3 w1,3
= (11)
dy1,1 w2,1 + dy1,2 w2,2 + dy1,3 w2,3
∂L T
= x (12)
∂y
∂L ∂L T
This final result ∂x = ∂y x is very interesting because it allows us to
∂L ∂y
efficiently compute ∂x without explicitly
forming the Jacobian ∂x . We have
only derived this formula for the specific case of N = 1, D = 2, M = 3 but it in
fact holds in general.
∂L
By a similar thought process we can derive a similar expression for ∂w with-
∂y
out explicitly forming the Jacobian ∂w . You should try and work through this
as an exercise.