Derivatives, Backpropagation, and Vectorization

Derivatives, Backpropagation, and Vectorization
Justin Johnson
September 6, 2017
1 Derivatives
1.1 Scalar Case
You are probably familiar with the concept of a derivative in the scalar case:
given a function f : R → R, the derivative of f at a point x ∈ R is defined as:
f (x + h) − f (x)
f 0 (x) = lim
h→0 h
Derivatives are a way to measure change. In the scalar case, the derivative
of the function f at the point x tells us how much the function f changes as the
input x changes by a small amount ε:
f (x + ε) ≈ f (x) + εf 0 (x)
For ease of notation we will commonly assign a name to the output of f ,
∂y
say y = f (x), and write ∂x for the derivative of y with respect to x. This
∂y
notation emphasizes that ∂x is the rate of change between the variables x and
∂y
y; concretely if x were to change by ε then y will change by approximately ε ∂x .
We can write this relationship as
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
You should read this as saying “changing x to x + ∆x implies that y will
∂y
change to approximately y + ∆x ∂x ”. This notation is nonstandard, but I like
it since it emphasizes the relationship between changes in x and changes in y.
The chain rule tells us how to compute the derivative of the compositon of
functions. In the scalar case suppose that f, g : R → R and y = f (x), z = g(y);
then we can also write z = (g ◦f )(x), or draw the following computational graph:
f g
x−
→y−
→z
The (scalar) chain rule tells us that
∂z ∂z ∂y
=
∂x ∂y ∂x
1
∂z ∂y
This equation makes intuitive sense. The derivatives ∂y and ∂x give:
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
∂z
y → y + ∆y =⇒ z →≈ z + ∆y
∂y
Combining these two rules lets us compute the effect of x on z: if x changes
∂y ∂y
by ∆x then y will change by ∂x ∆x, so we have ∆y = ∂x ∆x. If y changes by
∂z ∂z ∂y
∆y then z will change by ∂y ∆y = ∂y ∂x ∆x which is exactly what the chain rule
tells us.
1.2 Gradient: Vector in, scalar out

This same intuition carries over into the vector case. Now suppose that f :
RN → R takes a vector as input and produces a scalar. The derivative of f at
the point x ∈ RN is now called the gradient, and it is defined as:
f (x + h) − f (x)
∇x f (x) = lim
h→0 khk
Now the gradient ∇x f (x) ∈ RN is a vector, with the same intuition as the
scalar case. If we set y = f (x) then we have the relationship
∂y
x → x + ∆x =⇒ y →≈ y + · ∆x
∂x
The formula changes a bit from the scalar case to account for the fact that
∂y
x, ∆x, and ∂x are now vectors in RN while y is a scalar. In particular when
∂y
multiplying ∂x by ∆x we use the dot product, which combines two vectors to
give a scalar.
One nice outcome of this formula is that it gives meaning to the individual
∂y
elements of the gradient ∂x . Suppose that ∆x is the ith basis vector, so that
the ith coordinate of ε is 1 and all other coordinates of ε are 0. Then the dot
∂y ∂y
product ∂x · ∆x is simply the ith coordinate of ∂x ; thus the ith coordinate of
∂y
∂x tells us the approximate amount by which y will change if we move x along
the ith coordinate axis.
∂y
This means that we can also view the gradient ∂x as a vector of partial
derivatives:

∂y ∂y ∂y ∂y
= , ,...,
∂x ∂x1 ∂x2 ∂xN
where xi is the ith coordinate of the vector x, which is a scalar, so each
∂y
partial derivative ∂xi
is also a scalar.
2
1.3 Jacobian: Vector in, Vector out
Now suppose that f : RN → RM takes a vector as input and produces a vector
as output. Then the derivative of f at a point x, also called the Jacobian, is
the M × N matrix of partial derivatives. If we again set y = f (x) then we can
write:
 ∂y1 ∂y1
···

∂x1 ∂xN
∂y
=  ... .. .. 

∂x . . 
∂y ∂yM
M
∂x1 ··· ∂xN
The Jacobian tells us the relationship between each element of x and each
∂y ∂yi
element of y: the (i, j)-th element of ∂x is equal to ∂xj
, so it tells us the amount
by which yi will change if xj is changed by a small amount.
Just as in the previous cases, the Jacobian tells us the relationship between
changes in the input and changes in the output:
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
∂y
Here ∂x is a M × N matrix and ∆x is an N -dimensional vector, so the
∂y
product ∂x ∆x is a matrix-vector multiplication resulting in an M -dimensional
vector.
The chain rule can be extended to the vector case using Jacobian matrices.
Suppose that f : RN → RM and g : RM → RK . Let x ∈ RN , y ∈ RM , and
z ∈ RK with y = f (x) and z = g(y), so we have the same computational graph
as the scalar case:
f g
x−
→y−
→z
The chain rule also has the same form as the scalar case:
∂z ∂z ∂y
=
∂x ∂y ∂x
∂z ∂y
However now each of these terms is a matrix: ∂y is a K × M matrix, ∂x is
∂z ∂z ∂y
a M × N matrix, and ∂x is a K × N matrix; the multiplication of ∂y and ∂x is
matrix multiplication.
3
1.4 Generalized Jacobian: Tensor in, Tensor out
Just as a vector is a one-dimensional list of numbers and a matrix is a two-
dimensional grid of numbers, a tensor is a D-dimensional grid of numbers1 .
Many operations in deep learning accept tensors as inputs and produce
tensors as outputs. For example an image is usually represented as a three-
dimensional grid of numbers, where the three dimensions correspond to the
height, width, and color channels (red, green, blue) of the image. We must
therefore develop a derivative that is compatible with functions operating on
general tensors.
Suppose now that f : RN1 ×···×NDx → RM1 ×···×MDy . Then the input to f
is a Dx -dimensional tensor of shape N1 × · · · × NDx , and the output of f is a
Dy -dimensional tensor of shape M1 × · · · × MDy . If y = f (x) then the derivative
∂y
∂x is a generalized Jacobian, which is an object with shape
(M1 × · · · × MDy ) × (N1 × · · · × NDx )

∂y
Note that we have separated the dimensions of ∂x into two groups: the
first group matches the dimensions of y and the second group matches the
dimensions of x. With this grouping, we can think of the generalized Jacobian
as generalization of a matrix, where each “row” has the same shape as y and
each “column” has the same shape as x.
Now if we let i ∈ ZDy and j ∈ ZDx be vectors of integer indices, then we
can write

∂y ∂yi
=
∂x i,j ∂xj
∂yi
In this equation note that yi and xj are scalars, so the derivative ∂x j
is
also a scalar. Using this notation we see that like the standard Jacobian, the
generalized Jacobian tells us the relative rates of change between all elements
of x and all elements of y.
The generalized Jacobian gives the same relationship between inputs and
outputs as before:
∂y
x → x + ∆x =⇒ y →≈ y + ∆x
∂x
∂y
The difference is that now ∆x is a tensor of shape N1 × · · · × NDx and ∂x
is a generalized matrix of shape (M1 × · · · × MDy ) × (N1 × · · · × NDx ). The
∂y
product ∂x ∆x is therefore a generalized matrix-vector multiply, which results in
a tensor of shape M1 × · · · × MDy .
The generalized matrix-vector multipy follows the same algebraic rules as a
traditional matrix-vector multiply:
1 The word tensor is used in different ways in different fields; you may have seen the term
before in physics or abstract algebra. The machine learning definition of a tensor as a D-

dimensional grid of numbers is closely related to the definitions of tensors in these other
fields.
4
X ∂y
∂y ∂y
∆x = (∆x)i = · ∆x
∂x j i
∂x i,j ∂x j,:
The only difference is that the indicies i and j are not

scalars;
instead they
∂y
are vectors of indicies. In the equation above the term ∂x is the jth “row”
j,:
∂y
of the generalized matrix ∂x , which is a tensor with the same shape as x. We
have also used the convention that the dot product between two tensors of the
same shape is an elementwise product followed by a sum, identical to the dot
product between vectors.
The chain rule also looks the same in the case of tensor-valued functions.
Suppose that y = f (x) and z = g(y), where x and y have the same shapes as
above and z has shape K1 × · · · × KDz . Now the chain rule looks the same as
before:
∂z ∂z ∂y
=
∂x ∂y ∂x
∂z
The difference is that now ∂y is a generalized matrix of shape (K1 ×· · ·×KDz )×
∂y
(M1 × · · · × MDy ), and is a generalized matrix of shape (M1 × · · · × MDy ) ×
∂z
∂z ∂y
(N1 × · · · × NDx ); the product ∂y ∂x is a generalized matrix-matrix multiply,
resulting in an object of shape (K1 × · · · × KDz ) × (N1 × · · · × NDx ). Like
the generalized matrix-vector multiply defined above, the generalized matrix-
matrix multiply follows the same algebraic rules as the traditional matrix-matrix
multiply:
X ∂z ∂y
∂z ∂z ∂y
= = ·
∂x i,j ∂y i,k ∂x k,j ∂y i,: ∂x :,j
k

∂z
In this equation the indices i, j, k are vectors of indices, and the terms ∂y
i,:
∂y ∂z ∂y
and ∂x are the ith “row” of ∂y and the jth “column” of ∂x respectively.
:,j
2 Backpropagation with Tensors

In the context of neural networks, a layer f is typically a function of (tensor)
inputs x and weights w; the (tensor) output of the layer is then y = f (x, w).
The layer f is typically embedded in some large neural network with scalar loss
L.
During backpropagation, we assume that we are given ∂L ∂y and our goal is to
∂L ∂L
compute ∂x and ∂w . By the chain rule we know that
∂L ∂L ∂y ∂L ∂L ∂y
= =
∂x ∂y ∂x ∂w ∂y ∂w
Therefore one way to proceed would be to form the (generalized) Jacobians
∂y ∂y
∂x and ∂w and use (generalized) matrix multiplication to compute ∂L ∂L
∂x and ∂w .
5
∂y
However, there’s a problem with this approach: the Jacobian matrices ∂x and
∂y
∂w are typically far too large to fit in memory.
As a concrete example, let’s suppose that f is a linear layer that takes as
input a minibatch of N vectors, each of dimension D, and produces a minibatch
of N vectors, each of dimension M . Then x is a matrix of shape N × D, w is a
matrix of shape D × M , and y = f (x, w) = xw is a matrix of shape N × M .
∂y
The Jacobian ∂x then has shape (N × M ) × (N × D). In a typical neural
∂y
network we might have N = 64 and M = D = 4096; then ∂x consists of
64 · 4096 · 64 · 4096 scalar values; this is more than 68 billion numbers; using
32-bit floating point, this Jacobian matrix will take 256 GB of memory to store.
Therefore it is completely hopeless to try and explicitly store and manipulate
the Jacobian matrix.
However it turns out that for most common neural network layers, we can
∂y ∂L
derive expressions that compute the product ∂x ∂y without explicitly forming
∂y
the Jacobian ∂x . Even better, we can typically derive this expression without
∂y
even computing an explicit expression for the Jacobian ∂x ; in many cases we
can work out a small case on paper and then infer the general formula.
Let’s see how this works out for the case of the linear layer f (x, w) = xw.
Set N = 1, D = 2, M = 3. Then we can explicitly write

y = y1,1 y1,2 y1,3 = xw (1)

w1,1 w1,2 w1,3
= x1,1 x1,2 (2)
w2,1 w2,2 w2,3
 T
x1,1 w1,1 + x1,2 w2,1
= x1,1 w1,2 + x1,2 w2,2  (3)
x1,1 w1,3 + x1,2 w2,3
During backpropagation we assume that we have access to ∂L∂y which tech-

nically has shape (1) × (N × M ); however for notational convenience we will
instead think of it as a matrix of shape N × M . Then we can write
∂L
= dy1,1 dy1,2 dy1,3
∂y
∂L ∂L
Our goal now is to derive an expression for ∂x in terms of x, w, and ∂y ,
∂y
without explicitly forming the entire Jacobian know that ∂L
∂x . We ∂x will have
shape (1) × (N × D), but as is typical for representing gradients we instead view
∂L ∂L
∂x as a matrix of shape N × D. We know that each element of ∂x is a scalar
giving the partial derivatives of L with respect to the elements of x:
∂L ∂L
= ∂x1,1 ∂x∂L
∂x 1,2
Thinking one element at a time, the chain rule tells us that
6
∂L ∂L ∂y
= (4)
∂x1,1 ∂y ∂x1,1
∂L ∂L ∂y
= (5)
∂x1,2 ∂y ∂x1,2
∂L
Viewing these derivatives as generalized matrices, ∂y has shape (1)×(N ×M )
∂y ∂L
and ∂x1,1 has shape (N ×M )×(1); their product ∂x1,1 then has shape (1)×(1). If
∂L ∂y
we instead view ∂y and ∂x1,1 as matrices of shape N ×M , then their generalized
∂L ∂y
matrix product is simply the dot product ∂y · ∂x1,1 .
Now we compute
∂y
∂y ∂y1,2 ∂y1,3

= ∂x1,1 ∂x1,1 ∂x1,1 = w1,1 w1,2 w1,3 (6)
∂x1,1 1,1
∂y
∂y ∂y1,2 ∂y1,3

= ∂x1,1 ∂x1,2 ∂x1,2 = w2,1 w2,2 w2,3 (7)
∂x1,2 1,2
where the final equality comes from taking the derivatives of Equation 3 with
respect to x1,1 .
We can now combine these results and write
∂L ∂L ∂y
= · = dy1,1 w1,1 + dy1,2 w1,2 + dy1,3 w1,3 (8)
∂x1,1 ∂y ∂x1,1
∂L ∂L ∂y
= · = dy1,1 w2,1 + dy1,2 w2,2 + dy1,3 w2,3 (9)
∂x1,2 ∂y ∂x1,2
∂L
This gives us our final expression for ∂x :
∂L ∂L
= ∂x1,1 ∂x∂L (10)
∂x 1,2
T
dy1,1 w1,1 + dy1,2 w1,2 + dy1,3 w1,3
= (11)
dy1,1 w2,1 + dy1,2 w2,2 + dy1,3 w2,3
∂L T
= x (12)
∂y
∂L ∂L T
This final result ∂x = ∂y x is very interesting because it allows us to
∂L ∂y
efficiently compute ∂x without explicitly
forming the Jacobian ∂x . We have
only derived this formula for the specific case of N = 1, D = 2, M = 3 but it in
fact holds in general.
∂L
By a similar thought process we can derive a similar expression for ∂w with-
∂y
out explicitly forming the Jacobian ∂w . You should try and work through this
as an exercise.

Derivatives, Backpropagation, and Vectorization

Uploaded by

Copyright:

Available Formats

Derivatives, Backpropagation, and Vectorization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Derivatives, Backpropagation, and Vectorization

Uploaded by

Copyright:

Available Formats

Derivatives, Backpropagation, and Vectorization

1.2 Gradient: Vector in, scalar out

(M1 × · · · × MDy ) × (N1 × · · · × NDx )

before in physics or abstract algebra. The machine learning definition of a tensor as a D-

The only difference is that the indicies i and j are not

2 Backpropagation with Tensors

During backpropagation we assume that we have access to ∂L∂y which tech-

Thinking one element at a time, the chain rule tells us that

You might also like