Deep Learning: CS229 Lecture Notes
Deep Learning: CS229 Lecture Notes
Deep Learning: CS229 Lecture Notes
Andrew Ng
Deep Learning
We now begin our study of deep learning. In this set of notes, we give an
overview of neural networks, discuss vectorization and discuss training neural
networks with backpropagation.
1 Neural Networks
We will start small and slowly build up a neural network, step by step. Recall
the housing price prediction problem from before: given the size of the house,
we want to predict the price.
Previously, we fitted a straight line to the graph. Now, instead of fitting a
straight line, we wish prevent negative housing prices by setting the absolute
minimum price as zero. This produces a “kink” in the graph as shown in
Figure 1.
Our goal is to input some input x into a function f (x) that outputs the
price of the house y. Formally, f : x → y. One of the simplest possible
neural networks is to define f (x) as a single “neuron” in the network where
f (x) = max(ax + b, 0), for some coefficients a, b. What f (x) does is return
a single value: x or zero, whichever is greater. In the context of neural
networks, this function is called a ReLU (pronounced “ray-lu”), or rectified
linear unit. A more complex neural network may take the single neuron
described above and “stack” them together such that one neuron passes its
output as input into the next neuron, resulting in a more complex function.
Let us now deepen the housing prediction example. In addition to the size
of the house, suppose that you know the number of bedrooms, the zip code
Scribe: Albert Haque
1
2
housing prices
1000
900
800
700
price (in $1000)
600
500
400
300
200
100
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
square feet
We have described this neural network as if you (the reader) already have
the insight to determine these three factors ultimately affect the housing
price. Part of the magic of a neural network is that all you need are the
input features x and the output y while the neural network will figure out
everything in the middle by itself. The process of a neural network learning
the intermediate features is called end-to-end learning.
Following the housing example, formally, the input to a neural network is
a set of input features x1 , x2 , x3 , x4 . We connect these four features to three
neurons. These three ”internal” neurons are called hidden units. The goal for
the neural network is to automatically determine three relevant features such
that the three features predict the price of a house. The only thing we must
provide to the neural network is a sufficient number of training examples
(x(i) , y (i) ). Often times, the neural network will discover complex features
which are very useful for predicting the output but may be difficult for a
human to understand since it does not have a “common” meaning. This is
why some people refer to neural networks as a black box, as it can be difficult
to understand the features it has invented.
Let us formalize this neural network representation. Suppose we have
three input features x1 , x2 , x3 which are collectively called the input layer,
four hidden units which are collectively called the hidden layer and one out-
put neuron called the output layer. The term hidden layer is called “hidden”
because we do not have the ground truth/training value for the hidden units.
This is in contrast to the input and output layers, both of which we know
the ground truth values from (x(i) , y (i) ).
The first hidden unit requires the input x1 , x2 , x3 and outputs a value
denoted by a1 . We use the letter a since it refers to the neuron’s “activation”
value. In this particular example, we have a single hidden layer but it is
[1]
possible to have multiple hidden layers. Let a1 denote the output value of
the first hidden unit in the first hidden layer. We use zero-indexing to refer
to the layer numbers. That is, the input layer is layer 0, the first hidden
layer is layer 1 and the output layer is layer 2. Again, more complex neural
networks may have more hidden layers. Given this mathematical notation,
[2]
the output of layer 2 is a1 . We can unify our notation:
[0]
x 1 = a1 (1.1)
[0]
x 2 = a2 (1.2)
[0]
x 3 = a3 (1.3)
To clarify, foo[1] with brackets denotes anything associated with layer 1, x(i)
[`]
with parenthesis refers to the ith training example, and aj refers to the
4
x1
Estimated
x2
value of y
x3
Returning to our neural network from before, the first hidden unit in the first
hidden layer will perform the following computation:
[1] [1] T [1] [1] [1]
z1 = W1 x + b1 and a1 = g(z1 ) (1.7)
where W is a matrix of parameters and W1 refers to the first row of this
matrix. The parameters associated with the first hidden unit is the vector
5
[1] [1]
W1 ∈ R3 and the scalar b1 ∈ R. For the second and third hidden units in
the first hidden layer, the computation is defined as:
[1] [1] T [1] [1] [1]
z2 = W2 x + b2 and a2 = g(z2 )
[1] [1] T [1] [1] [1]
z3 = W3 x + b3 and a3 = g(z3 )
where each hidden unit has its corresponding parameters W and b. Moving
on, the output layer performs the computation:
[2] [2] T [1] [2] [2] [2]
z1 = W1 a + b1 and a1 = g(z1 ) (1.8)
2 Vectorization
In order to implement a neural network at a reasonable speed, one must be
careful when using for loops. In order to compute the hidden unit activations
in the first layer, we must compute z1 , ..., z4 and a1 , ..., a4 .
[1] [1] T [1] [1] [1]
z1 = W1 x + b1 and a1 = g(z1 ) (2.1)
.. .. ..
. . . (2.2)
[1] [1] T [1] [1] [1]
z4 = W4 x + b4 and a4 = g(z4 ) (2.3)
The most natural way to implement this in code is to use a for loop. One of
the treasures that deep learning has given to the field of machine learning is
that deep learning algorithms have high computational requirements. As a
result, code will run very slowly if you use for loops.
6
Where the Rn×m beneath each matrix indicates the dimensions. Expressing
this in matrix notation: z [1] = W [1] x + b[1] . To compute a[1] without a
for loop, we can leverage vectorized libraries in Matlab, Octave, or Python
which compute a[1] = g(z [1] ) very fast by performing parallel element-wise
operations. Mathematically, we defined the sigmoid function g(z) as:
1
g(z) = where z ∈ R (2.5)
1 + e−z
However, the sigmoid function can be defined not only for scalars but also
vectors. In a Matlab/Octave-like pseudocode, we can define the sigmoid as:
z [2] = W
|{z}
[2] [1]
b[2]
a + |{z}
|{z} |{z} and a[2] = g(|{z}
|{z} z [2] ) (2.7)
1×1 1×4 4×1 1×1 1×1 1×1
7
Why do we not use the identity function for g(z)? That is, why not use
g(z) = z? Assume for sake of argument that b[1] and b[2] are zeros. Using
Equation (2.7), we have:
z [2] = W [2] a[1] (2.8)
= W [2] g(z [1] ) by definition (2.9)
= W [2] z [1] since g(z) = z (2.10)
= W [2] W [1] x from Equation (2.4) (2.11)
[2] [1]
= W̃ x where W̃ = W W (2.12)
Notice how W [2] W [1] collapsed into W̃ . This is because applying a linear
function to another linear function will result in a linear function over the
original input (i.e., you can construct a W̃ such that W̃ x = W [2] W [1] x).
This loses much of the representational power of the neural network as often
times the output we are trying to predict has a non-linear relationship with
the inputs. Without non-linear activation functions, the neural network will
simply perform linear regression.
Putting it together: Suppose we have a training set (x(1) , y (1) ), ..., (x(m) , y (m) )
where x(i) is a picture and y (i) is a binary label for whether the picture
contains a cat or not (i.e., 1=contains a cat).
First, we initialize the parameters W [1] , b[1] , W [2] , b[2] to small random
numbers. For each example, we compute the output “probability” from the
sigmoid function a[2](i) . Using the logistic regression log likelihood:
m
X
(i) [2](i) (i) [2](i)
y log a + (1 − y ) log(1 − a ) (2.15)
i=1
3 Backpropagation
Instead of the housing example, we now have a new problem. Suppose we
wish to detect whether there is a soccer ball in an image or not. Given an
input image x(i) , we wish to output a binary prediction 1 if there is a ball in
the image and 0 otherwise.
Aside: Images can be represented as a matrix with number of elements
equal to the number of pixels. However, color images are digitally represented
as a volume (i.e., three-channels; or three matrices stacked on each other).
The number three is used because colors are represented as red-green-blue
(RGB) values. In the diagram below, we have a 64 × 64 × 3 image containing
a soccer ball. It is flattened into a single vector containing 12,288 elements.
64 Neural
Flattening
Output
x(i) = Network Prediction
Model 0 or 1
64 3
The next step is to compute how many parameters are in this network. One
way of doing this is to compute the forward propagation by hand.
We know that z [1] , a[1] ∈ R3×1 and z [2] , a[2] ∈ R2×1 and z [3] , a[3] ∈ R1×1 . As
of now, we do not know the size of W [1] . However, we can compute its size.
We know that x ∈ Rn×1 . This leads us to the following
z [1] = W [1] x(i) = R3×1 Written as sizes: R3×1 = R?×? × Rn×1 (3.7)
W [2] ∈ R2×3 , b[2] ∈ R2×1 and W [3] ∈ R1×2 , b[3] ∈ R1×1 (3.8)
The loss function L(ŷ, y) produces a single scalar value. For short, we will
refer to the loss value as L. Given this value, we now must update all
parameters in layers of the neural network. For any given layer index `, we
update them:
∂L
W [`] = W [`] − α (3.10)
∂W [`]
∂L
b[`] = b[`] − α [`] (3.11)
∂b
where α is the learning rate. To proceed, we must compute the gradient with
respect to the parameters: ∂L/∂W [`] and ∂L/∂b[`] .
Remember, we made a decision to not set all parameters to zero. What if
we had initialized all parameters to be zero? We know that z [3] = W [3] a[2] +b[3]
will evaluate to zero, because W [3] and b[3] are all zero. However, the output
of the neural network is defined as a[3] = g(z [3] ). Recall that g(·) is defined as
the sigmoid function. This means a[3] = g(0) = 0.5. Thus, no matter what
value of x(i) we provide, the network will output ŷ = 0.5.
What if we had initialized all parameters to be the same non-zero value?
In this case, consider the activations of the first layer:
Each element of the activation vector a[1] will be the same (because W [1]
contains all the same values). This behavior will occur at all layers of the
neural network. As a result, when we compute the gradient, all neurons in
11
a layer will be equally responsible for anything contributed to the final loss.
We call this property symmetry. This means each neuron (within a layer)
will receive the exact same gradient update value (i.e., all neurons will learn
the same thing).
In practice, it turns out there is something better than random initializa-
tion. It is called Xavier/He initialization and initializes the weights:
r !
2
w[`] ∼ N 0, (3.13)
n[`] + n[`−1]
3.2 Optimization
Recall our neural network parameters: W [1] , b[1] , W [2] , b[2] , W [3] , b[3] . To up-
date them, we use stochastic gradient descent (SGD) using the update rules
in Equations (3.10) and (3.11). We will first compute the gradient with re-
spect to W [3] . The reason for this is that the influence of W [1] on the loss
is more complex than that of W [3] . This is because W [3] is “closer” to the
output ŷ, in terms of number of computations.
∂L ∂
=− (1 − y) log(1 − ŷ) + y log ŷ (3.14)
∂W [3] ∂W [3]
∂ [3] [2] [3]
= −(1 − y) log 1 − g(W a + b ) (3.15)
∂W [3]
∂ [3] [2] [3]
−y log g(W a + b ) (3.16)
∂W [3]
1 T
= −(1 − y) [3] [2] [3]
(−1)g 0 (W [3] a[2] + b[3] )a[2] (3.17)
1 − g(W a + b )
1 T
−y [3] [2] [3]
g 0 (W [3] a[2] + b[3] )a[2] (3.18)
g(W a + b )
T T
= (1 − y)σ(W [3] a[2] + b[3] )a[2] − y(1 − σ(W [3] a[2] + b[3] ))a[2] (3.19)
[3] [2] T [2] T
= (1 − y)a a − y(1 − a[3] )a (3.20)
T
= (a[3] − y)a[2] (3.21)
12
Note that we are using sigmoid for g(·). The derivative of the sigmoid func-
tion: g 0 = σ 0 = σ(1 − σ). Additionally a[3] = σ(W [3] a[2] + b[3] ). At this point,
we have finished computing the gradient for one parameter, W [3] .
We will now compute the gradient for W [2] . Instead of deriving ∂L/∂W [2] ,
we can use the chain rule of calculus. We know that L depends on ŷ = a[3] .
∂L ∂L ?
[2]
= (3.22)
∂W ? ∂W [2]
If we look at the forward propagation, we know that loss L depends on
ŷ = a[3] . Using the chain rule, we can insert ∂a[3] /∂a[3] :
∂L ∂L ∂a[3] ?
= (3.23)
∂W [2] ∂a[3] ? ∂W [2]
We know that a[3] is directly related to z [3] .
∂L ∂L ∂a[3] ∂z [3] ?
= (3.24)
∂W [2] ∂a[3] ∂z [3] ? ∂W [2]
Furthermore we know that z [3] is directly related to a[2] . Note that we cannot
use W [2] or b[2] because a[2] is the only common element between Equations
(3.5) and (3.6). A common element is required for backpropagation.
While we have greatly simplified the process, we are not done yet. Because
we are computing derivatives in higher dimensions, the exact order of matrix
multiplication required to compute Equation (3.28) is not clear. We must
reorder the terms in Equation (3.28) such that the dimensions align. First,
we note the dimensions of all the terms:
∂L [3] 0 [2]
[2]
= (a[3] − y) W a[1]
g (z ) |{z} (3.29)
|∂W
|{z}
{z } | {z } 1×1
| {z }
1×2 2×1 3×1
2×3
Notice how the terms do not align their shapes properly. We must rearrange
the terms by using properties of matrix algebra such that the matrix opera-
tions produce a result with the correct output shape. The correct ordering
is below:
∂L [3] T [1] T
[2]
=W ◦ g 0 (z [2] ) (a[3] − y) a (3.30)
|∂W
| {z } | {z } | {z } |{z}
{z } 2×1 2×1 1×1 1×3
2×3
Notice how there are now two stages instead of a single stage. The weight
update now depends on the cost J at this update step and the velocity vdW [`] .
The relative importance is controlled by β. Consider the analogy to a human
driving a car. While in motion, the car has momentum. If the car were to use
the brakes (or not push accelerator throttle), the car would continue moving
due to its momentum. Returning to optimization, the velocity vdW [`] will
keep track of the gradient over time. This technique has significantly helped
neural networks during the training phase.
3.3.1 L2 Regularization
Let W below denote all the parameters in a model. In the case of neural
networks, you may think of applying the 2nd term to all layer weights W [`] .
For convenience, we simply write W . The L2 regularization adds another
term to the cost function:
λ
JL2 = J + ||W ||2 (3.34)
2
λX
=J+ |Wij |2 (3.35)
2 ij
λ
= J + WTW (3.36)
2
where J is the standard cost function from before, λ is an arbitrary value with
a larger value indicating more regularization and W contains all the weight
15
matrices, and where Equations (3.34), (3.35) and (3.36) are equivalent. The
update rule with L2 regularization becomes:
∂J λ ∂W T W
W =W −α −α (3.37)
∂W 2 ∂W
∂J
= (1 − αλ)W − α (3.38)
∂W
When we were updating our parameters using gradient descent, we did not
have the (1 − αλ)W term. This means with L2 regularization, every update
will include some penalization, depending on W . This penalization increases
the cost J, which encourages individual parameters to be small in magnitude,
which is a way to reduce overfitting.
64
Flattening
x(i) =
64
dimensional and contains 3 channels. We now take our matrix of parameters
θ and slide it over the image. This is shown above by the thick square
in the upper left of the image. To compute the activation a, we compute
the element-wise product between θ and x1:4,1:4 , where the subscripts for x
16
indicate we are taking the top left 4 × 4 region in the image x. We then
collapse the matrix into a single scalar by summing all the elements resulting
from the element-wise product. Formally:
4 X
X 4
a= θij xij (3.39)
i=1 j=1
We then move this window slightly to the right in the image and repeat this
process. Once we have reached the end of the row, we start at the beginning
of the second row.
Once we have reached the end of the image, the parameters θ have “seen”
all pixels of the image: θ1 is no longer related to only the top left pixel. As a
result, whether the soccer ball appears in the bottom right or top left of the
image, the neural network will successfully detect the soccer ball.