Neural Networks and Neural Language Models
Neural Networks and Neural Language Models
Neural Networks and Neural Language Models
All
rights reserved. Draft of January 7, 2023.
CHAPTER
7.1 Units
The building block of a neural network is a single computational unit. A unit takes
a set of real valued numbers as input, performs some computation on them, and
produces an output.
At its heart, a neural unit is taking a weighted sum of its inputs, with one addi-
bias term tional term in the sum called a bias term. Given a set of inputs x1 ...xn , a unit has
a set of corresponding weights w1 ...wn and a bias b, so the weighted sum z can be
represented as: X
z = b+ wi xi (7.1)
i
Often it’s more convenient to express this weighted sum using vector notation; recall
vector from linear algebra that a vector is, at heart, just a list or array of numbers. Thus
we’ll talk about z in terms of a weight vector w, a scalar bias b, and an input vector
x, and we’ll replace the sum with the convenient dot product:
z = w·x+b (7.2)
As defined in Eq. 7.2, z is just a real valued number.
Finally, instead of using z, a linear function of x, as the output, neural units
apply a non-linear function f to z. We will refer to the output of this function as
activation the activation value for the unit, a. Since we are just modeling a single unit, the
activation for the node is in fact the final output of the network, which we’ll generally
call y. So the value y is defined as:
y = a = f (z)
We’ll discuss three popular non-linear functions f () below (the sigmoid, the tanh,
and the rectified linear unit or ReLU) but it’s pedagogically convenient to start with
sigmoid the sigmoid function since we saw it in Chapter 5:
1
y = σ (z) = (7.3)
1 + e−z
The sigmoid (shown in Fig. 7.1) has a number of advantages; it maps the output
into the range (0, 1), which is useful in squashing outliers toward 0 or 1. And it’s
differentiable, which as we saw in Section ?? will be handy for learning.
Figure 7.1 The sigmoid function takes a real value and maps it to the range (0, 1). It is
nearly linear around 0 but outlier values get squashed toward 0 or 1.
Substituting Eq. 7.2 into Eq. 7.3 gives us the output of a neural unit:
1
y = σ (w · x + b) = (7.4)
1 + exp(−(w · x + b))
7.1 • U NITS 3
Fig. 7.2 shows a final schematic of a basic neural unit. In this example the unit
takes 3 input values x1 , x2 , and x3 , and computes a weighted sum, multiplying each
value by a weight (w1 , w2 , and w3 , respectively), adds them to a bias term b, and then
passes the resulting sum through a sigmoid function to result in a number between 0
and 1.
x1 w1
w2 z a
x2 ∑ σ y
w3
x3 b
+1
Figure 7.2 A neural unit, taking 3 inputs x1 , x2 , and x3 (and a bias b that we represent as a
weight for an input clamped at +1) and producing an output y. We include some convenient
intermediate variables: the output of the summation, z, and the output of the sigmoid, a. In
this case the output of the unit y is the same as a, but in deeper networks we’ll reserve y to
mean the final output of the entire network, leaving a as the activation of an individual node.
Let’s walk through an example just to get an intuition. Let’s suppose we have a
unit with the following weight vector and bias:
ez − e−z
y = tanh(z) = (7.5)
ez + e−z
The simplest activation function, and perhaps the most commonly used, is the rec-
ReLU tified linear unit, also called the ReLU, shown in Fig. 7.3b. It’s just the same as z
when z is positive, and 0 otherwise:
These activation functions have different properties that make them useful for differ-
ent language applications or network architectures. For example, the tanh function
has the nice properties of being smoothly differentiable and mapping outlier values
toward the mean. The rectifier function, on the other hand, has nice properties that
4 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS
(a) (b)
Figure 7.3 The tanh and ReLU activation functions.
result from it being very close to linear. In the sigmoid or tanh functions, very high
saturated values of z result in values of y that are saturated, i.e., extremely close to 1, and have
derivatives very close to 0. Zero derivatives cause problems for learning, because as
we’ll see in Section 7.6, we’ll train networks by propagating an error signal back-
wards, multiplying gradients (partial derivatives) from each layer of the network;
gradients that are almost 0 cause the error signal to get smaller and smaller until it is
vanishing
gradient too small to be used for training, a problem called the vanishing gradient problem.
Rectifiers don’t have this problem, since the derivative of ReLU for high values of z
is 1 rather than very close to 0.
AND OR XOR
x1 x2 y x1 x2 y x1 x2 y
0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1 1
1 0 0 1 0 1 1 0 1
1 1 1 1 1 1 1 1 0
perceptron This example was first shown for the perceptron, which is a very simple neural
unit that has a binary output and does not have a non-linear activation function. The
output y of a perceptron is 0 or 1, and is computed as follows (using the same weight
w, input x, and bias b as in Eq. 7.2):
0, if w · x + b ≤ 0
y= (7.7)
1, if w · x + b > 0
7.2 • T HE XOR PROBLEM 5
It’s very easy to build a perceptron that can compute the logical AND and OR
functions of its binary inputs; Fig. 7.4 shows the necessary weights.
x1 x1
1 1
x2 1 x2 1
-1 0
+1 +1
(a) (b)
Figure 7.4 The weights w and bias b for perceptrons for computing logical functions. The
inputs are shown as x1 and x2 and the bias as a special node with value +1 which is multiplied
with the bias weight b. (a) logical AND, with weights w1 = 1 and w2 = 1 and bias weight
b = −1. (b) logical OR, with weights w1 = 1 and w2 = 1 and bias weight b = 0. These
weights/biases are just one from an infinite number of possible sets of weights and biases that
would implement the functions.
It turns out, however, that it’s not possible to build a perceptron to compute
logical XOR! (It’s worth spending a moment to give it a try!)
The intuition behind this important result relies on understanding that a percep-
tron is a linear classifier. For a two-dimensional input x1 and x2 , the perceptron
equation, w1 x1 + w2 x2 + b = 0 is the equation of a line. (We can see this by putting
it in the standard linear format: x2 = (−w1 /w2 )x1 + (−b/w2 ).) This line acts as a
decision
boundary decision boundary in two-dimensional space in which the output 0 is assigned to all
inputs lying on one side of the line, and the output 1 to all input points lying on the
other side of the line. If we had more than 2 inputs, the decision boundary becomes
a hyperplane instead of a line, but the idea is the same, separating the space into two
categories.
Fig. 7.5 shows the possible logical inputs (00, 01, 10, and 11) and the line drawn
by one possible set of parameters for an AND and an OR classifier. Notice that there
is simply no way to draw a line that separates the positive cases of XOR (01 and 10)
linearly
separable from the negative cases (00 and 11). We say that XOR is not a linearly separable
function. Of course we could draw a boundary with a curve, or some other function,
but not a single line.
x2 x2 x2
1 1 1
?
0 x1 0 x1 0 x1
0 1 0 1 0 1
a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2
Figure 7.5 The functions AND, OR, and XOR, represented with input x1 on the x-axis and input x2 on the
y-axis. Filled circles represent perceptron outputs of 1, and white circles perceptron outputs of 0. There is no
way to draw a line that correctly separates the two categories for XOR. Figure styled after Russell and Norvig
(2002).
x1 1 h1
1
1
y1
1
-2
x2 1 h2 0
0
-1
+1 +1
Figure 7.6 XOR solution after Goodfellow et al. (2016). There are three ReLU units, in
two layers; we’ve called them h1 , h2 (h for “hidden layer”) and y1 . As before, the numbers
on the arrows represent the weights w for each unit, and we represent the bias b as a weight
on a unit clamped to +1, with the bias weights/units in gray.
It’s also instructive to look at the intermediate results, the outputs of the two
hidden nodes h1 and h2 . We showed in the previous paragraph that the h vector for
the inputs x = [0, 0] was [0, 0]. Fig. 7.7b shows the values of the h layer for all
4 inputs. Notice that hidden representations of the two input points x = [0, 1] and
x = [1, 0] (the two cases with XOR output = 1) are merged to the single point h =
[1, 0]. The merger makes it easy to linearly separate the positive and negative cases
of XOR. In other words, we can view the hidden layer of the network as forming a
representation of the input.
In this example we just stipulated the weights in Fig. 7.6. But for real examples
the weights for neural networks are learned automatically using the error backprop-
agation algorithm to be introduced in Section 7.6. That means the hidden layers will
learn to form useful representations. This intuition, that neural networks can auto-
matically learn useful representations of the input, is one of their key advantages,
and one that we will return to again and again in later chapters.
7.3 • F EEDFORWARD N EURAL N ETWORKS 7
x2 h2
1 1
0 x1 0
h1
0 1 0 1 2
x1 W U
y1
h1
x2 h2 y2
h3
…
…
xn
0 hn
1
b yn
2
+1
input layer hidden layer output layer
Figure 7.8 A simple 2-layer feedforward network, with one hidden layer, one output layer,
and one input layer (the input layer is usually not counted when enumerating layers).
steps: multiplying the weight matrix by the input vector x, adding the bias vector b,
and applying the activation function g (such as the sigmoid, tanh, or ReLU activation
function defined above).
The output of the hidden layer, the vector h, is thus the following (for this exam-
ple we’ll use the sigmoid function σ as our activation function):
h = σ (Wx + b) (7.8)
Notice that we’re applying the σ function here to a vector, while in Eq. 7.3 it was
applied to a scalar. We’re thus allowing σ (·), and indeed any activation function
g(·), to apply to a vector element-wise, so g[z1 , z2 , z3 ] = [g(z1 ), g(z2 ), g(z3 )].
Let’s introduce some constants to represent the dimensionalities of these vectors
and matrices. We’ll refer to the input layer as layer 0 of the network, and have n0
represent the number of inputs, so x is a vector of real numbers of dimension n0 ,
or more formally x ∈ Rn0 , a column vector of dimensionality [n0 , 1]. Let’s call the
hidden layer layer 1 and the output layer layer 2. The hidden layer has dimensional-
ity n1 , so h ∈ Rn1 and also b ∈ Rn1 (since each hidden unit can take a different bias
value). And the weight matrix W has dimensionality W ∈ Rn1 ×n0 , i.e. [n1 , n0 ].
Take a moment to convince yourselfPn0 that the matrix multiplication in Eq. 7.8 will
compute the value of each h j as σ i=1 W ji xi + b j .
As we saw in Section 7.2, the resulting value h (for hidden but also for hypoth-
esis) forms a representation of the input. The role of the output layer is to take
this new representation h and compute a final output. This output could be a real-
valued number, but in many cases the goal of the network is to make some sort of
classification decision, and so we will focus on the case of classification.
If we are doing a binary task like sentiment classification, we might have a sin-
gle output node, and its scalar value y is the probability of positive versus negative
sentiment. If we are doing multinomial classification, such as assigning a part-of-
speech tag, we might have one output node for each potential part-of-speech, whose
output value is the probability of that part-of-speech, and the values of all the output
nodes must sum to one. The output layer is thus a vector y that gives a probability
distribution across the output nodes.
Let’s see how this happens. Like the hidden layer, the output layer has a weight
matrix (let’s call it U), but some models don’t include a bias vector b in the output
7.3 • F EEDFORWARD N EURAL N ETWORKS 9
layer, so we’ll simplify by eliminating the bias vector in this example. The weight
matrix is multiplied by its input vector (h) to produce the intermediate output z:
z = Uh
There are n2 output nodes, so z ∈ Rn2 , weight matrix U has dimensionality U ∈
Rn2 ×n1 , and element Ui j is the weight from unit j in the hidden layer to unit i in the
output layer.
However, z can’t be the output of the classifier, since it’s a vector of real-valued
numbers, while what we need for classification is a vector of probabilities. There is
normalizing a convenient function for normalizing a vector of real values, by which we mean
converting it to a vector that encodes a probability distribution (all the numbers lie
softmax between 0 and 1 and sum to 1): the softmax function that we saw on page ?? of
Chapter 5. More generally for any vector z of dimensionality d, the softmax is
defined as:
exp(zi )
softmax(zi ) = Pd 1≤i≤d (7.9)
j=1 exp(z j )
You may recall that we used softmax to create a probability distribution from a
vector of real-valued numbers (computed from summing weights times features) in
the multinomial version of logistic regression in Chapter 5.
That means we can think of a neural network classifier with one hidden layer
as building a vector h which is a hidden layer representation of the input, and then
running standard multinomial logistic regression on the features that the network
develops in h. By contrast, in Chapter 5 the features were mainly designed by hand
via feature templates. So a neural network is like multinomial logistic regression,
but (a) with many layers, since a deep neural network is like layer after layer of lo-
gistic regression classifiers; (b) with those intermediate layers having many possible
activation functions (tanh, ReLU, sigmoid) instead of just sigmoid (although we’ll
continue to use σ for convenience to mean any activation function); (c) rather than
forming the features by feature templates, the prior layers of the network induce the
feature representations themselves.
Here are the final equations for a feedforward network with a single hidden layer,
which takes an input vector x, outputs a probability distribution y, and is parameter-
ized by weight matrices W and U and a bias vector b:
h = σ (Wx + b)
z = Uh
y = softmax(z) (7.12)
And just to remember the shapes of all our variables, x ∈ Rn0 , h ∈ Rn1 , b ∈ Rn1 ,
W ∈ Rn1 ×n0 , U ∈ Rn2 ×n1 , and the output vector y ∈ Rn2 . We’ll call this network a 2-
layer network (we traditionally don’t count the input layer when numbering layers,
but do count the output layer). So by this terminology logistic regression is a 1-layer
network.
10 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS
Note that with this notation, the equations for the computation done at each layer are
the same. The algorithm for computing the forward step in an n-layer feedforward
network, given the input vector a[0] is thus simply:
for i in 1,...,n
z[i] = W[i] a[i−1] + b[i]
a[i] = g[i] (z[i] )
ŷ = a[n]
The activation functions g(·) are generally different at the final layer. Thus g[2]
might be softmax for multinomial classification or sigmoid for binary classification,
while ReLU or tanh might be the activation function g(·) at the internal layers.
The need for non-linear activation functions One of the reasons we use non-
linear activation functions for each layer in a neural network is that if we did not, the
resulting network is exactly equivalent to a single-layer network. Let’s see why this
is true. Imagine the first two layers of such a network of purely linear layers:
Replacing the bias unit In describing networks, we will often use a slightly sim-
plified notation that represents exactly the same function without referring to an ex-
plicit bias node b. Instead, we add a dummy node a0 to each layer whose value will
[0]
always be 1. Thus layer 0, the input layer, will have a dummy node a0 = 1, layer 1
[1]
will have a0 = 1, and so on. This dummy node still has an associated weight, and
that weight represents the bias value b. For example instead of an equation like
h = σ (Wx + b) (7.15)
we’ll use:
h = σ (Wx) (7.16)
But now instead of our vector x having n0 values: x = x1 , . . . , xn0 , it will have n0 +
1 values, with a new 0th dummy value x0 = 1: x = x0 , . . . , xn0 . And instead of
computing each h j as follows:
n0
!
X
hj = σ Wji xi + b j , (7.17)
i=1
where the value Wj0 replaces what had been b j . Fig. 7.9 shows a visualization.
W U W U
x1 h1 y1 x0=1
h1 y1
h2
x2 y2 x1 h2 y2
h3
…
x2 h3
…
…
…
…
xn
hn
…
0
1
yn hn
b 2 xn 1 yn
+1 0 2
(a) (b)
Figure 7.9 Replacing the bias node (shown in a) with x0 (b).
We’ll continue showing the bias as b when we go over the learning algorithm
in Section 7.6, but then we’ll switch to this simplified notation without explicit bias
terms for the rest of the book.
Let’s begin with a simple 2-layer sentiment classifier. You might imagine tak-
ing our logistic regression classifier from Chapter 5, which corresponds to a 1-layer
network, and just adding a hidden layer. The input element xi could be scalar fea-
tures like those in Fig. ??, e.g., x1 = count(words ∈ doc), x2 = count(positive lexicon
words ∈ doc), x3 = 1 if “no” ∈ doc, and so on. And the output layer ŷ could have
two nodes (one each for positive and negative), or 3 nodes (positive, negative, neu-
tral), in which case ŷ1 would be the estimated probability of positive sentiment, ŷ2
the probability of negative and ŷ3 the probability of neutral. The resulting equations
would be just what we saw above for a 2-layer network (as always, we’ll continue
to use the σ to stand for any non-linearity, whether sigmoid, ReLU or other).
x = [x1 , x2 , ...xN ] (each xi is a hand-designed feature)
h = σ (Wx + b)
z = Uh
ŷ = softmax(z) (7.19)
Fig. 7.10 shows a sketch of this architecture. As we mentioned earlier, adding this
hidden layer to our logistic regression classifier allows the network to represent the
non-linear interactions between features. This alone might give us a better sentiment
classifier.
h1
dessert wordcount x1
=3
h2
y^1 p(+)
positive lexicon x2
was words = 1 y^ 2 p(-)
h3
y^3
…
There are many other options, like taking the element-wise max. The element-wise
max of a set of n vectors is a new vector whose kth element is the max of the kth
elements of all the n vectors. Here are the equations for this classifier assuming
mean pooling; the architecture is sketched in Fig. 7.11:
h1
embedding for
dessert “dessert” y^ 1 p(+)
pooling h2
was
embedding for ^y p(-)
+
“was” 2
h3
…
great “great” 3
hdh
Input words x W h U y
[d⨉1] [dh⨉d] [3⨉dh] [3⨉1]
[dh⨉1]
While Eq. 7.21 shows how to a classify a single example x, in practice we want
to efficiently classify an entire test set of m examples. We do this by vectoring the
process, just as we saw with logistic regression; instead of using for-loops to go
through each example, we’ll use matrix multiplication to do the entire computation
of an entire test set at once. First, we pack all the input feature vectors for each input
x into a single input matrix X, with each row i a row vector consisting of the pooled
embedding for input example x(i) (i.e., the vector x(i) ). If the dimensionality of our
pooled input embedding is d, X will be a matrix of shape [m × d].
We will then need to slightly modify Eq. 7.21. X is of shape [m × d] and W is of
shape [dh × d], so we’ll have to reorder how we multiply X and W and transpose W
so they correctly multiply to yield a matrix H of shape [m × dh ]. The bias vector b
from Eq. 7.21 of shape [1 × dh ] will now have to be replicated into a matrix of shape
[m × dh ]. We’ll need to similarly reorder the next step and transpose U. Finally, our
output matrix Ŷ will be of shape [m × 3] (or more generally [m × do ], where do is
the number of output classes), with each row i of our output matrix Ŷ consisting of
the output vector ŷ(i) .‘ Here are the final equations for computing the output class
distribution for an entire test set:
H = σ (XW| + b)
Z = HU|
Ŷ = softmax(Z) (7.22)
14 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS
In the following examples we’ll use a 4-gram example, so we’ll show a neural net to
estimate the probability P(wt = i|wt−3 , wt−2 , wt−1 ).
Neural language models represent words in this prior context by their embed-
dings, rather than just by their word identity as used in n-gram language models.
Using embeddings allows neural language models to generalize better to unseen
data. For example, suppose we’ve seen this sentence in training:
I have to make sure that the cat gets fed.
but have never seen the words “gets fed” after the word “dog”. Our test set has the
prefix “I forgot to make sure that the dog gets”. What’s the next word? An n-gram
language model will predict “fed” after “that the cat gets”, but not after “that the dog
gets”. But a neural LM, knowing that “cat” and “dog” have similar embeddings, will
be able to generalize from the “cat” context to assign a high enough probability to
“fed” even after seeing “dog”.
7.5 • F EEDFORWARD N EURAL L ANGUAGE M ODELING 15
|V| 1 1
d E ✕ 5 = d
5 |V|
e5
Figure 7.12 Selecting the embedding vector for word V5 by multiplying the embedding
matrix E with a one-hot vector with a 1 in index 5.
…
1 ^y p(aardvark|…)
and 0
0 1
1 35
…
thanks h1
0
wt-3 0
|V|
^y p(do|…)
for E 34
1 h2
…
0
0
y^42
0
1 992
all wt-2 p(fish|…)
0
0 h3
…
|V|
E
^y35102
…
the 1
wt-1 0
0
hdh
…
0
? 1 451
wt 0 ^y p(zebra|…)
0
|V|
E e W h U |V|
…
x d⨉|V| 3d⨉1 dh⨉3d dh⨉1 |V|⨉dh y
|V|⨉3 |V|⨉1
In summary, the equations for a neural language model with a window size of 3,
given one-hot input vectors for each input context word, are:
erwise. This makes it more obvious that the terms in the sum in Eq. 7.26 will be 0
except for the term corresponding to the true class for which yk = 1:
K
1{yk = 1} log ŷk
X
LCE (ŷ, y) = −
k=1
In other words, the cross-entropy loss is simply the negative log of the output proba-
bility corresponding to the correct class, and we therefore also call this the negative
negative log log likelihood loss:
likelihood loss
Plugging in the softmax formula from Eq. 7.9, and with K the number of classes:
exp(zc )
LCE (ŷ, y) = − log PK (where c is the correct class) (7.28)
j=1 exp(z j )
∂ LCE (ŷ, y)
= (ŷ − y) x j
∂wj
= (σ (w · x + b) − y) x j (7.29)
Or for a network with one weight layer and softmax output (=multinomial logistic
regression), we could use the derivative of the softmax loss from Eq. ??, shown for
a particular weight wk and input xi
∂ LCE (ŷ, y)
= −(yk − ŷk )xi
∂ wk,i
= −(yk − p(yk = 1|x))xi
!
exp (wk · x + bk )
= − yk − PK xi (7.30)
j=1 exp (wj · x + b j )
But these derivatives only give correct updates for one weight layer: the last one!
For deep networks, computing the gradients for each weight is much more complex,
since we are computing the derivative with respect to weight parameters that appear
all the way back in the very early layers of the network, even though the loss is
computed only at the very end of the network.
The solution to computing this gradient is an algorithm called error backprop-
error back-
propagation agation or backprop (Rumelhart et al., 1986). While backprop was invented spe-
cially for neural networks, it turns out to be the same as a more general procedure
called backward differentiation, which depends on the notion of computation
graphs. Let’s see how that works in the next subsection.
7.6 • T RAINING N EURAL N ETS 19
forward pass
a a=3
e=a+d e=5
d=2
b=1
b d = 2b L=ce L=-10
c=-2
c
Figure 7.14 Computation graph for the function L(a, b, c) = c(a+2b), with values for input
nodes a = 3, b = 1, c = −2, showing the forward pass computation of L.
The intuition of backward differentiation is to pass gradients back from the final
node to all the nodes in the graph. Fig. 7.15 shows part of the backward computation
at one node e. Each node takes an upstream gradient that is passed in from its parent
node to the right, and for each of its inputs computes a local gradient (the gradient
of its output with respect to its input), and uses the chain rule to multiply these two
to compute a downstream gradient to be passed on to the next earlier node.
d e
d e L
∂L = ∂L ∂e ∂e ∂L
∂d ∂e ∂d ∂d ∂e
downstream local upstream
gradient gradient gradient
Figure 7.15 Each node (like e here) takes an upstream gradient, multiplies it by the local
gradient (the gradient of its output with respect to its input), and uses the chain rule to compute
a downstream gradient to be passed on to a prior node. A node may have multiple local
gradients if it has multiple inputs.
Let’s now compute the 3 derivatives we need. Since in the computation graph
L = ce, we can directly compute the derivative ∂∂ Lc :
∂L
=e (7.33)
∂c
For the other two, we’ll need to use the chain rule:
∂L ∂L ∂e
=
∂a ∂e ∂a
∂L ∂L ∂e ∂d
= (7.34)
∂b ∂e ∂d ∂b
Eq. 7.34 and Eq. 7.33 thus require five intermediate derivatives: ∂∂ Le , ∂∂ Lc , ∂∂ ae , ∂∂ de , and
∂d
∂ b , which are as follows (making use of the fact that the derivative of a sum is the
sum of the derivatives):
∂L ∂L
L = ce : = c, =e
∂e ∂c
∂e ∂e
e = a+d : = 1, =1
∂a ∂d
∂d
d = 2b : =2
∂b
In the backward pass, we compute each of these partials along each edge of the
graph from right to left, using the chain rule just as we did above. Thus we begin by
computing the downstream gradients from node L, which are ∂∂ Le and ∂∂ Lc . For node e,
we then multiply this upstream gradient ∂∂ Le by the local gradient (the gradient of the
output with respect to the input), ∂∂ de to get the output we send back to node d: ∂∂ Ld .
And so on, until we have annotated the graph all the way to all the input variables.
The forward pass conveniently already will have computed the values of the forward
intermediate variables we need (like d and e) to compute these derivatives. Fig. 7.16
shows the backward pass.
7.6 • T RAINING N EURAL N ETS 21
a=3
a
∂L = ∂L ∂e =-2
∂a ∂e ∂a e=5
e=d+a
d=2
b=1 ∂e ∂e
=1 =1 ∂L
b d = 2b ∂L = ∂L ∂e =-2 ∂a ∂d =-2 L=-10
∂e L=ce
∂L = ∂L ∂d =-4 ∂d ∂d ∂e ∂d
=2 ∂L
∂b ∂d ∂b ∂b =-2
∂e
c=-2 ∂L
=5
∂c
∂L =5 backward pass
c ∂c
Figure 7.16 Computation graph for the function L(a, b, c) = c(a + 2b), showing the backward pass computa-
tion of ∂∂ La , ∂∂ Lb , and ∂∂ Lc .
For the backward pass we’ll also need to compute the loss L. The loss function
for binary sigmoid output from Eq. 7.25 is
The weights that need updating (those for which we need to know the partial
derivative of the loss function) are shown in teal. In order to do the backward pass,
we’ll need to know the derivatives of all the functions in the graph. We already saw
in Section ?? the derivative of the sigmoid σ :
dσ (z)
= σ (z)(1 − σ (z)) (7.38)
dz
We’ll also need the derivatives of each of the other activation functions. The
derivative of tanh is:
d tanh(z)
= 1 − tanh2 (z) (7.39)
dz
22 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS
[1]
w11
*
w[1]
12 z[1] = a1[1] =
* 1
+ ReLU
x1
*
b[1]
1 z[2] =
w[2] a[2] = σ L (a[2],y)
x2 11 +
*
*
Figure 7.17 Sample computation graph for a simple 2-layer neural net (= 1 hidden layer) with two input units
and 2 hidden units. We’ve adjusted the notation a bit to avoid long equations in the nodes by just mentioning
[1]
the function that is being computed, and the resulting variable name. Thus the * to the right of node w11 means
[1]
that w11 is to be multiplied by x1 , and the node z[1] = + means that the value of z[1] is computed by summing
[1]
the three nodes that feed into it (the two products, and the bias term bi ).
We’ll give the start of the computation, computing the derivative of the loss
function L with respect to z, or ∂∂Lz (and leaving the rest of the computation as an
exercise for the reader). By the chain rule:
∂L ∂ L ∂ a[2]
= [2] (7.41)
∂z ∂a ∂z
∂L
So let’s first compute ∂ a[2]
, taking the derivative of Eq. 7.37, repeated here:
h i
LCE (a[2] , y) = − y log a[2] + (1 − y) log(1 − a[2] )
! !
∂L ∂ log(a[2] ) ∂ log(1 − a[2] )
= − y + (1 − y)
∂ a[2] ∂ a[2] ∂ a[2]
1 1
= − y [2] + (1 − y) (−1)
a 1 − a[2]
y y−1
= − [2] + (7.42)
a 1 − a[2]
∂ a[2]
= a[2] (1 − a[2] )
∂z
7.7 • T RAINING THE NEURAL LANGUAGE MODEL 23
∂L ∂ L ∂ a[2]
=
∂z ∂ a[2] ∂ z
y y−1
= − [2] + a[2] (1 − a[2] )
a 1 − a[2]
= a[2] − y (7.43)
Continuing the backward computation of the gradients (next by passing the gra-
[2]
dients over b1 and the two product nodes, and so on, back to all the teal nodes), is
left as an exercise for the reader.
useful when the task the network is designed for (like sentiment classification, trans-
lation, or parsing) places strong constraints on what makes a good representation for
words.
Let’s see how to train the entire model including E, i.e. to set all the parameters
θ = E, W, U, b. We’ll do this via gradient descent (Fig. ??), using error backpropa-
gation on the computation graph to compute the gradient. Training thus not only sets
the weights W and U of the network, but also as we’re predicting upcoming words,
we’re learning the embeddings E for each word that best predict upcoming words.
…
thanks h1
0
wt-3 0
|V|
^y p(do|…)
for E 34
1 h2
…
0
0
y^42
0
1 992
all wt-2 p(fish|…)
0
0 h3
…
|V|
E
^y35102
…
the 1
wt-1 0
0
hdh
…
0
fish 1 451
wt 0 ^y p(zebra|…)
0
|V|
E e W h U |V|
…
x d⨉|V| 3d⨉1 dh⨉3d dh⨉1 |V|⨉dh y
|V|⨉3 |V|⨉1
Figure 7.18 Learning all the way back to embeddings. Again, the embedding matrix E is
shared among the 3 context words.
Fig. 7.18 shows the set up for a window size of N=3 context words. The input x
consists of 3 one-hot vectors, fully connected to the embedding layer via 3 instanti-
ations of the embedding matrix E. We don’t want to learn separate weight matrices
for mapping each of the 3 previous words to the projection layer. We want one single
embedding dictionary E that’s shared among these three. That’s because over time,
many different words will appear as wt−2 or wt−1 , and we’d like to just represent
each word with one vector, whichever context position it appears in. Recall that the
embedding weight matrix E has a column for each word, each a column vector of d
dimensions, and hence has dimensionality d × |V |.
Generally training proceeds by taking as input a very long text, concatenating all
the sentences, starting with random weights, and then iteratively moving through the
text predicting each word wt . At each word wt , we use the cross-entropy (negative
log likelihood) loss. Recall that the general form for this (repeated from Eq. 7.27 is:
LCE (ŷ, y) = − log ŷi , (where i is the correct class) (7.44)
For language modeling, the classes are the words in the vocabulary, so ŷi here means
the probability that the model assigns to the correct next word wt :
LCE = − log p(wt |wt−1 , ..., wt−n+1 ) (7.45)
7.8 • S UMMARY 25
The parameter update for stochastic gradient descent for this loss from step s to s + 1
is then:
∂ [− log p(wt |wt−1 , ..., wt−n+1 )]
θ s+1 = θ s − η (7.46)
∂θ
This gradient can be computed in any standard neural network framework which
will then backpropagate through θ = E, W, U, b.
Training the parameters to minimize loss will result both in an algorithm for
language modeling (a word predictor) but also a new set of embeddings E that can
be used as word representations for other tasks.
7.8 Summary
• Neural networks are built out of neural units, originally inspired by human
neurons but now simply an abstract computational device.
• Each neural unit multiplies input values by a weight vector, adds a bias, and
then applies a non-linear activation function like sigmoid, tanh, or rectified
linear unit.
• In a fully-connected, feedforward network, each unit in layer i is connected
to each unit in layer i + 1, and there are no cycles.
• The power of neural networks comes from the ability of early layers to learn
representations that can be utilized by later layers in the network.
• Neural networks are trained by optimization algorithms like gradient de-
scent.
• Error backpropagation, backward differentiation on a computation graph,
is used to compute the gradients of the loss function for a network.
• Neural language models use a neural network as a probabilistic classifier, to
compute the probability of the next word given the previous n words.
• Neural language models can use pretrained embeddings, or can learn embed-
dings from scratch in the process of language modeling.
1986). During the 1980s a wide variety of neural network and related architec-
tures were developed, particularly for applications in psychology and cognitive sci-
ence (Rumelhart and McClelland 1986b, McClelland and Elman 1986, Rumelhart
connectionist and McClelland 1986a, Elman 1990), for which the term connectionist or paral-
lel distributed processing was often used (Feldman and Ballard 1982, Smolensky
1988). Many of the principles and techniques developed in this period are foun-
dational to modern work, including the ideas of distributed representations (Hinton,
1986), recurrent networks (Elman, 1990), and the use of tensors for compositionality
(Smolensky, 1990).
By the 1990s larger neural networks began to be applied to many practical lan-
guage processing tasks as well, like handwriting recognition (LeCun et al. 1989) and
speech recognition (Morgan and Bourlard 1990). By the early 2000s, improvements
in computer hardware and advances in optimization and training techniques made it
possible to train even larger and deeper networks, leading to the modern term deep
learning (Hinton et al. 2006, Bengio et al. 2007). We cover more related history in
Chapter 9 and Chapter 16.
There are a number of excellent books on the subject. Goldberg (2017) has
superb coverage of neural networks for natural language processing. For neural
networks in general see Goodfellow et al. (2016) and Nielsen (2015).
Bibliographical and Historical Notes 27
Abadi, M., A. Agarwal, P. Barham, E. Brevdo, Z. Chen, Rumelhart, D. E., G. E. Hinton, and R. J. Williams. 1986.
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, Learning internal representations by error propagation. In
S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is- D. E. Rumelhart and J. L. McClelland, editors, Parallel
ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Leven- Distributed Processing, volume 2, pages 318–362. MIT
berg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, Press.
M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Tal- Rumelhart, D. E. and J. L. McClelland. 1986a. On learning
war, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, the past tense of English verbs. In D. E. Rumelhart and
O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, J. L. McClelland, editors, Parallel Distributed Process-
and X. Zheng. 2015. TensorFlow: Large-scale machine ing, volume 2, pages 216–271. MIT Press.
learning on heterogeneous systems. Software available
from tensorflow.org. Rumelhart, D. E. and J. L. McClelland, editors. 1986b. Par-
allel Distributed Processing. MIT Press.
Bengio, Y., R. Ducharme, P. Vincent, and C. Jauvin. 2003.
A neural probabilistic language model. JMLR, 3:1137– Russell, S. and P. Norvig. 2002. Artificial Intelligence: A
1155. Modern Approach, 2nd edition. Prentice Hall.
Bengio, Y., P. Lamblin, D. Popovici, and H. Larochelle. Smolensky, P. 1988. On the proper treatment of connection-
2007. Greedy layer-wise training of deep networks. ism. Behavioral and brain sciences, 11(1):1–23.
NeurIPS. Smolensky, P. 1990. Tensor product variable binding and
Elman, J. L. 1990. Finding structure in time. Cognitive sci- the representation of symbolic structures in connectionist
ence, 14(2):179–211. systems. Artificial intelligence, 46(1-2):159–216.
Feldman, J. A. and D. H. Ballard. 1982. Connectionist mod- Srivastava, N., G. E. Hinton, A. Krizhevsky, I. Sutskever,
els and their properties. Cognitive Science, 6:205–254. and R. R. Salakhutdinov. 2014. Dropout: a simple
way to prevent neural networks from overfitting. JMLR,
Goldberg, Y. 2017. Neural Network Methods for Natural 15(1):1929–1958.
Language Processing, volume 10 of Synthesis Lectures
on Human Language Technologies. Morgan & Claypool. Widrow, B. and M. E. Hoff. 1960. Adaptive switching cir-
cuits. IRE WESCON Convention Record, volume 4.
Goodfellow, I., Y. Bengio, and A. Courville. 2016. Deep
Learning. MIT Press.
Hinton, G. E. 1986. Learning distributed representations of
concepts. COGSCI.
Hinton, G. E., S. Osindero, and Y.-W. Teh. 2006. A fast
learning algorithm for deep belief nets. Neural computa-
tion, 18(7):1527–1554.
Hinton, G. E., N. Srivastava, A. Krizhevsky, I. Sutskever, and
R. R. Salakhutdinov. 2012. Improving neural networks
by preventing co-adaptation of feature detectors. ArXiv
preprint arXiv:1207.0580.
Kingma, D. and J. Ba. 2015. Adam: A method for stochastic
optimization. ICLR 2015.
LeCun, Y., B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. 1989. Backprop-
agation applied to handwritten zip code recognition. Neu-
ral computation, 1(4):541–551.
McClelland, J. L. and J. L. Elman. 1986. The TRACE model
of speech perception. Cognitive Psychology, 18:1–86.
McCulloch, W. S. and W. Pitts. 1943. A logical calculus of
ideas immanent in nervous activity. Bulletin of Mathe-
matical Biophysics, 5:115–133.
Minsky, M. and S. Papert. 1969. Perceptrons. MIT Press.
Morgan, N. and H. Bourlard. 1990. Continuous speech
recognition using multilayer perceptrons with hidden
markov models. ICASSP.
Nielsen, M. A. 2015. Neural networks and Deep learning.
Determination Press USA.
Paszke, A., S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-
Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer.
2017. Automatic differentiation in pytorch. NIPS-W.
Rosenblatt, F. 1958. The perceptron: A probabilistic model
for information storage and organization in the brain. Psy-
chological review, 65(6):386–408.