Cs224n 2023 Lecture05 RNNLM

Natural Language Processing
with Deep Learning

CS224N/Ling284
Christopher Manning
Lecture 5: Language Models and Recurrent Neural Networks
Lecture Plan
1. A bit more about neural networks (10 mins)
Language modeling + RNNs
• 2. A new NLP task: Language Modeling (20 mins) This is the most important concept in
the class! It leads to GPT-3 and ChatGPT!
motivates
• 3. A new family of neural networks: Recurrent Neural Networks (RNNs) (25 mins)
Important and used in Ass4, but not the only way to build LMs
• 4. Problems with RNNs (15 mins)
• 5. Recap on RNNs/LMs (10 mins)
Reminders:
You should have handed in Assignment 2 by today, start of class
In Assignment 3, out today, you build a neural dependency parser using PyTorch
2
We have models with many parameters! Regularization!
• A full loss function includes regularization over all parameters 𝜃, e.g., L2 regularization:
• Classic view: Regularization works to prevent overfitting when we have a lot of features
(or later a very powerful/deep model, etc.)
• Now: Regularization produces models that generalize well when we have a “big” model
• We do not care that our models overfit on the training data, even though they are hugely overfit
error
error Test
Trainin overfitting
g error
0
3 model “power”
Dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov 2012/JMLR 2014)
Preventing Feature Co-adaptation = Good Regularization Method! Use it everywhere!
• Training time: at each instance of evaluation (in online SGD-training), randomly set
~50% (p%) of the inputs to each neuron to 0
• Test time: halve the model weights (now twice as many)
• Except usually only drop first layer inputs a little (~15%) or not at all
• This prevents feature co-adaptation: A feature cannot only be useful in the presence
of particular other features
• In a single layer: A kind of middle-ground between Naïve Bayes (where all feature
weights are set independently) and logistic regression models (where weights are
set in the context of all others)
• Can be thought of as a form of model bagging (i.e., like an ensemble model)
• Nowadays usually thought of as strong, feature-dependent regularizer
[Wager, Wang, & Liang 2013]
4
Dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov 2012/JMLR 2014)
• During training
Train 1 Train 2 Test
• For each data
point each time: 𝑥# 𝑥# 𝑥#
• Randomly set 𝑤! 𝑤! 𝑤!
input to 0 with 0 𝑤" 𝑥$ 𝑤" 𝑥$ 𝑤"
probability 𝑝
“dropout ratio” 𝑤# 𝑤# 𝑤#
(often p = 0.5 𝑥" 𝑥" 𝑥"
except p – 0.15 𝑤$ 𝑤$ 𝑤$
for input layer)
via dropout mask 0 0 𝑥+
𝑏 𝑏 𝑏
• During testing
• Multiply all 1 1 1
weights by 1 − 𝑝 𝑦 = 𝑤1 𝑥1 + 𝑤3 𝑥3 + 𝑏 𝑦 = 𝑤# 𝑥# + 𝑦 = (1 − 𝑝)(𝑤# 𝑥# +
• No other dropout 𝑤$ 𝑥$ + 𝑤% 𝑥% 𝑤$ 𝑥$ + 𝑤% 𝑥% + 𝑤& 𝑥& )
5
“Vectorization”
• E.g., looping over word vectors versus concatenating them all into one large matrix
and then multiplying the softmax weights with that matrix:
• for loop: 1000 loops, best of 3: 639 µs per loop

Using single a C x N matrix: 10000 loops, best of 3: 53.8 µs per loop
• Matrices are awesome!!! Always try to use vectors and matrices rather than for loops!
• The speed gain goes from 1 to 2 orders of magnitude with GPUs!
6
Parameter Initialization
• You normally must initialize weights to small random values (i.e., not zero matrices!)
• To avoid symmetries that prevent learning/specialization
• Initialize hidden layer biases to 0 and output (or reconstruction) biases to optimal value
if weights were 0 (e.g., mean target or inverse sigmoid of mean target)
• Initialize all other weights ~ Uniform(–r, r), with r chosen so numbers get neither too
big or too small [later, the need for this is removed with use of layer normalization]
• Xavier initialization has variance inversely proportional to fan-in nin (previous layer size)
and fan-out nout (next layer size):
Optimizers
• Usually, plain SGD will work just fine!
• However, getting good results will often require hand-tuning the learning rate
• E.g., start it higher and halve it every k epochs (passes through full data, shuffled or sampled)
• For more complex nets, or to avoid worry, try more sophisticated “adaptive” optimizers
that scale the adjustment to individual parameters by an accumulated gradient
• These models give differential per-parameter learning rates
• Adagrad ß Simplest member of family, but tends to “stall early”
• RMSprop
• Adam ß A fairly good, safe place to begin in many cases
• AdamW
• NAdamW ß Can be better with word vectors (W) and for speed (Nesterov acceleration)
• …
• Start them with an initial learning rate, around 0.001 ß Many have other hyperparameters
2. Language Modeling
• Language Modeling is the task of predicting what word comes next
books
the students opened their ______ laptops
exams
minds
• More formally: given a sequence of words ,
compute the probability distribution of the next word :
where can be any word in the vocabulary
• A system that does this is called a Language Model
9
Language Modeling
• You can also think of a Language Model as a system that
assigns a probability to a piece of text
• For example, if we have some text , then the

probability of this text (according to the Language Model) is:
This is what our LM provides
10
You use Language Models every day!
11
You use Language Models every day!
12
n-gram Language Models
the students opened their ______
• Question: How to learn a Language Model?

• Answer (pre- Deep Learning): learn an n-gram Language Model!
• Definition: An n-gram is a chunk of n consecutive words.

• unigrams: “the”, “students”, “opened”, ”their”
• bigrams: “the students”, “students opened”, “opened their”
• trigrams: “the students opened”, “students opened their”
• four-grams: “the students opened their”
• Idea: Collect statistics about how frequent different n-grams are and use these to
predict next word.
13
n-gram Language Models
• First we make a Markov assumption: 𝑥 (&'!) depends only on the preceding n-1 words
n-1 words
(assumption)
prob of a n-gram
(definition of
prob of a (n-1)-gram conditional prob)
• Question: How do we get these n-gram and (n-1)-gram probabilities?

• Answer: By counting them in some large corpus of text!
(statistical
approximation)
14
n-gram Language Models: Example
Suppose we are learning a 4-gram Language Model.
as the proctor started the clock, the students opened their _____
discard
condition on this
For example, suppose that in the corpus:

• “students opened their” occurred 1000 times
• “students opened their books” occurred 400 times
• à P(books | students opened their) = 0.4 Should we have discarded
• “students opened their exams” occurred 100 times the “proctor” context?
• à P(exams | students opened their) = 0.1
15
Sparsity Problems with n-gram Language Models
Sparsity Problem 1
Problem: What if “students
(Partial) Solution: Add small 𝛿
opened their 𝑤” never
to the count for every 𝑤 ∈ 𝑉.
occurred in data? Then 𝑤 has
This is called smoothing.
probability 0!
Sparsity Problem 2
Problem: What if “students
(Partial) Solution: Just condition
opened their” never occurred in
on “opened their” instead.
data? Then we can’t calculate
This is called backoff.
probability for any 𝑤!
Note: Increasing n makes sparsity problems worse.

Typically, we can’t have n bigger than 5.
16
Storage Problems with n-gram Language Models
Storage: Need to store

count for all n-grams you
saw in the corpus.
Increasing n or increasing
corpus increases model size!
17
n-gram Language Models in practice
• You can build a simple trigram Language Model over a
1.7 million word corpus (Reuters) in a few seconds on your laptop*
Business and financial news
today the _______
get probability
distribution
company 0.153 Sparsity problem:

bank 0.153 not much granularity
price 0.077
in the probability
italian 0.039
emirate 0.039 distribution
…
Otherwise, seems reasonable! * Try for yourself: https://nlpforhackers.io/language-models/

18
Generating text with a n-gram Language Model
You can also use a Language Model to generate text
today the _______
condition get probability

on this distribution
company 0.153
bank 0.153
price 0.077 sample
italian 0.039
emirate 0.039
…
19
today the price _______

of 0.308 sample
for 0.050
it 0.046
to 0.046
is 0.031
…
20
today the price of _______

the 0.072
18 0.043
oil 0.043
its 0.036
gold 0.018 sample
…
21
today the price of gold per ton , while production of shoe

lasts and shoe industry , the bank intervened just after it
considered and rejected an imf demand to rebuild depleted
european stocks , sept 30 end primary 76 cts a share .
Surprisingly grammatical!
…but incoherent. We need to consider more than

three words at a time if we want to model language well.
But increasing n worsens sparsity problem,

and increases model size…
22
How to build a neural language model?
• Recall the Language Modeling task:
• Input: sequence of words
• Output: prob. dist. of the next word
• How about a window-based neural model?

• We saw this applied to Named Entity Recognition in Lecture 2:
LOCATION
museums in Paris are amazing

23
A fixed-window neural Language Model
as the proctor started the clock the students opened their ______
discard
fixed window
24
books
laptops
output distribution
a zoo
hidden layer
concatenated word embeddings
words / one-hot vectors the students opened their
25
Approximately: Y. Bengio, et al. (2000/2003): A Neural Probabilistic Language Model
Improvements over n-gram LM: books
laptops
• No sparsity problem
• Don’t need to store all observed n-grams
a zoo
Remaining problems:
• Fixed window is too small
• Enlarging window enlarges 𝑊
• Window can never be large enough!
• 𝑥 (!) and 𝑥 (") are multiplied by
completely different weights in 𝑊.
No symmetry in how the inputs are
processed.
We need a neural architecture the students opened their
that can process any length input
26
3. Recurrent Neural Networks (RNN) Core idea: Apply the same
A family of neural architectures weights 𝑊 repeatedly
outputs
(optional) …
hidden states …
input sequence
(any length) …
27
A Simple RNN Language Model books
laptops
output distribution
a zoo
hidden states
is the initial hidden state
word embeddings
words / one-hot vectors the students opened their
Note: this input sequence could be much

28 longer now!
RNN Language Models books
laptops
RNN Advantages:
• Can process any length input
• Computation for step t can (in a zoo
theory) use information from
many steps back
• Model size doesn’t increase for
longer input context
• Same weights applied on every
timestep, so there is symmetry
in how inputs are processed.
RNN Disadvantages:
• Recurrent computation is slow
• In practice, difficult to access More on
information from many steps these later the students opened their
back
29
Training an RNN Language Model
• Get a big corpus of text which is a sequence of words
• Feed into RNN-LM; compute output distribution for every step t.
• i.e., predict probability dist of every word, given words so far
• Loss function on step t is cross-entropy between predicted probability

distribution , and the true next word (one-hot for ):
• Average this to get overall loss for entire training set:
30
= negative log prob
of “students”
Loss
Predicted
prob dists
Corpus the students opened their exams …
31
= negative log prob
of “opened”
Loss
Predicted
prob dists
32
= negative log prob
of “their”
Loss
Predicted
prob dists
33
= negative log prob
of “exams”
Loss
Predicted
prob dists
34
“Teacher forcing”
Loss + + + +… =
Predicted
prob dists
35
Training a RNN Language Model
• However: Computing loss and gradients across entire corpus at once is
too expensive (memory-wise)!
• In practice, consider as a sentence (or a document)
• Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small
chunk of data, and update.
• Compute loss for a sentence (actually, a batch of sentences), compute gradients

and update weights. Repeat on a new batch of sentences.
36
Backpropagation for RNNs
… …
Question: What’s the derivative of w.r.t. the repeated weight matrix ?
“The gradient w.r.t. a repeated weight

Answer: is the sum of the gradient
w.r.t. each time it appears”
Why?
37
Multivariable Chain Rule
Source:
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version
38
Training the parameters of RNNs: Backpropagation for RNNs
In practice, often
… “truncated”
… after ~20
timesteps for training
eq efficiency reasons
eq
equals
ua a ls
ua
ls u
eq
ls
a ls
eq u
Apply the multivariable chain rule:

=1
Question: How do we calculate this?
Answer: Backpropagate over timesteps
i = t, … ,0, summing gradients as you go.
This algorithm is called “backpropagation
through time” [Werbos, P.G., 1988, Neural
39 Networks 1, and others]
Generating with an RNN Language Model (“Generating roll outs”)
Just like an n-gram Language Model, you can use a RNN Language Model to
generate text by repeated sampling. Sampled output becomes next step’s input.
my favorite season is spring </s>

sample sample sample sample sample sample
40 <s> my favorite season is spring

Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on Obama speeches:
Source: https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0
41
• RNN-LM trained on Harry Potter:
Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6
42
• RNN-LM trained on recipes:
Source: https://gist.github.com/nylki/1efbaa36635956d35bcc
43
Generating text with a RNN Language Model
• You can train a RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on paint color names:
This is an example of a character-level RNN-LM (predicts what character comes next)

Source: http://aiweirdness.com/post/160776374467/new-paint-colors-invented-by-neural-network
44
Evaluating Language Models
• The standard evaluation metric for Language Models is perplexity.
Normalized by
number of words
Inverse probability of corpus, according to Language Model
• This is equal to the exponential of the cross-entropy loss :
Lower perplexity is better!

45
RNNs greatly improved perplexity over what came before
n-gram model
Increasingly
complex RNNs
Perplexity improves
(lower is better)
Source: https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/
46
4. Problems with RNNs: Vanishing and Exploding Gradients
47
Vanishing gradient intuition
48
chain rule!
49
chain rule!
50
chain rule!
51
Vanishing gradient problem:

When these are small, the gradient
What happens if these are small? signal gets smaller and smaller as it
backpropagates further
52
Vanishing gradient proof sketch (linear case)
• Recall:
• What if were the identity function, ?
(chain rule)
• Consider the gradient of the loss on step , with respect

to the hidden state on some previous step . Let
(chain rule)
(value of )
If Wh is “small”, then this term gets

exponentially problematic as becomes large
Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
53 (and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf
Vanishing gradient proof sketch (linear case)
sufficient but
• What’s wrong with ? not necessary
• Consider if the eigenvalues of are all less than 1:
(eigenvectors)
• We can write using the eigenvectors of as a basis:
Approaches 0 as grows, so gradient vanishes
• What about nonlinear activations (i.e., what we use?)

• Pretty much the same thing, except the proof requires
for some dependent on dimensionality and
54 (and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf
Why is vanishing gradient a problem?
Gradient signal from far away is lost because it’s much smaller than gradient signal from close-by.
So, model weights are updated only with respect to near effects, not long-term effects.
55
Effect of vanishing gradient on RNN-LM
• LM task: When she tried to print her tickets, she found that the printer was out of toner.
She went to the stationery store to buy more toner. It was very overpriced. After
installing the toner into the printer, she finally printed her ________
• To learn from this training example, the RNN-LM needs to model the dependency
between “tickets” on the 7th step and the target word “tickets” at the end.
• But if the gradient is small, the model can’t learn this dependency
• So, the model is unable to predict similar long-distance dependencies at test time
56
Why is exploding gradient a problem?
• If the gradient becomes too big, then the SGD update step becomes too big:
learning rate
gradient
• This can cause bad updates: we take too large a step and reach a weird and bad
parameter configuration (with large loss)
• You think you’ve found a hill to climb, but suddenly you’re in Iowa
• In the worst case, this will result in Inf or NaN in your network
(then you have to restart training from an earlier checkpoint)
57
Gradient clipping: solution for exploding gradient
• Gradient clipping: if the norm of the gradient is greater than some threshold, scale it
down before applying SGD update
• Intuition: take a step in the same direction, but a smaller step
• In practice, remembering to clip gradients is important, but exploding gradients are an

easy problem to solve
58
How to fix the vanishing gradient problem?
• The main problem is that it’s too difficult for the RNN to learn to preserve information
over many timesteps.
• In a vanilla RNN, the hidden state is constantly being rewritten
• First off next time: How about an RNN with separate memory which is added to?
• LSTMs
• And then: Creating more direct and linear pass-through connections in model
• Attention, residual connections, etc.
59
5. Recap
• Language Model: A system that predicts the next word
• Recurrent Neural Network: A family of neural networks that:

• Take sequential input of any length
• Apply the same weights on each step
• Can optionally produce output on each step
• Recurrent Neural Network ≠ Language Model
• We’ve shown that RNNs are a great way to build a LM (despite some problems)
• RNNs are also useful for much more!

60
Why should we care about Language Modeling?
• Language Modeling is a benchmark task that helps us measure our progress on
predicting language use
• Language Modeling is a subcomponent of many NLP tasks, especially those involving
generating text or estimating the probability of text:
• Predictive typing
• Speech recognition
• Handwriting recognition
• Spelling/grammar correction
• Authorship identification
• Machine translation
• Summarization
• Dialogue
• etc.
• Everything else in NLP has now been rebuilt upon Language Modeling: GPT-3 is an LM!
61
Other RNN uses: RNNs can be used for sequence tagging
e.g., part-of-speech tagging, named entity recognition
DT JJ NN VBN IN DT NN
the startled cat knocked over the vase
62
RNNs can be used for sentence classification
e.g., sentiment classification
positive How to compute
sentence encoding?
Sentence
encoding
overall I enjoyed the movie a lot
63
sentence encoding?
Sentence Basic way:

encoding Use final hidden
e qua state
ls
64
sentence encoding?
Sentence Usually better:

encoding Take element-wise
max or mean of all
hidden states
65
RNNs can be used as an encoder module
e.g., question answering, machine translation, many other tasks!
Answer: German
Here the RNN acts as an
al
lot chi
encoder for the Question (the r
eu re
s o tec
ar
n
of ectu
f n tur
hidden states represent the s
eu e
lot rchit
ra
Question). The encoder is part
l
a
of a larger neural system. Context: Ludwig

van Beethoven was
a German
composer and
pianist. A crucial
figure …
Question: what nationality was Beethoven ?

66
RNN-LMs can be used to generate text
e.g., speech recognition, machine translation, summarization
RNN-LM
what’s the weather

Input (audio)
conditioning
<START> what’s the
This is an example of a conditional language model.

We’ll see Machine Translation in much more detail starting next lecture.
67
Terminology and a look forward
The RNN described in this lecture = simple/vanilla/Elman RNN
Next lecture: You will learn about other RNN flavors
like LSTM and GRU and multi-layer RNNs
By the end of the course: You will understand phrases like

“stacked bidirectional LSTMs with residual connections and self-attention”
68

Cs224n 2023 Lecture05 RNNLM

Uploaded by

Copyright:

Available Formats

Cs224n 2023 Lecture05 RNNLM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cs224n 2023 Lecture05 RNNLM

Uploaded by

Copyright:

Available Formats

Natural Language Processing

with Deep Learning

• for loop: 1000 loops, best of 3: 639 µs per loop

where can be any word in the vocabulary

• A system that does this is called a Language Model

• For example, if we have some text , then the

This is what our LM provides

• Question: How to learn a Language Model?

• Definition: An n-gram is a chunk of n consecutive words.

• Question: How do we get these n-gram and (n-1)-gram probabilities?

For example, suppose that in the corpus:

Note: Increasing n makes sparsity problems worse.

Storage: Need to store

company 0.153 Sparsity problem:

Otherwise, seems reasonable! * Try for yourself: https://nlpforhackers.io/language-models/

today the _______

condition get probability

today the price _______

condition get probability

today the price of _______

condition get probability

today the price of gold per ton , while production of shoe

…but incoherent. We need to consider more than

But increasing n worsens sparsity problem,

• How about a window-based neural model?

museums in Paris are amazing

concatenated word embeddings

words / one-hot vectors the students opened their

is the initial hidden state

words / one-hot vectors the students opened their

Note: this input sequence could be much

• Loss function on step t is cross-entropy between predicted probability

• Average this to get overall loss for entire training set:

Corpus the students opened their exams …

Corpus the students opened their exams …

Corpus the students opened their exams …

Corpus the students opened their exams …

Corpus the students opened their exams …

• In practice, consider as a sentence (or a document)

• Compute loss for a sentence (actually, a batch of sentences), compute gradients

Question: What’s the derivative of w.r.t. the repeated weight matrix ?

“The gradient w.r.t. a repeated weight

Apply the multivariable chain rule:

my favorite season is spring </s>

40 <s> my favorite season is spring

This is an example of a character-level RNN-LM (predicts what character comes next)

Inverse probability of corpus, according to Language Model

• This is equal to the exponential of the cross-entropy loss :

Lower perplexity is better!

Vanishing gradient problem:

• Consider the gradient of the loss on step , with respect

If Wh is “small”, then this term gets

• Consider if the eigenvalues of are all less than 1:

Approaches 0 as grows, so gradient vanishes

• What about nonlinear activations (i.e., what we use?)

• Intuition: take a step in the same direction, but a smaller step

• In practice, remembering to clip gradients is important, but exploding gradients are an

• In a vanilla RNN, the hidden state is constantly being rewritten

• Recurrent Neural Network: A family of neural networks that: