Cs224n 2023 Lecture05 RNNLM
Cs224n 2023 Lecture05 RNNLM
Cs224n 2023 Lecture05 RNNLM
Christopher Manning
Lecture 5: Language Models and Recurrent Neural Networks
Lecture Plan
1. A bit more about neural networks (10 mins)
Language modeling + RNNs
• 2. A new NLP task: Language Modeling (20 mins) This is the most important concept in
the class! It leads to GPT-3 and ChatGPT!
motivates
• 3. A new family of neural networks: Recurrent Neural Networks (RNNs) (25 mins)
Important and used in Ass4, but not the only way to build LMs
• 4. Problems with RNNs (15 mins)
• 5. Recap on RNNs/LMs (10 mins)
Reminders:
You should have handed in Assignment 2 by today, start of class
In Assignment 3, out today, you build a neural dependency parser using PyTorch
2
We have models with many parameters! Regularization!
• A full loss function includes regularization over all parameters 𝜃, e.g., L2 regularization:
• Classic view: Regularization works to prevent overfitting when we have a lot of features
(or later a very powerful/deep model, etc.)
• Now: Regularization produces models that generalize well when we have a “big” model
• We do not care that our models overfit on the training data, even though they are hugely overfit
error
error Test
Trainin overfitting
g error
0
3 model “power”
Dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov 2012/JMLR 2014)
Preventing Feature Co-adaptation = Good Regularization Method! Use it everywhere!
• Training time: at each instance of evaluation (in online SGD-training), randomly set
~50% (p%) of the inputs to each neuron to 0
• Test time: halve the model weights (now twice as many)
• Except usually only drop first layer inputs a little (~15%) or not at all
• This prevents feature co-adaptation: A feature cannot only be useful in the presence
of particular other features
• In a single layer: A kind of middle-ground between Naïve Bayes (where all feature
weights are set independently) and logistic regression models (where weights are
set in the context of all others)
• Can be thought of as a form of model bagging (i.e., like an ensemble model)
• Nowadays usually thought of as strong, feature-dependent regularizer
[Wager, Wang, & Liang 2013]
4
Dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov 2012/JMLR 2014)
• During training
Train 1 Train 2 Test
• For each data
point each time: 𝑥# 𝑥# 𝑥#
• Randomly set 𝑤! 𝑤! 𝑤!
input to 0 with 0 𝑤" 𝑥$ 𝑤" 𝑥$ 𝑤"
probability 𝑝
“dropout ratio” 𝑤# 𝑤# 𝑤#
(often p = 0.5 𝑥" 𝑥" 𝑥"
except p – 0.15 𝑤$ 𝑤$ 𝑤$
for input layer)
via dropout mask 0 0 𝑥+
𝑏 𝑏 𝑏
• During testing
• Multiply all 1 1 1
weights by 1 − 𝑝 𝑦 = 𝑤1 𝑥1 + 𝑤3 𝑥3 + 𝑏 𝑦 = 𝑤# 𝑥# + 𝑦 = (1 − 𝑝)(𝑤# 𝑥# +
• No other dropout 𝑤$ 𝑥$ + 𝑤% 𝑥% 𝑤$ 𝑥$ + 𝑤% 𝑥% + 𝑤& 𝑥& )
5
“Vectorization”
• E.g., looping over word vectors versus concatenating them all into one large matrix
and then multiplying the softmax weights with that matrix:
exams
minds
• More formally: given a sequence of words ,
compute the probability distribution of the next word :
9
Language Modeling
• You can also think of a Language Model as a system that
assigns a probability to a piece of text
10
You use Language Models every day!
11
You use Language Models every day!
12
n-gram Language Models
the students opened their ______
• Idea: Collect statistics about how frequent different n-grams are and use these to
predict next word.
13
n-gram Language Models
• First we make a Markov assumption: 𝑥 (&'!) depends only on the preceding n-1 words
n-1 words
(assumption)
prob of a n-gram
(definition of
prob of a (n-1)-gram conditional prob)
(statistical
approximation)
14
n-gram Language Models: Example
Suppose we are learning a 4-gram Language Model.
as the proctor started the clock, the students opened their _____
discard
condition on this
Sparsity Problem 2
Problem: What if “students
(Partial) Solution: Just condition
opened their” never occurred in
on “opened their” instead.
data? Then we can’t calculate
This is called backoff.
probability for any 𝑤!
Increasing n or increasing
corpus increases model size!
17
n-gram Language Models in practice
• You can build a simple trigram Language Model over a
1.7 million word corpus (Reuters) in a few seconds on your laptop*
Business and financial news
today the _______
get probability
distribution
company 0.153
bank 0.153
price 0.077 sample
italian 0.039
emirate 0.039
…
19
Generating text with a n-gram Language Model
You can also use a Language Model to generate text
of 0.308 sample
for 0.050
it 0.046
to 0.046
is 0.031
…
20
Generating text with a n-gram Language Model
You can also use a Language Model to generate text
the 0.072
18 0.043
oil 0.043
its 0.036
gold 0.018 sample
…
21
Generating text with a n-gram Language Model
You can also use a Language Model to generate text
Surprisingly grammatical!
as the proctor started the clock the students opened their ______
discard
fixed window
24
A fixed-window neural Language Model
books
laptops
output distribution
a zoo
hidden layer
25
A fixed-window neural Language Model
Approximately: Y. Bengio, et al. (2000/2003): A Neural Probabilistic Language Model
Improvements over n-gram LM: books
laptops
• No sparsity problem
• Don’t need to store all observed n-grams
a zoo
Remaining problems:
• Fixed window is too small
• Enlarging window enlarges 𝑊
• Window can never be large enough!
• 𝑥 (!) and 𝑥 (") are multiplied by
completely different weights in 𝑊.
No symmetry in how the inputs are
processed.
We need a neural architecture the students opened their
that can process any length input
26
3. Recurrent Neural Networks (RNN) Core idea: Apply the same
A family of neural architectures weights 𝑊 repeatedly
outputs
(optional) …
hidden states …
input sequence
(any length) …
27
A Simple RNN Language Model books
laptops
output distribution
a zoo
hidden states
word embeddings
RNN Advantages:
• Can process any length input
• Computation for step t can (in a zoo
theory) use information from
many steps back
• Model size doesn’t increase for
longer input context
• Same weights applied on every
timestep, so there is symmetry
in how inputs are processed.
RNN Disadvantages:
• Recurrent computation is slow
• In practice, difficult to access More on
information from many steps these later the students opened their
back
29
Training an RNN Language Model
• Get a big corpus of text which is a sequence of words
• Feed into RNN-LM; compute output distribution for every step t.
• i.e., predict probability dist of every word, given words so far
30
Training an RNN Language Model
= negative log prob
of “students”
Loss
Predicted
prob dists
31
Training an RNN Language Model
= negative log prob
of “opened”
Loss
Predicted
prob dists
32
Training an RNN Language Model
= negative log prob
of “their”
Loss
Predicted
prob dists
33
Training an RNN Language Model
= negative log prob
of “exams”
Loss
Predicted
prob dists
34
Training an RNN Language Model
“Teacher forcing”
Loss + + + +… =
Predicted
prob dists
35
Training a RNN Language Model
• However: Computing loss and gradients across entire corpus at once is
too expensive (memory-wise)!
• Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small
chunk of data, and update.
36
Backpropagation for RNNs
… …
37
Multivariable Chain Rule
Source:
https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version
38
Training the parameters of RNNs: Backpropagation for RNNs
In practice, often
… “truncated”
… after ~20
timesteps for training
eq efficiency reasons
eq
equals
ua a ls
ua
ls u
eq
ls
a ls
eq u
Source: https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0
41
Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on Harry Potter:
Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6
42
Generating text with an RNN Language Model
Let’s have some fun!
• You can train an RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on recipes:
Source: https://gist.github.com/nylki/1efbaa36635956d35bcc
43
Generating text with a RNN Language Model
Let’s have some fun!
• You can train a RNN-LM on any kind of text, then generate text in that style.
• RNN-LM trained on paint color names:
Normalized by
number of words
n-gram model
Increasingly
complex RNNs
Perplexity improves
(lower is better)
Source: https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/
46
4. Problems with RNNs: Vanishing and Exploding Gradients
47
Vanishing gradient intuition
48
Vanishing gradient intuition
chain rule!
49
Vanishing gradient intuition
chain rule!
50
Vanishing gradient intuition
chain rule!
51
Vanishing gradient intuition
(value of )
Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
53 (and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf
Vanishing gradient proof sketch (linear case)
sufficient but
• What’s wrong with ? not necessary
(eigenvectors)
• We can write using the eigenvectors of as a basis:
Gradient signal from far away is lost because it’s much smaller than gradient signal from close-by.
So, model weights are updated only with respect to near effects, not long-term effects.
55
Effect of vanishing gradient on RNN-LM
• LM task: When she tried to print her tickets, she found that the printer was out of toner.
She went to the stationery store to buy more toner. It was very overpriced. After
installing the toner into the printer, she finally printed her ________
• To learn from this training example, the RNN-LM needs to model the dependency
between “tickets” on the 7th step and the target word “tickets” at the end.
• But if the gradient is small, the model can’t learn this dependency
• So, the model is unable to predict similar long-distance dependencies at test time
56
Why is exploding gradient a problem?
• If the gradient becomes too big, then the SGD update step becomes too big:
learning rate
gradient
• This can cause bad updates: we take too large a step and reach a weird and bad
parameter configuration (with large loss)
• You think you’ve found a hill to climb, but suddenly you’re in Iowa
• In the worst case, this will result in Inf or NaN in your network
(then you have to restart training from an earlier checkpoint)
57
Gradient clipping: solution for exploding gradient
• Gradient clipping: if the norm of the gradient is greater than some threshold, scale it
down before applying SGD update
• First off next time: How about an RNN with separate memory which is added to?
• LSTMs
• And then: Creating more direct and linear pass-through connections in model
• Attention, residual connections, etc.
59
5. Recap
• Language Model: A system that predicts the next word
• We’ve shown that RNNs are a great way to build a LM (despite some problems)
• Everything else in NLP has now been rebuilt upon Language Modeling: GPT-3 is an LM!
61
Other RNN uses: RNNs can be used for sequence tagging
e.g., part-of-speech tagging, named entity recognition
DT JJ NN VBN IN DT NN
62
RNNs can be used for sentence classification
e.g., sentiment classification
positive How to compute
sentence encoding?
Sentence
encoding
63
RNNs can be used for sentence classification
e.g., sentiment classification
positive How to compute
sentence encoding?
64
RNNs can be used for sentence classification
e.g., sentiment classification
positive How to compute
sentence encoding?
65
RNNs can be used as an encoder module
e.g., question answering, machine translation, many other tasks!
Answer: German
Here the RNN acts as an
al
lot chi
encoder for the Question (the r
eu re
s o tec
ar
n
of ectu
f n tur
hidden states represent the s
eu e
lot rchit
ra
Question). The encoder is part
l
a
conditioning
68