Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
y
h' Wh h Wi x
h f h'
y Wo h’ Note, y is computed
from h’
x softmax
y1 y2 y3
h0 f h1 f h2 f h3 ……
x1 x2 x3
No matter how long the input/output sequence is, we only need
one function f. If f’s are different, then it becomes a feedforward
NN. This may be treated as another compression from fully
connected network.
Recurrent Neural Networks
• Some examples of important design patterns
for recurrent neural networks include the
following.
Bi-RNN
• In many applications, however, we want to
output a prediction of y(t) that may depend
on the whole input sequence.
• They have been extremely successful in
applications where that need arises, such as
handwriting recognition, Speech recognition
and Bioinformatics.
Bi-RNN
• As the name suggests, bidirectional RNNs
combine an RNN that moves forward through
time, beginning from the start of the
sequence, with another RNN that moves
backward through time, beginning from the
end of the sequence.
Bidirectional RNN y,h=f1(x,h) z,g = f2(g,x)
x1 x2 x3
g0 f2 g1 f2 g2 f2 g3
z1 z2 z3
p=f3(y,z) f3 p1 f3 p2 f3 p3
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3
x1 x2 x3
Encoder-Decoder Sequence-to-Sequence Architectures
…
z1 z2 z3
g0 f2 g1 f2 g2 f2 g3 ……
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3 ……
x1 x2 x3
Recursive NN
Problems with naive RNN
• When dealing with a time series, it tends to
forget old information. When there is a distant
relationship of unknown length, we wish to
have a “memory” to it.
• Vanishing gradient problem.
LSTM
The sigmoid layer outputs numbers between 0-1 determine how much
each component should be let through. Pink X gate is point-wise multiplication.
Output
This
This LSTM
decides
gate
sigmoid gateinfo
what
Controls what
Isdetermines how
to add to the cellmuch
state
goes into output
information goes thru
Ct-1
ht-1
Forget input
gate gate
The core idea is this cell state
Why sigmoid or tanh:
Ct, it is changed slowly, with
Sigmoid: 0,1 gating as switch.
only minor linear interactions.
Vanishing gradient problem in
It is very easy for information
LSTM is handled already.
to flow along it unchanged.
ReLU replaces tanh ok?
it decides what component
is to be updated.
C’t provides change contents
yt ct-1 ct
LSTM
Naïve
ht-1 ht ht-1 ht
RNN
xt xt
ct-1 xt
zi = σ( Wi )
ht-1
Controls Controls Updating Controls
forget gate input gate information Output gate xt
zf = σ( Wf )
ht-1
zf zi z zo
xt
zo = σ( Wo )
ht-1 xt ht-1
LSTM
Sequence to sequence chat model
Chat with context
M: Hi
M: Hello
U: Hi
M: Hi
M: Hello U: Hi
Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau, 2015
"Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.
Baidu’s speech recognition using RNN
Attention
Image Caption Generation
Word 1 Word 2
A vector for
each region
z0 z1 z2
weighted
CNN filter filter filter sum
filter filter filter
0.0 0.8 0.2
0.0 0.0 0.0
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural
Image Caption Generation with Visual Attention”, ICML, 2015
Image Caption Generation
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural
Image Caption Generation with Visual Attention”, ICML, 2015
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron
Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV, 2015
Demo