Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 71

Sequence Modeling: Recurrent

and Recursive Nets


Dr. Anand Kumar M
Dept. Of IT
National Institute of Technology Karnataka (NITK)
m_anandkumar@nitk.edu.in
Outline
• Motivation
• Unfolding Computational Graph
• RNN –variants
• BPTT
• Bi-RNN
• Encoder-Decoder
• DRN and Recursive NN
• LSTM
Motivation
• Recurrent neural networks , or RNNs, are a
family of neural networks for processing
sequential data.
• Much as a convolutional network is a neural
network that is specialized for processing a
grid of values X such as an image.
• Recurrent neural network is a neural network
that is specialized for processing a sequence of
values x (1) , . . . , x (τ)
Motivation
• recurrent networks can also process
sequences of variable length.
• Parameter sharing makes it possible to extend
and apply the model to examples of different
lengths and generalize across them.
• If we had separate parameters for each value
of the time index, we could not generalize to
sequence lengths not seen during training
Motivation
• The convolution operation allows a network to
share parameters across time but is shallow.
• The output of convolution is a sequence
where each member of the output is a
function of a small number of neighboring
members of the input.
• IN RNN, each member of the output is a
function of the previous members of the
output.
Unfolding Computational Graphs
Unfolding Computational Graphs
• As another example, let us consider a
dynamical system driven by an external signal
Xt.
Unfolding Computational Graphs
• The unfolding process thus introduces two
major advantages:
– Regardless of the sequence length, the learned
model always has the same input size, because it
is specified in terms of transition from one state to
another state, rather than specified in terms of a
variable-length history of states.
– It is possible to use the same transition function f
with the same parameters at every time step
Recurrent Neural Network
RNN
Naïve RNN

y
h' Wh h Wi x

h f h'
y Wo h’ Note, y is computed
from h’
x softmax

We have ignored the bias


How does RNN reduce complexity?
• Given function f: h’,y=f(h,x) h and h’ are vectors with the
same dimension

y1 y2 y3

h0 f h1 f h2 f h3 ……

x1 x2 x3
No matter how long the input/output sequence is, we only need
one function f. If f’s are different, then it becomes a feedforward
NN. This may be treated as another compression from fully
connected network.
Recurrent Neural Networks
• Some examples of important design patterns
for recurrent neural networks include the
following.
Bi-RNN
• In many applications, however, we want to
output a prediction of y(t) that may depend
on the whole input sequence.
• They have been extremely successful in
applications where that need arises, such as
handwriting recognition, Speech recognition
and Bioinformatics.
Bi-RNN
• As the name suggests, bidirectional RNNs
combine an RNN that moves forward through
time, beginning from the start of the
sequence, with another RNN that moves
backward through time, beginning from the
end of the sequence.
Bidirectional RNN y,h=f1(x,h) z,g = f2(g,x)
x1 x2 x3

g0 f2 g1 f2 g2 f2 g3

z1 z2 z3

p=f3(y,z) f3 p1 f3 p2 f3 p3

y1 y2 y3

h0 f1 h1 f1 h2 f1 h3

x1 x2 x3
Encoder-Decoder Sequence-to-Sequence Architectures

• RNN can map an input sequence to a fixed-size


vector.
• An RNN can map a fixed-size vector to a
sequence.
• an RNN can map an input sequence to an
output sequence of the same length.
• RNN can be trained to map an input sequence
to an output sequence which is not necessarily
of the same length.
Example of an encoder-decoder or
sequence-to- sequence RNN
architecture, for learning to
generate an output sequence y
given an input sequence x.
It is composed of an encoder RNN that
reads the input sequence as well as
a decoder RNN that generates the
output sequence (or computes the
probability of a given output
sequence).
The final hidden state of the encoder
RNN is used to compute a generally
fixed-size context variable C , which
represents a semantic summary of
the input sequence and is given as
input to the decoder RNN
Deep Recurrent Net
Deep RNN h’,y = f1(h,x), g’,z = f2(g,y)


z1 z2 z3

g0 f2 g1 f2 g2 f2 g3 ……

y1 y2 y3

h0 f1 h1 f1 h2 f1 h3 ……

x1 x2 x3
Recursive NN
Problems with naive RNN
• When dealing with a time series, it tends to
forget old information. When there is a distant
relationship of unknown length, we wish to
have a “memory” to it.
• Vanishing gradient problem.
LSTM
The sigmoid layer outputs numbers between 0-1 determine how much
each component should be let through. Pink X gate is point-wise multiplication.
Output
This
This LSTM
decides
gate
sigmoid gateinfo
what
Controls what
Isdetermines how
to add to the cellmuch
state
goes into output
information goes thru

Ct-1

ht-1

Forget input
gate gate
The core idea is this cell state
Why sigmoid or tanh:
Ct, it is changed slowly, with
Sigmoid: 0,1 gating as switch.
only minor linear interactions.
Vanishing gradient problem in
It is very easy for information
LSTM is handled already.
to flow along it unchanged.
ReLU replaces tanh ok?
it decides what component
is to be updated.
C’t provides change contents

Updating the cell state

Decide what part of the cell


state to output
RNN vs LSTM
Naïve RNN vs LSTM yt

yt ct-1 ct
LSTM
Naïve
ht-1 ht ht-1 ht
RNN

xt xt

c changes slowly ct is ct-1 added by something

h changes faster ht and ht-1 can be very different


These 4 matrix
computation should
be done concurrently.
xt
z W
ht-1

ct-1 xt
zi = σ( Wi )
ht-1
Controls Controls Updating Controls
forget gate input gate information Output gate xt
zf = σ( Wf )
ht-1
zf zi z zo

xt
zo = σ( Wo )

ht-1 xt ht-1

Information flow of LSTM


Applications
End to End Memory Networks
Neural machine translation

LSTM
Sequence to sequence chat model
Chat with context
M: Hi
M: Hello
U: Hi
M: Hi

M: Hello U: Hi
Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau, 2015
"Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.
Baidu’s speech recognition using RNN
Attention
Image Caption Generation
Word 1 Word 2
A vector for
each region
z0 z1 z2

weighted
CNN filter filter filter sum
filter filter filter
0.0 0.8 0.2
0.0 0.0 0.0

filter filter filter


filter filter filter
Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural
Image Caption Generation with Visual Attention”, ICML, 2015
Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural
Image Caption Generation with Visual Attention”, ICML, 2015
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron
Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV, 2015
Demo

You might also like