5a. Recurrent Neural Networks

COMP9444: Neural Networks and Deep
Learning
Week 5a. Recurrent Neural Networks (RNNs)
Sonit Singh
School of Computer Science and Engineering
June 25, 2024
Outline
• Processing Temporal Sequences

• Sliding Window
• Recurrent Network Architectures
• Hidden Unit Dynamics
• Long Short Term Memory (LSTM)
• Gated Recurrent Unit (GRU)
2
Processing Temporal Sequences
There are many tasks which require a sequence of inputs to be processed rather
than a single input.
• speech recognition
• time series prediction
• machine translation
• handwriting recognition
How can neural network models be adapted for these tasks?
3
Sliding Window
The simplest way to feed temporal input to a neural network is the

“sliding window” approach, first used in the NetTalk system
(Sejnowski & Rosenberg, 1987).
4
NetTalk Task
Given a sequence of 7 characters, predict the phonetic pronunciation of the middle

character.
For this task, we need to know the characters on both sides.
For example, how are the vowels in these words pronounced?
pa pat pate paternal
mo mod mode modern
5
NetTalk Architecture
6
NetTalk Test
https://www.youtube.com/watch?v=gakJlr3GecE
7
NetTalk
• NETtalk gained a lot of media attention at the time.

• Hooking it up to a speech synthesizer was very cute. In the early stages of
training, it sounded like a babbling baby. When fully trained, it pronounced the
words mostly correctly (but sounded somewhat robotic).
• Later studies on similar tasks have often found that a decision tree could
produce equally good or better accuracy.
• This kind of approach can only learn short term dependencies, not the
medium or long term dependencies that are required for some tasks.
8
Simple Recurrent Network (Elman, 1990)
• at each time step, hidden layer activations are copied to “context” layer
• hidden layer receives connections from input and context layers
• the inputs are fed one at a time to the network, it uses the context layer to
“remember” whatever information is required for it to produce the correct
output
9
Back Propagation Through Time
• we can “unroll” a recurrent architecture into an equivalent feedforward

architecture, with shared weights
• applying backpropagation to the unrolled architecture is reffered to as
“backpropagation through time”
• we can backpropagate just one timestep, or a fixed number of timesteps, or all
the way back to beginning of the sequence
10
Other Recurrent Neural Architectures
• it is sometimes beneficial to add “shortcut” connections directly from input to

output
• connections from output back to hidden have also been explored (sometimes
called “Jordan Networks”)
11
Second Order (or Gated) Networks
d
j
X
xt = tanh(Wσj0t + Wσjkt xt−1
k
)
k=1
d
Pj xnj )
X
z = tanh(P0 +
j=1
12
Task: Formal Language Recognition
Accept Reject
1 0
11 10
111 01
1111 00
11111 011
111111 110
1111111 11111110
11111111 10111111
Scan a sequence of characters one at a time,

then classify the sequence as Accept or Reject.
13
Dynamic Recognizers
• gated network trained by BPTT

• emulates exactly the behaviour of Finite State Automaton
14
Task: Formal Language Recognition
Accept Reject
1 000
0 11000
10 0001
01 000000000
00 11111000011
100100 1101010000010111
001111110100 1010010001
0100100100 0000
11100 00000
0010
Scan a sequence of characters one at a time,

then classify the sequence as Accept or Reject.
15
Dynamical Recognizers
• trained network emulates the behaviour of Finite State Automaton

• training set must include short, medium and long examples
16
Phase Transition
17
Chomsky Hierarchy
Language Machine Example

Regular Finite State Automaton an (n odd)
Context Free Push Down Automaton an b n
Context Sensitive Linear Bounded Automaton an b n c n
Recursively Enumerable Turing Machine true QBF
18
Task: Formal Language Prediction
abaabbabaaabbbaaaabbbbabaabbaaaaabbbbb . . .
• Scan a sequence of characters one at a time, and try at each step to predict
the next character in the sequence.
• In some cases, the prediction is probabilistic.
• For the an b n task, the first b is not predictable, but subsequent b’s and the
initial a in the next subsequence are predictable.
19
Elman Network for predicting an bn
a b
a b
20
Oscillating Solution for an bn
21
Learning to Predict an bn
• the network does not implement a Finite State Automaton but instead uses
two fixed points in activation space – one attracting, the other repelling (Wiles
& Elman, 1995)
• networks trained only up to a10 b 10 could generalize up to a12 b 12
• training the weights by evolution is more stable than by backpropagation
• networks trained by evolution were sometimes monotonic rather than
oscillating
22
Monotonic Solution for an bn
23
Hidden Unit Analysis for an bn
hidden unit trajectory fixed points and eigenvectors
24
Counting by Spiralling
• for this task, sequence is accepted if the number of a’s and b’s are equal
• network counts up by spiralling inwards, down by spiralling outwards
25
Hidden Unit Dynamics for an bn c n
• SRN with 3 hidden units can learn to predict an b n c n by counting up and

down simultaneously in different directions, thus producing a star shape.
26
Partly Monotonic Solution for an bn c n
27
Long Range Dependencies
• Simple Recurrent Networks (SRNs) can learn medium-range dependencies

but have difficulty learning long range dependencies
• Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU) can
learn long range dependencies better than SRN
28
Long Short Term Memory
Two excellent Web resources for LSTM:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
christianherta.de/lehre/dataScience/machineLearning/neuralNetworks/LSTM.php
29
Reber Grammar
30
Embedded Reber Grammar
31
Simple Recurrent Network
• SRN – context layer is combined directly with the input to produce the next
hidden layer.
• SRN can learn Reber Grammar, but not Embedded Reber Grammar.
32
• LSTM – context layer is modulated by three gating mechanisms:

forget gate, input gate and output gate.
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
33
34
Gated Recurrent Unit
GRU is similar to LSTM but has only two gates instead of three.
35
End-to-End Text Classification
Image Credit: https://towardsdatascience.com/using-deep-learning-for-end-to-end-multiclass-text-classification-39b46aecac81
36
Seq2Seq model
Encoder-Decoder Framework
37
Seq2Seq model applications
Machine Translation
38
Automatic Email Reply
39
Google’s Neural Machine Translation
40
CNNs + LSTM : Image Captioning
Show and Tell: Neural Image Caption Generator (Vinyals et al. 2015)
41
Show and Tell: Neural Image Caption Generator (Vinyals et al. 2015)
42
Show, Attend and Tell (Xu et al. 2015)
43
Show, Attend and Tell (Xu et al. 2015) Examples
44
Summary
• Recurrent Neural Networks (RNNs) are specialised neural networks suitable
for modelling sequential or time-series data.
• RNNs have a looping mechanism that acts as a highway to allow information
to flow from one step to the next. This information is the hidden state, which is
a representation of previous inputs.
• Simple RNNs suffers from vanishing gradient problem
◦ As the RNNs processes more steps, it has troubles retaining information from
previous steps.
◦ Due to back-propagation, the earlier layers fail to do any learning as the internal
weights are barely being adjusted due to extremely small gradients.
◦ Does not learn the long-range dependencies across time steps
• LSTMs and GRUs are two special RNNs, capable of learning long-term
dependencies using mechanisms called gates.
• These gates are different tensor operations that can learn what information to
add or remove to the hidden state.
45

5a. Recurrent Neural Networks

Uploaded by

Copyright:

Available Formats

5a. Recurrent Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5a. Recurrent Neural Networks

Uploaded by

Copyright:

Available Formats

COMP9444: Neural Networks and Deep

• Processing Temporal Sequences

How can neural network models be adapted for these tasks?

The simplest way to feed temporal input to a neural network is the

Given a sequence of 7 characters, predict the phonetic pronunciation of the middle

pa pat pate paternal

mo mod mode modern

• NETtalk gained a lot of media attention at the time.

• we can “unroll” a recurrent architecture into an equivalent feedforward

• it is sometimes beneficial to add “shortcut” connections directly from input to

Scan a sequence of characters one at a time,

• gated network trained by BPTT

Scan a sequence of characters one at a time,

• trained network emulates the behaviour of Finite State Automaton

Language Machine Example

hidden unit trajectory fixed points and eigenvectors

• SRN with 3 hidden units can learn to predict an b n c n by counting up and

• Simple Recurrent Networks (SRNs) can learn medium-range dependencies

Two excellent Web resources for LSTM:

• LSTM – context layer is modulated by three gating mechanisms:

Image Credit: https://towardsdatascience.com/using-deep-learning-for-end-to-end-multiclass-text-classification-39b46aecac81

Automatic Email Reply

Show, Attend and Tell (Xu et al. 2015)

You might also like