5a. Recurrent Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

COMP9444: Neural Networks and Deep

Learning
Week 5a. Recurrent Neural Networks (RNNs)

Sonit Singh
School of Computer Science and Engineering
June 25, 2024
Outline

• Processing Temporal Sequences


• Sliding Window
• Recurrent Network Architectures
• Hidden Unit Dynamics
• Long Short Term Memory (LSTM)
• Gated Recurrent Unit (GRU)

2
Processing Temporal Sequences

There are many tasks which require a sequence of inputs to be processed rather
than a single input.
• speech recognition
• time series prediction
• machine translation
• handwriting recognition

How can neural network models be adapted for these tasks?

3
Sliding Window

The simplest way to feed temporal input to a neural network is the


“sliding window” approach, first used in the NetTalk system
(Sejnowski & Rosenberg, 1987).

4
NetTalk Task

Given a sequence of 7 characters, predict the phonetic pronunciation of the middle


character.
For this task, we need to know the characters on both sides.
For example, how are the vowels in these words pronounced?

pa pat pate paternal

mo mod mode modern

5
NetTalk Architecture

6
NetTalk Test

https://www.youtube.com/watch?v=gakJlr3GecE

7
NetTalk

• NETtalk gained a lot of media attention at the time.


• Hooking it up to a speech synthesizer was very cute. In the early stages of
training, it sounded like a babbling baby. When fully trained, it pronounced the
words mostly correctly (but sounded somewhat robotic).
• Later studies on similar tasks have often found that a decision tree could
produce equally good or better accuracy.
• This kind of approach can only learn short term dependencies, not the
medium or long term dependencies that are required for some tasks.

8
Simple Recurrent Network (Elman, 1990)

• at each time step, hidden layer activations are copied to “context” layer
• hidden layer receives connections from input and context layers
• the inputs are fed one at a time to the network, it uses the context layer to
“remember” whatever information is required for it to produce the correct
output
9
Back Propagation Through Time

• we can “unroll” a recurrent architecture into an equivalent feedforward


architecture, with shared weights
• applying backpropagation to the unrolled architecture is reffered to as
“backpropagation through time”
• we can backpropagate just one timestep, or a fixed number of timesteps, or all
the way back to beginning of the sequence

10
Other Recurrent Neural Architectures

• it is sometimes beneficial to add “shortcut” connections directly from input to


output
• connections from output back to hidden have also been explored (sometimes
called “Jordan Networks”)
11
Second Order (or Gated) Networks

d
j
X
xt = tanh(Wσj0t + Wσjkt xt−1
k
)
k=1
d
Pj xnj )
X
z = tanh(P0 +
j=1

12
Task: Formal Language Recognition

Accept Reject
1 0
11 10
111 01
1111 00
11111 011
111111 110
1111111 11111110
11111111 10111111

Scan a sequence of characters one at a time,


then classify the sequence as Accept or Reject.

13
Dynamic Recognizers

• gated network trained by BPTT


• emulates exactly the behaviour of Finite State Automaton

14
Task: Formal Language Recognition

Accept Reject
1 000
0 11000
10 0001
01 000000000
00 11111000011
100100 1101010000010111
001111110100 1010010001
0100100100 0000
11100 00000
0010

Scan a sequence of characters one at a time,


then classify the sequence as Accept or Reject.

15
Dynamical Recognizers

• trained network emulates the behaviour of Finite State Automaton


• training set must include short, medium and long examples

16
Phase Transition

17
Chomsky Hierarchy

Language Machine Example


Regular Finite State Automaton an (n odd)
Context Free Push Down Automaton an b n
Context Sensitive Linear Bounded Automaton an b n c n
Recursively Enumerable Turing Machine true QBF

18
Task: Formal Language Prediction

abaabbabaaabbbaaaabbbbabaabbaaaaabbbbb . . .

• Scan a sequence of characters one at a time, and try at each step to predict
the next character in the sequence.
• In some cases, the prediction is probabilistic.
• For the an b n task, the first b is not predictable, but subsequent b’s and the
initial a in the next subsequence are predictable.

19
Elman Network for predicting an bn

a b

a b

20
Oscillating Solution for an bn

21
Learning to Predict an bn

• the network does not implement a Finite State Automaton but instead uses
two fixed points in activation space – one attracting, the other repelling (Wiles
& Elman, 1995)
• networks trained only up to a10 b 10 could generalize up to a12 b 12
• training the weights by evolution is more stable than by backpropagation
• networks trained by evolution were sometimes monotonic rather than
oscillating

22
Monotonic Solution for an bn

23
Hidden Unit Analysis for an bn

hidden unit trajectory fixed points and eigenvectors

24
Counting by Spiralling

• for this task, sequence is accepted if the number of a’s and b’s are equal
• network counts up by spiralling inwards, down by spiralling outwards

25
Hidden Unit Dynamics for an bn c n

• SRN with 3 hidden units can learn to predict an b n c n by counting up and


down simultaneously in different directions, thus producing a star shape.

26
Partly Monotonic Solution for an bn c n

27
Long Range Dependencies

• Simple Recurrent Networks (SRNs) can learn medium-range dependencies


but have difficulty learning long range dependencies
• Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU) can
learn long range dependencies better than SRN

28
Long Short Term Memory

Two excellent Web resources for LSTM:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

christianherta.de/lehre/dataScience/machineLearning/neuralNetworks/LSTM.php

29
Reber Grammar

30
Embedded Reber Grammar

31
Simple Recurrent Network

• SRN – context layer is combined directly with the input to produce the next
hidden layer.
• SRN can learn Reber Grammar, but not Embedded Reber Grammar.

32
Long Short Term Memory

• LSTM – context layer is modulated by three gating mechanisms:


forget gate, input gate and output gate.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

33
Long Short Term Memory

34
Gated Recurrent Unit

GRU is similar to LSTM but has only two gates instead of three.

35
End-to-End Text Classification

Image Credit: https://towardsdatascience.com/using-deep-learning-for-end-to-end-multiclass-text-classification-39b46aecac81

36
Seq2Seq model
Encoder-Decoder Framework

37
Seq2Seq model applications

Machine Translation

38
Seq2Seq model applications

Automatic Email Reply

39
Seq2Seq model applications
Google’s Neural Machine Translation

40
CNNs + LSTM : Image Captioning

Show and Tell: Neural Image Caption Generator (Vinyals et al. 2015)

41
CNNs + LSTM : Image Captioning
Show and Tell: Neural Image Caption Generator (Vinyals et al. 2015)

42
CNNs + LSTM : Image Captioning

Show, Attend and Tell (Xu et al. 2015)

43
CNNs + LSTM : Image Captioning
Show, Attend and Tell (Xu et al. 2015) Examples

44
Summary
• Recurrent Neural Networks (RNNs) are specialised neural networks suitable
for modelling sequential or time-series data.
• RNNs have a looping mechanism that acts as a highway to allow information
to flow from one step to the next. This information is the hidden state, which is
a representation of previous inputs.
• Simple RNNs suffers from vanishing gradient problem
◦ As the RNNs processes more steps, it has troubles retaining information from
previous steps.
◦ Due to back-propagation, the earlier layers fail to do any learning as the internal
weights are barely being adjusted due to extremely small gradients.
◦ Does not learn the long-range dependencies across time steps
• LSTMs and GRUs are two special RNNs, capable of learning long-term
dependencies using mechanisms called gates.
• These gates are different tensor operations that can learn what information to
add or remove to the hidden state.
45

You might also like