Unit 11-LSTM-CNN

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Unit 11 Inferences with Neural

Networks

Institute of Communications Engineering, EE, NCTU


Unit 11
Outline
Deep Learning Algorithm
– Artificial Neural Networks (ANNs)
– Recurrent Neural Network (LSTM)
– Convolutional Neural Network
Generative Deep Learning Algorithm
– Auto-encoder
• Variational Auto-Encoder (VAE)
– Generative Adversarial Networks (GAN)
– Reinforcement learning
– Transform learning
Reference :
– Aymeric Damien GitHub Project :
https://github.com/aymericdamien/TensorFlow-Examples/
– “Generative Deep Learning – Teaching Machines to Paint, Write,
Compose and Play”, David Foster, O’Reilly 2019.

2
Unit 11
Artificial Neural Networks (ANNs)
A single artificial neuron
– Inputs :
– Weights :
– Bias :

– Activation or transfer function to
• Predict the output : 𝑦 = 𝑓(𝑧)
Canonical form of a NN
– Inputs :
– Output :
– Weights and bias :

• 𝑧 = 𝑥𝑤 + 𝑏
• 𝑎 = 𝑓(𝑧 )
• 𝑧 =𝑎 𝑤 +𝑏
• 𝑦 = 𝑓(𝑧 )

3
Unit 11
Approximation with Neural Networks
For general regression case, any reasonable function can be approximated to
any degree of precision by a three-layer network with
– Input layer: 𝑥 , 𝑥 , … , 𝑥
– A hidden layer of sigmoid units:
hk= 𝜎 ∑ xi ∗ wi + bk , if m is arbitrarily large
– One layer of linear output units : 𝑦 = ∑ hk ∗ wk
A simple proof
– Consider a continuous function of y = f (𝑥 ),
with x  [0, 1], W.L.O.G
– ∃𝑛, 𝑥 − 𝑥 ≤  𝑓(𝑥 ) − 𝑓(𝑥 ) ≤ 𝜖
– It suffices to approximate 𝑓(𝑥) with a 𝑔(𝑥) such that 𝑔 0 = 𝑓(0) and 𝑔 𝑥 =
𝑓(𝑘/𝑛) for any x  [(𝑘 − 1)/𝑛, 𝑘/𝑛], and any k = 1,..., n.
– 𝑔(𝑥) can be realized with a NN with one input x, n+1 hidden threshold gate units
– Hidden units, hk, are numbered from 0 to n with hk having a threshold (bias)
– For x  [(𝑘 − 1)/𝑛, 𝑘/𝑛], all hk are zeros except for h0 = h1 =,…, hk=1
– Let the weights wk for hk be wk = k f = 𝑓(𝑘/𝑛)- 𝑓(𝑘 − 1/𝑛) with 0 f = 𝑓(0)
– If 𝑔 0 = 𝑓(0), then 𝑔 𝑥 = 𝑓(0) + ∑ 𝑓(𝑗/𝑛)− 𝑓(𝑗 − 1/𝑛) = 𝑓(𝑘/𝑛), Q.E.D

4
Unit 11
Learning with a Neural Network (NN)
Learning with a NN
– Cost function (loss function, 𝐽)
• How well the neural network is performing.
• May define as the mean of the squared difference between the actual value (𝑦)
and the predicted value (𝑦)
• 𝐽= 𝑦−𝑦
– Back propagation :
• Adjust these weight matrices (𝑤 and 𝑤 ) → Gradient Descent
• Objective : minimize the cost function → Reach the lowest point
• After calculating gradients, update old weights by weight update rule :
• 𝑤 =𝑤−𝛼 , 𝛼 as the learning rate
• Follow the chain rule, equation becomes :
• =
𝒘𝒉𝒚

• = (𝑦 − 𝑦) , = 𝜎′ = , =𝑎

• =

• = (𝑦 − 𝑦) , = 𝜎′ , =𝑤 , = 𝜎′ , =𝑥

5
Unit 11

RECURRENT NEURAL
NETWORK(RNN)

A. See, P. J Liu, and C. Manning,“Get To The Point: Summarization with Pointer-Generator Networks,”
in Proc. of Annual Meeting of the Association for Computational Linguistics (ACL) 2017

6
Unit 11
Recurrent Neural Network (RNN)
Introduction
– Recurrent Neural Networks (RNNs) are a family of neural networks for
processing sequential data
– The Neuron ( hidden unit ) could be seen as an “internal memory” which captures
and maintain information about the previous input.
Applications:
– Speech recognition, language modeling, translation, image captioning, …
Overview – Architecture (RNN) U : represents the input to the hidden state weight matrix
W : represents the hidden to the hidden state weight matrix
V : represents the hidden to the output state weight matrix
For the t-th step:
• 𝐼𝑛𝑝𝑢𝑡𝑠 ∶ 𝑥 ∈ ℝ
𝑦( ) • 𝐻𝑖𝑑𝑑𝑒𝑛 𝑈𝑛𝑖𝑡𝑠 ∶ ℎ ∈ ℝ 𝑦( )
𝑦( )
𝑦( )
• 𝑂𝑢𝑡𝑝𝑢𝑡𝑠 ∶ 𝑦 ∈ ℝ
𝑉 𝑉 𝑉 𝑉
ℎ( ) • 𝑊𝑒𝑖𝑔ℎ𝑡 ∶ 𝑾 × , 𝑼 × , 𝑽 ×
RNN RNN 𝑊 RNN 𝑊 RNN
𝑊 Unrolled through time
Cell Cell Cell Cell
𝑈
• 𝑎( ) = 𝑾ℎ( ) + 𝑼𝑥 ( ) 𝑈 𝑈 𝑈
• ℎ( ) = 𝑓 𝑎
𝑥( )
• 𝑦 ( ) = 𝜎(𝑽ℎ ) 𝑥( )
𝑥( )
𝑥( )

• 𝑓: 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛, tanh( )

7
Unit 11
Backpropagation through Time (BPTT)
Adjust weight matrices ( and )
– Update the weight matrices with a sum of the gradients at each time step

• 𝑳𝟏 , 𝑳𝟐 , and 𝑳𝟑 : the losses at each time step

– Difficult to learn the long term dependency with gradient descent (Bengio et al,
1994)

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

8
Unit 11

From RNN to Long Short Term Memory


– Information are gauged and limited within the range of [-1, 1]

– Add a state (conveyor belt) variable to directly carry old information but control
its gain

– Then decide what new information to store in the state and what to output with
it

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 9
Unit 11
A Long Short Term Memory (LSTM) Cell
Each LSTM block consists of a forget gate, input gate, and an output gate
– Forget gate : ( )

– External input gate : ( )

– Internal State : ( ) ( ) ( )

– Output gate : ( )

– The output of LSTM : ( ) ( )

– Output Layer : ( ) ( )

Forget Gate

Input Gate
Output Gate

10
Unit 11
Different Structures of RNNs
Different kinds of sequential data
– Example : sentences, speech, stock market, signal data
– Perform the sharing parameter task for every element of a sequence,
with the output being dependent on the previous computations

Ex. Image captioning: Ex. Handle a Ex. Language translations:


Label a image with a sequence of Label each image of a video sequence,
sentence images: Speech to text
Text classification

11
Unit 11
A Simplified LSTM Cell Architecture
•  , ,

•  , ,

• , ,



• ,

• 

: element-wise product
 : sigmoid function
:

12
Unit 11
Architecture of the LSTM
: element-wise product output
𝑦(𝑡)
 : sigmoid function 
:

, ℎ 𝑡
Input
𝑥(𝑡) , ℎ 𝑡
𝑛𝑒𝑡 𝑡 Recurrent
+ +
ℎ 𝑡−1
Recurrent ,  
𝑛𝑒𝑡 𝑡 𝑛𝑒𝑡 𝑡
+

+
, , , ,

Input ℎ 𝑡−1 Input ℎ 𝑡−1


𝑥(𝑡) Recurrent 𝑥(𝑡) Recurrent
Unit 11
Forward Pass
The truncated derivatives of output unit are:
𝜕𝑦 𝑡 𝜕𝑦 𝑡 𝜕𝑛𝑒𝑡 𝑡 𝜕𝑦 𝑡 𝜕𝑤 , 𝜕ℎ 𝑡
= = ℎ 𝑡 +𝑤 ,
𝜕𝑤 , 𝜕𝑛𝑒𝑡 𝑡 𝜕𝑤 , 𝜕𝑛𝑒𝑡 𝑡 𝜕𝑤 , 𝜕𝑤 ,
ℎ 𝑡 ; 𝑙 = 𝑦, 𝑚 = ℎ
𝜕ℎ 𝑡
𝑤 , ; 𝑙 = 𝑐, 𝑖𝑛, 𝑜𝑟 𝑜𝑢𝑡, 𝑚 = ℎ
=𝜎 𝑛𝑒𝑡 𝑡 𝜕𝑤 ,
𝜕𝑜 𝑡 𝜕𝑐 𝑡
𝑤 , 𝑔 𝑐 𝑡 +𝑜 𝑡 𝑔 𝑐 𝑡 ; 𝑙 = 𝑐, 𝑖𝑛, 𝑜𝑟 𝑜𝑢𝑡, 𝑚 = 𝑥
𝜕𝑤 , 𝜕𝑤 ,
ℎ 𝑡 ; 𝑙 = 𝑦, 𝑚 = ℎ
𝜕ℎ 𝑡
𝑤 , ; 𝑙 = 𝑐, 𝑖𝑛, 𝑜𝑟 𝑜𝑢𝑡, 𝑚 = ℎ
𝜕𝑤 ,

𝜕ℎ 𝑡 − 1
=𝜎 𝑛𝑒𝑡 𝑡 𝑤 , 𝜎 𝑛𝑒𝑡 𝑡 𝑥 𝑡 +𝑤 , 𝑔 𝑐 𝑡 ; 𝑙 = 𝑜𝑢𝑡, 𝑚 = 𝑥
𝜕𝑤 ,
𝜕𝑐 𝑡 − 1
𝑤 , 𝑜 𝑡 𝑔 𝑐 𝑡 + 𝜎 𝑛𝑒𝑡 𝑡 𝑔 𝑛𝑒𝑡 𝑡 𝑥 𝑡 ; 𝑙 = 𝑖𝑛, 𝑚 = 𝑥
𝜕𝑤 ,
𝜕𝑐 𝑡 − 1
𝑤 , 𝑜 𝑡 𝑔 𝑐 𝑡 + 𝑖 𝑡 𝑔 𝑛𝑒𝑡 𝑡 𝑥 𝑡 ; 𝑙 = 𝑐, 𝑚 = 𝑥
𝜕𝑤 ,

14
Unit 11
Forward Pass
The truncated derivatives of cell block are:

 ,
, ,

𝑥(𝑡) ; 𝑙 = 𝑜𝑢𝑡, 𝑚 = 𝑥
 = = 𝜎 𝑛𝑒𝑡 𝑡 𝛿 ,
, , ℎ 𝑡 − 1 ; 𝑙 = 𝑜𝑢𝑡, 𝑚 = ℎ

= + 𝑔 𝑛𝑒𝑡 𝑡 + 𝑖 𝑡 𝑔 𝑛𝑒𝑡 𝑡
, , , ,

= (𝛿 , +𝛿 , )
 ,

𝛿 , 𝑛𝑒𝑡 𝑡 𝑔 𝑛𝑒𝑡 𝑡 𝑥 𝑡 + 𝛿 , 𝑖 𝑡 𝑔 𝑛𝑒𝑡 𝑡 𝑥 𝑡 ;𝑚 = 𝑥


+
𝛿 , 𝑛𝑒𝑡 𝑡 𝑔 𝑛𝑒𝑡 𝑡 ℎ 𝑡 − 1 + 𝛿 , 𝑖 𝑡 𝑔 𝑛𝑒𝑡 𝑡 ℎ 𝑡 − 1 ; 𝑚 = ℎ

( )
= 𝑔 𝑐 𝑡 +𝑜 𝑡 𝑔 𝑐 𝑡
, , ,

=𝛿 , 𝑔 𝑐 𝑡 + 𝛿 , +𝛿 , 𝑔 𝑐 𝑡 𝑜(𝑡)
, ,

15
Unit 11
Backward Pass
The squared error at time t is given by
𝐸 𝑡 = 𝑧 𝑡 − 𝑦 (𝑡)
∈𝒦
Where 𝑧 𝑡 is output unit 𝑦 target at time 𝑡
Time 's contribution to 's gradient-based update with learning rate
𝜕𝐸(𝑡)
∆𝑤 , 𝑡 = −𝛼
𝜕𝑤 ,
We define some unit 's error at time step by
𝜕𝐸 𝑡 𝜕𝑦 𝑡
𝑒 𝑡 ≜ = 2 𝑧 𝑡 − 𝑦 (𝑡)
𝜕𝑛𝑒𝑡 𝑡 𝜕𝑛𝑒𝑡 𝑡

2 𝑧 𝑡 − 𝑦 (𝑡) 𝜎 𝑛𝑒𝑡 𝑡 ;𝑙 = 𝑦
= 𝜕𝑛𝑒𝑡 𝑡 𝜕ℎ 𝑡
2 𝑧 𝑡 − 𝑦 (𝑡) 𝜎 𝑛𝑒𝑡 𝑡 ; 𝑙 = 𝑜𝑢𝑡
𝜕ℎ 𝑡 𝜕𝑛𝑒𝑡 𝑡

𝑒 𝑡 ;𝑙 = 𝑦
=
𝜎 𝑛𝑒𝑡 𝑡 𝑔 𝑐 𝑡 𝑤 , 𝑒 𝑡 ; 𝑙 = 𝑜𝑢𝑡

16
Unit 11
Backward Pass
We define some unit error at time step by

𝜕𝐸(𝑡)
𝑒 𝑡 =−
𝜕𝑐 𝑡
𝜕𝑛𝑒𝑡 𝑡 𝜕ℎ(𝑡)
= 2 𝑧 𝑡 − 𝑦 (𝑡) 𝜎 𝑛𝑒𝑡 𝑡
𝜕ℎ 𝑡 𝜕𝑐 𝑡

=𝑜 𝑡 𝑔 𝑐 𝑡 𝑤 , 𝑒 𝑡

We define

𝜕𝑐 𝑡 𝜕𝑐(𝑡 − 1) 𝑥 𝑡 ;𝑚 = 𝑥
= + 𝑖 𝑡 𝑔 𝑛𝑒𝑡 𝑡
𝜕𝑤 , 𝜕𝑤 , ℎ 𝑡−1 ;𝑚 = ℎ

𝜕𝑐 𝑡 𝜕𝑐(𝑡 − 1) 𝑥 𝑡 ;𝑚 = 𝑥
= + 𝜎 𝑛𝑒𝑡 𝑡 𝑔 𝑛𝑒𝑡 𝑡
𝜕𝑤 , 𝜕𝑤 , ℎ 𝑡−1 ;𝑚 = ℎ

17
Unit 11
Weight Update
Weight updates :
( )
,
 ,

Where and
( ) were obtained on p.16
,
,

( )
, ( ) ,

,

,
( )
Where , , and
( ) , ,

were obtained on p.17
,
,

18
Unit 11
Example 1
An RNN example for digit classification
– Build a RNN to classify digit with TensorFlow.
– Sample code :
– https://github.com/aymericdamien/TensorFlow-
Examples/blob/master/notebooks/3_NeuralNetworks/recurrent_network.ipynb

MNIST Dataset Overview

More info: http://yann.lecun.com/exdb/mnist/

19
Unit 11
Example 2
An RNN example for regression
– Build a RNN and use a sine wave to predict cos wave with TensorFlow.
– Sample code :
– https://github.com/MorvanZhou/tutorials/blob/master/tensorflowTUT/tf20_RNN2.2/fu
ll_code.py

Training process

20
Unit 11

VARIATIONAL
AUTOENCODER (VAE)

D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in Proc. of ICLR 2014

21
Unit 11
Autoencoder
Overview – Architecture
– A neural network made up of two parts:
– An encoder network that compresses high-dimensional input data into a lower-
dimensional representation vector
– A decoder network that decompresses a given representation vector back to the
original domain

More examples of reconstructed paintings


Latent space
Representing images in the latent space : A
B

Fig. C Fig. B Fig. A C


Some generated images

22
Unit 11
Variational Auto-Encoder (VAE)
Compare with Autoencoder
– There are two parts that we need to change : the encoder and the loss function.
• The difference between the encoder in an autoencoder and a variational autoencoder
Overview – Architecture
• Observable Data : 𝒙
• Latent Variable : 𝒛
• 𝜃= 𝑊 , 𝑏 ,𝑊 , 𝑏 ,𝑊 ,𝑏 ,𝑊 , 𝑏 , 𝑊 , 𝑏
• 𝜙= W , 𝑏 ,𝑊 , 𝑏 ,𝑊 ,𝑏 for the approximation 𝑞 𝑧𝑥

Ideally they are identical Reconstructed


Input
𝑥 ≈ 𝑥′ Input
Probabilistic Encoder Probabilistic Decoder
Sampled
𝑧
Latent vector
Encoder Decoder
𝑧

𝑧 ( )

𝜖~𝑁(0, 𝐼) An compressed low dimensional


representation of the input.

23
Unit 11
Variational Inference
Definition
• Observable Data :
• Latent Variable :

• Likelihood
• Posterior Distribution of latent variable
( , ) ( | ) ( )
=
( ) ∫ ( | )

• for the approximation

 The KL divergence of and :

24
Unit 11
Variational Inference
 The KL divergence of and :

• Because the KL divergence is always greater than or equal to zero


• Let

 We want to maximize the marginal likelihood


Equivalent to maximize

25
Unit 11
Lower Bound
 Differentiate and optimize the Lower Bound

𝑍
 The first term 𝑍
𝑍 ( )
• We have 𝑧~𝑞 𝑧𝑥
𝜖~𝑁(0, 𝐼)
• Assume 𝑧~𝑔 𝑧 𝑥 ⇒ 𝑧 = 𝑔 𝜖, 𝑥 with a set of samples 𝜖 ( ) ~𝑝(𝜖)
• 𝑞 𝑧 𝑥 𝑑𝑧 = 𝑝 𝜖 𝑑𝜖 ⇒ 𝑞 𝑧 𝑥 ∏ 𝑑𝑧 = 𝑝(𝜖) ∏ 𝑑𝜖
• ∫𝑞 𝑧 𝑥 𝑙𝑜𝑔 𝑝 𝑥 𝑧 𝑑𝑧 = ∫ 𝑝 𝜖 𝑙𝑜𝑔 𝑝 𝑥 𝑧 𝑑𝜖
 Monte Carlo estimates of expectations of w.r.t. :
• 𝛦 𝑧 𝑥 𝑙𝑜𝑔 𝑝 𝑥 𝑧 =𝛦 𝑙𝑜𝑔 𝑝 𝑥 𝑧 𝑑𝜖 ≅ ∑ 𝑙𝑜𝑔 𝑝 𝑥 𝑧 ( )
 When () be a multivariate Bernoulli of dim N
• 𝑙𝑜𝑔 𝑝 𝒙 𝑧 ( ) = ∑ 𝑥 log 𝑥 + 1 − 𝑥 log 1 − 𝑥
• where 𝒙 = tanh(𝑊 (𝑊 × 𝑧 + 𝑏 ) + 𝑏 ), 𝜃 = 𝑊 , 𝑏 , 𝑊 , 𝑏
• 𝛦 𝑧 𝑥 𝑙𝑜𝑔 𝑝 𝒙 𝑧 = ∑ ∑ 𝑥 log 𝑥 + (1 − 𝑥 ) log(1 − 𝑥 )

26
Unit 11
Lower Bound
 Differentiate and optimize the Lower Bound

 The Second term
• We give the solution when the prior 𝑝 𝑧 = 𝑁(0, 𝐼) and the posterior
approximation 𝑞 𝑧 𝑥 ( ) = 𝑁(𝜇, 𝜎 ) are Gaussian
• Let 𝐽 be the dimensionality of 𝑧.
• At datapoint 𝑖, let 𝜇 and 𝜎 denote the 𝑗-th element of these vectors.
• − 𝐾𝐿 𝑞 𝑧 𝑥 ( ) ‖𝑝 𝑧 =− 𝜎 + 𝜇 − 𝑙𝑜𝑔 ( 𝜎 )−1
() () ()
⇒ − 𝐾𝐿 𝑞 𝑧 𝑥 ‖𝑝 𝑧 =− Σ Σ 𝜎 + 𝜇 − 𝑙𝑜𝑔 ( 𝜎 )−1
 Lower Bound


() () ()

27
Unit 11
Update Parameters Using Gradients
 Lower Bound

() () ()

where
• 𝑥 = tanh 𝑊 𝑊 ×𝑧+𝑏 +𝑏
()
• 𝜇 : 𝑧 = 𝑡𝑎𝑛ℎ(𝑡𝑎𝑛ℎ 𝑥 × 𝑊 +𝑏 ×𝑊 +𝑏 )
()
• 𝑙𝑜𝑔 ( 𝜎 ):𝑧 ( ) = 𝑡𝑎𝑛ℎ(𝑡𝑎𝑛ℎ 𝑥 × 𝑊 +𝑏 ×𝑊 +𝑏 )
• 𝜃= 𝑊 , 𝑏 , 𝑊 , 𝑏 , 𝜙= 𝑊 , 𝑏 ,W , 𝑏 ,𝑊 ,𝑏

 Gradient for update the parameters


• ,
= Σ Σ 𝑥 log 𝑥 + 1 − 𝑥 log 1 − 𝑥
() () ()
+ − Σ Σ 𝜎 + 𝜇 − 𝑙𝑜𝑔 ( 𝜎 )−1

28
Unit 11
Example
A Variational Auto-Encoder Example
– Build a variational auto-encoder (VAE) to generate digit images from a noise
distribution with TensorFlow.
– Sample Code :
– https://github.com/aymericdamien/TensorFlow-
Examples/blob/master/notebooks/3_NeuralNetworks/variational_autoencoder.ipynb

Build a manifold of generated digits


MNIST Dataset Overview

More info: http://yann.lecun.com/exdb/mnist/

29
Unit 11

GENERATIVE ADVERSARIAL
NETWORK (GAN)

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, “Generative Adversarial Networks,”
in Proc. of NIPS 2014

30
Unit 11
Generative Adversarial Network (GAN)
A new framework for estimating generative models
– A novel adversarial process to train two models:
Generative model, G, and discriminative model, D

The difference between Generative Model and GAN


– Generative Model
• Assume a distribution ( → posterior distribution )
• Design loss function to decrease the divergence between posterior
distribution and real data distribution ( ← difficult )
– Generative Adversarial Network (GAN)
• Generate from noise variables
• A discriminative model learns to determine whether a sample is
from the model distribution or the data distribution

31
Unit 11
Generative Adversarial Network (GAN)
Basic Idea of Generative Adversarial Networks (GAN)
– Generator : a team of counterfeiters, trying to produce fake
currency and use it without detection.

Generator
vector

– Discriminator : police, trying to detect the counterfeit currency.


• Supervised learning, dividing inputs into two classes (real or fake)

Discriminator scalar
It’s Fake ! → 0
It’s True ! → 1

32
Unit 11
Proposed Method
Architecture

Sample
(label is 1)

Loss
scalar
Discriminator
0? 1?
noise vector

Sample
(label is 0)
Generator

33
Unit 11
Training Processing
1. Update the Discriminator
 Fix Generator, draw samples from both real world and generated images
 Train Discriminator to distinguish between real world and generated images.

2. Update the Generator


 Fix Discriminator
 Sample from generator
 Backpropagation error through Discriminator to update Generator 𝜃
Label = 1

Label = 0

34
Unit 11
The Discriminator’s Cost Function

Sample D sigmoid
Value
1 : True

Discriminator 0 : False
𝒛 and

The discriminator’s cost function,


~

– Cross-entropy with Binary Classifier using Bernoulli distribution


– The classifier is trained with two training sets of equal probability
• Dataset, where the label is 1
• Generator, where the label is 0
– Minimizes negative log-probability of the discriminator being
correct

35
Unit 11
The Generator’s Cost Function

G Sample D sigmoid
Value
Generator Discriminator

ex : Normal Distribution

The generator’s cost function,



( ) ( )

36
Unit 11
Minimax Game – Value Function
Summarize D and G cost function
– Min ~

Max
– Min ( )
~

– Minimax two-player game over D and G


– Two models trainable simultaneously through back-propagation
over Value function using neural network

37
Unit 11
Example
A Generative Adversarial Network Example
– Build a Generative Adversarial Network (GAN) to generate digit images from a
noise distribution with TensorFlow.
– Sample Code :
– https://github.com/aymericdamien/TensorFlow-
Examples/blob/master/notebooks/3_NeuralNetworks/gan.ipynb

Generate images from noise


using the generator network

MNIST Dataset Overview

More info: http://yann.lecun.com/exdb/mnist/

38
Unit 11

APPENDIX :
FINDING OPTIMAL MODEL
The global minimum of the virtual training criterion is achieved if
and only if

39
Unit 11
Appendix : Finding Optimal Discriminator
Objective Function
~ ( ) ~ ( )
 Fix G
𝑫
~ ( ) ~ ( )

• For any 𝑎, 𝑏 ∈ ℝ \ {0,0}, and function 𝑓 𝐷 = 𝑎 × log 𝐷 + 𝑏 × log 1 − 𝐷


• ⇒𝑓′ D = 𝑎 × + 𝑏 × × −1 = 0 ⇒ D =

• ⇒𝑓 =− − <0

• achieves its maximum in 0,1 at



The optimal discriminator D for any given generator G

40
Unit 11
Appendix : Finding Optimal Discriminator

 Fix D 𝑮
𝑮
 The global minimum of the virtual training criterion is achieved
if and only if

∗ ( )
( )

 The global minimum


41
Unit 11
Appendix : Finding optimal generator
 The global minimum
min max 𝑉 𝐷, 𝐺 = min 𝑉 𝐷∗ , 𝐺
=𝐸 ~ log𝐷∗ 𝑥 +𝐸 ~ [log(1 − 𝐷∗ (𝐺(𝑧)))]
=𝐸 ~ log𝐷∗ 𝑥 +𝐸 ~ [log(1 − 𝐷∗ (𝑥))]
𝑝 (𝑥) 𝑝 (𝑥)
=𝐸 ~ log +𝐸 ~ log
𝑝 𝑥 + 𝑝 (𝑥) 𝑝 𝑥 + 𝑝 (𝑥)
1 1
𝑝 (𝑥) × 𝑝 (𝑥) ×
= 𝑝 (𝑥) log 2
𝑑𝑥 + 𝑝 log 2 𝑑𝑥
1 1
𝑝 𝑥 + 𝑝 (𝑥) × 𝑝 𝑥 + 𝑝 (𝑥) ×
2 2
𝑝 (𝑥) 𝑝 (𝑥)
= −2 log 2 + 𝑝 (𝑥) log 𝑑𝑥 + 𝑝 log 𝑑𝑥
𝑝 𝑥 + 𝑝 (𝑥) /2 𝑝 𝑥 + 𝑝 (𝑥) /2

42
Unit 11
Appendix : Finding optimal generator

𝑝 (𝑥) 𝑝 (𝑥)
⇒ − log 4 + 𝑝 (𝑥) log 𝑑𝑥 + 𝑝 log 𝑑𝑥
𝑝 𝑥 + 𝑝 (𝑥) /2 𝑝 𝑥 + 𝑝 (𝑥) /2
• Kullback-Liebler divergence
( )
• 𝐾𝐿 𝑃‖𝑄 = ∫ 𝑝 𝑥 𝑙𝑛 𝑑𝑥 ≥0
( )

𝑝 𝑥 +𝑝 𝑥 𝑝 𝑥 +𝑝 𝑥
⇒−log4 + 𝐾𝐿 𝑝 𝑥 + 𝐾𝐿 𝑝 𝑥
2 2
≥ −log4

• The global minimum is achieved if and only if 𝑝 =𝑝


• For 𝑝 = 𝑝 , 𝐷∗ = ⇒ 𝑉 𝐷∗ , 𝐺 = −2log(2)

43
Unit 11
Gobal Optimum at

44
Unit 11

CONVOLUTIONAL
NEURAL NETWORK (CNN)

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,”
in Proc. of ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012

45
Unit 11
Convolutional Neural Network Structure
 It is typical to decompose a CNN into two subnetworks:
1) The feature extraction (FE) subnet :
• Multiple convolutional layers
• Nonlinear activation function (Rectifier)
• Pooling, subsampling (Max pooling, Average pooling)
2) The decision-making (DM) subnet :
• Classification, fully connected layers
Feature Map Feature Map Feature Map Feature Map Fully-Connect Output

Input Convolution Max pooling Convolution Max pooling


Image + ReLU + ReLU

: 2D Input Image : 2D Feature map : Conv Filter

46
Unit 11
Convolution Operations
Given an input feature map 𝟏

0 0 0 0 0 0
0Z+0Y+ 0Z+0Y+ 0Z+0Y+
0 A B C D 0
0X+AW BX+CW DX+0W
0 E F G H 0
W X Z Y 0Z+EY+ GZ+GY+ HZ+0Y+
0 I J K L 0 Y Z X W 0X+IW JX+KW LX+0W
0 M N O P 0 0Z+MY+ NZ+OY+ PZ+0Y+
0X+0W 0X+0W 0X+0W
0 0 0 0 0 0

47
Unit 11
Feature Maps Channel Mapping from One Layer
to the Next Layer
• Given two channels of input feature maps ,
• Given six weight matrices, , , , , , , , , , ,
• Let
• , + , , , + ,


• ,

48
Unit 11
Motivation for Using Convolution Networks
Convolution leverages three important ideas that can help
improve a machine learning system:
1. Sparse interactions
2. Parameter sharing
3. Equivariance representations
Convolution also allows for working with inputs of variable sizes
The figures illustrate the differences in network links between
CNN and the traditional neural network

CNN Network Neural Network

49
Unit 11
Sparse Connectivity
We highlight one input unit, x3 , and also the output units, si , that
are affected by this unit
– When s is formed by convolution with a kernel of width 3, only
three outputs are affected. (Left)
– When s is formed by matrix multiplication, connectivity is no longer
sparse, so all of the outputs are affected by x (Right)

50
Unit 11
An Example of Convolution

51
Unit 11
The Receptive Field of Deep Networks
The receptive field of the units in the deeper layers of a
convolutional network is larger than the receptive field of the
units in the shallow layers
– Even though direct connections in a convolutional net are very sparse, units
in the deeper layers can be indirectly connected to all or most of the input
image

52
Unit 11
Parameter Sharing
Parameter sharing refers to using the same parameter for more
than one function in a model
– Particular features are endowed for the outputs of different filters
The parameter sharing used by the convolution operation means
that rather than learning a separate set of parameters for every
location, we learn only one set
– Black arrows below indicate the connections that use a particular parameter
in two different models

53
Unit 11
Results of Convolutions

Only the first 6


channels of each
layer is shown in
this figure

Two Stream LSTM : A Deep Fusion Framework for Human Action Recognition eat-maps-of-Intermediate-CNN-layer-outputs-for-the-
video-frame-obtained-from-UCF, Available from: https://www.researchgate.net/figure/HSports_fig2_315877962

54
Unit 11
Results of Convolutions

Fig. 1. The outputs of the different layers of the CNN and the determination of the exact position of the target by
applying a coarse-to-fine method. The red boxes are the estimated location of the target at each layer. W 1, W 2
and W 3 refer to the correlation filters. Blue and red circles show the previous and current positions, respectively.

Deep convolutional particle filter for visual tracking. Available from: https://www.researchgate.net/figure/The-outputs-of-the-different-
layers-of-the-CNN-and-the-determination-of-the-exact_fig1_323351455

55
Unit 11
Equivariance Representations
Equivariant means that if the input changes, the output changes
in the same way
– A function is equivariant to a function if
Example of Equivariance :
– is image brightness at point
– is image function with
– If we apply g to I and then apply convolution, the output:
will be the same as if we applied convolution to I: I , then applied
transformation g to the output : I

56
Unit 11

Rectifier Design
 Three rectifiers that are often used :
1. Sigmoid
2. ReLU

3. Leaky ReLU

 Without rectifier, the system cannot differentiate the following two cases:
1. A positive response at the first layer followed by a negative filter weight at
the second layer
2. A negative response at the first layer followed by a positive filter weight at
the second layer.
57
Unit 11
Single-Layer RECOS model
RECOS: The compound operation of ‘‘convolution followed by nonlinear
activation” serve as a mechanism to conduct ‘‘REctified COrrelations on a
Sphere (RECOS)”
Filter weights: A set of anchor vectors selected for each RECOS model to
capture and represent frequently occurring patterns
Signal convolution: can also be viewed as signal correlation or projection
Feature extraction (FE) subnet: conducts clustering aiming at a new
representation through a sequence of RECOS transforms
Input
Anchor Vector

RECOS Unit
58
Unit 11
Origin-Centered Unit Sphere
Let be an arbitrary vector on a unit sphere centered at the
origin in the N-dimensional space, denoted by

The correlation can be viewed as a projection


from an anchor vector to the input
The geodesic distance between vectors and
in S is proportional to the magnitude of their angle

Correlation rectification in the unit circle

For 𝟏 𝟐 less than are positive. The geodesic distance is


a monotonically decreasing function of the projection value. The larger the
correlation, the shorter the distance
For an angle 𝟑 lager than : is negative. The two vectors, x and are far
apart in terms of the geodesic distance, yet their correlation is strong (although a
negative one).

59
Unit 11
Why Is a Nonlinear Activation Needed?
 Without rectification, a system cannot differentiate the following two cases:
1. A positive response at the first layer followed by a negative filter weight at
the second layer
2. A negative response at the first layer followed by a positive filter weight at
the second layer
 For this reason, it is essential to set a negative correlation value at each
layer to zero (or almost zero) to avoid confusion in a multi-layer
RECOS system
1. Let
2. Change filter weights to their negative values: ,

𝐫,𝐤 𝐤

60
Unit 11
Pooling Function: Max Pooling
Feature Map Feature Map Feature Map Feature Map Fully-Connect Output

Input Convolution Max-pooling Convolution Max-pooling


Image +ReLu +ReLu

A pooling function replaces the output of the net at a certain


location with a summary statistic of the nearby inputs
– Ex: For max pooling, with stride =2 and pooling size = 2, the operation is like

1 5 10 5
8 10
6 8 4 3
2 2 6 1 7 6
7 5 2 1

61
Unit 11
Pooling Introduces Invariance to Translation
Other types of pooling functions:
1. Max pooling : reports the maximum output within a rectangular neighborhood
2. Average of a rectangular neighborhood
3. L2 norm of a rectangular neighborhood
4. Weighted average based on the distance from the central pixel
In all cases, pooling helps make the representation become
approximately invariant to small translations of the inputs
• Same network but inputs are shifted by one pixel

Every input value has changed, but only half the values of output have changed because max
pooling units are only sensitive to maximum value in neighborhood not exact value

62
Unit 11
History of Convolutional Neural Network
Invented by Yann Lecun in 1980’s – LeNet 5
Improved by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton
in 2012, using rectified linear neurons and dropout regularization
Refer to CiFar:

63
Unit 11
CNN : The Feature Extraction (FE) Subnet
: the kernel (weight) from the i-th channel at layer l-1 to the k-th
,
channel at layer l
: the k-th channel at layer l
: the bias of the k-th channel at layer l
number of channels at layer l-1
: nonlinear activation functions (e.g. ReLU) 𝑏
Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1
𝑠
+
* 𝑤
𝑥
,
𝑏 𝑤 ,

+
*
𝑤, 𝑥
𝑏
𝑠 * f 𝑤 ,

𝑦
Pooling
𝑠
* + 𝑥
𝑤 ,

* 𝑤 , 𝑏
*
𝑠
+ 𝑥

64
Unit 11
Operation of 2D Convolution
• Convolution of the output of the current layer neuron and kernel
, to form the input of the th neuron at the next layer

0 1 2 𝑠 (𝑚, 𝑛)

0 𝑥 (𝑚, 𝑛)
0 1 2
1 x O 𝑤 ,
0 x O
2 + 0 1 2
1 +
0
* 1 x
= 2

2
(3,3)

(9,9) 2 1 0 (7,7)
2
1 x
0

65
Unit 11
Gradient Descent in CNN
A. Intra-Back Propagation (BP) within a CNN neuron :

Note:
Δ Δ𝑠

66
Unit 11
Gradient Descent in CNN
∑ 𝑏
B. Inter-BP among CNN Layers: Layer 𝑙 + 1

+ 𝑥
𝑤 ,

* 𝑤 ,
𝑏

where , 𝑠
* + 𝑥

For wk,i of dim 33

𝑥 𝑚 − 1, 𝑛 − 1 = ⋯ + 𝑠 𝑚, 𝑛 𝑤 , 0,0 + ⋯ Δ𝑠
* 𝑤 , 𝑏

𝑥
……
𝑚 − 1, 𝑛 = ⋯ + 𝑠 𝑚, 𝑛 𝑤 , 0,1 + ⋯ + 𝑥

𝑥 𝑚 + 1, 𝑛 + 1 = ⋯ + 𝑠 𝑚, 𝑛 𝑤 , 2,2 + ⋯
Δ

67
Unit 11
Gradient Descent in CNN
C. Computation of the Weight (Kernel) and Bias Sensitivities

, , ,

where x =𝑏 + ⋯+𝑠 𝑤 + ⋯

𝑥 0,0 = ⋯ + 𝑤 , 1,2 𝑠 0, −1 + 𝑤 , 1,1 𝑠 0,0 + 𝑤 , 1,0 𝑠 0,1 + ⋯


𝑥 0,1 = ⋯ + 𝑤 , 1,1 𝑠 0,1 + 𝑤 , 1,0 𝑠 0,2 + 𝑤 , 0,1 𝑠 1,1 + ⋯
𝑥 1,0 = ⋯ + 𝑤 , 1,1 𝑠 1,0 + 𝑤 , 1,0 𝑠 1,1 + 𝑤 , 0,1 𝑠 2,0 + ⋯
𝑥 𝑚, 𝑛 = ⋯ + 𝑤 , 1,1 𝑠 𝑚, 𝑛 + 𝑤 , 1,0 𝑠 𝑚, 𝑛 + 1 + 𝑤 , 0,1 𝑠 𝑚 + 1, 𝑛 + ⋯

𝑥 (𝑚, 𝑛) = 𝑤 𝑟 + 1, 𝑡 + 1 𝑠 (𝑚 − 𝑟, 𝑛 − 𝑡)

𝜕𝐸 𝜕𝐸 𝜕𝐸 𝜕𝑥 𝑚, 𝑛
= Δ 𝑚, 𝑛 𝑠 (𝑚 + 1 − 𝑝, 𝑛 + 1 − 𝑞) =
𝜕𝑤 , (𝑝, 𝑞) 𝜕𝑏 𝜕𝑥 𝑚, 𝑛 𝜕𝑏
𝜕𝐸
= 𝑐𝑜𝑛𝑣2𝐷 Δ , 𝑠 𝑝 − 1, 𝑞 − 1 , 𝑝, 𝑞 = 0,1,2 = Δ (𝑚, 𝑛)
𝜕𝑤 ,

68
Unit 11
Derivation of the Cost Function
Two-class classification
𝑥C ( )
– The posterior probability of class 𝑐 can be written as : where, 𝑎 = ln
𝑥𝐶 ( )

𝑃 𝑿 𝐶 𝑃 𝐶 1 1
𝑃 𝐶 𝑿𝒏 = = = =𝜎 𝑎
𝑃 𝑿 𝐶 𝑃 𝐶 +𝑃 𝑿 𝐶 𝑃 𝐶 𝑃 𝑿 𝐶 𝑃 𝐶 1 + exp −𝑎
1+
𝑃 𝑿 𝐶 𝑃 𝐶

𝑝 𝐶 𝑿 =𝑦 =𝜎 𝑾 𝑿 𝑝 𝐶 𝑿 = 1−𝑝 𝐶 𝑿
– The likelihood function

𝑻
𝑃 𝒕𝑾 = 𝑦 1−𝑦 , where 𝒕 = 𝑡 ,𝑡 ,…,𝑡

– We define cost function as :


1
𝐸 𝑾 = −𝐸 ln 𝑃 𝒕 𝑾 =− 𝑡 ln 𝑦 + 1−𝑡 ln 1 − 𝑦
𝑁
• 𝑿 ∈ ℝ : 𝑛 training image (𝐻, 𝐿 : height and width of image ) • 𝑾 ∶ weight
• 𝑡 : training label of 𝑛 data , 𝑡 ∈ 0,1 • 𝑁 ∶ training set size
• 𝐶 ∶ class one , 𝐶 ∶ class two • 𝜎( ⋅ ) : logistic sigmoid function

69
Unit 11
Derivation of the Cost Function
Multiclass classification
– The generalization of the logistic sigmoid to class number >2
– The posterior probabilities are given by softmax :
𝒏
𝒏
𝑃 𝑿 𝐶 𝑃(𝐶 ) exp(𝑎 )
𝑃 𝐶 𝑿 = =
∑ 𝑃 𝑿 𝒏
𝐶 𝑃(𝐶 ) ∑ exp(𝑎 )

where, 𝑎 = ln 𝑃 𝑿 𝒏
𝐶 𝑃 𝐶 , 𝑘 : the 𝑘 class , 𝑗: total class

– Given a data set {𝑋 ,𝒕 }, 𝑡 ∈ {0,1} "using 1−k−coding", 𝑛 = 1 … 𝑁


• 𝒕 = 𝑡 , 𝑡 ,…,𝑡 ∈ ℝ and ∑ 𝑡 =1
– The likelihood function

𝑃 𝑻 𝑾 , 𝑾𝟐 , … , 𝑾 = 𝑝 𝐶 𝑿 = 𝑦
𝑇 = 𝑁 × 𝐾 of target label

𝑡 ⋯ 𝑡
𝐸 𝑾 ,…,𝑾 = − ln 𝑃 𝑻 𝑾 , … , 𝑾 =− 𝑡 ln 𝑦 𝑇= ⋮ ⋱ ⋮
𝑡 ⋯ 𝑡

70
Unit 11
Backpropagation in CNN
• Given N training data of classes and define minibatch size as , >
• The updating process :
1. Randomly initialize weights,
2. Forward propagation from input layers to output layers
Δ
3. Compute output error and back propagate the error
𝑏
Layer 𝑙 + 1
Δ𝑠 = 𝑐𝑜𝑛𝑣2𝐷(Δ , 𝑟𝑜𝑡180(𝑤 ))
Δ𝑠
+ 𝑥
𝑤 ,
4. Compute the gradient of
𝜕𝐸
𝜕𝑤
= 𝑐𝑜𝑛𝑣2𝐷 Δ , 𝑠 𝑝 − 1, 𝑞 − 1 ,
𝜕𝐸
𝜕𝑏
= Δ (𝑚, 𝑛) * 𝑤 ,
𝑏

𝑠
* + 𝑥
5. Update the weights and bias with the learning
rate and the regularizing parameter * 𝑤 , 𝑏

𝐸 𝑊 ,…,𝑊 =− 𝑡 ln 𝑦 +𝜆 𝒘 𝟐
+ 𝑥

71
Unit 11
Example
A CNN example for digit classification
– Build a Convolutional Neural Network to classify digit images with TensorFlow.
– Sample Code :
– https://github.com/aymericdamien/TensorFlow-
Examples/blob/master/notebooks/3_NeuralNetworks/convolutional_network.ipynb

MNIST Dataset Overview

More info: http://yann.lecun.com/exdb/mnist/

72

You might also like