Kannan M5L3 Notes
Kannan M5L3 Notes
Kannan Singaravelu
23 May 2022
In this Lecture…
Kannan Singaravelu 2
Understanding AI
Kannan Singaravelu 3
Why Deep Learning Now?
1952 Stochastic Gradient Descent (SGD) Big Data - Large datasets. Easier storage and
collection
1986 Backpropagation
- Multi-Layer Perceptron (MLP)
ILSVRC TOP-5 ERROR
30
ON IMAGENET
1995 Deep Convolutional Neural Network 25
Error (%)
- Digit Recognition
20
- Sequential Timeseries 10
. 5
Human
0
2012 Watershed Moment in NNSTANDARD
GLOBAL History IN FINANCIAL ENGINEERING
2010 2011 2012 2013 2014 2015
- ImageNet
Pre-2012 Post-2012
with machine learning with deep learning
Kannan Singaravelu 4
What is Deep Learning?
Kannan Singaravelu 5
Layered Representations
Kannan Singaravelu 6
Layered Representations
Kannan Singaravelu 7
How Deep Learning Works?
Kannan Singaravelu 8
Building Blocks of Deep Learning
▪ Perceptron
▪ Forward Propagation
▪ Activation Functions
▪ Weight Initialization
▪ Backpropagation
Kannan Singaravelu 9
Preceptron: The Forward Propogation
Linear
combination of
inputs
1 𝑤0 Output
𝑥1 𝑤1
𝑚
𝑤2 𝑦ො
𝑦ො = 𝑔 𝑤0 + 𝑥j 𝑤j
𝑥2 𝑗=1
𝑤𝑚
Non-linear Bias
𝑥𝑚 Activation Function
Kannan Singaravelu 10
Preceptron: The Forward Propogation
1 𝑤0
𝑚
𝑤1 𝑦ො = 𝑔 𝑤0 + 𝑥j 𝑤j
𝑥1
𝑗=1
𝑤2 𝑦ො
𝑥2 yො = 𝑔 𝑤0 + 𝑿𝑇 𝑾
𝑤𝑚
𝑥1 𝑤1
𝑥𝑚 where: X = ⋮ 𝑎𝑛𝑑 W = ⋮
𝑥𝑚 𝑤𝑚
Kannan Singaravelu 11
Preceptron: Simplified
𝑥1
𝑚
yො = 𝑔 𝑧
𝑥2 z 𝑧 = 𝑤0 + 𝑥j 𝑤j
𝑗=1
yො = 𝑔 𝑧 = 𝑎
𝑥𝑚
Removing the bias in the visual representation for simplicity and ease of representation
Kannan Singaravelu 12
Multi Output Preceptron
𝑥1
yෝ1 = 𝑔 𝑧1
z1 𝑚
𝑥2 𝑧𝑖 = 𝑤0,i + 𝑥j 𝑤j,i
yෝ2 = 𝑔 𝑧2 𝑗=1
z2
𝑥𝑚 yෝ𝑖 = 𝑔 𝑧𝑖 = 𝑎𝑖
MLPs are fully connected networks where all inputs are densely connected to all outputs
Kannan Singaravelu 13
Single Hidden Layer Network
[1] [2]
𝑾 𝑾
𝑔 𝑧1
𝑥1 z1 𝑚
𝑔 𝑧2
𝑦ෝ1 𝑧𝑖 = 𝑤0, 𝑖[1] + 𝑥j 𝑤𝑗, 𝑖 [1]
z2 𝑗=1
𝑥2
𝑔 𝑧3 𝑛
𝑦ෝ2
z3 yෝ𝑖 = 𝑔 𝑤0, 𝑖 [2] + 𝑧𝑗 𝑤𝑗 , 𝑖[2]
𝑥𝑚
𝑗=1
zn
𝑔 𝑧𝑛
Kannan Singaravelu 14
Single Hidden Layer Network
𝑥1 z1 𝑚
𝑤1,2[1]
𝑦ෝ1 𝑧2 = 𝑤0,2[1] + 𝑥j 𝑤𝑗, 2[1]
𝑤2,2[1]
𝑥2 z2 𝑗=1
z4
Kannan Singaravelu 15
Multi Output Preceptron
𝑥1 z1
𝑦ෝ1
𝑥2 z2
𝑦ෝ2
𝑥𝑚 z3
zn
Kannan Singaravelu 16
Deep Neural Network
𝑥1 𝑧𝑘, 1
𝑦ෝ1
𝑥2 ... 𝑧𝑘, 2 ...
𝑦ෝ2
𝑥𝑚 𝑧𝑘, 3
𝑧𝑘, 𝑛𝑘
𝑛𝑘
−1
Kannan Singaravelu 17
Bias
▪ Bias value allows the activation function to be shifted to the left or right, to better fit
the data.
▪ Influence the output values and doesn’t interact with the actual input data.
Kannan Singaravelu 18
Activation Functions
Kannan Singaravelu 19
Why a Non-Linear Functions?
▪ Derivative is constant
▪ Unbounded output
𝑔 𝑧 =𝑚 𝑧 +𝑏
𝑔′ 𝑧 = 𝑚
Kannan Singaravelu 20
Sigmoid Function
▪ Differentiable
▪ Bounded output
1
𝑔 𝑧 =
1 + 𝑒 −𝑧
𝑔′ 𝑧 = 𝑔 𝑧 (1 − 𝑔 𝑧 )
Kannan Singaravelu 21
Tanh Function
▪ Non-linear
▪ Bounded output
𝑒 𝑧 − 𝑒 −𝑧
𝑔 𝑧 = 𝑧
𝑒 + 𝑒 −𝑧
−2
𝑔′ 𝑧 = 1 − 𝑔 𝑧
Kannan Singaravelu 22
ReLU Function
▪ Non-linear
▪ Unbounded output
▪ Sparse activations
𝑔 𝑧 = 𝑚𝑎𝑥(0, 𝑧)
1, 𝑧>0
𝑔′ 𝑧 = ቊ
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Kannan Singaravelu 23
Leaky ReLU Function
▪ Non-zero slope
𝑔 𝑧 = 𝑚𝑎𝑥(0.01𝑧, 𝑧)
1, 𝑧>0
𝑔′ 𝑧 = ቊ
0.01, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Kannan Singaravelu 24
Which Activation Functions to Use?
▪ More than 20 activation functions including Hard Sigmoid, Softmax, ELU, PReLU,
Maxout and Swish
▪ Linear activation function can only be used in output layer for regression problem
Kannan Singaravelu 25
Weight Initialization
−1 1
▪ Uniform Distribution W 𝒋, 𝒊 ~ U ,
𝑓𝑎𝑛𝑖𝑛 𝑓𝑎𝑛𝑖𝑛
▪ Xavier Glorot
2
▪ Normal W 𝒋, 𝒊 ~ N(0, 𝜎) , where 𝜎 =
(𝑓𝑎𝑛𝑖𝑛 +𝑓𝑎𝑛𝑜𝑢𝑡 )
− 6 6
▪ Uniform W 𝒋, 𝒊 ~ U ,
𝑓𝑎𝑛𝑖𝑛 +𝑓𝑎𝑛𝑜𝑢𝑡 𝑓𝑎𝑛𝑖𝑛 +𝑓𝑎𝑛𝑜𝑢𝑡
▪ Uniform and Xavier Glorot works well with sigmoid activation function
Kannan Singaravelu 26
Weight Initialization
▪ He Init
2
▪ Normal W 𝒋, 𝒊 ~ N(0, 𝜎) , where 𝜎 =
(𝑓𝑎𝑛𝑖𝑛 )
6 6
▪ Uniform W 𝒋, 𝒊 ~ U − ,
𝑓𝑎𝑛𝑖𝑛 (𝑓𝑎𝑛𝑖𝑛 )
Kannan Singaravelu 27
Applying Neural Networks
[1] [2]
𝑾 𝑾
𝑔 𝑧1
𝑥1 z1
𝑔 𝑧2
𝑥2 z2 𝑦ෝ1 Predicted 0.3
[1]
𝑥 = [80, 60, 90] Actual 1
𝑔 𝑧3
𝑥3 z3
z4
𝑔 𝑧4 𝐿 𝑓(𝑥 [𝑖] ; 𝑾), 𝑦 [𝑖]
Kannan Singaravelu 28
Quantifying Loss
Empirical loss measures the total loss over the entire dataset
[1] [2]
𝑾 𝑾
𝑔 𝑧1
𝑥1 z1
𝑓 𝑥 𝑦
80 60 90 𝑔 𝑧2
z2 0.3 1
𝑿 = 70 50 80 𝑥2 𝑦ෝ1
0.8 0
100 80 90
𝑔 𝑧3 0.6 1
… … …
𝑥3 z3
z4
𝑔 𝑧4
Kannan Singaravelu 29
Quantifying Loss
▪ Loss of our neural network measures the cost incurred from incorrect predictions.
▪ The loss or objective function is the quantity that will be minimized during training.
▪ Binary Cross Entropy - used with models that output a probability between 0 and 1
𝑛
1
𝑱 𝑾 = 𝑦 [𝑖] 𝑙𝑜𝑔 𝑓(𝑥 [𝑖] ; 𝑾) + (1 − 𝑦 [𝑖] ) 𝑙𝑜𝑔 1 − 𝑓(𝑥 [𝑖] ; 𝑾)
𝑛
𝑖=1
▪ Mean Squared Error - used with regression models that output continuous values
𝑛
1
𝑱 𝑾 = 𝑦 [𝑖] − 𝑓(𝑥 [𝑖] ; 𝑾) 2
𝑛
𝑖=1
Kannan Singaravelu 30
Optimization of Loss
▪ Training the neural networks essentially mean finding the network weights that
achieve the lowest loss.
𝑛
∗ 1
𝑾 = argmin 𝐿 𝑓(𝑥 [𝑖] ; 𝑾), 𝑦 [𝑖]
𝑊 𝑛
𝑖=1
▪ Optimizer to determine how the network will be updated based on the loss function
by implementing a specific variant of stochastic gradient descent (SGD)
Kannan Singaravelu 31
Backpropagation
Kannan Singaravelu 32
Vanishing or Exploding Gradient
▪ For a layer to experience this problem, there must be more weights that satisfy the
condition for either vanishing or exploding gradients.
Kannan Singaravelu 33
Gradient Clipping / Norm
▪ Clip the derivatives of the loss function to a given threshold value if a gradient value
is less than a negative threshold or more than the positive threshold.
▪ If the gradient value exceeds 0.5 or − 0.5 , then it will be either scaled back by
the gradient norm or clipped back to the threshold value.
▪ Change the derivatives of the loss function to have a given vector norm when the L2
vector norm (sum of the squared values) of the gradient vector exceeds a threshold
value.
▪ If the vector norm for a gradient exceeds 1.0, then the values in the vector will
be rescaled so that the norm of the vector equals 1.0
Kannan Singaravelu 34
Backpropagation : Computing Gradients
w1 w2
x z1 𝑦ො 𝑱 𝑾
Kannan Singaravelu 35
Setting the Learning Rate
▪ Small learning rate converges slowly and gets stuck in false local minima
Kannan Singaravelu 36
Mini-batches
𝐵
𝜕𝐽(𝑊) 1 𝜕𝐽𝑘(𝑊)
=
𝜕𝑊 𝐵 𝜕𝑊
𝑘=1
▪ The true gradient is then the average of the gradient from each of those batches
Kannan Singaravelu 37
Mini-batches
▪ Mini-batch ensure more accurate estimation of gradient and lead to fast training
▪ 1 < Batch Size < Size of Training Set → Mini-batch Gradient Descent
Kannan Singaravelu 38
Problem of Overfitting
▪ Underfitting is the model that doesn't not have the capacity to fully learn from the
data
▪ Overfitting is too complex and does not generalize well with the data as it starts to
memorize the training data
Kannan Singaravelu 39
Problem of Overfitting
▪ Regularization I: Dropout
▪ Randomly set some activations to zero
Kannan Singaravelu 40
Regularization I : Dropout
𝑥1 Z1[1] Z1[2]
𝑦ෝ1
𝑥2 Z2[1] Z2[2]
Z4[1] Z4[2]
Kannan Singaravelu 41
Regularization I : Dropout
𝑥1 Z1[1] Z1[2]
𝑦ෝ1
𝑥2 Z2[1] Z2[2]
Z4[1] Z4[2]
Kannan Singaravelu 42
Regularization II : Early Stopping
Under-fitting Over-fitting
Legend
Loss
Testing
Stop training Training
here !
Epochs
Kannan Singaravelu 43
Neural Network Representation
▪ Shallow vs Deep
▪ Logistic regression are the simplest form of neural network which is a shallow model
▪ A multi hidden network of layers that are highly interconnected are an example of
deep model
Kannan Singaravelu 44
Neural Network Representation
𝐿 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟𝑠
𝑛[𝑙] = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑢𝑟𝑜𝑛𝑠 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙
𝑥1
𝑥2 𝑦ො
𝑥3
Input ayer idden ayer idden ayer idden ayer utput ayer
Kannan Singaravelu 45
Neural Network Dimensions
𝑥1
𝑥2 𝑦ො 𝑤 [𝑙] = 𝑛 𝑙 , 𝑛 𝑙−1
𝑥3
𝑏 [𝑙] = ( 𝑛 𝑙 , 1 )
Input ayer idden ayer idden ayer idden ayer utput ayer
𝑧 [1] = 5, 1 = 5, 3 3, 1 + 𝑏 1 𝑤 [2] = 𝑛 2 , 𝑛 1 = 5, 5
𝑎[1] = 𝑔 𝑧 1 𝑧 [2] = 5, 1 = 5, 5 5, 1 + 𝑏 2
𝑎[2] = 𝑔 𝑧 2
Kannan Singaravelu 46
Neural Network Hyperparameters
▪ Some of the most common hyperparameters that can be optimized for better results
▪ Number of neurons
▪ Choice of activation
▪ Number of epochs
▪ Learning rate
▪ Mini-batch size
▪ Regularization parameters
Kannan Singaravelu 47
Deep Learning for Computer Vision
Convolutional Neural Network
Kannan Singaravelu 48
Deep Learning for Computer Vision
Kannan Singaravelu 49
Images are Numbers
Kannan Singaravelu 50
Image Representation in CV
▪ Made up of pixels
Kannan Singaravelu 51
Learning Visual Features
▪ Spatial structures are super important in image data and we need to preserve this
Kannan Singaravelu 52
Feature Extraction with Convolution
▪ Filter size: 4 x 4
▪ 16 different weights
▪ Continue this filter to 4 x 4 patches in input
▪ Shift 2 pixels for next patch
▪ Edge detection to connect patch in input layers to a single neuron in subsequent layer
▪ Slide through the window to define connections and apply set of weights (weighted
sum) to extract local features
▪ Spatially share the parameters of each filter to extract maximum spatial features
Kannan Singaravelu 53
Feature Extraction with Convolution
▪ Process of adding each element of the image to its local neighbors, weighted by the
filter
▪ Filter is a matrix of values whose size and values determine the transformation
effect
Kannan Singaravelu 54
Convolution Operation
1 1 1 0 0
=
0 1 1 1 0 1 0 1 4 3 4
0
0
0
0
1
1
1
1
1
0
* 0
1
1
0
0
1
2
2
4
3
3
4
0 1 1 -1 0
3 x 3 Filter 3 x 3 Feature Map
5 x 5 Image
Kannan Singaravelu 55
Vertical Edge Detection
10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
=
1 0 -1
*
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0 3 x 3 Filter 4 x 4 Feature Map
6 x 6 Image
▪ In vertical edge deduction, a vertical edge is a 3 x 3 region (in the above example),
where there are bright pixels on the left and dark pixels on the right
Kannan Singaravelu 56
Horizontal Edge Detection
10 10 10 0 0 0
10 10 10 0 0 0 0 0 0 0
=
1 1 1
*
10 10 10 0 0 0 30 10 -10 -30
0 0 0
0 0 0 10 10 10 30 10 -10 -30
-1 -1 -1
0 0 0 10 10 10 0 0 0 0
0 0 0 10 10 10 3 x 3 Filter 4 x 4 Feature Map
6 x 6 Image
Kannan Singaravelu 57
Other Common Filters
1 0 -1 1 1 1 1 0 -1 1 2 1
1 0 -1 0 0 0 2 0 -2 0 0 0
1 0 -1 -1 -1 -1 1 0 -1 -1 -2 -1
Prewitt Sobol
3 0 -3 3 10 3 𝑤1 𝑤2 𝑤3
10 0 -10 0 0 0 𝑤4 𝑤5 𝑤6
3 0 -3 -3 -10 -3 𝑤7 𝑤8 𝑤9
Kannan Singaravelu 58
Producing Feature Maps
-1 -1 -1 0 1 0 -1 -2 -1
-1 9 -1 -1 -4 1 0 0 0
-1 -1 -1 0 1 0 1 2 1
Note: If the feature map contains negative values (black portion), one can convert negative values to non-negative values by applying ReLU activation functions,
thus converting the black portions into grey.
Kannan Singaravelu 59
Padding
▪ On every convolutional operation (edge deduction), the image shrinks and we end
up with very small image
▪ Information on the edges are used much less as compared to other parts of the
image and we miss vital spatial information
▪ Padding before applying convolution operation help address this issues by adding
one border of one pixel around the borders
Kannan Singaravelu 60
Padding
Kannan Singaravelu 61
Padding
▪ By convention, we pad with zeros with one pixel as the padded amount (p = 1)
f−1
▪ Same convolutions : output size is same as the input size; p =
2
Kannan Singaravelu 62
Strided Convolution
n+2p−f n+2P−f
▪ The new output is of dimension is +1 x +1
s S
▪ Filter must lie entirely within the image (or image plus the padded region)
Kannan Singaravelu 63
Pooling
▪ Pooling down sample the image data extracted by the convolutional layers
▪ Reduces the dimensionality of the feature map in order to decrease the processing
time
▪ Max Pooling extracts maximum value of the sub-regions of the feature map
▪ Average Pooling is used sometimes for very deep neural networks to collapse the
representation
Kannan Singaravelu 64
Convolutional Neural Networks
▪ Pooling : down sampling the spatial representation of the image to reduce dimensionality
and to preserve spatial invariance
Kannan Singaravelu 65
Convolutional Neural Networks
Kannan Singaravelu 66
CNN : Key Takeaways
n+2p−f n+2P−f
▪ Output dimension is given by +1 x +1
s S
Kannan Singaravelu 67
Convolutions in Financial Time Series
▪ Convolutions are an unique type of Neural Networks which look data as a grid
▪ Ensembe with traditional sequence models like LSTM to boost the model score
▪ A CNN-LSTM architecture uses CNN layers for features extraction combined with
LSTM to support sequence prediction
▪ CNN-LSTM ≠ ConvLSTM
Kannan Singaravelu 68
Code Walkthrough
▪
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu’))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu’))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
Kannan Singaravelu 69
Code Walkthrough – Model Summary
▪
Layer (type) Output Shape Param #
====================================================================
conv2d_1 (Conv2D) (None, 26, 26, 32) 320
____________________________________________________________________
maxpooling2d_1 (MaxPooling2D) (None, 13, 13, 32) 0
____________________________________________________________________
conv2d_2 (Conv2D) (None, 11, 11, 64) 18496
____________________________________________________________________
maxpooling2d_2 (MaxPooling2D) (None, 5, 5, 64) 0
____________________________________________________________________
conv2d_3 (Conv2D) (None, 3, 3, 64) 36928
____________________________________________________________________
flatten_1 (Flatten) (None, 576) 0
____________________________________________________________________
dense_1 (Dense) (None, 64) 36928
____________________________________________________________________
dense_2 (Dense) (None, 10) 650
====================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0
Kannan Singaravelu 70
Deep Sequence Modeling
Long Short Term Memory Network
Kannan Singaravelu 71
Deep Sequence Modeling
Kannan Singaravelu 72
Deep Sequence Modeling
▪ Sequence data comes in many forms : text, audio, video and financial time series
▪ Modeling to predict the next sequence of events (word, sound, time series)
▪ preserve information about the order and share parameters across the sequence
Kannan Singaravelu 73
Recurrent Neural Network
▪ RNN are faster, uses less computational resources as there are less tensor
operations
Kannan Singaravelu 74
Recurrent for Sequence Modeling
1 [2] 3 4 [5]
Adapted from Andrej Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks
Kannan Singaravelu 75
Short-term Memory
Output 𝑦ෝ𝑡
Kannan Singaravelu 76
Problem of Short-term Memory
Kannan Singaravelu 77
Long Short Term Memory Network
▪ Special kind of RNN, explicitly designed to avoid the long-term dependency problem
▪ Widely used for sequence prediction problems and proved to be extremely effective
Kannan Singaravelu 78
Long Short Term Memory Network
▪ Input Gate
▪ Output Gate
Kannan Singaravelu 79
LSTM : Forget Gate
yt
Ct-1 X + Ct 𝑓𝑡 = 𝜎 𝑊𝑓 . ht−1, xt + 𝑏𝑓
tanh
ft X X
s s tanh s
ht-1 ht
xt
Kannan Singaravelu 80
LSTM : Input Gate
yt
Ct-1 X + Ct 𝑖𝑡 = 𝜎 𝑊𝑖 . ht−1, xt + 𝑏𝑖
it tanh
X X
𝑐ҧt = 𝑡𝑎𝑛ℎ 𝑊𝑐 . ht−1, xt + 𝑏𝑐
𝑐ҧt
s s tanh s
ht-1 ht
xt
Kannan Singaravelu 81
LSTM : Update Cell State
yt
s s tanh s
ht-1 ht
xt
Kannan Singaravelu 82
LSTM : Output Gate
yt
Ct-1 X + Ct 𝑜𝑡 = 𝜎 𝑊𝑜 . ht−1, xt + 𝑏𝑜
tanh
X ot X
h𝑡 = 𝑜𝑡 ∗ 𝑡𝑎𝑛ℎ 𝐶𝑡
s s tanh s
ht-1 ht
xt
Kannan Singaravelu 83
LSTM Network
yt
𝑓𝑡 = 𝜎 𝑊𝑓 . ht−1, xt + 𝑏𝑓
Ct-1 X + Ct 𝑖𝑡 =
𝜎 𝑊𝑖 . ht−1, xt + 𝑏𝑖
it tanh
𝑐ҧt = 𝑡𝑎𝑛ℎ 𝑊𝑐 . ht−1, xt + 𝑏𝑐
ft X ot X
𝑐ҧt 𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝑐ҧt
s s tanh s
ht-1 ht 𝑜𝑡 = 𝜎 𝑊𝑜 . ht−1, xt + 𝑏𝑜
h𝑡 = 𝑜𝑡 ∗ 𝑡𝑎𝑛ℎ 𝐶𝑡
xt
Kannan Singaravelu 84
LSTM Gradient Flow
Gradient Flow
y1 y3
C0 C1 C2 C3
h0 h1 h2 h3
x1 x3
Kannan Singaravelu 85
Code Walkthrough
▪
model = Sequential()
Kannan Singaravelu 86
Code Walkthrough – Model Summary
▪
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 60, 256) 264192
_________________________________________________________________
dropout_1 (Dropout) (None, 60, 256) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 256) 525312
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 16448
_________________________________________________________________
dense_2 (Dense) (None, 1) 65
=================================================================
Total params: 806,017
Trainable params: 806,017
Non-trainable params: 0
Kannan Singaravelu 87
Generative Adversarial Networks
Generator – Discriminator Network
Kannan Singaravelu 88
Generative Adversarial Networks
▪ Generative models
▪ Generate content such as images, text similar to what a human can produce
▪ Impressive results on image and video generation such as style transfer using
CycleGAN, and human face generation using StyleGAN
Kannan Singaravelu 89
Generative Adversarial Networks
▪ Generative models capture the join probability P(x,y) or P(x), if there are not labels
▪ Unlike discriminative models, generative models are used for both supervised and
unsupervised learning
Kannan Singaravelu 90
GAN Structure
Kannan Singaravelu 91
Five Steps to GAN
▪ Train the generator to fake data that can fool the discriminator
Kannan Singaravelu 92
GAN : Discriminator
Gradient Flow
Note: Hold the generator values constant when training the discriminator and discriminator values constant when training the generator. Each of these should be
trained against static adversary.
Kannan Singaravelu 93
GAN : Generator
Gradient Flow
Note: Hold the generator values constant when training the discriminator and discriminator values constant when training the generator. Each of these should be
trained against static adversary.
Kannan Singaravelu 94
GAN Loss Functions
▪ Loss functions measure distance between the distribution by the GAN and the
distribution of the real data
▪ Two common loss functions are Minimax loss and Wasserstein loss
Kannan Singaravelu 95
Pros & Cons of Deep Learning
▪ Most Machine earning algorithms used in industry aren’t Deep earning algorithms
▪ Deep earning isn’t always the right tool as there may not be enough data available
for deep learning to be applicable and/or can better be solved by a different
algorithms
Kannan Singaravelu 96
Limitations of Neural Networks
▪ Data Hungry
Kannan Singaravelu 97
References
▪ Chigozie, Winifred, Anthony, and Stephen (2018), Activation Functions: Comparison of Trends in Practice and
Research for Deep Learning
▪ Michale Phi (2018), Illustrated Guide to STM’s and GRU’s: A step by step explanation
▪ Standford University, Massachusetts Institute of Technology, Technische Universität München, Notes on Artificial
Intelligence
Note: Some of the materials from above resources are adopted for these notes under CC-BY-SA 4.0. For details interpretation on the subject, refer to the above resources.
Kannan Singaravelu 98