0% found this document useful (0 votes)
67 views98 pages

Kannan M5L3 Notes

Uploaded by

2020borat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views98 pages

Kannan M5L3 Notes

Uploaded by

2020borat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Introduction to

Deep Learning & Neural Networks

Kannan Singaravelu
23 May 2022
In this Lecture…

▪ Why do we need a different learning algorithm?

▪ Building blocks of neural networks

▪ Understanding activation functions

▪ Deep Learning for computer vision

▪ How do computers understand images?

▪ Underlying sequence modeling

▪ Problem of short term memory

▪ Brief overview of Generative Adversarial Networks

▪ Pros & cons of deep learning

Kannan Singaravelu 2
Understanding AI

▪ Science & Engineering of making


intelligent machines

▪ Ability to learn automatically without


being explicitly programmed

▪ Layered or hierarchical representations


and learning using Neural Networks

Kannan Singaravelu 3
Why Deep Learning Now?

1952 Stochastic Gradient Descent (SGD) Big Data - Large datasets. Easier storage and
collection

Hardware – Faster CPUs. Massively parallelizable


1958 Perceptron chips : GPUs, and TPUs
- Learnable Weights
. Algorithms & Software - Improved techniques. New
. Models & Toolsets. Democratization of Deep Learning
.

1986 Backpropagation
- Multi-Layer Perceptron (MLP)
ILSVRC TOP-5 ERROR
30
ON IMAGENET
1995 Deep Convolutional Neural Network 25

Error (%)
- Digit Recognition
20

1997 Long Short-Term Memory 15

- Sequential Timeseries 10

. 5
Human
0
2012 Watershed Moment in NNSTANDARD
GLOBAL History IN FINANCIAL ENGINEERING
2010 2011 2012 2013 2014 2015

- ImageNet
Pre-2012 Post-2012
with machine learning with deep learning

Kannan Singaravelu 4
What is Deep Learning?

▪ Inspired from our understanding of human brains

▪ Layered or Hierarchical representations learning

▪ “deep” in deep learning stands for successive layers of representations

▪ Layered representations are learned via models called neural network

▪ Neural networks are structured layers stacked on top of each other

▪ Mathematical framework for learning representations from data

Kannan Singaravelu 5
Layered Representations

Network of layers transforms image to digit

Image Source: Francois Chollet (2017), Deep Learning with Python

Kannan Singaravelu 6
Layered Representations

Multistage way to learn data representations


Information-distillation operation with successive filters

Image Source: Francois Chollet (2017), Deep Learning with Python

Kannan Singaravelu 7
How Deep Learning Works?

Image Source: Francois Chollet (2017), Deep Learning with Python

Kannan Singaravelu 8
Building Blocks of Deep Learning

▪ Perceptron

▪ Forward Propagation

▪ Activation Functions

▪ Weight Initialization

▪ Backpropagation

Kannan Singaravelu 9
Preceptron: The Forward Propogation

Linear
combination of
inputs
1 𝑤0 Output

𝑥1 𝑤1
𝑚
𝑤2  𝑦ො
𝑦ො = 𝑔 𝑤0 + ෍ 𝑥j 𝑤j
𝑥2 𝑗=1
𝑤𝑚

Non-linear Bias
𝑥𝑚 Activation Function

Inputs Weights Sum Activation Output


Function

Kannan Singaravelu 10
Preceptron: The Forward Propogation

1 𝑤0
𝑚

𝑤1 𝑦ො = 𝑔 𝑤0 + ෍ 𝑥j 𝑤j
𝑥1
𝑗=1
𝑤2  𝑦ො
𝑥2 yො = 𝑔 𝑤0 + 𝑿𝑇 𝑾
𝑤𝑚
𝑥1 𝑤1
𝑥𝑚 where: X = ⋮ 𝑎𝑛𝑑 W = ⋮
𝑥𝑚 𝑤𝑚

Inputs Weights Sum Activation Output


Function

Kannan Singaravelu 11
Preceptron: Simplified

𝑥1

𝑚
yො = 𝑔 𝑧
𝑥2 z 𝑧 = 𝑤0 + ෍ 𝑥j 𝑤j
𝑗=1

yො = 𝑔 𝑧 = 𝑎
𝑥𝑚

Removing the bias in the visual representation for simplicity and ease of representation

Kannan Singaravelu 12
Multi Output Preceptron

𝑥1
yෝ1 = 𝑔 𝑧1
z1 𝑚

𝑥2 𝑧𝑖 = 𝑤0,i + ෍ 𝑥j 𝑤j,i
yෝ2 = 𝑔 𝑧2 𝑗=1
z2
𝑥𝑚 yෝ𝑖 = 𝑔 𝑧𝑖 = 𝑎𝑖

MLPs are fully connected networks where all inputs are densely connected to all outputs

Kannan Singaravelu 13
Single Hidden Layer Network

[1] [2]
𝑾 𝑾
𝑔 𝑧1
𝑥1 z1 𝑚

𝑔 𝑧2
𝑦ෝ1 𝑧𝑖 = 𝑤0, 𝑖[1] + ෍ 𝑥j 𝑤𝑗, 𝑖 [1]
z2 𝑗=1
𝑥2
𝑔 𝑧3 𝑛
𝑦ෝ2
z3 yෝ𝑖 = 𝑔 𝑤0, 𝑖 [2] + ෍ 𝑧𝑗 𝑤𝑗 , 𝑖[2]
𝑥𝑚
𝑗=1

zn
𝑔 𝑧𝑛

Inputs Hidden Output

Kannan Singaravelu 14
Single Hidden Layer Network

𝑥1 z1 𝑚
𝑤1,2[1]
𝑦ෝ1 𝑧2 = 𝑤0,2[1] + ෍ 𝑥j 𝑤𝑗, 2[1]
𝑤2,2[1]
𝑥2 z2 𝑗=1

𝑤𝑚, 2[1] 𝑦ෝ2 = 𝑤0,2[1] + 𝑥1 𝑤1,2[1] + 𝑥2 𝑤2,2[1] + 𝑥𝑚 𝑤𝑚, 2[1]


𝑥𝑚 z3

z4

Inputs Hidden Output

Kannan Singaravelu 15
Multi Output Preceptron

𝑥1 z1
𝑦ෝ1
𝑥2 z2

𝑦ෝ2
𝑥𝑚 z3

zn

Inputs Hidden Output

Kannan Singaravelu 16
Deep Neural Network

Inputs Hidden Output

𝑥1 𝑧𝑘, 1
𝑦ෝ1
𝑥2 ... 𝑧𝑘, 2 ...

𝑦ෝ2
𝑥𝑚 𝑧𝑘, 3

𝑧𝑘, 𝑛𝑘

𝑛𝑘
−1

𝑧𝑘, 𝑖 = 𝑤0, 𝑖[𝑘] + ෍ 𝑔(𝑧k−1,𝑗 ) 𝑤𝑗 , 𝑖[𝑘]


𝑗=1

Kannan Singaravelu 17
Bias

▪ Bias can be thought of as analogous to the role of a constant in a linear function.

▪ Bias value allows the activation function to be shifted to the left or right, to better fit
the data.

▪ Influence the output values and doesn’t interact with the actual input data.

Kannan Singaravelu 18
Activation Functions

▪ New take on learning representations from data

▪ Introduce non-linearity in the network

▪ Decide whether a neuron can contribute to the next layer

▪ Specify contribution threshold

▪ Should be computationally efficient

Kannan Singaravelu 19
Why a Non-Linear Functions?

▪ Derivative is constant

▪ No gradient relationship with input data

▪ Unbounded output

▪ N-layer network  single layer

𝑔 𝑧 =𝑚 𝑧 +𝑏

𝑔′ 𝑧 = 𝑚

Kannan Singaravelu 20
Sigmoid Function

▪ Non-binary activations (analog outputs)

▪ Differentiable

▪ Bounded output

▪ Vanishing gradient problem

1
𝑔 𝑧 =
1 + 𝑒 −𝑧
𝑔′ 𝑧 = 𝑔 𝑧 (1 − 𝑔 𝑧 )

Kannan Singaravelu 21
Tanh Function

▪ Non-linear

▪ Derivative steeper than Sigmoid

▪ Bounded output

▪ Vanishing gradient problem

𝑒 𝑧 − 𝑒 −𝑧
𝑔 𝑧 = 𝑧
𝑒 + 𝑒 −𝑧
−2
𝑔′ 𝑧 = 1 − 𝑔 𝑧

Kannan Singaravelu 22
ReLU Function

▪ Non-linear

▪ Unbounded output

▪ Sparse activations

▪ Dying ReLU problem

𝑔 𝑧 = 𝑚𝑎𝑥(0, 𝑧)
1, 𝑧>0
𝑔′ 𝑧 = ቊ
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Kannan Singaravelu 23
Leaky ReLU Function

▪ Non-zero slope

▪ Less computationally expensive

▪ Parametric ReLU function with a = 0.01

𝑔 𝑧 = 𝑚𝑎𝑥(0.01𝑧, 𝑧)

1, 𝑧>0
𝑔′ 𝑧 = ቊ
0.01, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Kannan Singaravelu 24
Which Activation Functions to Use?

▪ More than 20 activation functions including Hard Sigmoid, Softmax, ELU, PReLU,
Maxout and Swish

▪ No single activation function that works in all cases

▪ Linear activation function can only be used in output layer for regression problem

▪ ReLU and their combinations preferred - used only in hidden layers

▪ Sigmoid, Softmax work better for classifier - preferred in output layers

Kannan Singaravelu 25
Weight Initialization

▪ Small, different and have good variance

−1 1
▪ Uniform Distribution W 𝒋, 𝒊 ~ U ,
𝑓𝑎𝑛𝑖𝑛 𝑓𝑎𝑛𝑖𝑛

▪ Xavier Glorot

2
▪ Normal W 𝒋, 𝒊 ~ N(0, 𝜎) , where 𝜎 =
(𝑓𝑎𝑛𝑖𝑛 +𝑓𝑎𝑛𝑜𝑢𝑡 )

− 6 6
▪ Uniform W 𝒋, 𝒊 ~ U ,
𝑓𝑎𝑛𝑖𝑛 +𝑓𝑎𝑛𝑜𝑢𝑡 𝑓𝑎𝑛𝑖𝑛 +𝑓𝑎𝑛𝑜𝑢𝑡

▪ Uniform and Xavier Glorot works well with sigmoid activation function

▪ Xavier Glorot works well with tanh activation function

Kannan Singaravelu 26
Weight Initialization

▪ He Init

2
▪ Normal W 𝒋, 𝒊 ~ N(0, 𝜎) , where 𝜎 =
(𝑓𝑎𝑛𝑖𝑛 )

6 6
▪ Uniform W 𝒋, 𝒊 ~ U − ,
𝑓𝑎𝑛𝑖𝑛 (𝑓𝑎𝑛𝑖𝑛 )

▪ Works well with ReLU activation function

Kannan Singaravelu 27
Applying Neural Networks

Example : Will I pass the CQF program?

[1] [2]
𝑾 𝑾
𝑔 𝑧1
𝑥1 z1

𝑔 𝑧2
𝑥2 z2 𝑦ෝ1 Predicted 0.3
[1]
𝑥 = [80, 60, 90] Actual 1
𝑔 𝑧3

𝑥3 z3

z4
𝑔 𝑧4 𝐿 𝑓(𝑥 [𝑖] ; 𝑾), 𝑦 [𝑖]

GLOBAL STANDARD IN FINANCIAL ENGINEERING

𝑥1 = Exam 1 marks | 𝑥2 = Exam 2 marks | 𝑥3 = Exam 3 marks

Kannan Singaravelu 28
Quantifying Loss

Empirical loss measures the total loss over the entire dataset

[1] [2]
𝑾 𝑾
𝑔 𝑧1
𝑥1 z1
𝑓 𝑥 𝑦
80 60 90 𝑔 𝑧2
z2 0.3 1
𝑿 = 70 50 80 𝑥2 𝑦ෝ1
0.8 0
100 80 90
𝑔 𝑧3 0.6 1
… … …
𝑥3 z3

z4
𝑔 𝑧4

Kannan Singaravelu 29
Quantifying Loss

▪ Loss of our neural network measures the cost incurred from incorrect predictions.

▪ The loss or objective function is the quantity that will be minimized during training.

▪ Binary Cross Entropy - used with models that output a probability between 0 and 1
𝑛
1
𝑱 𝑾 = ෍ 𝑦 [𝑖] 𝑙𝑜𝑔 𝑓(𝑥 [𝑖] ; 𝑾) + (1 − 𝑦 [𝑖] ) 𝑙𝑜𝑔 1 − 𝑓(𝑥 [𝑖] ; 𝑾)
𝑛
𝑖=1

▪ Mean Squared Error - used with regression models that output continuous values

𝑛
1
𝑱 𝑾 = ෍ 𝑦 [𝑖] − 𝑓(𝑥 [𝑖] ; 𝑾) 2
𝑛
𝑖=1

Kannan Singaravelu 30
Optimization of Loss

▪ Training the neural networks essentially mean finding the network weights that
achieve the lowest loss.

𝑛
∗ 1
𝑾 = argmin ෍ 𝐿 𝑓(𝑥 [𝑖] ; 𝑾), 𝑦 [𝑖]
𝑊 𝑛
𝑖=1

▪ Optimizer to determine how the network will be updated based on the loss function
by implementing a specific variant of stochastic gradient descent (SGD)

Kannan Singaravelu 31
Backpropagation

▪ Backpropagation is used to compute gradients.

▪ Algorithm typically has the following steps

▪ Initialize random weights ~ 𝑁(0, 𝜎2)

▪ Loop until convergence


𝜕𝐽(𝑊)
▪ Compute gradient 𝜕𝑊
𝜕𝐽(𝑊)
▪ Update weights 𝑊𝑛𝑒𝑤  𝑊𝑜𝑙𝑑 − 
𝜕𝑊
▪ Return weights

Kannan Singaravelu 32
Vanishing or Exploding Gradient

▪ Vanishing gradient problem occurs when 0 < 𝑤 < 1

▪ Exploding gradient problem occurs when 𝑤 > 1

▪ For a layer to experience this problem, there must be more weights that satisfy the
condition for either vanishing or exploding gradients.

Kannan Singaravelu 33
Gradient Clipping / Norm

▪ Basic idea is to set up a rule for avoiding vanishing or exploding gradients.

▪ Clip the derivatives of the loss function to a given threshold value if a gradient value
is less than a negative threshold or more than the positive threshold.

▪ Specify a threshold value; e.g. 0.5.

▪ If the gradient value exceeds 0.5 or − 0.5 , then it will be either scaled back by
the gradient norm or clipped back to the threshold value.

▪ Change the derivatives of the loss function to have a given vector norm when the L2
vector norm (sum of the squared values) of the gradient vector exceeds a threshold
value.

▪ If the vector norm for a gradient exceeds 1.0, then the values in the vector will
be rescaled so that the norm of the vector equals 1.0

Kannan Singaravelu 34
Backpropagation : Computing Gradients

w1 w2
x z1 𝑦ො 𝑱 𝑾

𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕𝑦ො 𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕𝑦ො 𝜕𝑧1


= ∗ = ∗ ∗
𝜕𝑤2 𝜕𝑦ො 𝜕𝑤2 𝜕𝑤1 𝜕𝑦ො 𝜕𝑧1 𝜕𝑤1

Kannan Singaravelu 35
Setting the Learning Rate

▪ There can be multiple local extremum

▪ Loss functions can be difficult to optimize

▪ Small learning rate converges slowly and gets stuck in false local minima

▪ Large learning rate overshoot and become unstable and diverge

▪ Selecting adaptive learning rates will address these issues

▪ Adaptive algorithm used in optimization : Adam, Adadelta, Adagrad, RMSProp

Kannan Singaravelu 36
Mini-batches

▪ Gradient descent algorithms are computationally expensive


𝜕𝐽𝑖(𝑊)
▪ One idea is to compute gradient using single data point →
𝜕𝑊
▪ Single data point (SGD) computation can be very noisy

▪ Computing gradient by taking batch of points is a good practice

𝐵
𝜕𝐽(𝑊) 1 𝜕𝐽𝑘(𝑊)
= ෍
𝜕𝑊 𝐵 𝜕𝑊
𝑘=1

▪ The true gradient is then the average of the gradient from each of those batches

Kannan Singaravelu 37
Mini-batches

▪ Mini-batch ensure more accurate estimation of gradient and lead to fast training

▪ Batch Size = Size of Training Set → Batch Gradient Descent

▪ Batch Size = 1 → Stochastic Gradient Descent

▪ 1 < Batch Size < Size of Training Set → Mini-batch Gradient Descent

Kannan Singaravelu 38
Problem of Overfitting

▪ Underfitting is the model that doesn't not have the capacity to fully learn from the
data

▪ Overfitting is too complex and does not generalize well with the data as it starts to
memorize the training data

▪ The process of fighting overfitting is called Regularization

Kannan Singaravelu 39
Problem of Overfitting

▪ Regularization I: Dropout
▪ Randomly set some activations to zero

▪ Typically drop 50% of activations in layer

▪ Forces network to not rely on any one node

▪ Regularization II: Early Stopping


▪ Stop training before there is a possibility of over-fitting

Kannan Singaravelu 40
Regularization I : Dropout

𝑥1 Z1[1] Z1[2]

𝑦ෝ1
𝑥2 Z2[1] Z2[2]

Z3[1] Z3[2] 𝑦ෝ2


𝑥𝑚

Z4[1] Z4[2]

Kannan Singaravelu 41
Regularization I : Dropout

𝑥1 Z1[1] Z1[2]

𝑦ෝ1
𝑥2 Z2[1] Z2[2]

Z3[1] Z3[2] 𝑦ෝ2


𝑥𝑚

Z4[1] Z4[2]

Kannan Singaravelu 42
Regularization II : Early Stopping

Under-fitting Over-fitting
Legend
Loss

Testing
Stop training Training
here !

Epochs

Kannan Singaravelu 43
Neural Network Representation

▪ Shallow vs Deep

▪ Logistic regression are the simplest form of neural network which is a shallow model

▪ A multi hidden network of layers that are highly interconnected are an example of
deep model

▪ A neural network with 1-hidden layer is a 2 layer neural network

Kannan Singaravelu 44
Neural Network Representation

𝐿 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟𝑠
𝑛[𝑙] = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑢𝑟𝑜𝑛𝑠 𝑖𝑛 𝑙𝑎𝑦𝑒𝑟 𝑙

𝑥1

𝑥2 𝑦ො

𝑥3

Input ayer idden ayer idden ayer idden ayer utput ayer

𝑛[0] = 𝑛𝑥 = 3 𝑛[1] = 5 𝑛[2] = 5 𝑛[3] = 3 𝑛[4] = 𝑛[𝐿] = 1

Kannan Singaravelu 45
Neural Network Dimensions

𝑥1

𝑥2 𝑦ො 𝑤 [𝑙] = 𝑛 𝑙 , 𝑛 𝑙−1

𝑥3
𝑏 [𝑙] = ( 𝑛 𝑙 , 1 )

Input ayer idden ayer idden ayer idden ayer utput ayer

𝑧 [1] = 𝑤 [1] 𝑥 + 𝑏 [1] 𝑤 [1] = ( 𝑛 1 , 𝑛 0 )

𝑧 [1] = 5, 1 = 5, 3 3, 1 + 𝑏 1 𝑤 [2] = 𝑛 2 , 𝑛 1 = 5, 5

𝑧 [1] = 𝑛 1 , 1 = (𝑛[1] , 1) (𝑛[0] , 1) 𝑧 [2] = 𝑤 2


𝑎1 +𝑏 2

𝑎[1] = 𝑔 𝑧 1 𝑧 [2] = 5, 1 = 5, 5 5, 1 + 𝑏 2
𝑎[2] = 𝑔 𝑧 2

Kannan Singaravelu 46
Neural Network Hyperparameters

▪ Some of the most common hyperparameters that can be optimized for better results

▪ Number of hidden layers

▪ Number of neurons

▪ Choice of activation

▪ Number of epochs

▪ Learning rate

▪ Mini-batch size

▪ Regularization parameters

Kannan Singaravelu 47
Deep Learning for Computer Vision
Convolutional Neural Network

Kannan Singaravelu 48
Deep Learning for Computer Vision

▪ Computer Vision is one of the rapidly advancing fields

▪ Field of having a computer understand and label what is present in an image

▪ Giving machines a sense of vision

▪ What computers “see”?

▪ How does it process an image or video?

▪ Why Not a Fully connected Neural Network?

Kannan Singaravelu 49
Images are Numbers

Input Image What the computer sees

An image is just a matrix


of numbers [0, 255]

Kannan Singaravelu 50
Image Representation in CV

▪ Three Dimension: height, width, and color channels

▪ Made up of pixels

▪ 28 x 28 greyscale image is represented as 28 x 28 x 1

▪ 28 x 28 color image is represented as 28 x 28 x 3

▪ Images are 4D tensors while video data are 5D

Adapted from Francois Chollet (2017), Deep Learning with Python

Kannan Singaravelu 51
Learning Visual Features

▪ In fully connected or dense neural networks, each hidden layer is densely


connected to its previous layer - every input is connected to every output in that
layer

▪ In a densely connected network, the 2D input (spatial structure) is collapsed down


into 1D vector which is fed into the dense network. Every pixel in that 1D vector will
be feed into the next layer and in the process, we lose all of the very useful spatial
structure of the image

▪ Deep learning on large images on a fully connected layers aren’t feasible

▪ Spatial structures are super important in image data and we need to preserve this

Kannan Singaravelu 52
Feature Extraction with Convolution

▪ Filter size: 4 x 4
▪ 16 different weights
▪ Continue this filter to 4 x 4 patches in input
▪ Shift 2 pixels for next patch

▪ The ‘patch’ method is called convolution

▪ Edge detection to connect patch in input layers to a single neuron in subsequent layer

▪ Slide through the window to define connections and apply set of weights (weighted
sum) to extract local features

▪ Use multiple filters – weights – to extract different features

▪ Spatially share the parameters of each filter to extract maximum spatial features

Kannan Singaravelu 53
Feature Extraction with Convolution

▪ Process of adding each element of the image to its local neighbors, weighted by the
filter

▪ One of the most import operations in signal and image processing

▪ Filter is a matrix of values whose size and values determine the transformation
effect

Kannan Singaravelu 54
Convolution Operation

1 1 1 0 0

=
0 1 1 1 0 1 0 1 4 3 4
0
0
0
0
1
1
1
1
1
0
* 0
1
1
0
0
1
2
2
4
3
3
4

0 1 1 -1 0
3 x 3 Filter 3 x 3 Feature Map
5 x 5 Image

▪ Apply the 3 x 3 filter over the input image

▪ Perform element-wise multiplication

▪ Add the outputs

Kannan Singaravelu 55
Vertical Edge Detection

10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0

=
1 0 -1

*
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0 3 x 3 Filter 4 x 4 Feature Map
6 x 6 Image

▪ In vertical edge deduction, a vertical edge is a 3 x 3 region (in the above example),
where there are bright pixels on the left and dark pixels on the right

Kannan Singaravelu 56
Horizontal Edge Detection

10 10 10 0 0 0
10 10 10 0 0 0 0 0 0 0

=
1 1 1

*
10 10 10 0 0 0 30 10 -10 -30
0 0 0
0 0 0 10 10 10 30 10 -10 -30
-1 -1 -1
0 0 0 10 10 10 0 0 0 0
0 0 0 10 10 10 3 x 3 Filter 4 x 4 Feature Map
6 x 6 Image

▪ In horizontal edge deduction, a horizontal edge is a 3 x 3 region (in the above


example), where the pixels are relatively bright on top and dark in the bottom

Kannan Singaravelu 57
Other Common Filters

1 0 -1 1 1 1 1 0 -1 1 2 1
1 0 -1 0 0 0 2 0 -2 0 0 0
1 0 -1 -1 -1 -1 1 0 -1 -1 -2 -1

Prewitt Sobol

3 0 -3 3 10 3 𝑤1 𝑤2 𝑤3
10 0 -10 0 0 0 𝑤4 𝑤5 𝑤6
3 0 -3 -3 -10 -3 𝑤7 𝑤8 𝑤9

Scharr Parametric Filter

(using back propagation)

Kannan Singaravelu 58
Producing Feature Maps

Original Sharpen Edge Detect ‘Strong’ Edge Detect

-1 -1 -1 0 1 0 -1 -2 -1
-1 9 -1 -1 -4 1 0 0 0
-1 -1 -1 0 1 0 1 2 1

Note: If the feature map contains negative values (black portion), one can convert negative values to non-negative values by applying ReLU activation functions,
thus converting the black portions into grey.

Kannan Singaravelu 59
Padding

▪ A n x n image with f x f filter will produce an output of [n-f+1] x [n-f+1]

▪ On every convolutional operation (edge deduction), the image shrinks and we end
up with very small image

▪ Information on the edges are used much less as compared to other parts of the
image and we miss vital spatial information

▪ Padding before applying convolution operation help address this issues by adding
one border of one pixel around the borders

Kannan Singaravelu 60
Padding

Kannan Singaravelu 61
Padding

▪ A 6 x 6 image will then become 8 x 8 resulting into a 6 x 6 output, thus preserving


the original input size

▪ By convention, we pad with zeros with one pixel as the padded amount (p = 1)

▪ The new output is of dimension [n+2p-f+1] x [n+2p-f+1]

▪ Two common choices on how much to pad


▪ Valid convolutions : no padding

f−1
▪ Same convolutions : output size is same as the input size; p =
2

▪ By convention, f is almost always an odd number

Kannan Singaravelu 62
Strided Convolution

▪ Another piece of basic building block of convolutions

▪ Convolve with a stride of two (s=2); hop over two steps

n+2p−f n+2P−f
▪ The new output is of dimension is +1 x +1
s S

▪ Round down the dimension if not an integer

▪ Filter must lie entirely within the image (or image plus the padded region)

Kannan Singaravelu 63
Pooling

▪ Pooling down sample the image data extracted by the convolutional layers

▪ Reduces the dimensionality of the feature map in order to decrease the processing
time

▪ Max Pooling extracts maximum value of the sub-regions of the feature map

▪ For 3D inputs, the computation is done independently on each of the channels

▪ Average Pooling is used sometimes for very deep neural networks to collapse the
representation

Kannan Singaravelu 64
Convolutional Neural Networks

▪ Architecture designed for image classifications tasks

▪ Three parts to a CNN


▪ Convolution : apply filters to generate feature maps by extracting features in the image or
in the previous layers (generically)

▪ Non linearity : apply non linear activation function – ReLU

▪ Pooling : down sampling the spatial representation of the image to reduce dimensionality
and to preserve spatial invariance

▪ Some classic CNN architectures are


▪ LeNet, AlexNet, VGGNet, ResNet, GoogLeNet, XceptionNet, Fast R-CNN, U-Net,
EfficientNet

Kannan Singaravelu 65
Convolutional Neural Networks

Image Source: mathworks.com

Kannan Singaravelu 66
CNN : Key Takeaways

▪ Explicitly assume inputs are images

▪ Architecture designed for image classifications tasks

▪ Three parts to a CNN Convolution, Non linearity, Pooling

▪ Three hyperparameters - Depth, Padding, Stride - decides the output dimension

n+2p−f n+2P−f
▪ Output dimension is given by +1 x +1
s S

Kannan Singaravelu 67
Convolutions in Financial Time Series

▪ Convolutions are an unique type of Neural Networks which look data as a grid

▪ Applying convolutions to sequence data is an evolving idea

▪ Converts a non-image data into synthetic images

▪ Ensembe with traditional sequence models like LSTM to boost the model score

▪ A CNN-LSTM architecture uses CNN layers for features extraction combined with
LSTM to support sequence prediction

▪ Conv1D for univariate or multivariate time series

▪ Conv@D if we have time series of images as input

▪ CNN-LSTM ≠ ConvLSTM

Kannan Singaravelu 68
Code Walkthrough


model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))

model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu’))

model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu’))

model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

Kannan Singaravelu 69
Code Walkthrough – Model Summary


Layer (type) Output Shape Param #
====================================================================
conv2d_1 (Conv2D) (None, 26, 26, 32) 320
____________________________________________________________________
maxpooling2d_1 (MaxPooling2D) (None, 13, 13, 32) 0
____________________________________________________________________
conv2d_2 (Conv2D) (None, 11, 11, 64) 18496
____________________________________________________________________
maxpooling2d_2 (MaxPooling2D) (None, 5, 5, 64) 0
____________________________________________________________________
conv2d_3 (Conv2D) (None, 3, 3, 64) 36928
____________________________________________________________________
flatten_1 (Flatten) (None, 576) 0
____________________________________________________________________
dense_1 (Dense) (None, 64) 36928
____________________________________________________________________
dense_2 (Dense) (None, 10) 650
====================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0

Kannan Singaravelu 70
Deep Sequence Modeling
Long Short Term Memory Network

Kannan Singaravelu 71
Deep Sequence Modeling

Kannan Singaravelu 72
Deep Sequence Modeling

▪ Applying neural networks to problems involving sequential processing of data

▪ Sequence data comes in many forms : text, audio, video and financial time series

▪ Modeling to predict the next sequence of events (word, sound, time series)

▪ Effective for financial time series prediction

▪ Handle different types of network architecture


▪ variable length sequence

▪ track long-term dependencies

▪ preserve information about the order and share parameters across the sequence

Kannan Singaravelu 73
Recurrent Neural Network

▪ Generalization of feedforward neural network that has an internal memory

▪ Good at modeling sequence data

▪ RNN uses sequential memory for prediction

▪ Sequential memory is a mechanism used to identify the sequence patterns

▪ RNN are faster, uses less computational resources as there are less tensor
operations

Kannan Singaravelu 74
Recurrent for Sequence Modeling

1 [2] 3 4 [5]

Feed Sequence Sequence Sequence Synced sequence


Forward Output Input Input & Output Input & Output

Adapted from Andrej Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks

Kannan Singaravelu 75
Short-term Memory

Output 𝑦ෝ𝑡

the clouds are dark, its about to …..


RNN
ℎ𝑡 The sky is …

Input 𝑥𝑡 This is the 15th day of wildires in the bay area.


There is smoke everywhere, it is showing ash and
the sky is …

Kannan Singaravelu 76
Problem of Short-term Memory

▪ Suffer from short-term memory

▪ Vanishing gradient due to the nature of back propagation algorithm

▪ Doesn’t learn long-range dependencies across time step

Kannan Singaravelu 77
Long Short Term Memory Network

▪ LSTM algorithm is fundamental to deep learning for timeseries

▪ Special kind of RNN, explicitly designed to avoid the long-term dependency problem

▪ Widely used for sequence prediction problems and proved to be extremely effective

▪ LSTMs have four interacting layers

Kannan Singaravelu 78
Long Short Term Memory Network

▪ Maintain a separate cell state from what is outputted

▪ Use gates to regulate the flow of information


▪ Forget Gate

▪ Input Gate

▪ Update (Cell) State

▪ Output Gate

▪ Uninterrupted gradient flow for backpropagation

Kannan Singaravelu 79
LSTM : Forget Gate

yt

Ct-1 X + Ct 𝑓𝑡 = 𝜎 𝑊𝑓 . ht−1, xt + 𝑏𝑓
tanh

ft X X

s s tanh s
ht-1 ht

xt

Kannan Singaravelu 80
LSTM : Input Gate

yt

Ct-1 X + Ct 𝑖𝑡 = 𝜎 𝑊𝑖 . ht−1, xt + 𝑏𝑖
it tanh
X X
𝑐ҧt = 𝑡𝑎𝑛ℎ 𝑊𝑐 . ht−1, xt + 𝑏𝑐
𝑐ҧt
s s tanh s
ht-1 ht

xt

Kannan Singaravelu 81
LSTM : Update Cell State

yt

Ct-1 X + Ct 𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝑐ҧt


tanh
X X

s s tanh s
ht-1 ht

xt

Kannan Singaravelu 82
LSTM : Output Gate

yt

Ct-1 X + Ct 𝑜𝑡 = 𝜎 𝑊𝑜 . ht−1, xt + 𝑏𝑜
tanh
X ot X
h𝑡 = 𝑜𝑡 ∗ 𝑡𝑎𝑛ℎ 𝐶𝑡

s s tanh s
ht-1 ht

xt

Kannan Singaravelu 83
LSTM Network

yt

𝑓𝑡 = 𝜎 𝑊𝑓 . ht−1, xt + 𝑏𝑓

Ct-1 X + Ct 𝑖𝑡 =
𝜎 𝑊𝑖 . ht−1, xt + 𝑏𝑖

it tanh
𝑐ҧt = 𝑡𝑎𝑛ℎ 𝑊𝑐 . ht−1, xt + 𝑏𝑐
ft X ot X
𝑐ҧt 𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝑐ҧt
s s tanh s
ht-1 ht 𝑜𝑡 = 𝜎 𝑊𝑜 . ht−1, xt + 𝑏𝑜

h𝑡 = 𝑜𝑡 ∗ 𝑡𝑎𝑛ℎ 𝐶𝑡
xt

Kannan Singaravelu 84
LSTM Gradient Flow

▪ LSTM network is comprised of different memory blocks called cells or units.

Gradient Flow

y1 y3

C0 C1 C2 C3

h0 h1 h2 h3

x1 x3

Kannan Singaravelu 85
Code Walkthrough


model = Sequential()

#Add first layer


model.add(LSTM(units=256, input_shape = (60,1), return_sequences=True))
model.add(Dropout(0.4))

#Add second layer


model.add(LSTM(units=256, return_sequences=False))
model.add(Dropout(0.4))

#Add a Dense layer


model.add(Dense(64, activation = 'relu'))

#Add the output layer – output layer


model.add(Dense(1))

Kannan Singaravelu 86
Code Walkthrough – Model Summary


_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 60, 256) 264192
_________________________________________________________________
dropout_1 (Dropout) (None, 60, 256) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 256) 525312
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 16448
_________________________________________________________________
dense_2 (Dense) (None, 1) 65
=================================================================
Total params: 806,017
Trainable params: 806,017
Non-trainable params: 0

Kannan Singaravelu 87
Generative Adversarial Networks
Generator – Discriminator Network

Kannan Singaravelu 88
Generative Adversarial Networks

▪ Generative models

▪ Neural network that mimic a given distribution of the data

▪ Generate content such as images, text similar to what a human can produce

▪ Objective is to find the hidden laten meaning

▪ Foundational level insights of explanatory factors behind the data

▪ Impressive results on image and video generation such as style transfer using
CycleGAN, and human face generation using StyleGAN

▪ Different types of GAN : Vanilla, CGAN, DCGAN, LAPGAN, SRGAN

Kannan Singaravelu 89
Generative Adversarial Networks

▪ Consist of two neural networks

▪ Generator : trained to generate new data from the problem domain

▪ Discriminator : trained to distinguish fake data from real data

▪ Most applications in NNs are implemented using discriminative models

▪ GANs are part of a different class of models known as generative models

▪ Discriminative models learn the conditional probability P(y|x)

▪ Generative models capture the join probability P(x,y) or P(x), if there are not labels

▪ Unlike discriminative models, generative models are used for both supervised and
unsupervised learning

Kannan Singaravelu 90
GAN Structure

Adapted from Google Developers, Generative Adversarial Networks

Kannan Singaravelu 91
Five Steps to GAN

▪ Define GAN architecture (based on the application)

▪ Train discriminator to distinguish real vs fake data

▪ Train the generator to fake data that can fool the discriminator

▪ Continue training both discriminator and generator for multiple epochs

▪ Save the generator model to create new, realistically fake data

Kannan Singaravelu 92
GAN : Discriminator

Gradient Flow

Note: Hold the generator values constant when training the discriminator and discriminator values constant when training the generator. Each of these should be
trained against static adversary.

Adapted from Google Developers, Generative Adversarial Networks

Kannan Singaravelu 93
GAN : Generator

Gradient Flow

Note: Hold the generator values constant when training the discriminator and discriminator values constant when training the generator. Each of these should be
trained against static adversary.

Adapted from Google Developers, Generative Adversarial Networks

Kannan Singaravelu 94
GAN Loss Functions

▪ Try to replicate a probability distribution

▪ Loss functions measure distance between the distribution by the GAN and the
distribution of the real data

▪ Two common loss functions are Minimax loss and Wasserstein loss

▪ Minimax : Generator minimize and discriminator maximize the following function

𝐸𝑥 𝑙𝑜𝑔 𝐷 𝑥 + 𝐸𝑧 𝑙𝑜𝑔 1 − 𝐷 𝐺(𝑧)

▪ Wasserstein : depends on WGAN where discriminator does not classify instances


▪ Critic Loss 𝐷 𝑥 − 𝐷(𝐺 𝑧 )

▪ Generator Loss 𝐷(𝐺 𝑧 )

Kannan Singaravelu 95
Pros & Cons of Deep Learning

▪ Deep Learning algorithms are versatile and scalable

▪ Preferred for high dimensionality, complex and sequential problems

▪ Learns from data Incrementally, layer-by-layer and jointly

▪ Automating feature engineering is the key highlight of Deep Learning

▪ Most Machine earning algorithms used in industry aren’t Deep earning algorithms

▪ Deep earning isn’t always the right tool as there may not be enough data available
for deep learning to be applicable and/or can better be solved by a different
algorithms

Kannan Singaravelu 96
Limitations of Neural Networks

▪ Data Hungry

▪ Computationally intensive to train and deploy

▪ Subject to algorithmic bias

▪ Fooled by adversarial examples

▪ Requires expert knowledge to design and tune architectures

▪ Hype and Promise of AI : The AI Winters

Kannan Singaravelu 97
References

▪ Chigozie, Winifred, Anthony, and Stephen (2018), Activation Functions: Comparison of Trends in Practice and
Research for Deep Learning

▪ Francois Chollet (2017), Deep Learning with Python

▪ Andrej Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks

▪ Christopher Olah (2015), Understanding LSTM Networks

▪ Michale Phi (2018), Illustrated Guide to STM’s and GRU’s: A step by step explanation

▪ Standford University, Massachusetts Institute of Technology, Technische Universität München, Notes on Artificial
Intelligence

▪ TensorFlow, Keras, API Documentation

▪ Google Developers, Generative Adversarial Networks

Note: Some of the materials from above resources are adopted for these notes under CC-BY-SA 4.0. For details interpretation on the subject, refer to the above resources.

Kannan Singaravelu 98

You might also like