Deep Learning Module-2 & 4
Deep Learning Module-2 & 4
By Prof. Jignasha A. S.
2
Introduction
● An autoencoder is a neural network that is trained to attempt to copy its input to its output.
● It tries to learn a way to compress data and then reconstruct it back to its original form.
● It has a hidden layer h that describes a code used to represent the input.
● It has two parts:
○ Encoder: Shrinks the data down to a smaller size (called the "code" or "latent space").
○ Decoder: Expands the compressed data back to the original size.
● The network may be viewed as consisting of two parts:
○ an encoder function h = f (x) and
○ a decoder that produces a reconstruction r = g(h).
Fig. The general structure of an autoencoder, mapping an input x to an output (called reconstruction) r through an
internal representation or code h. The autoencoder has two components: the encoder f (mapping x to h) and the
3
decoder g (mapping h to r)
Introduction …continued
4
Linear Autoencoders
● It is a type of neural network which uses linear transformations to compress and recontsruct the
data.
● It consists of encoder and decoder.
● Encoder:
○ It is linear layer which maps input data to lower dimensional representation i.e. latent
space.
○ Lets say, input layer is of n dimensions then linear layer will be m dimensions where m<n.
● Decoder:
○ It is linear layer which maps latent space to original input layer.
○ Lets say, latent space is of m dimensions then input layer will be n dimensions.
5
Linear Autoencoders …continued
● Encoding: ● Decoding:
○ Z = Wx + b ○ x` = W`z + b`
where, Z: Encoded data (m dimensions) where, Z: Encoded data (m dimensions)
W: Weight Matrix m*n W`: Weight Matrix m*n
x: input Data (n dimensions) x`: reconstructed data (n dimensions)
b: bias (m dimensions) b: bias (n dimensions)
Note: W and W`, b and b` are learned during training. This minimizes reconstruction error.
6
Undercomplete Autoencoders
● An autoencoder whose code (h) dimension is less than the input (x) dimension is
called undercomplete.
● It capture the most salient features of the training data.
● For example, if your original data has 100 numbers, the undercomplete autoencoder
might compress it down to just 10 numbers.
● The learning process is described simply as minimizing a loss function,
L(x, g(f(x))) where L is a loss function.
● Principal Component Analysis (PCA) is used to handle linear data whereas
Autoencoders handles non-linear data.
● Autoencoders with nonlinear encoder functions f and nonlinear decoder functions g
can thus learn a more powerful nonlinear generalization of PCA.
● Unfortunately, if the encoder and decoder are allowed too much capacity, it memorize
the data instead of learning.
7
Undercomplete Autoencoders Flaws
1. Overfitting to the Training Data:
● Even though undercomplete autoencoders use a smaller latent space, they can still overfit the
training data if the network is too powerful (e.g., if it has too many layers or units).
● Overfitting occurs when the model learns to memorize the training data rather than generalizing
to unseen data.
2. Inability to Capture Complex Data Distributions:
● An undercomplete autoencoder might struggle to capture the full complexity of the data
distribution, especially if the data is highly non-linear
3. Lack of Regularization:
● Without regularization techniques, undercomplete autoencoders can still find ways to trivially
map inputs to outputs without learning meaningful features.
8
Overcomplete Autoencoders
● In an overcomplete autoencoder, the latent space (the code) has more dimensions than the input
data.
● This means that instead of compressing the data, the network expands it into a
higher-dimensional space.
● For example, if your input data has 100 features, an overcomplete autoencoder might expand it
to 200 features in the latent space.
● The main goal is to learn a richer, more detailed representation of the input data.
● By having more neurons in the latent space, the autoencoder can capture more subtle patterns
and correlations in the data.
● These models are often used when the goal is to discover complex, high-level features that are
not easily captured by a smaller latent space.
● This autoencoders has risk of overfitting as it has more capacity than needed, it might simply
learn to copy the input directly to the output without extracting meaningful features thus it
doesn't learn useful representations and just memorizes the input data.
9
Regularization in Autoencoders
● The ideal situation would be to train any kind of autoencoder (undercomplete, overcomplete, or
with code dimensions equal to the input) and still get meaningful results.
● Regularization has the ability to train the architecture of autoencoder successfully, choosing the
code dimension and the capacity of the encoder and decoder based on the complexity of
distribution to be modeled,
● Instead of limiting the power of the encoder and decoder by making them simple or the code
size small, regularized autoencoders use a special loss function.
● Loss Function:
○ Sparsity: It encourages the autoencoder to create codes where only a few neurons are
active at a time, making the representation more efficient and meaningful.
○ Smoothness: It encourages the autoencoder to make sure small changes in the input lead
to small changes in the code, helping the model to generalize better.
○ Robustness: It makes the autoencoder resistant to noise or missing parts of the input
data, ensuring that it can still understand the important features even when the input isn’t
10
perfect.
Sparse Autoencoders
● It is a learning algorithm.
● It automatically learn features from unlabelled data.
● The number of hidden units are small; The number of hidden units are large?
○ Sparse Constraint
■ Discovers interesting structure in the data.
● Consist of:
○ Encoder: Used to compress input to latent space representations.
○ Decoder: Reconstruct input from latent space representations.
○ Loss Function
● It is enforced by adding “Penalty Term” to Loss Function that encourages activations of hidden
unit to be sparse.
● Sparsity constraint is implemented in various ways,
○ Sparsity Penalty
○ Sparsity Regularizer
○ Sparsity Proportion
11
Sparse Autoencoders …continued
● Sparsity Penalty
○ is a term added to the loss function that penalizes the network for having non-sparse
activations.
● Sparsity Regularizer
○ is a function that encourages the network to have sparse activations.
● Sparsity Proportion
○ is a hyperparameter that determines the desired level of sparsity in the activations.
● Neuron is “active”, if its output value is close to “1”.
● Neuron is “inactive”, if its output value is close to “0”.
● Note: We would like to constrain neuron to be inactive most of the time.
p̂j = (1/n) nΣi=1 [aj xi]
aj2 : activation of hidden unit j in autoencoder.
x : input
aj2 xi : activation of hidden unit when the network is given specific input x.
n : Total no of training examples 12
Sparse Autoencoders …continued
13
Sparse Autoencoders …continued
● There are actually two different ways to construct our sparsity penalty:
○ L1 regularization
○ KL-divergence
● A regression model that uses L1 regularization technique is called Lasso Regression and
model which uses L2 is called Ridge Regression.
● Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value
of magnitude” of coefficient as penalty term to the loss function.
● Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss
function.
● These techniques are used when we are dealing with a large set of features.
● Although L1 and L2 can both be used as regularization term, the key difference between them
is that L1 regularization tends to shrink the penalty coefficient to zero while L2
regularization would move coefficients towards zero but they will never reach.
● Thus L1 regularization is often used as a method of feature extraction.
14
Sparse Autoencoders …continued
● L1 regularization adds the absolute value of the weights to the loss function.
L1 Regularization term = λ ∑i ∣wi∣
where λ is the regularization parameter and wi are the weights of the model.
● L1 regularization tends to produce sparse models, meaning it drives many weights to exactly
zero.
● This can effectively perform feature selection.
● L2 regularization adds the squared value of the weights to the loss function.
L2 Regularization term = λ ∑i wi2
● L2 regularization tends to shrink the weights uniformly but does not drive them to exactly zero.
● Instead, it forces the weights to be small, but not necessarily sparse, resulting in a more evenly
distributed effect on the features.
15
Sparse Autoencoders …continued
● L1 Regularization ● L2 Regularization
○ L1 encourages many neurons to be inactive ○ L2 spreads regularization across all
(zero output) which results in a more weights which leads to partial
compact and efficient representation. activation rather than full inactivity.
○ Drives some weights to zero. ○ Reduces the magnitude of weights.
○ Effectively selects the most relevant ○ Does not perform effective feature
features. selection.
○ Sparse activations lead to a more ○ Less sparsity, more distributed
interpretable model. activation.
○ Easier to understand which neurons or ○ Harder to pinpoint important
features are important. features.
16
Contractive Autoencoders
● It also uses regularization.
● It uses Frobenius norm or Euclidean norm of the Jacobian function.
● Jacobian Matrix:
○ It is the matrix of first order partial derivatives.
○ The Jacobian matrix ∂h1for k∂h
hidden
1 **
nodes and n input nodes is given as follows,
∂h1
∂x1 ∂x2 **
∂xn
∂x1 ∂x2 **
∂xn
: : :
∂hk ∂hk **
∂hk 17
Contractive Autoencoders …continued
● Frobenius Norm:
○ The Frobenius Norm or Euclidean Norm of a matrix (M) of order n * m is the square root
of the sum of the squares of the elements of the matrix.
○ The Frobenius Norm of matrix M is given by:
■ ||M||F = √n∑i=1m∑j=1 |mij|2
○ Regularization:
|| Jx(h) ||2F
○ Loss Function:
Lnew = L + || Jx(h) ||2F
18
Contractive Autoencoders …continued
Sparse Autoencoder Contractive Autoencoder
Encourage sparsity in the hidden layer Encourage robustness to small input.
representation.
Add a sparsity constraint to ensure only a few Add a penalty on the Jacobian of the hidden layer
neurons are active for any input. activations with respect to the input, to reduce sensitivity to
input changes.
Includes reconstruction loss + sparsity penalty Includes reconstruction loss + penalty on the Frobenius
(e.g., KL divergence or L1 norm). norm of the Jacobian matrix of the encoder.
Learning efficient, compact, and sparse Learning representations that are invariant to small
representations of input data. variations in the input.
Explicitly enforced (most hidden neurons are Not explicitly enforced; the goal is robustness, not sparsity.
inactive for any input).
Useful for feature extraction where sparse Useful for tasks where robustness to noise or small input
representation is needed (e.g., text data, images). variations is important (e.g., denoising).
19
Application: Feature learning. Application: Denoising
Denoising Autoencoders
20
Denoising Autoencoders …continued
21
Module-3: Convolutional Neural Networks (CNN):
Supervised Learning
3.2 Modern Deep Learning Architectures:
● LeNet: Architecture,
● AlexNet: Architecture,
● ResNet : Architecture
22
LeNET Architecture
● LeNet is one of the earliest convolutional neural network (CNN) architectures, designed by
Yann LeCun in the late 1980s primarily for handwritten digit recognition (e.g., the MNIST
dataset).
● It laid the foundation for many modern CNNs used in various computer vision tasks.
● Components:
1. Input Layer
2. Convolutional Layer 1 (C1)
3. Subsampling (Pooling) Layer 1 (S2)
4. Convolutional Layer 2 (C3)
5. Subsampling (Pooling) Layer 2 (S4)
6. Fully Connected Layer (C5)
7. Output Layer (F6)
23
LeNET Architecture …continued
24
LeNET Architecture …continued
Steps Layer Type Input Size Operation Output Size
1 Input 28✕28✕1 Image padded with zeros to become 32✕32 32✕32✕1
2 Convolution C1 32✕32✕1 No padding, 6 filters, 5✕5, stride=1 28✕28✕6
3 Pooling S2 28✕28✕6 Average Pooling, 2✕2, stride=2 14✕14✕6
4 Convolution C3 14✕14✕6 No padding, 16 filters, 5✕5, stride=1 10✕10✕16
10✕10✕1
5 Padding S4 6 Average Pooling, 2✕2, stride=2 10✕10✕16
6 Flattening 5✕5✕16 Flattens 3D feature maps into 1D vector 1✕400
7 Fully Connected 1✕400 Fully connected layer with 120 units 1✕120
8 Fully Connected 1✕120 Fully connected layer with 84 units 1✕84
Output Fully connected layer with 10 units (one for each
9 (softmax) 1✕84 class 0-9) 1✕10
Final Output Prediction - Digit ‘5’ 25
AlexNET Architecture
● AlexNet is a groundbreaking convolutional neural network (CNN) architecture that won the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012.
● It marked a significant advancement in deep learning, particularly in image classification tasks,
and laid the foundation for future developments in computer vision.
● AlexNet consists of five convolutional layers, followed by three fully connected layers.
● It takes the Input Size 227x227x3 (RGB image)
26
AlexNET Architecture
27
AlexNET Architecture …continued
Layer Type Filter/Kernel Size Stride Padding Output Size Activation
Input - - - - 227x227x3 -
Conv1 Convolutional 11x11x96 4 0 (valid) 55x55x96 ReLU
Max Pooling 1 Max Pooling 3x3 2 - 27x27x96 -
Conv2 Convolutional 5x5x256 1 2 (same) 27x27x256 ReLU
Max Pooling 2 Max Pooling 3x3 2 - 13x13x256 -
Conv3 Convolutional 3x3x384 1 1 (same) 13x13x384 ReLU
Conv4 Convolutional 3x3x384 1 1 (same) 13x13x384 ReLU
Conv5 Convolutional 3x3x256 1 1 (same) 13x13x256 ReLU
Max Pooling 3 Max Pooling 3x3 2 - 6x6x256 -
Flatten - - - - 9216 -
FC6 Fully Connected - - - 4096 ReLU + Dropout
FC7 Fully Connected - - - 4096 ReLU + Dropout
FC8 (Output
Layer) Fully Connected - - - 1000 (classes) Softmax 28
ResNET Architecture
● ResNet (Residual Network) is a deep convolutional neural network architecture introduced by Kaiming
He et al. in 2015. It won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in the
same year
● Residual Learning:
● The core idea of ResNet is to learn residual functions instead of directly learning the desired
mapping. A residual function is defined as: F(x)=H(x)−x where H(x) is the desired underlying
mapping and x is the input.
● This means that the network is designed to learn the difference (or residual) between the input and
the output, making it easier to optimize.
Skip Connections:
● ResNet employs skip connections that allow gradients to flow through the network without vanishing,
enabling effective training of very deep architectures.
● Skip connections bypass one or more layers, allowing the output of a layer to be added directly to the
output of a deeper layer: y=F(x)+x
● This architecture helps alleviate the vanishing gradient problem commonly encountered in deep networks,
29
allowing for deeper models without degrading performance.
ResNET Architecture …continued
Layer Type Output Size Description
Input Layer 224x224x3 Input image.
Convolution Layer 112x112x64 7x7 filters with 64 filters, stride of
2, followed by ReLU activation.
Max Pooling Layer 56x56x64 3x3 pooling layer with a stride of 2
to reduce spatial dimensions.
Residual Block (xN) Varies (depends on Each block consists of two or three
depth) convolutional layers (3x3) with
ReLU and skip connections.
Average Pooling Layer 1x1x512 (e.g. for Global average pooling to reduce
ResNet-50) spatial dimensions.
Fully Connected Layer 1000 Final layer with 1000 neurons for
classification, typically with
softmax activation.
31
Recurrent Neural Network
● It is a family of Neural Network.
● It processes sequential data.
● The CNN network is specialized in processing the images; whereas RNN is specialized in
processing the sequence of values x(1)..x(τ).
● It also processes the sequence of fixed and variable length.
● It shares same weights across several time steps.
● It predicts output (relevant word) based on :
○ Current input
○ Previous hidden state
● The previous input stores information about previous word.
● Previous hidden state capture contextual information about all the words in the sentence that
the network has seen so far.
32
Recurrent Neural Network …continued
● E.g. The sun rises in the ____.
● Pass The as an input; Pass hidden state “h0” as an input; Then pass “sun” as an input…
● So, everytime we pass the input word, we pass previous hidden state as an input.
● At last step, pass ”the” as an input and pass “h3” as a hidden state.
● With current input and h3, next relevant word is predicted.
east
Output Layer
Hidden h0 h1 h2 h3 h4
Layer
34
Recurrent Neural Network …continued
Difference between Feedforward Neural Network and Recurrent Neural Network
Feedforward
Neural Network x h y
Recurrent
x h y
Neural Network
35
Recurrent Neural Network …continued
Unfolded version of RNN
● Assume an input sentence with T words, thus we have 0 to T-1 layers one for each word.
● At the time step t=1, ● At the time step t=2,
○ Output y1 is predicted based on ○ Output y2 is predicted based on
■ Current input x1 ■ Current input x2
■ Previous hidden state h0 ■ Previous hidden state h1
y y0 y1 y2 yt yt-1
h =
h0 h1 h2 ht ht-1
x x0 x1 x2 xt xt-1 36
Recurrent Neural Network …continued
Forward Propagation in RNNs
Input Layer Hidden Layer Output Layer
U V
x h y
x: Input Layer W
y: Output Layer
H: Hidden Layer
37
Recurrent Neural Network …continued
● Hidden state “h” at a time step t can be calculated as,
ht = tan h (Uxt + Wht-1)
● That is, hidden state at a time step t = tan h ([input to hidden layer weight ✕ input] + [hidden to
hidden layer weight ✕ previous hidden state])
● The output at a time step t can be computed as,
● ŷt = softmax (Vht)
x RNN y
38
Recurrent Neural Network …continued
Below figure shows forward propagation works with RNN
V V V
hinit h0 = tan h (Ux0 + Whinit) h1 = tan h (Ux1 + Wh0) h2 = tan h (Ux2 + Wh1)
U U U
x0 x1 x2
39
Types of Recurrent Neural Network
● One to One Architecture
● One to Many Architecture
● Many to One Architecture
● Many to Many Architecture
40
Types of Recurrent Neural Network
One to One Architecture
● A single input is mapped to a single output.
● The output from the time step t is fed as an input to the next time step.
● E.g. Songs generation
h0 h1 h2 h3
x0 x1 x2 x3
h0 h1 h2 h3 h4 h5 h6
42
‘X’
Types of Recurrent Neural Network
Many to One Architecture
● It takes sequence of input and maps it to a single output value.
● A sentence is s sequence of words, on each time step, each word is pass as an input and predict
the output at final time step.
● E.g. sentiment classification.
y
y0 y1 y2 y3
h0 h1 h2 h3
x0 x1 x2 x3
[2] Li Deng and Dong Yu, ―Deep Learning Methods and Applicationsǁ, Publishers Inc.
45
Experiments Python Code
Experiment No 1: https://colab.research.google.com/drive/1NfxdzNvS-kmBHXTT_YAzjaOG8dGzNjyh?usp=drive_link
Experiment No 4: https://colab.research.google.com/drive/1l_7is0lQMrEpY29y3CC5fJ1_MRlhuHfi?usp=drive_link
Experiment No 6: https://colab.research.google.com/drive/1OQ8KSCcD7ntOexuh08spOvEzLR-Wth2Z?usp=drive_link
Experiment No 7: https://colab.research.google.com/drive/1WqeKFPsUK1vuxMDXFEXJ1FczE0pCFoPE?usp=drive_link
Experiment No 8: https://colab.research.google.com/drive/1pbHMydssnousWd3056U8Ji8O-kupRmX6?usp=drive_link
Experiment No 12: https://colab.research.google.com/drive/1KigGNSayaMMr2cmY5TwlfkfzbJsItXwA?usp=drive_link
Other:
Language Translation using RNN:
https://colab.research.google.com/drive/1Hemwdm4kkUcPQo3L8HAvfYcrL-J1ahIW?usp=drive_link
LeNet_Handwritting_recognition:
https://colab.research.google.com/drive/1uz9owpRldL-sxOsx3tdVYvG1nD4bK7fp?usp=drive_link
46
47
48