DL Unit 3 2019PAT
DL Unit 3 2019PAT
(410251)
(BE Computer 2019 PAT)
A.Y. 2022-23 SEM-II
Prepared by
Mr.Jameer Kotwal
1
Unit-3 Convolution Neural Network(CNN)
• Introduction, CNN architecture
overview,
• The Basic Structure of a Convolutional
Network- Padding,
• Strides, Typical Settings, the ReLU
layer, Pooling,
• Fully Connected Layers, The
Interleaving between
• Layers,
• Local Response Normalization, Training
a Convolutional Network 2
CNN
● A convolutional neural network, or CNN, is a network architecture for deep learning.
● It learns directly from images. A CNN is made up of several layers that process and transform an input to
produce an output.
● You can train a CNN to do image analysis tasks, including scene classification, object
detection and segmentation, and image processing.
● In order to understand how CNNs work, we'll cover three key concepts: local receptive fields,
shared weights and biases, and activation and pooling.
● So let's start with the concept of local receptive fields. In a typical neural network, each neuron in the input
layer is connected to a neuron in the hidden layer. However, in a CNN, only a small region of input
layer neurons connects to neurons in the hidden layer. These regions are referred to as local
receptive fields.The local receptive field is translated across an image to create a feature map from the input layer to the
hidden layer neurons.
3
CNN
● A
4
CNN
However, in the case of CNNs, the weights and bias values are the same for all hidden neurons in a given layer.
This means that all hidden neurons are detecting the same feature, such as an edge or a blob, in different regions of the image. This makes the
network tolerant to translation of objects in an image. For example, a network trained to recognize cats will be able to do so whenever
5 the
cat is in the image.
CNN
● Our third and final concept is activation and pooling. The activation step applies a transformation to the
output of each neuron by using activation functions. Rectified linear unit, or ReLU, is an example of a commonly
used activation function. It takes the output of a neuron and maps it to the highest positive value.
● Or if the output is negative, the function maps it to zero. You can further transform the output of the activation step by
applying a pooling step. Pooling reduces the dimensionality of the featured map by condensing the output of
small regions of neurons into a single output. This helps simplify the following layers and reduces the number of
parameters that the model needs to learn.
6
CNN
● For example, the first hidden layer learns how to detect edges, and the last learns how to detect more complex shapes. Just like in a
typical neural network, the final layer connects every neuron, from the last hidden layer to the output neurons. This produces the final
output. There are three ways to use CNNs for image analysis.
● The first method is to train the CNN from scratch. This method is highly accurate, although it is also the most challenging, as you
might need hundreds of thousands of labeled images and significant computational resources.
● The second method relies on transfer learning, which is based on the idea that you can use knowledge of one type of problem to
solve a similar problem. For example, you could use a CNN model that has been trained to recognize animals to initialize and train a
new model that differentiates between cars and trucks.
7
CNN
● Convolution Neural Network has input layer, output layer, many hidden layers and
millions of parameters that have the ability to learn complex objects and patterns.
● It sub-samples the given input by convolution and pooling processes and is subjected to
activation function, where all of these are the hidden layers which are partially connected and
at last end is the fully connected layer that results in the output layer.
● The output retains the original shape similar to input image dimensions.
1.1 Convolution
● Convolution is the process involving combination of two functions that produces the other
function as a result. In CNN’s, the input image is subjected to convolution with use of filters
that produces a Feature map.
● Filters are randomly generated vectors in the network consisting of weights and
biases. The same weights and bias are shared among various neurons in CNN instead of
unique weights and bias for each neuron. Many filters can be generat ed where every filter captures
unique feature from input. Filters are also referred as Kernels.
8
CNN
Convolution Layer - Convolutions occur in convolution layer which are the building blocks of
CNN. This layer generally has
● Padding is used to make dimension of output equal to input by adding zeros to the input frame
of matrix. Padding allows more spaces for kernel to cover image and is accurate for
analysis of images. Due to padding, information on the borders of images are also preserved
similarly as at the center of image.
10
CNN
● Stride controls how filter convolves over input i.e., the number of pixels shifts over the input matrix. If stride is set to 1,
filter moves across 1 pixel at a time and if stride is 2, filter moves 2 pixels at a time. More the value of stride,
smaller will be the resulting output and vice versa.
ReLU is computed after convolution. It is most commonly deployed activation function that allows the neural
network to account for non-linear relationships. In a given matrix (x), ReLU sets all negative values to zero and all
other values remains constant. It is mathematically represented as :
http://makeyourownneuralnetwork.blogspot.com/2020/0
2/calculating-output-size-of-convolutions.html
11
CNN
Pooling / Sub-sampling Layer
Next, there’s a pooling layer. Pooling layer operates on each feature map independently. This reduces
resolution of the feature map by reducing height and width of features maps, but retains
features of the map required for classification. This is called Down-sampling.
● Max-pooling : It selects maximum element from the feature map. The resulting max-pooled
layer holds important features of feature map. It is the most common approach as it gives better
results. Average pooling : It involves average calculation for each patch of the feature
map.
12
CNN
Why pooling is important ?
● It progressively reduces the spatial size of representation to reduce amount of parameters and
computation in network and also controls overfitting. If no pooling, then the output consists
of same resolution as input.
● There can be many number of convolution, ReLU and pooling layers. Initial layers of convolution
learns generic information and last layers learn more specific/complex features. After the
final Convolution Layer, ReLU, Pooling Layer the output feature map(matrix) will be
converted into vector(one dimensional array). This is called flatten layer.
● The feature vector from fully connected layer is further used to classify images between different
categories after training. All the inputs from this layer are connected to every activation unit of the
next layer. Since all the parameters are occupied into fully-connected layer, it causes overfitting.
Dropout is one of the techniques that reduces overfitting.
● Dropout
● Dropout is an approach used for regularization in neural networks. It is a technique where randomly
chosen nodes are ignored in network during training phase at each stage.
● This dropout rate is usually 0.5 and dropout can be tuned to produce best results and
also improves training speed. This method of regularization reduces node-to-node interactions
in the network which leads to learning of important features and also helps in generalizing
new data better. 13
CNN
Soft-Max Layer
Soft-max is an activation layer normally applied to the last layer of network that acts as a classifier.
Classification of given input into distinct classes takes place at this layer. The soft max function is
used to map the non-normalized output of a network to a probability distribution.
● The output from last layer of fully connected layer is directed to soft max layer, which
converts it into probabilities.
● Here soft-max assigns decimal probabilities to each class in a multi-class problem, these
probabilities sum equals 1.0.
● This allows the output to be interpreted directly as a probability.
14
CNN
Why Convolutions
● Parameter sharing: a feature detector (such as a vertical edge detector) that’s useful in one part of the
15
CNN
● Basic Convolution Operation
Step 1: overlay the filter to the input, perform element wise multiplication, and add the
result.
16
CNN
● Step 2: move the overlay right one position (or according to the stride setting),
and do the same calculation above to get the next result. And so on.
18
CNN
Padding - Padding has the following benefits:
1. It allows us to use a CONV layer without necessarily shrinking the height and width of
the volumes. This is important for building deeper networks, since otherwise the
height/width would shrink as we go to deeper layers. If we have an activation map of size
W x W x D, a pooling kernel of spatial size F, and stride S, then the size of output volume can
be determined by the following formula:
19
CNN
with stride S and amount of padding P, then the size of output volume can be determined by
20
Convolution Operation on Volume- When the input has more than one channels (e.g. an RGB image),
CNN
Convolution parameters
21
CNN
23
CNN
● Convolution Operation with Multiple Filters
● Multiple filters can be used in a convolution layer to detect multiple features. The output of the
layer then will have the same number of channels as the number of filters in the layer.
● The total number of multiplications to calculate the result is (4 x 4 x 2) x (3
x 3 x 3) = 864
24
CNN
● 1 x 1 Convolution
● This is convolution with 1 x 1 filter. The effect is to flatten or “merge” channels together, which
can save computations later in the network:
25
CNN
● One Convolution Layer
● Finally to make up a convolution layer, a bias ( ϵ R) is added and an activation function such as
ReLU or tanh is applied.
●
26
CNN
● Shorthand Representation
● This simpler representation will be used from now on to represent one convolutional layer:
Sample Complete Network- This is a sample network with three convolution layers. At the end of the network, the output
of the convolution layer is flattened and is connected to a logistic regression or a softmax output layer.
27
CNN
● Pooling Layer
● Pooling layer is used to reduce the size of the representations and to speed up
calculations, as well as to make some of the features it detects a bit more robust.
● Sample types of pooling are max pooling and avg pooling, but these days max
pooling is more common.
● it has hyper-parameters:
○ size (f)
○ stride (s)
○ type (max or avg)
● but it doesn’t have parameter; there’s nothing for gradient descent to learn
28
CNN
● When done on input with multiple channels, pooling reduces the height and width (nW and
nH) but keeps nC unchanged:
29
CNN-LeNet – 5 Architecture
● Well Known Architectures
● Classic Network: LeNet – 5
● Number of parameters: ~ 60 thousands. The network has 5 layers with learnable parameters and hence
named Lenet-5.
● It has three sets of convolution layers with a combination of average pooling.
● After the convolution and average pooling layers, we have two fully connected layers. At last, a Softmax
classifier which classifies the images into respective class.
30
CNN-LeNet – 5 Architecture
● The input to this model is a 32 X 32 grayscale image hence the number of channels is one.
● It means that a ConvNet neuron transforms the input image by arranging its neurons in three dimension.
We then apply the first convolution operation with the filter size 5X5 and we have 6 such filters. As a result, we
get a feature map of size 28X28X6. Here the number of channels is equal to the number of filters applied.
31
CNN-LeNet – 5 Architecture
● After the first pooling operation, we apply the average pooling and the size of the feature
map is reduced by half. Note that, the number of channels is intact.
Next, we have a convolution layer with sixteen filters of size 5X5. Again the feature map changed it is 10X10X16.
The output size is calculated in a similar manner. After this, we again applied an average pooling or subsampling
layer, which again reduce the size of the feature map by half i.e 5X5X16.
32
CNN-LeNet – 5 Architecture
● Then we have a final convolution layer of size 5X5 with 120 filters. As shown in the above image. Leaving
the feature map size 1X1X120. After which flatten result is 120 values.
● After these convolution layers, we have a fully connected layer with eighty-four neurons. At last, we have
an output layer with ten neurons since the data have ten classes.
33
CNN-AlexNet
● Classic Network: AlexNet ( Number of parameters: ~ 60 millions. )
● AlexNet is another classic CNN architecture from
ImageNet Classification with Deep Convolutional Neural Networks
34
CNN-AlexNet
● Calculation of AlexNet Parameter
35
CNN-AlexNet Example
● Parameters in the second CONV1(filter shape =5*5, stride=1) layer is: ((shape of width of
● Parameters in the fourth CONV2(filter shape =5*5, stride=1) layer is: ((shape of width of
filter * shape of height filter * number of filters in the previous layer+1) * number of
● Just because there are no parameters in the pooling layer, it does not imply
that pooling has no role in backprop. Pooling layer is responsible for passing on
the values to the next and previous layers during forward and backward
propagation respectively.
● There are no trainable parameters in a max-pooling layer. In the
forward pass, it pass maximum value within each rectangle to the next
layer.
● Convolution: Combine filter values and input values (multiply and add).
● Pooling: Only use input values. 37
CNN-VGG-16
● VGG-16 from
Very Deep Convolutional Networks for Large-Scale Image Recognition.
● The number 16 refers to the fact that the network has 16 trainable layers (i.e. layers that have weights).
38
CNN-VGG-16
● The figure below is another way to visually depict the layers in a network. In the case of VGG-16 there are five
convolutional blocks (Conv-1 to Conv-5). The specific layers within a convolutional block can vary depending on
the architecture.
● However, a convolutional block typically contains one or more 2D convolutional layers followed by a pooling layer.
Other layers are also sometimes incorporated, but we will focus on these two layer types to keep things simple.
● Notice that we have explicitly specified the last layer of the network as SoftMax. This layer applies the softmax
function to the outputs from the last fully connected layer in the network (FC-8).
● It converts the raw output values from the network to normalized values in the range [0,1], which we can interpret as
a probability score for each class.
39
CNN-VGG-16
● In the case of VGG-16, the input is a color image with the shape: (224x224x3). Here we depict a high-level
view of a convolutional layer. Convolutional layers use filters to process the input data. The filter moves
across the input, and at each filter location, a convolution operation is performed, which produces a single
number. This value is then passed through an activation function, and the output from the activation
function populates the corresponding entry in the output, also known as an activation map (224x224x1).
You can think of the activation map as containing a summary of features from the input via the
convolution process.
40
CNN-VGG-16
● Example
41
CNN-VGG-16
● Convolutional Layer with a Single Filter
● Let’s now look at a concrete example of a simple convolutional layer. Suppose the input image has a size of
(224x224x3). The spatial size of a filter is a design choice, but it must have a depth of three to match the
depth of the input image. In this example, we use a single filter of size (3x3x3).
42
CNN-VGG-16
● Here we have chosen to use just a single filter. Therefore the filter contains three kernels where each kernel has nine
trainable weights. There are a total of 27 trainable weights in this filter, plus a single bias term, for 28 total trainable
parameters. Because we’ve chosen just a single filter, the depth of our output is one, which means we produce just a
single channel activation map shown. When we convolve this single (3-channel) filter with the (3-channel) input, the
convolution operation is performed for each channel separately. The weighted sum of all three channels plus a bias
term is then passed through an activation function whose output is represented as a single number in the output
activation map (shown in blue).
● So to summarize, the number of channels in a filter must match the number of channels in the input. And the number
of filters in a convolutional layer (a design choice) dictates the number of activation maps that are produced by the
convolutional layer.
● This next example represents a more general case. First, notice that the input has a depth of three, but this doesn’t
necessarily correspond to color channels. It just means that we have an input tensor that has three channels.
Remember that when we refer to the input, we don’t necessarily mean the input to the neural network. But rather the
input to this convolutional layer which could represent the output from a previous layer in the network.
43
CNN-VGG-16
44
CNN-VGG-16
● Each filter learns to detect different structural elements (i.e., horizontal lines, vertical lines, and diagonal lines). To be
more precise, these “lines” represent edge structures in an image. In the output activation maps, we emphasize the
highly activated neurons associated with each filter.
45
CNN-VGG-16
● CNN architectures like VGG-16 is that there are usually two or three consecutive convolutional layers followed by a
max pooling layer. And convolutional layers usually contain at least 32 filters.
46
CNN-VGG-16
● Fully Connected Classifier
● The fully connected (dense) layers in a CNN architecture transform features into class probabilities.
In the case of VGG-16, the output from the last convolutional block (Conv-5) is a series of activation
maps with shape (7x7x512). For reference, we have indicated the number of channels at key points in
the architecture.
Before the data from the last convolutional layer in the feature extractor can flow through the classifier, it needs to be
flattened to a 1-dimensional vector of length 25,088. After flattening, this 1-dimensional layer is then fully connected to
FC-6, as shown below.
47
CNN
● Flattening the output of Conv-5 is required for processing the activation maps through the classifier, but it does not
alter the original spatial interpretation of the data. It’s just a repacking of the data for processing purposes.
48
Summary
● CNNs designed for a classification task contain an upstream feature extractor and a
downstream classifier.
● The feature extractor comprises convolutional blocks with a similar structure
composed of one or more convolutional layers followed by a max pooling layer.
● The convolutional layers extract features from the previous layer and store the results
in activation maps.
● The number of filters in a convolutional layer is a design choice in the architecture of a
model, but the number of kernels within a filter is dictated by the depth of the input
tensor.
● The depth of the output from a convolutional layer (the number of activation maps) is
dictated by the number of filters in the layer.
● Pooling layers are often used at the end of a convolutional block to downsize the
activation maps. This reduces the total number of trainable parameters in the network
and, therefore, the training time required. Additionally, this also helps mitigate
overfitting.
● The classifier portion of the network transforms the extracted features into class
probabilities using one or more densely connected layers.
● When the number of classes is more than two, a SoftMax layer is used to normalize the
raw output from the classifier to the range [0,1]. These normalized values can be
interpreted as the probability that the input image corresponds to the class label for
each output neuron.
49
CNN-Inception
● Inception- The motivation of the inception network is, rather than requiring us to pick the
filter size manually, let the network decide what is best to put in a layer. We give it choices and
hopefully it will pick up what is best to use in that layer:
The problem with the above network is computation cost (e.g. for the 5 x 5 filter only, the computation cost
Using 1 x 1 convolution will reduce the computation to about 1/10 of that. With this idea, an inception
50
module will look like this:
CNN-Inception
51
52
53
At last, all the channels in the network are concatenated together i.e. (28 x 28 x (64 +
128 + 32 + 32)) = 28 x 28 x 256.
54
CNN-Inception
● A
55
CNN-Inception
● Now let’s look at the computational cost. To apply this \(1 \times 1 \) convolution we have \(16 \)
filters. Each of the filters is going to be of dimension \(1 \times 1 \times 192 \). The cost of computing
this \(28 \times 28 \times 16 \) volume is going to equal to the number of computations and it’s \(192
\) multiplications.
● The cost of this first convolutional layer is learning about \(2.4 \) million parameters. The cost of the
second convolutional layer will be about \(10.0 \) million because we have to apply \(5\times 5 \times
16 \) dimensional filter on this \(28\times 28\times 32 \) dimensional volume above.
● So, the total number of multiplications we need to do is the sum of those, which is \(12.4 \) million
multiplications. If we compare this with previous example we reduced the computational cost from
about \(120 \) million multiplications down to about 1/10 of that, to \(12.4 \) million multiplications.
The number of additions we need to perform is very similar to the number of multiplications as well.
● Example
56
CNN-Local Response Normalization
● Why Normalization?
● Normalization has become important for deep neural networks that compensate for the
unbounded nature of certain activation functions such as ReLU, ELU, etc. With these activation
functions, the output layers are not constrained within a bounded range (such as [-1,1] for
tanh), rather they can grow as high as the training allows it. To limit the unbounded activation
from increasing the output layer values, normalization is used just before the activation
function
● Local Response Normalization
● Local Response Normalization (LRN) was first introduced in AlexNet architecture where the
activation function used was ReLU as opposed to the more common tanh and sigmoid at that
time. Apart from the reason mentioned above, the reason for using LRN was to encourage
lateral inhibition.
● It is a concept in Neurobiology that refers to the capacity of a neuron to reduce the activity of its
neighbors [1]. In DNNs, the purpose of this lateral inhibition is to carry out local contrast
enhancement so that locally maximum pixel values are used as excitation for the next layers.
● LRN is a non-trainable layer that square-normalizes the pixel values in a feature map within
a local neighborhood. There are two types of LRN based on the neighborhood defined and can
be seen in the figure below. 57
CNN-Local Response Normalization
● Inter-Channel LRN: This is originally what the AlexNet paper used. The neighborhood
defined is across the channel. For each (x,y) position, the normalization is carried out in the
depth dimension and is given by the following formula https://hackmd.io/@imkushwaha/
alexnet
58
59
CNN-Local Response Normalization
● A
where i indicates the output of filter i, a(x,y), b(x,y) the pixel values at (x,y) position before and after
The constants (k,α,β,n) are hyper-parameters. k is used to avoid any singularities (division by zero),
The constant n is used to define the neighborhood length i.e. how many consecutive pixel values
The case of (k,α, β, n)=(0,1,1,N) is the standard normalization). In the figure above n is taken to be to
2 while N=4.
60
CNN
● A
61
CNN-Local Response Normalization
● Different colors denote different channels and hence N=4. Lets take the hyper-parameters to be
(k,α, β, n)=(0,1,1,2).
● The value of n=2 means that while calculating the normalized value at position (i,x,y), we
consider the values at the same position for the previous and next filter i.e (i-1, x, y) and (i+1, x,
y).
● For (i,x,y)=(0,0,0) we have value(i,x,y)=1, value(i-1,x,y) doesn’t exist and value(i+,x,y)=1.
Hence normalized_value(i,x,y) = 1/(¹²+¹²) = 0.5 and can be seen in the lower part of the
figure above.
● The rest of the normalized values are calculated in a similar way.
Intra-Channel LRN: In Intra-channel LRN, the neighborhood is extended within the same
channel only as can be seen in the figure above. The formula is given by
62
CNN-Local Response Normalization
● where (W,H) are the width and height of the feature map (for example in the figure above
(W,H) = (8,8)). The only difference between Inter and Intra Channel LRN is the neighborhood
figure below shows the Intra-Channel normalization on a 5x5 feature map with n=2 (i.e. 2D
63
CNN-Local Response Normalization
In batch normalization, the output of hidden neurons is processed in the following manner before
being fed to the activation function.
Comparison:
LRN has multiple directions to perform normalization across (Inter or Intra Channel), on the other
hand, BN has only one way of being carried out (for each pixel position across all the activations).
The table below compares the two normalization techniques.
64
Training a Convolutional Network
65
66