0% found this document useful (0 votes)
88 views66 pages

DL Unit 3 2019PAT

1. Convolutional neural networks (CNNs) are a type of deep learning model used for image analysis tasks. They learn directly from images through a hierarchical pattern of layers. 2. CNNs are composed of convolutional layers that apply filters to extract features, activation functions like ReLU, and pooling layers that reduce the dimensionality of feature maps. 3. Shared weights and local receptive fields allow CNNs to detect the same features across an image, making them translation invariant and effective for computer vision tasks like image classification.

Uploaded by

Nilesh Nagrale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views66 pages

DL Unit 3 2019PAT

1. Convolutional neural networks (CNNs) are a type of deep learning model used for image analysis tasks. They learn directly from images through a hierarchical pattern of layers. 2. CNNs are composed of convolutional layers that apply filters to extract features, activation functions like ReLU, and pooling layers that reduce the dimensionality of feature maps. 3. Shared weights and local receptive fields allow CNNs to detect the same features across an image, making them translation invariant and effective for computer vision tasks like image classification.

Uploaded by

Nilesh Nagrale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Deep Learning

(410251)
(BE Computer 2019 PAT)
A.Y. 2022-23 SEM-II

Prepared by
Mr.Jameer Kotwal
1
Unit-3 Convolution Neural Network(CNN)
• Introduction, CNN architecture
overview,
• The Basic Structure of a Convolutional
Network- Padding,
• Strides, Typical Settings, the ReLU
layer, Pooling,
• Fully Connected Layers, The
Interleaving between
• Layers,
• Local Response Normalization, Training
a Convolutional Network 2
CNN
● A convolutional neural network, or CNN, is a network architecture for deep learning.
● It learns directly from images. A CNN is made up of several layers that process and transform an input to
produce an output.
● You can train a CNN to do image analysis tasks, including scene classification, object
detection and segmentation, and image processing.
● In order to understand how CNNs work, we'll cover three key concepts: local receptive fields,
shared weights and biases, and activation and pooling.
● So let's start with the concept of local receptive fields. In a typical neural network, each neuron in the input
layer is connected to a neuron in the hidden layer. However, in a CNN, only a small region of input
layer neurons connects to neurons in the hidden layer. These regions are referred to as local
receptive fields.The local receptive field is translated across an image to create a feature map from the input layer to the
hidden layer neurons.

3
CNN
● A

4
CNN

However, in the case of CNNs, the weights and bias values are the same for all hidden neurons in a given layer.
This means that all hidden neurons are detecting the same feature, such as an edge or a blob, in different regions of the image. This makes the
network tolerant to translation of objects in an image. For example, a network trained to recognize cats will be able to do so whenever
5 the
cat is in the image.
CNN
● Our third and final concept is activation and pooling. The activation step applies a transformation to the
output of each neuron by using activation functions. Rectified linear unit, or ReLU, is an example of a commonly
used activation function. It takes the output of a neuron and maps it to the highest positive value.
● Or if the output is negative, the function maps it to zero. You can further transform the output of the activation step by
applying a pooling step. Pooling reduces the dimensionality of the featured map by condensing the output of
small regions of neurons into a single output. This helps simplify the following layers and reduces the number of
parameters that the model needs to learn.

6
CNN
● For example, the first hidden layer learns how to detect edges, and the last learns how to detect more complex shapes. Just like in a
typical neural network, the final layer connects every neuron, from the last hidden layer to the output neurons. This produces the final
output. There are three ways to use CNNs for image analysis.
● The first method is to train the CNN from scratch. This method is highly accurate, although it is also the most challenging, as you
might need hundreds of thousands of labeled images and significant computational resources.
● The second method relies on transfer learning, which is based on the idea that you can use knowledge of one type of problem to
solve a similar problem. For example, you could use a CNN model that has been trained to recognize animals to initialize and train a
new model that differentiates between cars and trucks.

7
CNN
● Convolution Neural Network has input layer, output layer, many hidden layers and
millions of parameters that have the ability to learn complex objects and patterns.
● It sub-samples the given input by convolution and pooling processes and is subjected to
activation function, where all of these are the hidden layers which are partially connected and
at last end is the fully connected layer that results in the output layer.
● The output retains the original shape similar to input image dimensions.

1.1 Convolution

● Convolution is the process involving combination of two functions that produces the other
function as a result. In CNN’s, the input image is subjected to convolution with use of filters
that produces a Feature map.

1.2 Filters / Kernels

● Filters are randomly generated vectors in the network consisting of weights and
biases. The same weights and bias are shared among various neurons in CNN instead of
unique weights and bias for each neuron. Many filters can be generat ed where every filter captures
unique feature from input. Filters are also referred as Kernels.

8
CNN
Convolution Layer - Convolutions occur in convolution layer which are the building blocks of
CNN. This layer generally has

● Input vectors (Image)


● Filters (Feature Detector)
● Output vectors (Feature map)
● Input Image x Feature Detector = Feature Map
● This layer identify and extract best features/patterns from input image and preserves
the generic information into a matrix. Matrix representation of the input image is multiplied
element-wise with filters and summed up to produce a feature map, which is the same as dot
product between combination of vectors.
● Convolution involves the following important features :
● Local connectivity
● Where each neural is connected only to a subset of input image (unlike a neural network where all
neurons are fully connected). In CNN, a certain dimension of filter is chosen, which slides over these
subsets of input data. Multiple filters are present in CNN where each filter moves over entire image
and learns different portions of input image.
● Parameter Sharing
● Is sharing of weights by all neurons in a particular feature map. All of them share same amounts of
9
weight hence called parameter sharing.
CNN
● Batch Normalization
● Batch normalization is generally done in between convolution and activation(ReLU) layers. It
normalizes the inputs at each layer, reduces internal co-variate shift(change in the distribution
of network activations) and is a method to regularize a convolutional network.
● Batch normalizing allows higher learning rates that can reduce training time and gives
better performance. It allows learning at each layer by itself without being more dependent on
other layers. Dropout which is also a regularizing technique, is less effective to regularize convolution
layers.
● Padding and Stride
● Padding and Stride influence how convolution operation is performed. Padding and stride can be
used to alter the dimensions(height and width) of input/output vectors either by
increasing or decreasing.

● Padding is used to make dimension of output equal to input by adding zeros to the input frame
of matrix. Padding allows more spaces for kernel to cover image and is accurate for
analysis of images. Due to padding, information on the borders of images are also preserved
similarly as at the center of image.

10
CNN
● Stride controls how filter convolves over input i.e., the number of pixels shifts over the input matrix. If stride is set to 1,
filter moves across 1 pixel at a time and if stride is 2, filter moves 2 pixels at a time. More the value of stride,
smaller will be the resulting output and vice versa.

ReLU Layer (Rectified Linear Unit)

ReLU is computed after convolution. It is most commonly deployed activation function that allows the neural
network to account for non-linear relationships. In a given matrix (x), ReLU sets all negative values to zero and all
other values remains constant. It is mathematically represented as :

http://makeyourownneuralnetwork.blogspot.com/2020/0
2/calculating-output-size-of-convolutions.html
11
CNN
Pooling / Sub-sampling Layer

Next, there’s a pooling layer. Pooling layer operates on each feature map independently. This reduces
resolution of the feature map by reducing height and width of features maps, but retains
features of the map required for classification. This is called Down-sampling.

● Max-pooling : It selects maximum element from the feature map. The resulting max-pooled
layer holds important features of feature map. It is the most common approach as it gives better
results. Average pooling : It involves average calculation for each patch of the feature
map.

12
CNN
Why pooling is important ?
● It progressively reduces the spatial size of representation to reduce amount of parameters and
computation in network and also controls overfitting. If no pooling, then the output consists
of same resolution as input.
● There can be many number of convolution, ReLU and pooling layers. Initial layers of convolution
learns generic information and last layers learn more specific/complex features. After the
final Convolution Layer, ReLU, Pooling Layer the output feature map(matrix) will be
converted into vector(one dimensional array). This is called flatten layer.

Fully Connected Layer


● Fully connected layer looks like a regular neural network connecting all neurons and forms the last
few layers in the network. The output from flatten layer is fed to this fully-connected layer.

● The feature vector from fully connected layer is further used to classify images between different
categories after training. All the inputs from this layer are connected to every activation unit of the
next layer. Since all the parameters are occupied into fully-connected layer, it causes overfitting.
Dropout is one of the techniques that reduces overfitting.
● Dropout
● Dropout is an approach used for regularization in neural networks. It is a technique where randomly
chosen nodes are ignored in network during training phase at each stage.
● This dropout rate is usually 0.5 and dropout can be tuned to produce best results and
also improves training speed. This method of regularization reduces node-to-node interactions
in the network which leads to learning of important features and also helps in generalizing
new data better. 13
CNN
Soft-Max Layer

Soft-max is an activation layer normally applied to the last layer of network that acts as a classifier.
Classification of given input into distinct classes takes place at this layer. The soft max function is
used to map the non-normalized output of a network to a probability distribution.

● The output from last layer of fully connected layer is directed to soft max layer, which
converts it into probabilities.
● Here soft-max assigns decimal probabilities to each class in a multi-class problem, these
probabilities sum equals 1.0.
● This allows the output to be interpreted directly as a probability.

14
CNN
Why Convolutions

● Parameter sharing: a feature detector (such as a vertical edge detector) that’s useful in one part of the

image is probably useful in another part of the image.


● Sparsity of connections: in each layer, each output value depends only on small number of inputs.

15
CNN
● Basic Convolution Operation

Step 1: overlay the filter to the input, perform element wise multiplication, and add the
result.

16
CNN

● Step 2: move the overlay right one position (or according to the stride setting),
and do the same calculation above to get the next result. And so on.

The total number of multiplications to calculate the result above is


(4 x 4) x (3 x 3) = 144. 17
CNN
● Stride
● Stride governs how many cells the filter is moved in the input to calculate the next
cell in the result. The total number of multiplications to calculate the result above is (2 x 2) x (3 x 3) =
36.

18
CNN
Padding - Padding has the following benefits:

1. It allows us to use a CONV layer without necessarily shrinking the height and width of
the volumes. This is important for building deeper networks, since otherwise the
height/width would shrink as we go to deeper layers. If we have an activation map of size
W x W x D, a pooling kernel of spatial size F, and stride S, then the size of output volume can
be determined by the following formula:

19
CNN

Some padding terminologies:

● “valid” padding: Output size = input size - kernel size + 1


● “same” padding: Output size = input size
● “Full” padding: Output size = input size + kernel size - 1
● Calculating the Output Dimension

The output dimension is calculated with the following formula:


● If we have an input of size W x W x D and Dout number of kernels with a spatial size of F

with stride S and amount of padding P, then the size of output volume can be determined by

the following formula:

20
Convolution Operation on Volume- When the input has more than one channels (e.g. an RGB image),
CNN
Convolution parameters

● Filter dimensions: 2D for images.


● Filter size: generally 3x3 or 5x5.
● Number of filters: determine the number of feature maps created by the convolution operation.
● Stride: step for sliding the convolution window. Generally equal to 1.
● Padding: blank rows/columns with all-zero values added on sides of the input feature map.

21
CNN

● Accepts a volume of size W1×H1×D1 Requires four


hyperparameters:
● Number of filters K,
● their spatial extent F,
● the stride S,
● the amount of zero padding P.
● Produces a volume of size W2×H2×D2
● where: W2=(W1−F+2P)/S+1
● H2=(H1−F+2P)/S+1 (i.e. width and height are computed
equally by symmetry) D2=K
22
CNN

● The total number of multiplications to calculate the result is (4 x 4) x (3 x 3


x 3) = 432.

23
CNN
● Convolution Operation with Multiple Filters
● Multiple filters can be used in a convolution layer to detect multiple features. The output of the
layer then will have the same number of channels as the number of filters in the layer.
● The total number of multiplications to calculate the result is (4 x 4 x 2) x (3
x 3 x 3) = 864

24
CNN
● 1 x 1 Convolution
● This is convolution with 1 x 1 filter. The effect is to flatten or “merge” channels together, which
can save computations later in the network:

25
CNN
● One Convolution Layer
● Finally to make up a convolution layer, a bias ( ϵ R) is added and an activation function such as
ReLU or tanh is applied.

26
CNN
● Shorthand Representation
● This simpler representation will be used from now on to represent one convolutional layer:

Sample Complete Network- This is a sample network with three convolution layers. At the end of the network, the output
of the convolution layer is flattened and is connected to a logistic regression or a softmax output layer.

27
CNN
● Pooling Layer
● Pooling layer is used to reduce the size of the representations and to speed up
calculations, as well as to make some of the features it detects a bit more robust.
● Sample types of pooling are max pooling and avg pooling, but these days max
pooling is more common.

Interesting properties of pooling layer:

● it has hyper-parameters:
○ size (f)
○ stride (s)
○ type (max or avg)
● but it doesn’t have parameter; there’s nothing for gradient descent to learn

28
CNN
● When done on input with multiple channels, pooling reduces the height and width (nW and
nH) but keeps nC unchanged:

29
CNN-LeNet – 5 Architecture
● Well Known Architectures
● Classic Network: LeNet – 5
● Number of parameters: ~ 60 thousands. The network has 5 layers with learnable parameters and hence
named Lenet-5.
● It has three sets of convolution layers with a combination of average pooling.
● After the convolution and average pooling layers, we have two fully connected layers. At last, a Softmax
classifier which classifies the images into respective class.

30
CNN-LeNet – 5 Architecture
● The input to this model is a 32 X 32 grayscale image hence the number of channels is one.
● It means that a ConvNet neuron transforms the input image by arranging its neurons in three dimension.

We then apply the first convolution operation with the filter size 5X5 and we have 6 such filters. As a result, we
get a feature map of size 28X28X6. Here the number of channels is equal to the number of filters applied.

31
CNN-LeNet – 5 Architecture
● After the first pooling operation, we apply the average pooling and the size of the feature
map is reduced by half. Note that, the number of channels is intact.

Next, we have a convolution layer with sixteen filters of size 5X5. Again the feature map changed it is 10X10X16.
The output size is calculated in a similar manner. After this, we again applied an average pooling or subsampling
layer, which again reduce the size of the feature map by half i.e 5X5X16.

32
CNN-LeNet – 5 Architecture

● Then we have a final convolution layer of size 5X5 with 120 filters. As shown in the above image. Leaving

the feature map size 1X1X120. After which flatten result is 120 values.

● After these convolution layers, we have a fully connected layer with eighty-four neurons. At last, we have

an output layer with ten neurons since the data have ten classes.

● Here is the final architecture of the Lenet-5 model.

33
CNN-AlexNet
● Classic Network: AlexNet ( Number of parameters: ~ 60 millions. )
● AlexNet is another classic CNN architecture from
ImageNet Classification with Deep Convolutional Neural Networks

34
CNN-AlexNet
● Calculation of AlexNet Parameter

35
CNN-AlexNet Example

● The first input layer has no parameters. You know why.

● Parameters in the second CONV1(filter shape =5*5, stride=1) layer is: ((shape of width of

filter*shape of height filter*number of filters in the previous layer+1)*number of

filters) = (((5*5*3)+1)*8) = 608.

● The third POOL1 layer has no parameters. You know why.

● Parameters in the fourth CONV2(filter shape =5*5, stride=1) layer is: ((shape of width of

filter * shape of height filter * number of filters in the previous layer+1) * number of

filters) = (((5*5*8)+1)*16) = 3216. 36


CNN-AlexNet Example
● The fifth POOL2 layer has no parameters. You know why.
● Parameters in the Sixth FC3 layer is((current layer c*previous layer p)+1*c) =
120*400+1*120= 48120.
● Parameters in the Seventh FC4 layer is: ((current layer c*previous layer p)+1*c) =
84*120+1* 84 = 10164.
● The Eighth Softmax layer has ((current layer c*previous layer p)+1*c) parameters =
10*84+1*10 = 850.

● Just because there are no parameters in the pooling layer, it does not imply
that pooling has no role in backprop. Pooling layer is responsible for passing on
the values to the next and previous layers during forward and backward
propagation respectively.
● There are no trainable parameters in a max-pooling layer. In the
forward pass, it pass maximum value within each rectangle to the next
layer.
● Convolution: Combine filter values and input values (multiply and add).
● Pooling: Only use input values. 37
CNN-VGG-16

● Classic Network: VGG-16 ( Number of parameters: ~ 138 millions. )

● VGG-16 from
Very Deep Convolutional Networks for Large-Scale Image Recognition.
● The number 16 refers to the fact that the network has 16 trainable layers (i.e. layers that have weights).

38
CNN-VGG-16
● The figure below is another way to visually depict the layers in a network. In the case of VGG-16 there are five
convolutional blocks (Conv-1 to Conv-5). The specific layers within a convolutional block can vary depending on
the architecture.
● However, a convolutional block typically contains one or more 2D convolutional layers followed by a pooling layer.
Other layers are also sometimes incorporated, but we will focus on these two layer types to keep things simple.
● Notice that we have explicitly specified the last layer of the network as SoftMax. This layer applies the softmax
function to the outputs from the last fully connected layer in the network (FC-8).
● It converts the raw output values from the network to normalized values in the range [0,1], which we can interpret as
a probability score for each class.

39
CNN-VGG-16
● In the case of VGG-16, the input is a color image with the shape: (224x224x3). Here we depict a high-level
view of a convolutional layer. Convolutional layers use filters to process the input data. The filter moves
across the input, and at each filter location, a convolution operation is performed, which produces a single
number. This value is then passed through an activation function, and the output from the activation
function populates the corresponding entry in the output, also known as an activation map (224x224x1).
You can think of the activation map as containing a summary of features from the input via the
convolution process.

40
CNN-VGG-16
● Example

41
CNN-VGG-16
● Convolutional Layer with a Single Filter
● Let’s now look at a concrete example of a simple convolutional layer. Suppose the input image has a size of
(224x224x3). The spatial size of a filter is a design choice, but it must have a depth of three to match the
depth of the input image. In this example, we use a single filter of size (3x3x3).

42
CNN-VGG-16
● Here we have chosen to use just a single filter. Therefore the filter contains three kernels where each kernel has nine
trainable weights. There are a total of 27 trainable weights in this filter, plus a single bias term, for 28 total trainable
parameters. Because we’ve chosen just a single filter, the depth of our output is one, which means we produce just a
single channel activation map shown. When we convolve this single (3-channel) filter with the (3-channel) input, the
convolution operation is performed for each channel separately. The weighted sum of all three channels plus a bias
term is then passed through an activation function whose output is represented as a single number in the output
activation map (shown in blue).
● So to summarize, the number of channels in a filter must match the number of channels in the input. And the number
of filters in a convolutional layer (a design choice) dictates the number of activation maps that are produced by the
convolutional layer.

Convolutional Layer with Two Filters

● This next example represents a more general case. First, notice that the input has a depth of three, but this doesn’t
necessarily correspond to color channels. It just means that we have an input tensor that has three channels.
Remember that when we refer to the input, we don’t necessarily mean the input to the neural network. But rather the
input to this convolutional layer which could represent the output from a previous layer in the network.

43
CNN-VGG-16

44
CNN-VGG-16
● Each filter learns to detect different structural elements (i.e., horizontal lines, vertical lines, and diagonal lines). To be
more precise, these “lines” represent edge structures in an image. In the output activation maps, we emphasize the
highly activated neurons associated with each filter.

45
CNN-VGG-16
● CNN architectures like VGG-16 is that there are usually two or three consecutive convolutional layers followed by a
max pooling layer. And convolutional layers usually contain at least 32 filters.

46
CNN-VGG-16
● Fully Connected Classifier
● The fully connected (dense) layers in a CNN architecture transform features into class probabilities.
In the case of VGG-16, the output from the last convolutional block (Conv-5) is a series of activation
maps with shape (7x7x512). For reference, we have indicated the number of channels at key points in
the architecture.

Before the data from the last convolutional layer in the feature extractor can flow through the classifier, it needs to be
flattened to a 1-dimensional vector of length 25,088. After flattening, this 1-dimensional layer is then fully connected to
FC-6, as shown below.
47
CNN
● Flattening the output of Conv-5 is required for processing the activation maps through the classifier, but it does not
alter the original spatial interpretation of the data. It’s just a repacking of the data for processing purposes.

48
Summary
● CNNs designed for a classification task contain an upstream feature extractor and a
downstream classifier.
● The feature extractor comprises convolutional blocks with a similar structure
composed of one or more convolutional layers followed by a max pooling layer.
● The convolutional layers extract features from the previous layer and store the results
in activation maps.
● The number of filters in a convolutional layer is a design choice in the architecture of a
model, but the number of kernels within a filter is dictated by the depth of the input
tensor.
● The depth of the output from a convolutional layer (the number of activation maps) is
dictated by the number of filters in the layer.
● Pooling layers are often used at the end of a convolutional block to downsize the
activation maps. This reduces the total number of trainable parameters in the network
and, therefore, the training time required. Additionally, this also helps mitigate
overfitting.
● The classifier portion of the network transforms the extracted features into class
probabilities using one or more densely connected layers.
● When the number of classes is more than two, a SoftMax layer is used to normalize the
raw output from the classifier to the range [0,1]. These normalized values can be
interpreted as the probability that the input image corresponds to the class label for
each output neuron.
49
CNN-Inception
● Inception- The motivation of the inception network is, rather than requiring us to pick the
filter size manually, let the network decide what is best to put in a layer. We give it choices and
hopefully it will pick up what is best to use in that layer:

The problem with the above network is computation cost (e.g. for the 5 x 5 filter only, the computation cost

is (28 x 28 x 32) x (5 x 5 x 192) = ~ 120 millions).

Using 1 x 1 convolution will reduce the computation to about 1/10 of that. With this idea, an inception
50
module will look like this:
CNN-Inception

51
52
53
At last, all the channels in the network are concatenated together i.e. (28 x 28 x (64 +
128 + 32 + 32)) = 28 x 28 x 256.
54
CNN-Inception
● A

55
CNN-Inception
● Now let’s look at the computational cost. To apply this \(1 \times 1 \) convolution we have \(16 \)
filters. Each of the filters is going to be of dimension \(1 \times 1 \times 192 \). The cost of computing
this \(28 \times 28 \times 16 \) volume is going to equal to the number of computations and it’s \(192
\) multiplications.
● The cost of this first convolutional layer is learning about \(2.4 \) million parameters. The cost of the
second convolutional layer will be about \(10.0 \) million because we have to apply \(5\times 5 \times
16 \) dimensional filter on this \(28\times 28\times 32 \) dimensional volume above.
● So, the total number of multiplications we need to do is the sum of those, which is \(12.4 \) million
multiplications. If we compare this with previous example we reduced the computational cost from
about \(120 \) million multiplications down to about 1/10 of that, to \(12.4 \) million multiplications.
The number of additions we need to perform is very similar to the number of multiplications as well.
● Example

56
CNN-Local Response Normalization
● Why Normalization?
● Normalization has become important for deep neural networks that compensate for the
unbounded nature of certain activation functions such as ReLU, ELU, etc. With these activation
functions, the output layers are not constrained within a bounded range (such as [-1,1] for
tanh), rather they can grow as high as the training allows it. To limit the unbounded activation
from increasing the output layer values, normalization is used just before the activation
function
● Local Response Normalization
● Local Response Normalization (LRN) was first introduced in AlexNet architecture where the
activation function used was ReLU as opposed to the more common tanh and sigmoid at that
time. Apart from the reason mentioned above, the reason for using LRN was to encourage
lateral inhibition.
● It is a concept in Neurobiology that refers to the capacity of a neuron to reduce the activity of its
neighbors [1]. In DNNs, the purpose of this lateral inhibition is to carry out local contrast
enhancement so that locally maximum pixel values are used as excitation for the next layers.
● LRN is a non-trainable layer that square-normalizes the pixel values in a feature map within
a local neighborhood. There are two types of LRN based on the neighborhood defined and can
be seen in the figure below. 57
CNN-Local Response Normalization
● Inter-Channel LRN: This is originally what the AlexNet paper used. The neighborhood
defined is across the channel. For each (x,y) position, the normalization is carried out in the
depth dimension and is given by the following formula https://hackmd.io/@imkushwaha/
alexnet

58
59
CNN-Local Response Normalization
● A

where i indicates the output of filter i, a(x,y), b(x,y) the pixel values at (x,y) position before and after

normalization respectively, and N is the total number of channels.

The constants (k,α,β,n) are hyper-parameters. k is used to avoid any singularities (division by zero),

α is used as a normalization constant, while β is a contrasting constant.

The constant n is used to define the neighborhood length i.e. how many consecutive pixel values

need to be considered while carrying out the normalization.

The case of (k,α, β, n)=(0,1,1,N) is the standard normalization). In the figure above n is taken to be to

2 while N=4.
60
CNN
● A

61
CNN-Local Response Normalization
● Different colors denote different channels and hence N=4. Lets take the hyper-parameters to be
(k,α, β, n)=(0,1,1,2).
● The value of n=2 means that while calculating the normalized value at position (i,x,y), we
consider the values at the same position for the previous and next filter i.e (i-1, x, y) and (i+1, x,
y).
● For (i,x,y)=(0,0,0) we have value(i,x,y)=1, value(i-1,x,y) doesn’t exist and value(i+,x,y)=1.
Hence normalized_value(i,x,y) = 1/(¹²+¹²) = 0.5 and can be seen in the lower part of the
figure above.
● The rest of the normalized values are calculated in a similar way.

Intra-Channel LRN: In Intra-channel LRN, the neighborhood is extended within the same
channel only as can be seen in the figure above. The formula is given by

62
CNN-Local Response Normalization

● where (W,H) are the width and height of the feature map (for example in the figure above

(W,H) = (8,8)). The only difference between Inter and Intra Channel LRN is the neighborhood

for normalization. In Intra-channel LRN, a 2D neighborhood is defined (as opposed to the 1D

neighborhood in Inter-Channel) around the pixel under-consideration. As an example, the

figure below shows the Intra-Channel normalization on a 5x5 feature map with n=2 (i.e. 2D

neighborhood of size (n+1)x(n+1) centered at (x,y)).

63
CNN-Local Response Normalization
In batch normalization, the output of hidden neurons is processed in the following manner before
being fed to the activation function.

1. Normalize the entire batch B to be zero mean and unit variance


● Calculate the mean of the entire mini-batch output: u_B
● Calculate the variance of the entire mini-batch output: sigma_B
● Normalize the mini-batch by subtracting the mean and dividing with variance

Comparison:

LRN has multiple directions to perform normalization across (Inter or Intra Channel), on the other
hand, BN has only one way of being carried out (for each pixel position across all the activations).
The table below compares the two normalization techniques.

64
Training a Convolutional Network

65
66

You might also like