7 CNN
7 CNN
7 CNN
Networks
Dr. Dinesh Kumar Vishwakarma
Professor, Department of Information Technology
Delhi Technological University, Delhi
Email: dinesh@dtu.ac.in
Web page: http://www.dtu.ac.in/Web/Departments/InformationTechnology/faculty/dkvishwakarma.php
Biometric Research Laboratory
http://www.dtu.ac.in/Web/Departments/InformationTechnology/lab_and_infra/bml/
1 𝑖𝑓 𝜔. 𝑥 + 𝑏 > 0
• Recognized letters of the alphabet 𝑓 𝑥 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
recognizable math
Reinvigorated (a new
energy) research in
Deep Learning
Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
• It is deeper and much wider version of the LeNet and won by a large margin the difficult
ImageNet competition.
• AlexNet is scaled LeNet in to much large NN and it can learn more complex objects and
object hierarchies.
• Used GPUs NVIDIA GTX 580 to reduce training time.
“AlexNet”
ImageNet Classification with Deep Convolutional Neural Networks [Krizhevsky, Sutskever, Hinton, 2012]., NIPS citation: 59454 (02.04.2020)
Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Figures copyright Shaoqing Ren, Kaiming He, Ross Girschick, Jian Sun, 2015. Reproduced with Figures copyright Clement Farabet, 2012.
permission. Reproduced with permission. [Farabet et al., 2012]
[Faster R-CNN: Ren, He, Girshick, Sun 2015]
Images are examples of pose estimation, not actually from Toshev & Szegedy 2014. Copyright Lane McIntosh.
[Toshev, Szegedy 2014]
[Guo et al. 2014] Figures copyright Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard Lewis, and Xiaoshi Wang, 2014. Reproduced with permission.
• The role of the CNN is to reduce the images into a form which is easier to process, without
losing features which are critical for getting a good prediction.
• It is very good when we design a system, which can learn features as well as scale the
data.
3/7/2023 Dinesh K. Vishwakarma, Ph.D. 19
Convolution Layer
32x32x3 image -> preserve spatial structure
32 height
32 width
3 depth
𝟎 × −1 + 𝟎 × 0 + 𝟎 × 1 + 𝟎 × 0 + 𝟎 × 0 + 𝟎 × 1 + 𝟎 × 1 + 𝟏 × −1 + 𝟎 × 1 = −1
𝟎 × −1 + 𝟎 × 0 + 𝟎 × 1 + 𝟎 × 1 + 2 × −1 + 𝟏 × 1 + 0 × 0 + 2 × 1 + 𝟏 × 0 = 1
𝟎 × −1 + 𝟎 × 1 + 𝟎 × 1 + 𝟎 × 1 + 2 × 1 + 𝟏 × 0 + 0 × 0 + 1 × (−1) + 0 × 0 = 1
𝑤 ∗ 𝑥 + 𝒃𝟎 = −1 + 1 + 1 + 𝟏 = 2
𝒘𝟎
3×3×3
× 𝒘1 3×3×2
3×3×3
7×7×3
5x5x3 filter
32
32
3
5x5x3 filter
32
32
3
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
5x5x3 filter
32
28
32 28
3 1
28
32 28
3 1
activation maps
32
28
Convolution Layer
32 28
3 6
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
Preview
28
32 28
3 1
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
(recall:)
(N - F) / stride + 1
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
3/7/2023 Dinesh K. Vishwakarma, Ph.D. 48
Examples time
• Input volume: 32x32x3
• 10 5x5 filters with stride 1, pad 2
• Output volume size: ?
• Accepts a volume of size W1×H1×D1 K = (powers of 2, e.g. 32, 64, 128, 512)
• Requires four hyperparameters: - F = 3, S = 1, P = 1
✓ Number of filters K, - F = 5, S = 1, P = 2
✓ their spatial extent F, - F = 5, S = 2, P = ? (whatever fits)
✓ the stride S,
- F = 1, S = 1, P = 0
✓ the amount of zero padding P.
• Produces a volume of size W2×H2×D2 where:
✓ W2=(W1−F+2P)/S+1
✓ H2=(H1−F+2P)/S+1 (i.e. width and height are computed equally by symmetry)
✓ D2=K
• With parameter sharing, it introduces F⋅F⋅D1 weights per filter, for a total of (F⋅F⋅D1)⋅K weights
and K biases.
• In the output volume, the d-th depth slice (of size W2×H2 ) is the result of performing a valid
convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
32x32x3 image
5x5x3 filter
32
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)
32x32x3 image
5x5x3 filter
32
32
28
An activation map is a 28x28 sheet of neuron
outputs:
1. Each is connected to a small region in the input
32
28 2. All of them share parameters
3 “5x5 filter” -> “5x5 receptive field for each neuron”
32
28
E.g. with 5 filters, CONV layer
consists of neurons arranged in a
3D grid (28x28x5)
32 28 There will be 5 different neurons all
3 looking at the same region in the
5 input volume.
3/7/2023 Dinesh K. Vishwakarma, Ph.D. 58
Pooling Layer
• Makes the representations smaller and more
• Pooling is two types: Max and
manageable
Average
• Operates over each activation map
independently.
• Pooling layer is responsible for reducing the
spatial size of the Convolved Feature.
• This is to decrease the computational power
required to process the data through
dimensionality reduction.
• Furthermore, it is useful for extracting
dominant features which are rotational and
positional invariant, thus maintaining the
process of effectively training of the model.
3/7/2023 Dinesh K. Vishwakarma, Ph.D. 59
MAX and AVERAGE POOLING
• Max Pooling returns the maximum value from the portion of
the image covered by the Kernel. Max and Average pool with
• Average Pooling returns the average of all the values from the 2x2 filters and stride 2
portion of the image covered by the Kernel.
• Max Pooling also performs as a Noise Suppressant. It discards
the noisy activations altogether and also performs de-noising
along with dimensionality reduction.
• On the other hand, Average Pooling simply performs
dimensionality reduction as a noise suppressing mechanism.
Hence, we can say that Max Pooling performs a lot better
than Average Pooling.
• The Convo Layer and the Pooling Layer, together form the i-th
layer of a CNN. These layers may be increased to have low
level details but computational complexity increases.
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)
http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html