CNN Iitkgp
CNN Iitkgp
CNN Iitkgp
: Kernel
f * g (x)
f ( )g(x
N )d
Output is
1 f ( )g(x sometimes called
Feature map
0
)
2D (continuous, discrete)
:
f * g (x, y) f ( , )g(x , y
N 1 N )dd
1 f ( , )g(x , y
)
0 0
Convolution Properties
• Commutative:
f*g = g*f
• Associative:
(f*g)*h = f*(g*h)
• Homogeneous:
f*(g)= f*g
• Additive (Distributive):
f*(g+h)= f*g+f*h
• Shift-Invariant
f*g(x-x0,y-yo)= (f*g) (x-x0,y-yo)
ConvNet
• ConvNet architectures for images:
– fully-connected structure does not scale to large
images
– the explicit assumption that the inputs are
images
– allows us to encode certain properties into the
architecture.
– These then make the forward function more efficient
to implement
– Vastly reduce the amount of parameters in the
network.
• 3D volumes: neurons arranged in 3 dimensions:
width, height, depth.
Convnets
translated
image image
32x32x3 image
32 height
32 width
3 depth
Andrej Karpathy
Convolutions: More detail
32x32x3 image
5x5x3 filter
32
32
3
Andrej Karpathy
Convolutions: More detail
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
Andrej Karpathy
Convolutions: More detail
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
Andrej Karpathy
Convolutions: More detail
consider a second, green filter
Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32
28
32 28
3 1
Andrej Karpathy
Convolutions: More detail
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
Andrej Karpathy
Convolutions: More detail
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6
Andrej Karpathy
Convolutions: More detail
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation
functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
Andrej Karpathy
Convolutions: More detail
[From recent
Preview Yann LeCun
slides]
Andrej Karpathy
Convolutions: More detail
one filter =>
one activation map example 5x5 filters
(32 total)
28
32 28
3 1
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
• 7
• 7x7 input
(spatially)
assume 3x3
filter
• 7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
• 7
• 7x7 input
(spatially)
assume 3x3
filter
• 7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
• 7
• 7x7 input
(spatially)
assume 3x3
filter
• 7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
• 7
• 7x7 input
(spatially)
assume 3x3
filter
• 7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
• 7
• 7x7 input (spatially)
assume 3x3 filter
7 => 5x5 output
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
Andrej Karpathy
Convolutions: More detail
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
Andrej Karpathy
Convolutions: More detail
N
Output size:
(N - F) / stride +
F 1
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
Andrej Karpathy
Convolutions: More detail
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
(recall:)
(N - F) / stride + 1
Andrej Karpathy
Convolutions: More detail
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
0
7x7 output!
Andrej Karpathy
Convolutions: More detail
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0
3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
0
7x7 output!
in general, common to see CONV layers with stride
1, filters of size FxF, and zero-padding with (F-1)/2.
(will preserve size spatially)
e.g. F = 3 => zero pad with 1 F
= 5 => zero pad with 2 F
= 7 => zero pad with 3
(N + 2*padding - F) /
stride + 1
Andrej Karpathy
Convolutions: More detail
Examples time:
Andrej Karpathy
Convolutions: More detail
Examples time:
Andrej Karpathy
Convolutions: More detail
Examples time:
Andrej Karpathy
Convolutions: More detail
Examples time:
Andrej Karpathy
Convolutions: More detail
Andrej Karpathy
Spatial arrangement
• Three hyperparameters control the size of the
output volume
– Depth: no of filters, each learning to look for
something different in the input.
– the stride with which we slide the filter.
– pad the input volume with zeros around the
border.
Spatial arrangement
• We compute the spatial size of the output
volume as a function of
– the input volume size (W)
– the receptive field size of the Conv Layer
neurons (F)
– the stride with which they are applied (S)
– the amount of zero padding used (P) on the
border.
• The number of neurons that “fit” is given by
(W−F+2P)/(S+1)
– For a 7x7 input and a 3x3 filter with stride 1 and pad 0
we would get a 5x5 output.
– With stride 2 we would get a 3x3 output.
• one spatial dimension (x-axis), one neuron with a receptive field
size of F = 3, the input size is W = 5, and zero padding of P = 1
• Stride = 1, 2
Max
Sum
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale
image recognition." arXiv preprint arXiv:1409.1556 (2014).
Fully-connected layer
• Neurons in a fully connected layer have full connections to all
activations in the previous layer
• Their activations can hence be computed with a matrix
multiplication followed by a bias offset.
• Converting FC layers to CONV layers
• the only difference between FC and CONV layers is that the
neurons in the CONV layer are connected only to a local
region in the input, and that many of the neurons in a CONV
volume share parameters.
• However, the neurons in both layers still compute dot
products, so their functional form is identical.
Converting FC layers to CONV layers
• For any CONV layer there is an FC layer that implements the same forward
function.
• The weight matrix would be a large matrix that is mostly zero except for at
certain blocks (due to local connectivity) where the weights in many of the
blocks are equal (due to parameter sharing).
• Conversely, any FC layer can be converted to a CONV layer.
• For example, an FC layer with K=4096 that is looking at some input volume
of size 7×7×512
• can be equivalently expressed as a CONV layer with F=7,P=0,S=1,K=4096.
• In other words, we are setting the filter size to be exactly the size of the
input volume, and hence the output will simply be 1×1×4096 since only a
single depth column “fits” across the input volume, giving identical
result as the initial FC layer.
ConvNet Architectures
Layer Patterns
• The most common architecture
• stacks a few CONV-RELU layers,
• follows them with POOL layers,
• and repeats this pattern until the image has been merged spatially
to a small size.
• At some point, it is common to transition to fully-connected layers.
The last fully-connected layer holds the output, such as the class
scores. In other words, the most common ConvNet architecture
follows the pattern:
INPUT -> [[CONV -> RELU]*N -> POOL?]*M ->[FC -> RELU]*K -> FC
• N >= 0 (and usually N <= 3), M >= 0, K >= 0
Prefer a stack of small filter CONV to one large receptive field CONV layer.
three layers of 3x3 CONV vs a single CONV layer with 7x7
receptive fields.
• The receptive field size is identical in spatial extent (7x7), but
with several disadvantages.
1. The neurons would be computing a linear function over the input,
while the three stacks of CONV layers contain non-linearities that
make their features more expressive.
2. If we suppose that all the volumes have C channels, the single 7x7
CONV layer would contain C×(7×7×C)=49C2 parameters, while the
three 3x3 CONV layers would contain 3×(C×(3×3×C))=27C2
parameters.
• Intuitively, stacking CONV layers with tiny filters as opposed to
having one CONV layer with big filters allows us to express
more powerful features of the input, and with fewer
parameters.
Recent Departures
• The conventional paradigm of a linear list of layers
has recently been challenged, in
1. Google’s Inception architectures
2. current (state of the art) Residual Networks from
Microsoft Research Asia.
• Both of these feature more intricate and different
connectivity structures.
Case Studies
• LeNet. The first successful applications of Convolutional Networks
were developed by Yann LeCun in 1990’s. was used to read zip
codes, digits, etc.
• AlexNet. popularized Convolutional Networks in Computer Vision,
developed by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton.
• The AlexNet was submitted to the ImageNet ILSVRC challenge in
2012 and significantly outperformed the second runner-up (top 5
error of 16% compared to runner-up with 26% error). The Network
had a very similar architecture to LeNet, but was deeper, bigger,
and featured Convolutional Layers stacked on top of each other
• ZF Net. The ILSVRC 2013 winner was a Convolutional Network from
Matthew Zeiler and Rob Fergus.
• An improvement on AlexNet by tweaking the architecture
hyperparameters, -- expanding the size of the middle
convolutional layers and making the stride and filter size on the first
layer smaller.
Case Studies
• GoogLeNet. The ILSVRC 2014 winner was a Convolutional Network from Szegedy
et al. from Google.
• Its main contribution was the development of an Inception Module that
dramatically reduced the number of parameters in the network (4M, compared to
AlexNet with 60M).
• Uses Average Pooling instead of Fully Connected layers at the top of the ConvNet
• There are also several followup versions to the GoogLeNet, most
recently Inception-v4.
• VGGNet. The runner-up in ILSVRC 2014 was the network from
Karen Simonyan and
Andrew Zisserman.
• Showed that the depth of the network is a critical component for good
performance. Their final best network contains 16 CONV/FC layers
• features an extremely homogeneous architecture that only performs
3x3
convolutions and 2x2 pooling from the beginning to the end.
• A downside: that it is more expensive to evaluate and uses a lot more memory and
parameters (140M). Most of these parameters are in the first fully connected layer,
and it was since found that these FC layers can be removed with no performance
downgrade, significantly reducing the number of necessary parameters.
LeNet
• Yann LeCun and his collaborators developed a really good
recognizer for handwritten digits.
• This net was used for reading ~10% of the checks in North
America.
• Demo at http://yann.lecun.com
The architecture of LeNet5
From hand-written digits to 3-D objects
• Recognizing real objects in color photographs
downloaded from the web is much more complicated
than recognizing hand-written digits:
– Hundred times as many classes (1000 vs 10)
– Hundred times as many pixels (256 x 256 color vs
28 x 28 gray)
– Two dimensional image of three-dimensional scene.
– Cluttered scenes requiring segmentation
– Multiple objects in each image.
• Will the same type of convolutional neural network
work?
The ILSVRC-2012 competition on ImageNet
• The dataset has 1.2 million high-resolution training images.
• The classification task:
– Get the “correct” class in your top 5 bets. There are 1000
classes.
• The localization task:
– For each bet, put a box around the object. Your box must have
at least 50% overlap with the correct box.
• Some of the best existing computer vision methods were tried on
this dataset by leading computer vision groups from Oxford, INRIA,
XRCE, …
– Computer vision systems use complicated multi-stage systems.
– The early stages are typically hand-tuned by optimizing a few
parameters.
Examples from the test set (with the
network’s guesses)
A neural network for ImageNet
• Alex Krizhevsky (NIPS 2012) developed a very deep convolutional
neural net of the type pioneered by Yann Le Cun. Its architecture
was:
– 7 hidden layers not counting some max pooling layers.
– The early layers were convolutional.
– The last two layers were globally connected.
– 650000 units, 60 million params
• The activation functions were:
– Rectified linear units in every hidden layer. These train much faster
and are more expressive than logistic units.
– Competitive normalization to suppress hidden activities when nearby
units have stronger activities. This helps with variations in intensity.
A Common Architecture: AlexNet
AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 15.4% -> 14.8%
Andrej Karpathy
Case Studies
• GoogLeNet. The ILSVRC 2014 winner was a Convolutional Network
from Szegedy et al. from Google.
• Its main contribution was the development of an Inception Module that
dramatically reduced the number of parameters in the network (4M,
compared to AlexNet with 60M).
• Uses Average Pooling instead of Fully Connected layers at the top of the
ConvNet
• There are also several followup versions to the GoogLeNet, most
recently Inception-v4.
• VGGNet. The runner-up in ILSVRC 2014 was the network from
Karen
Simonyan and Andrew Zisserman.
• Showed that the depth of the network is a critical component for good
performance. Their final best network contains 16 CONV/FC layers
• and, apfeatures an extremely homogeneous architecture that only
performs 3x3 convolutions and 2x2 pooling from the beginning to the end.
Their pretrained model is available for plug and play use in Caffe. A
downside of the VGGNet is that it is more expensive to evaluate and uses
a lot more memory and parameters (140M). Most of these parameters are
in the first fully connected layer, and it was since found that these FC
layers can be removed with no performance downgrade, significantly
reducing the number of necessary parameters.
Case Study: VGGNet
[Simonyan and Zisserman, 2014]
best model
Andrej Karpathy
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Convolution
Pooling
Softmax
Other
Inception module
Andrej Karpathy
GoogLeNet vs State of the art
GoogLeNet
Convolution
Pooling
Softmax
Other
Andrej Karpathy
Case Study: ResNet
Andrej Karpathy
• Escape from few layers
– ReLU for solving gradient vanishing problem
– Dropout …
• Escape from 10 layers
– Normalized initialization
– Intermediate normalization layers
• Escape from 100 layers
– Residual network
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
at runtime: faster
than a VGGNet!
(even though it has
8x more layers)
Andrej Karpathy
Plain Network
• Plain nets: stacking 3x3 conv layers
• 56-layer net has higher training error and test
error than 20-layers net
Some residual
Network
93
Residual Network
• Deeper ResNets have lower training error
• Other remarks
– No max pooling (almost)
– No hidden fc
– No dropout
Kaiming He, Xiangyu Zhang, Shaoqing Ren,
& Jian Sun. “Deep Residual Learning for 98
Image Recognition”. arXiv 2015.
Network Design
• ResNet-152
– Use bottlenecks
– ResNet-152(11.3
billion FLOPs) has
lower
complexity than
VGG-16/19 nets
(15.3/19.6
billion FLOPs)
Kaiming He, Xiangyu Zhang, Shaoqing Ren,
& Jian Sun. “Deep Residual Learning for 99
Image Recognition”. arXiv 2015.
Results
• Deep Resnets can be trained without
difficulties
• Deeper ResNets have lower training error, and
also lower test error
Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]
Adapted from Jia-bin Huang
Data Augmentation (Jittering)
• Create virtual trainin g samples
– Horizontal flip
– Random crop
– Color casting
– Geometric distortion
Andrej Karpathy
Transfer Learning with CNNs
Source: classification on ImageNet Target: some other task/data
Freeze these
Freeze these
Train this
Train this
Andrej Karpathy
Simplest Way to Use CNNs
• Take model trained on, e.g., ImageNet 2012
training set
• Easiest: Take outputs of e.g. 6th or 7th fully-
connected layer, and plug features from each
layer into linear SVM
• Features are neuron activations at that level
• Can train linear SVM for different tasks, not just
one used to learn the deep net
• Better: fine-tune features and/or classifier on
new dataset
• Classify test set of new dataset
Adapted from Lana Lazebnik
Package
s
• Caffe and Caffe Model Zoo
• Torch
• Theano with Keras/Lasagne
• MatConvNet
• TensorFlow
Learning Resources
• http://deeplearning.net/
• http://cs231n.stanford.edu
Things to remember
• Overview
– Neuroscience, perceptron, multi-layer neural
networks
• Convolutional neural network (CNN)
– Convolution, nonlinearity, max pooling
• Training CNN
– Dropout; data augmentation; transfer
learning
• Using CNNs for your own task
– Basic first step: try the pre-trained CaffeNet fc6-
fc8 layers as features
Adapted from Jia-bin Huang