@@ -10,16 +10,16 @@ Convolutional Neural Networks (LeNet)
10
10
`floatX`_, `downsample`_ , `conv2d`_, `dimshuffle`_. If you intend to run the
11
11
code on GPU also read `GPU`_.
12
12
13
- To run this example on a GPU, you need a good GPU. First, it need
14
- at least 1G of GPU RAM and possibly more if your monitor is
13
+ To run this example on a GPU, you need a good GPU. It needs
14
+ at least 1GB of GPU RAM. More may be required if your monitor is
15
15
connected to the GPU.
16
-
17
- Second, when the GPU is connected to the monitor, there is a limit
16
+
17
+ When the GPU is connected to the monitor, there is a limit
18
18
of a few seconds for each GPU function call. This is needed as
19
- current GPU can't be used for the monitor while doing
20
- computation. If there wasn't this limit, the screen would freeze
21
- for too long and this look as if the computer froze. User don't
22
- like this. This example hit this limit with medium GPU . When the
19
+ current GPUs can't be used for the monitor while doing
20
+ computation. Without this limit, the screen would freeze
21
+ for too long and make it look as if the computer froze.
22
+ This example hits this limit with medium-quality GPUs . When the
23
23
GPU isn't connected to a monitor, there is no time limit. You can
24
24
lower the batch size to fix the time out problem.
25
25
@@ -52,87 +52,87 @@ Convolutional Neural Networks (LeNet)
52
52
Motivation
53
53
++++++++++
54
54
55
- Convolutional Neural Networks (CNN) are variants of MLPs which are inspired from
56
- biology. From Hubel and Wiesel's early work on the cat's visual cortex [Hubel68]_,
57
- we know there exists a complex arrangement of cells within the visual cortex.
58
- These cells are sensitive to small sub-regions of the input space , called a
59
- **receptive field**, and are tiled in such a way as to cover the entire visual
60
- field. These filters are local in input space and are thus better suited to
61
- exploit the strong spatially local correlation present in natural images.
55
+ Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs.
56
+ From Hubel and Wiesel's early work on the cat's visual cortex [Hubel68]_, we
57
+ know the visual cortex contains a complex arrangement of cells. These cells are
58
+ sensitive to small sub-regions of the visual field , called a *receptive
59
+ field*. The sub-regions are tiled to cover the entire visual field. These
60
+ cells act as local filters over the input space and are well- suited to exploit
61
+ the strong spatially local correlation present in natural images.
62
62
63
- Additionally, two basic cell types have been identified: simple cells (S) and
64
- complex cells (C). Simple cells (S) respond maximally to specific edge-like
65
- stimulus patterns within their receptive field. Complex cells (C) have larger
66
- receptive fields and are locally invariant to the exact position of the
67
- stimulus.
63
+ Additionally, two basic cell types have been identified: Simple cells respond
64
+ maximally to specific edge-like patterns within their receptive field. Complex
65
+ cells have larger receptive fields and are locally invariant to the exact
66
+ position of the pattern.
68
67
69
- The visual cortex being the most powerful "vision" system in existence, it
70
- seems natural to emulate its behavior. Many such neurally inspired models can be
71
- found in the litterature. To name a few: the NeoCognitron [Fukushima]_, HMAX
72
- [Serre07]_ and LeNet-5 [LeCun98]_, which will be the focus of this tutorial.
68
+ The animal visual cortex being the most powerful visual processing system in
69
+ existence, it seems natural to emulate its behavior. Hence, many
70
+ neurally-inspired models can be found in the literature. To name a few: the
71
+ NeoCognitron [Fukushima]_, HMAX [Serre07]_ and LeNet-5 [LeCun98]_, which will
72
+ be the focus of this tutorial.
73
73
74
74
75
75
Sparse Connectivity
76
76
+++++++++++++++++++
77
77
78
- CNNs exploit spatially local correlation by enforcing a local connectivity pattern between
79
- neurons of adjacent layers. The input hidden units in the m-th layer are
80
- connected to a local subset of units in the (m-1)-th layer, which have spatially
81
- contiguous receptive fields. We can illustrate this graphically as follows:
78
+ CNNs exploit spatially-local correlation by enforcing a local connectivity
79
+ pattern between neurons of adjacent layers. In other words, the inputs of
80
+ hidden units in layer **m** are from a subset of units in layer **m-1**, units
81
+ that have spatially contiguous receptive fields. We can illustrate this
82
+ graphically as follows:
82
83
83
84
.. figure:: images/sparse_1D_nn.png
84
85
:align: center
85
86
86
- Imagine that layer **m-1** is the input retina.
87
- In the above, units in layer **m**
88
- have receptive fields of width 3 with respect to the input retina and are thus only
89
- connected to 3 adjacent neurons in the layer below (the retina).
90
- Units in layer **m** have
91
- a similar connectivity with the layer below. We say that their receptive
92
- field with respect to the layer below is also 3, but their receptive field
93
- with respect to the input is larger (it is 5).
94
- The architecture thus
95
- confines the learnt "filters" (corresponding to the input producing the strongest response) to be a spatially local pattern
96
- (since each unit is unresponsive to variations outside of its receptive field with respect to the retina).
97
- As shown above, stacking many such
98
- layers leads to "filters" (not anymore linear) which become increasingly "global" however (i.e
99
- spanning a larger region of pixel space). For example, the unit in hidden
100
- layer **m+1** can encode a non-linear feature of width 5 (in terms of pixel
101
- space).
87
+ Imagine that layer **m-1** is the input retina. In the above figure, units in
88
+ layer **m** have receptive fields of width 3 in the input retina and are thus
89
+ only connected to 3 adjacent neurons in the retina layer. Units in layer **m**
90
+ have a similar connectivity with the layer below. We say that their receptive
91
+ field with respect to the layer below is also 3, but their receptive field with
92
+ respect to the input is larger (5). Each unit is unresponsive to variations
93
+ outside of its receptive field with respect to the retina. The architecture
94
+ thus ensures that the learnt "filters" produce the strongest response to a
95
+ spatially local input pattern.
96
+
97
+ However, as shown above, stacking many such layers leads to (non-linear)
98
+ "filters" that become increasingly "global" (i.e. responsive to a larger region
99
+ of pixel space). For example, the unit in hidden layer **m+1** can encode a
100
+ non-linear feature of width 5 (in terms of pixel space).
102
101
103
102
104
103
Shared Weights
105
104
++++++++++++++
106
105
107
- In CNNs, each sparse filter :math:`h_i` is additionally replicated across the
108
- entire visual field. These " replicated" units form a **feature map**, which
109
- share the same parametrization, i.e. the same weight vector and the same bias .
106
+ In addition, in CNNs, each filter :math:`h_i` is replicated across the entire
107
+ visual field. These replicated units share the same parameterization (weight
108
+ vector and bias) and form a *feature map* .
110
109
111
110
.. figure:: images/conv_1D_nn.png
112
111
:align: center
113
112
114
113
In the above figure, we show 3 hidden units belonging to the same feature map.
115
- Weights of the same color are shared, i.e. are constrained to be identical.
116
- Gradient descent can still be used to learn such shared parameters, and
117
- requires only a small change to the original algorithm. The gradient of a
118
- shared weight is simply the sum of the gradients of the parameters being
119
- shared.
114
+ Weights of the same color are shared---constrained to be identical. Gradient
115
+ descent can still be used to learn such shared parameters, with only a small
116
+ change to the original algorithm. The gradient of a shared weight is simply the
117
+ sum of the gradients of the parameters being shared.
120
118
121
- Why are shared weights interesting ? Replicating units in this way allows for
122
- features to be detected regardless of their position in the visual field.
123
- Additionally, weight sharing offers a very efficient way to do this, since it
124
- greatly reduces the number of free parameters to learn. By controlling model
125
- capacity, CNNs tend to achieve better generalization on vision problems.
119
+ Replicating units in this way allows for features to be detected *regardless
120
+ of their position in the visual field.* Additionally, weight sharing increases
121
+ learning efficiency by greatly reducing the number of free parameters being
122
+ learnt. The constraints on the model enable CNNs to achieve better
123
+ generalization on vision problems.
126
124
127
125
128
126
Details and Notation
129
127
++++++++++++++++++++
130
128
131
- Conceptually, a feature map is obtained by convolving the input image with a
132
- linear filter, adding a bias term and then applying a non-linear function. If
133
- we denote the k-th feature map at a given layer as :math:`h^k`, whose filters
134
- are determined by the weights :math:`W^k` and bias :math:`b_k`, then the
135
- feature map :math:`h^k` is obtained as follows (for :math:`tanh` non-linearities):
129
+ A feature map is obtained by repeated application of a function across
130
+ sub-regions of the entire image, in other words, by *convolution* of the
131
+ input image with a linear filter, adding a bias term and then applying a
132
+ non-linear function. If we denote the k-th feature map at a given layer as
133
+ :math:`h^k`, whose filters are determined by the weights :math:`W^k` and bias
134
+ :math:`b_k`, then the feature map :math:`h^k` is obtained as follows (for
135
+ :math:`tanh` non-linearities):
136
136
137
137
.. math::
138
138
h^k_{ij} = \tanh ( (W^k * x)_{ij} + b_k ).
@@ -144,40 +144,39 @@ feature map :math:`h^k` is obtained as follows (for :math:`tanh` non-linearities
144
144
This can be extended to 2D as follows:
145
145
:math:`o[m,n] = f[m,n]*g[m,n] = \sum_{u=-\infty}^{\infty} \sum_{v=-\infty}^{\infty} f[u,v] g[m-u,n-v]`.
146
146
147
- To form a richer representation of the data, hidden layers are composed of
148
- a set of multiple feature maps, :math:`\{h^{(k)}, k=0..K\}`.
149
- The weights :math:`W` of this layer can be parametrized as a 4D tensor
150
- ( destination feature map index , source feature map index , source vertical position index, source horizontal position index)
151
- and
152
- the biases :math:`b` as a vector ( one element per destination feature map index).
153
- We illustrate this graphically as follows:
147
+ To form a richer representation of the data, each hidden layer is composed of
148
+ * multiple* feature maps, :math:`\{h^{(k)}, k=0..K\}`. The weights :math:`W` of
149
+ a hidden layer can be represented in a 4D tensor containing elements for every
150
+ combination of destination feature map, source feature map, source vertical
151
+ position, and source horizontal position. The biases :math:`b` can be
152
+ represented as a vector containing one element for every destination feature
153
+ map. We illustrate this graphically as follows:
154
154
155
155
.. figure:: images/cnn_explained.png
156
156
:align: center
157
157
158
158
**Figure 1**: example of a convolutional layer
159
159
160
- Here, we show two layers of a CNN, containing 4 feature maps at layer (m-1)
161
- and 2 feature maps (:math:`h^0` and :math:`h^1`) at layer m. Pixels (neuron outputs) in
162
- :math:`h^0` and :math:`h^1` (outlined as blue and red squares) are computed
163
- from pixels of layer (m-1) which fall within their 2x2 receptive field in the
164
- layer below (shown
165
- as colored rectangles). Notice how the receptive field spans all four input
166
- feature maps. The weights :math:`W^0` and :math:`W^1` of :math:`h^0` and
167
- :math:`h^1` are thus 3D weight tensors. The leading dimension indexes the
168
- input feature maps, while the other two refer to the pixel coordinates.
160
+ The figure shows two layers of a CNN. **Layer m-1** contains four feature maps.
161
+ **Hidden layer m** contains two feature maps (:math:`h^0` and :math:`h^1`).
162
+ Pixels (neuron outputs) in :math:`h^0` and :math:`h^1` (outlined as blue and
163
+ red squares) are computed from pixels of layer (m-1) which fall within their
164
+ 2x2 receptive field in the layer below (shown as colored rectangles). Notice
165
+ how the receptive field spans all four input feature maps. The weights
166
+ :math:`W^0` and :math:`W^1` of :math:`h^0` and :math:`h^1` are thus 3D weight
167
+ tensors. The leading dimension indexes the input feature maps, while the other
168
+ two refer to the pixel coordinates.
169
169
170
170
Putting it all together, :math:`W^{kl}_{ij}` denotes the weight connecting
171
171
each pixel of the k-th feature map at layer m, with the pixel at coordinates
172
172
(i,j) of the l-th feature map of layer (m-1).
173
173
174
174
175
- The ConvOp
176
- ++++++++++
175
+ The Convolution Operator
176
+ ++++++++++++++++++++++++
177
177
178
178
ConvOp is the main workhorse for implementing a convolutional layer in Theano.
179
- It is meant to replicate the behaviour of scipy.signal.convolve2d. Conceptually,
180
- the ConvOp (once instantiated) takes two symbolic inputs:
179
+ ConvOp is used by ``theano.tensor.signal.conv2d``, which takes two symbolic inputs:
181
180
182
181
183
182
* a 4D tensor corresponding to a mini-batch of input images. The shape of the
@@ -284,38 +283,39 @@ This should generate the following output.
284
283
285
284
Notice that a randomly initialized filter acts very much like an edge detector!
286
285
287
- Also of note, remark that we use the same weight initialization formula as
288
- with the MLP. Weights are sampled randomly from a uniform distribution in the
289
- range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a hidden
290
- unit. For MLPs, this was the number of units in the layer below. For CNNs
291
- however, we have to take into account the number of input feature maps and the
292
- size of the receptive fields.
286
+ Note that we use the same weight initialization formula as with the MLP.
287
+ Weights are sampled randomly from a uniform distribution in the range
288
+ [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a hidden unit.
289
+ For MLPs, this was the number of units in the layer below. For CNNs however, we
290
+ have to take into account the number of input feature maps and the size of the
291
+ receptive fields.
293
292
294
293
295
294
MaxPooling
296
295
++++++++++
297
296
298
- Another important concept of CNNs is that of max-pooling, which is a form of
297
+ Another important concept of CNNs is * max-pooling,* which is a form of
299
298
non-linear down-sampling. Max-pooling partitions the input image into
300
299
a set of non-overlapping rectangles and, for each such sub-region, outputs the
301
300
maximum value.
302
301
303
- Max-pooling is useful in vision for two reasons: (1) it reduces the
304
- computational complexity for upper layers and (2) it provides a form of
305
- translation invariance. To understand the invariance argument, imagine
306
- cascading a max-pooling layer with a convolutional layer. There are 8
307
- directions in which one can translate the input image by a single pixel. If
308
- max-pooling is done over a 2x2 region, 3 out of these 8 possible
309
- configurations will produce exactly the same output at the convolutional
310
- layer. For max-pooling over a 3x3 window, this jumps to 5/8.
302
+ Max-pooling is useful in vision for two reasons:
303
+ #. By eliminating non-maximal values, it reduces computation for upper layers.
304
+
305
+ #. It provides a form of translation invariance. Imagine
306
+ cascading a max-pooling layer with a convolutional layer. There are 8
307
+ directions in which one can translate the input image by a single pixel.
308
+ If max-pooling is done over a 2x2 region, 3 out of these 8 possible
309
+ configurations will produce exactly the same output at the convolutional
310
+ layer. For max-pooling over a 3x3 window, this jumps to 5/8.
311
311
312
- Since it provides additional robustness to position, max-pooling is thus a
313
- "smart" way of reducing the dimensionality of intermediate representations.
312
+ Since it provides additional robustness to position, max-pooling is a
313
+ "smart" way of reducing the dimensionality of intermediate representations.
314
314
315
- Max-pooling is done in Theano by way of ``theano.tensor.signal.downsample.max_pool_2d``.
316
- This function takes as input an N dimensional tensor (with N >= 2), a
317
- downscaling factor and performs max-pooling over the 2 trailing dimensions of
318
- the tensor.
315
+ Max-pooling is done in Theano by way of
316
+ ``theano.tensor.signal.downsample.max_pool_2d``. This function takes as input
317
+ an N dimensional tensor (where N >= 2) and a downscaling factor and performs
318
+ max-pooling over the 2 trailing dimensions of the tensor.
319
319
320
320
An example is worth a thousand words:
321
321
@@ -366,10 +366,10 @@ This should generate the following output:
366
366
[ 0.66379465 0.94459476 0.58655504]
367
367
[ 0.90340192 0.80739129 0.39767684]]
368
368
369
- Note that contrary to most Theano code, the ``max_pool_2d`` operation is a little
370
- *special*. It requires the downscaling factor ``ds`` (tuple of length 2 containing
371
- downscaling factors for image width and height) to be known at graph build
372
- time. This may change in the near future.
369
+ Note that compared to most Theano code, the ``max_pool_2d`` operation is a
370
+ little *special*. It requires the downscaling factor ``ds`` (tuple of length 2
371
+ containing downscaling factors for image width and height) to be known at graph
372
+ build time. This may change in the near future.
373
373
374
374
375
375
The Full Model: LeNet
0 commit comments