Skip to content

Commit c22e572

Browse files
committed
Merge pull request lisa-lab#46 from mspandit/conv-mlp-edits
Edits to convolutional multilayer perceptron tutorial, for clarity and grammar.
2 parents a656ce2 + 9882526 commit c22e572

File tree

1 file changed

+107
-107
lines changed

1 file changed

+107
-107
lines changed

doc/lenet.txt

Lines changed: 107 additions & 107 deletions
Original file line numberDiff line numberDiff line change
@@ -10,16 +10,16 @@ Convolutional Neural Networks (LeNet)
1010
`floatX`_, `downsample`_ , `conv2d`_, `dimshuffle`_. If you intend to run the
1111
code on GPU also read `GPU`_.
1212

13-
To run this example on a GPU, you need a good GPU. First, it need
14-
at least 1G of GPU RAM and possibly more if your monitor is
13+
To run this example on a GPU, you need a good GPU. It needs
14+
at least 1GB of GPU RAM. More may be required if your monitor is
1515
connected to the GPU.
16-
17-
Second, when the GPU is connected to the monitor, there is a limit
16+
17+
When the GPU is connected to the monitor, there is a limit
1818
of a few seconds for each GPU function call. This is needed as
19-
current GPU can't be used for the monitor while doing
20-
computation. If there wasn't this limit, the screen would freeze
21-
for too long and this look as if the computer froze. User don't
22-
like this. This example hit this limit with medium GPU. When the
19+
current GPUs can't be used for the monitor while doing
20+
computation. Without this limit, the screen would freeze
21+
for too long and make it look as if the computer froze.
22+
This example hits this limit with medium-quality GPUs. When the
2323
GPU isn't connected to a monitor, there is no time limit. You can
2424
lower the batch size to fix the time out problem.
2525

@@ -52,87 +52,87 @@ Convolutional Neural Networks (LeNet)
5252
Motivation
5353
++++++++++
5454

55-
Convolutional Neural Networks (CNN) are variants of MLPs which are inspired from
56-
biology. From Hubel and Wiesel's early work on the cat's visual cortex [Hubel68]_,
57-
we know there exists a complex arrangement of cells within the visual cortex.
58-
These cells are sensitive to small sub-regions of the input space, called a
59-
**receptive field**, and are tiled in such a way as to cover the entire visual
60-
field. These filters are local in input space and are thus better suited to
61-
exploit the strong spatially local correlation present in natural images.
55+
Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs.
56+
From Hubel and Wiesel's early work on the cat's visual cortex [Hubel68]_, we
57+
know the visual cortex contains a complex arrangement of cells. These cells are
58+
sensitive to small sub-regions of the visual field, called a *receptive
59+
field*. The sub-regions are tiled to cover the entire visual field. These
60+
cells act as local filters over the input space and are well-suited to exploit
61+
the strong spatially local correlation present in natural images.
6262

63-
Additionally, two basic cell types have been identified: simple cells (S) and
64-
complex cells (C). Simple cells (S) respond maximally to specific edge-like
65-
stimulus patterns within their receptive field. Complex cells (C) have larger
66-
receptive fields and are locally invariant to the exact position of the
67-
stimulus.
63+
Additionally, two basic cell types have been identified: Simple cells respond
64+
maximally to specific edge-like patterns within their receptive field. Complex
65+
cells have larger receptive fields and are locally invariant to the exact
66+
position of the pattern.
6867

69-
The visual cortex being the most powerful "vision" system in existence, it
70-
seems natural to emulate its behavior. Many such neurally inspired models can be
71-
found in the litterature. To name a few: the NeoCognitron [Fukushima]_, HMAX
72-
[Serre07]_ and LeNet-5 [LeCun98]_, which will be the focus of this tutorial.
68+
The animal visual cortex being the most powerful visual processing system in
69+
existence, it seems natural to emulate its behavior. Hence, many
70+
neurally-inspired models can be found in the literature. To name a few: the
71+
NeoCognitron [Fukushima]_, HMAX [Serre07]_ and LeNet-5 [LeCun98]_, which will
72+
be the focus of this tutorial.
7373

7474

7575
Sparse Connectivity
7676
+++++++++++++++++++
7777

78-
CNNs exploit spatially local correlation by enforcing a local connectivity pattern between
79-
neurons of adjacent layers. The input hidden units in the m-th layer are
80-
connected to a local subset of units in the (m-1)-th layer, which have spatially
81-
contiguous receptive fields. We can illustrate this graphically as follows:
78+
CNNs exploit spatially-local correlation by enforcing a local connectivity
79+
pattern between neurons of adjacent layers. In other words, the inputs of
80+
hidden units in layer **m** are from a subset of units in layer **m-1**, units
81+
that have spatially contiguous receptive fields. We can illustrate this
82+
graphically as follows:
8283

8384
.. figure:: images/sparse_1D_nn.png
8485
:align: center
8586

86-
Imagine that layer **m-1** is the input retina.
87-
In the above, units in layer **m**
88-
have receptive fields of width 3 with respect to the input retina and are thus only
89-
connected to 3 adjacent neurons in the layer below (the retina).
90-
Units in layer **m** have
91-
a similar connectivity with the layer below. We say that their receptive
92-
field with respect to the layer below is also 3, but their receptive field
93-
with respect to the input is larger (it is 5).
94-
The architecture thus
95-
confines the learnt "filters" (corresponding to the input producing the strongest response) to be a spatially local pattern
96-
(since each unit is unresponsive to variations outside of its receptive field with respect to the retina).
97-
As shown above, stacking many such
98-
layers leads to "filters" (not anymore linear) which become increasingly "global" however (i.e
99-
spanning a larger region of pixel space). For example, the unit in hidden
100-
layer **m+1** can encode a non-linear feature of width 5 (in terms of pixel
101-
space).
87+
Imagine that layer **m-1** is the input retina. In the above figure, units in
88+
layer **m** have receptive fields of width 3 in the input retina and are thus
89+
only connected to 3 adjacent neurons in the retina layer. Units in layer **m**
90+
have a similar connectivity with the layer below. We say that their receptive
91+
field with respect to the layer below is also 3, but their receptive field with
92+
respect to the input is larger (5). Each unit is unresponsive to variations
93+
outside of its receptive field with respect to the retina. The architecture
94+
thus ensures that the learnt "filters" produce the strongest response to a
95+
spatially local input pattern.
96+
97+
However, as shown above, stacking many such layers leads to (non-linear)
98+
"filters" that become increasingly "global" (i.e. responsive to a larger region
99+
of pixel space). For example, the unit in hidden layer **m+1** can encode a
100+
non-linear feature of width 5 (in terms of pixel space).
102101

103102

104103
Shared Weights
105104
++++++++++++++
106105

107-
In CNNs, each sparse filter :math:`h_i` is additionally replicated across the
108-
entire visual field. These "replicated" units form a **feature map**, which
109-
share the same parametrization, i.e. the same weight vector and the same bias.
106+
In addition, in CNNs, each filter :math:`h_i` is replicated across the entire
107+
visual field. These replicated units share the same parameterization (weight
108+
vector and bias) and form a *feature map*.
110109

111110
.. figure:: images/conv_1D_nn.png
112111
:align: center
113112

114113
In the above figure, we show 3 hidden units belonging to the same feature map.
115-
Weights of the same color are shared, i.e. are constrained to be identical.
116-
Gradient descent can still be used to learn such shared parameters, and
117-
requires only a small change to the original algorithm. The gradient of a
118-
shared weight is simply the sum of the gradients of the parameters being
119-
shared.
114+
Weights of the same color are shared---constrained to be identical. Gradient
115+
descent can still be used to learn such shared parameters, with only a small
116+
change to the original algorithm. The gradient of a shared weight is simply the
117+
sum of the gradients of the parameters being shared.
120118

121-
Why are shared weights interesting ? Replicating units in this way allows for
122-
features to be detected regardless of their position in the visual field.
123-
Additionally, weight sharing offers a very efficient way to do this, since it
124-
greatly reduces the number of free parameters to learn. By controlling model
125-
capacity, CNNs tend to achieve better generalization on vision problems.
119+
Replicating units in this way allows for features to be detected *regardless
120+
of their position in the visual field.* Additionally, weight sharing increases
121+
learning efficiency by greatly reducing the number of free parameters being
122+
learnt. The constraints on the model enable CNNs to achieve better
123+
generalization on vision problems.
126124

127125

128126
Details and Notation
129127
++++++++++++++++++++
130128

131-
Conceptually, a feature map is obtained by convolving the input image with a
132-
linear filter, adding a bias term and then applying a non-linear function. If
133-
we denote the k-th feature map at a given layer as :math:`h^k`, whose filters
134-
are determined by the weights :math:`W^k` and bias :math:`b_k`, then the
135-
feature map :math:`h^k` is obtained as follows (for :math:`tanh` non-linearities):
129+
A feature map is obtained by repeated application of a function across
130+
sub-regions of the entire image, in other words, by *convolution* of the
131+
input image with a linear filter, adding a bias term and then applying a
132+
non-linear function. If we denote the k-th feature map at a given layer as
133+
:math:`h^k`, whose filters are determined by the weights :math:`W^k` and bias
134+
:math:`b_k`, then the feature map :math:`h^k` is obtained as follows (for
135+
:math:`tanh` non-linearities):
136136

137137
.. math::
138138
h^k_{ij} = \tanh ( (W^k * x)_{ij} + b_k ).
@@ -144,40 +144,39 @@ feature map :math:`h^k` is obtained as follows (for :math:`tanh` non-linearities
144144
This can be extended to 2D as follows:
145145
:math:`o[m,n] = f[m,n]*g[m,n] = \sum_{u=-\infty}^{\infty} \sum_{v=-\infty}^{\infty} f[u,v] g[m-u,n-v]`.
146146

147-
To form a richer representation of the data, hidden layers are composed of
148-
a set of multiple feature maps, :math:`\{h^{(k)}, k=0..K\}`.
149-
The weights :math:`W` of this layer can be parametrized as a 4D tensor
150-
(destination feature map index, source feature map index, source vertical position index, source horizontal position index)
151-
and
152-
the biases :math:`b` as a vector (one element per destination feature map index).
153-
We illustrate this graphically as follows:
147+
To form a richer representation of the data, each hidden layer is composed of
148+
*multiple* feature maps, :math:`\{h^{(k)}, k=0..K\}`. The weights :math:`W` of
149+
a hidden layer can be represented in a 4D tensor containing elements for every
150+
combination of destination feature map, source feature map, source vertical
151+
position, and source horizontal position. The biases :math:`b` can be
152+
represented as a vector containing one element for every destination feature
153+
map. We illustrate this graphically as follows:
154154

155155
.. figure:: images/cnn_explained.png
156156
:align: center
157157

158158
**Figure 1**: example of a convolutional layer
159159

160-
Here, we show two layers of a CNN, containing 4 feature maps at layer (m-1)
161-
and 2 feature maps (:math:`h^0` and :math:`h^1`) at layer m. Pixels (neuron outputs) in
162-
:math:`h^0` and :math:`h^1` (outlined as blue and red squares) are computed
163-
from pixels of layer (m-1) which fall within their 2x2 receptive field in the
164-
layer below (shown
165-
as colored rectangles). Notice how the receptive field spans all four input
166-
feature maps. The weights :math:`W^0` and :math:`W^1` of :math:`h^0` and
167-
:math:`h^1` are thus 3D weight tensors. The leading dimension indexes the
168-
input feature maps, while the other two refer to the pixel coordinates.
160+
The figure shows two layers of a CNN. **Layer m-1** contains four feature maps.
161+
**Hidden layer m** contains two feature maps (:math:`h^0` and :math:`h^1`).
162+
Pixels (neuron outputs) in :math:`h^0` and :math:`h^1` (outlined as blue and
163+
red squares) are computed from pixels of layer (m-1) which fall within their
164+
2x2 receptive field in the layer below (shown as colored rectangles). Notice
165+
how the receptive field spans all four input feature maps. The weights
166+
:math:`W^0` and :math:`W^1` of :math:`h^0` and :math:`h^1` are thus 3D weight
167+
tensors. The leading dimension indexes the input feature maps, while the other
168+
two refer to the pixel coordinates.
169169

170170
Putting it all together, :math:`W^{kl}_{ij}` denotes the weight connecting
171171
each pixel of the k-th feature map at layer m, with the pixel at coordinates
172172
(i,j) of the l-th feature map of layer (m-1).
173173

174174

175-
The ConvOp
176-
++++++++++
175+
The Convolution Operator
176+
++++++++++++++++++++++++
177177

178178
ConvOp is the main workhorse for implementing a convolutional layer in Theano.
179-
It is meant to replicate the behaviour of scipy.signal.convolve2d. Conceptually,
180-
the ConvOp (once instantiated) takes two symbolic inputs:
179+
ConvOp is used by ``theano.tensor.signal.conv2d``, which takes two symbolic inputs:
181180

182181

183182
* a 4D tensor corresponding to a mini-batch of input images. The shape of the
@@ -284,38 +283,39 @@ This should generate the following output.
284283

285284
Notice that a randomly initialized filter acts very much like an edge detector!
286285

287-
Also of note, remark that we use the same weight initialization formula as
288-
with the MLP. Weights are sampled randomly from a uniform distribution in the
289-
range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a hidden
290-
unit. For MLPs, this was the number of units in the layer below. For CNNs
291-
however, we have to take into account the number of input feature maps and the
292-
size of the receptive fields.
286+
Note that we use the same weight initialization formula as with the MLP.
287+
Weights are sampled randomly from a uniform distribution in the range
288+
[-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a hidden unit.
289+
For MLPs, this was the number of units in the layer below. For CNNs however, we
290+
have to take into account the number of input feature maps and the size of the
291+
receptive fields.
293292

294293

295294
MaxPooling
296295
++++++++++
297296

298-
Another important concept of CNNs is that of max-pooling, which is a form of
297+
Another important concept of CNNs is *max-pooling,* which is a form of
299298
non-linear down-sampling. Max-pooling partitions the input image into
300299
a set of non-overlapping rectangles and, for each such sub-region, outputs the
301300
maximum value.
302301

303-
Max-pooling is useful in vision for two reasons: (1) it reduces the
304-
computational complexity for upper layers and (2) it provides a form of
305-
translation invariance. To understand the invariance argument, imagine
306-
cascading a max-pooling layer with a convolutional layer. There are 8
307-
directions in which one can translate the input image by a single pixel. If
308-
max-pooling is done over a 2x2 region, 3 out of these 8 possible
309-
configurations will produce exactly the same output at the convolutional
310-
layer. For max-pooling over a 3x3 window, this jumps to 5/8.
302+
Max-pooling is useful in vision for two reasons:
303+
#. By eliminating non-maximal values, it reduces computation for upper layers.
304+
305+
#. It provides a form of translation invariance. Imagine
306+
cascading a max-pooling layer with a convolutional layer. There are 8
307+
directions in which one can translate the input image by a single pixel.
308+
If max-pooling is done over a 2x2 region, 3 out of these 8 possible
309+
configurations will produce exactly the same output at the convolutional
310+
layer. For max-pooling over a 3x3 window, this jumps to 5/8.
311311

312-
Since it provides additional robustness to position, max-pooling is thus a
313-
"smart" way of reducing the dimensionality of intermediate representations.
312+
Since it provides additional robustness to position, max-pooling is a
313+
"smart" way of reducing the dimensionality of intermediate representations.
314314

315-
Max-pooling is done in Theano by way of ``theano.tensor.signal.downsample.max_pool_2d``.
316-
This function takes as input an N dimensional tensor (with N >= 2), a
317-
downscaling factor and performs max-pooling over the 2 trailing dimensions of
318-
the tensor.
315+
Max-pooling is done in Theano by way of
316+
``theano.tensor.signal.downsample.max_pool_2d``. This function takes as input
317+
an N dimensional tensor (where N >= 2) and a downscaling factor and performs
318+
max-pooling over the 2 trailing dimensions of the tensor.
319319

320320
An example is worth a thousand words:
321321

@@ -366,10 +366,10 @@ This should generate the following output:
366366
[ 0.66379465 0.94459476 0.58655504]
367367
[ 0.90340192 0.80739129 0.39767684]]
368368

369-
Note that contrary to most Theano code, the ``max_pool_2d`` operation is a little
370-
*special*. It requires the downscaling factor ``ds`` (tuple of length 2 containing
371-
downscaling factors for image width and height) to be known at graph build
372-
time. This may change in the near future.
369+
Note that compared to most Theano code, the ``max_pool_2d`` operation is a
370+
little *special*. It requires the downscaling factor ``ds`` (tuple of length 2
371+
containing downscaling factors for image width and height) to be known at graph
372+
build time. This may change in the near future.
373373

374374

375375
The Full Model: LeNet

0 commit comments

Comments
 (0)