Skip to content

Commit 20897f5

Browse files
committed
Merge pull request lisa-lab#51 from mspandit/sda-edits
Edits to SdA tutorial for clarity.
2 parents e64f050 + 5fea1bb commit 20897f5

File tree

1 file changed

+54
-52
lines changed

1 file changed

+54
-52
lines changed

doc/SdA.txt

Lines changed: 54 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Stacked Denoising Autoencoders (SdA)
44
====================================
55

66
.. note::
7-
This section assumes the reader has already read through :doc:`logreg`
7+
This section assumes you have already read through :doc:`logreg`
88
and :doc:`mlp`. Additionally it uses the following Theano functions
99
and concepts : `T.tanh`_, `shared variables`_, `basic arithmetic ops`_, `T.grad`_, `Random numbers`_, `floatX`_. If you intend to run the code on GPU also read `GPU`_.
1010

@@ -32,46 +32,48 @@ Stacked Denoising Autoencoders (SdA)
3232
The Stacked Denoising Autoencoder (SdA) is an extension of the stacked
3333
autoencoder [Bengio07]_ and it was introduced in [Vincent08]_.
3434

35-
This tutorial builds on the previous tutorial :ref:`dA` and we recommend,
36-
especially if you do not have experience with autoencoders, to read it
35+
This tutorial builds on the previous tutorial :ref:`dA`.
36+
Especially if you do not have experience with autoencoders, we recommend reading it
3737
before going any further.
3838

3939
.. _stacked_autoencoders:
4040

4141
Stacked Autoencoders
4242
++++++++++++++++++++
4343

44-
The denoising autoencoders can be stacked to form a deep network by
44+
Denoising autoencoders can be stacked to form a deep network by
4545
feeding the latent representation (output code)
46-
of the denoising auto-encoder found on the layer
46+
of the denoising autoencoder found on the layer
4747
below as input to the current layer. The **unsupervised pre-training** of such an
4848
architecture is done one layer at a time. Each layer is trained as
49-
a denoising auto-encoder by minimizing the reconstruction of its input
49+
a denoising autoencoder by minimizing the error in reconstructing its input
5050
(which is the output code of the previous layer).
5151
Once the first :math:`k` layers
5252
are trained, we can train the :math:`k+1`-th layer because we can now
5353
compute the code or latent representation from the layer below.
54+
5455
Once all layers are pre-trained, the network goes through a second stage
5556
of training called **fine-tuning**. Here we consider **supervised fine-tuning**
5657
where we want to minimize prediction error on a supervised task.
57-
For this we first add a logistic regression
58+
For this, we first add a logistic regression
5859
layer on top of the network (more precisely on the output code of the
5960
output layer). We then
6061
train the entire network as we would train a multilayer
6162
perceptron. At this point, we only consider the encoding parts of
6263
each auto-encoder.
6364
This stage is supervised, since now we use the target class during
64-
training (see the :ref:`mlp` for details on the multilayer perceptron).
65+
training. (See the :ref:`mlp` for details on the multilayer perceptron.)
6566

6667
This can be easily implemented in Theano, using the class defined
67-
before for a denoising autoencoder. We can see the stacked denoising
68-
autoencoder as having two facades, one is a list of
69-
autoencoders, the other is an MLP. During pre-training we use the first facade, i.e we treat our model
68+
previously for a denoising autoencoder. We can see the stacked denoising
69+
autoencoder as having two facades: a list of
70+
autoencoders, and an MLP. During pre-training we use the first facade, i.e., we treat our model
7071
as a list of autoencoders, and train each autoencoder seperately. In the
71-
second stage of training, we use the second facade. These two
72-
facedes are linked by the fact that the autoencoders and the sigmoid layers of
73-
the MLP share parameters, and the fact that autoencoders get as input latent
74-
representations of intermediate layers of the MLP.
72+
second stage of training, we use the second facade. These two facades are linked because:
73+
74+
* the autoencoders and the sigmoid layers of the MLP share parameters, and
75+
76+
* the latent representations computed by intermediate layers of the MLP are fed as input to the autoencoders.
7577

7678
.. literalinclude:: ../code/SdA.py
7779
:start-after: start-snippet-1
@@ -80,78 +82,78 @@ representations of intermediate layers of the MLP.
8082
``self.sigmoid_layers`` will store the sigmoid layers of the MLP facade, while
8183
``self.dA_layers`` will store the denoising autoencoder associated with the layers of the MLP.
8284

83-
Next step, we construct ``n_layers`` sigmoid layers (we use the
84-
``HiddenLayer`` class introduced in :ref:`mlp`, with the only
85-
modification that we replaced the non-linearity from ``tanh`` to the
86-
logistic function :math:`s(x) = \frac{1}{1+e^{-x}}`) and ``n_layers``
87-
denoising autoencoders, where ``n_layers`` is the depth of our model.
88-
We link the sigmoid layers such that they form an MLP, and construct
89-
each denoising autoencoder such that they share the weight matrix and the
90-
bias of the encoding part with its corresponding sigmoid layer.
85+
Next, we construct ``n_layers`` sigmoid layers and ``n_layers`` denoising
86+
autoencoders, where ``n_layers`` is the depth of our model. We use the
87+
``HiddenLayer`` class introduced in :ref:`mlp`, with one
88+
modification: we replace the ``tanh`` non-linearity with the
89+
logistic function :math:`s(x) = \frac{1}{1+e^{-x}}`).
90+
We link the sigmoid layers to form an MLP, and construct
91+
the denoising autoencoders such that each shares the weight matrix and the
92+
bias of its encoding part with its corresponding sigmoid layer.
9193

9294
.. literalinclude:: ../code/SdA.py
9395
:start-after: start-snippet-2
9496
:end-before: end-snippet-2
9597

96-
All we need now is to add the logistic layer on top of the sigmoid
98+
All we need now is to add a logistic layer on top of the sigmoid
9799
layers such that we have an MLP. We will
98100
use the ``LogisticRegression`` class introduced in :ref:`logreg`.
99101

100102
.. literalinclude:: ../code/SdA.py
101103
:start-after: end-snippet-2
102104
:end-before: def pretraining_functions
103105

104-
The class also provides a method that generates training functions for
105-
each of the denoising autoencoder associated with the different layers.
106+
The ``SdA`` class also provides a method that generates training functions for
107+
the denoising autoencoders in its layers.
106108
They are returned as a list, where element :math:`i` is a function that
107-
implements one step of training the ``dA`` correspoinding to layer
109+
implements one step of training the ``dA`` corresponding to layer
108110
:math:`i`.
109111

110112
.. literalinclude:: ../code/SdA.py
111113
:start-after: self.errors = self.logLayer.errors(self.y)
112114
:end-before: corruption_level = T.scalar('corruption')
113115

114-
In order to be able to change the corruption level or the learning rate
115-
during training we associate a Theano variable to them.
116+
To be able to change the corruption level or the learning rate
117+
during training, we associate Theano variables with them.
116118

117119
.. literalinclude:: ../code/SdA.py
118120
:start-after: index = T.lscalar('index')
119121
:end-before: def build_finetune_functions
120122

121123
Now any function ``pretrain_fns[i]`` takes as arguments ``index`` and
122-
optionally ``corruption`` -- the corruption level or ``lr`` -- the
123-
learning rate. Note that the name of the parameters are the name given
124-
to the Theano variables when they are constructed, not the name of the
125-
python variables (``learning_rate`` or ``corruption_level``). Keep this
124+
optionally ``corruption``---the corruption level or ``lr``---the
125+
learning rate. Note that the names of the parameters are the names given
126+
to the Theano variables when they are constructed, not the names of the
127+
Python variables (``learning_rate`` or ``corruption_level``). Keep this
126128
in mind when working with Theano.
127129

128-
In the same fashion we build a method for constructing function required
129-
during finetuning ( a ``train_model``, a ``validate_model`` and a
130-
``test_model`` function).
130+
In the same fashion we build a method for constructing the functions required
131+
during finetuning (``train_fn``, ``valid_score`` and
132+
``test_score``).
131133

132134
.. literalinclude:: ../code/SdA.py
133135
:pyobject: SdA.build_finetune_functions
134136

135-
Note that the returned ``valid_score`` and ``test_score`` are not Theano
136-
functions, but rather python functions that also loop over the entire
137-
validation set and the entire test set producing a list of the losses
137+
Note that ``valid_score`` and ``test_score`` are not Theano
138+
functions, but rather Python functions that loop over the entire
139+
validation set and the entire test set, respectively, producing a list of the losses
138140
over these sets.
139141

140142
Putting it all together
141143
+++++++++++++++++++++++
142144

143-
The few lines of code below constructs the stacked denoising
144-
autoencoder :
145+
The few lines of code below construct the stacked denoising
146+
autoencoder:
145147

146148
.. literalinclude:: ../code/SdA.py
147149
:start-after: start-snippet-3
148150
:end-before: end-snippet-3
149151

150-
There are two stages in training this network, a layer-wise pre-training and
151-
fine-tuning afterwards.
152+
There are two stages of training for this network: layer-wise pre-training
153+
followed by fine-tuning.
152154

153155
For the pre-training stage, we will loop over all the layers of the
154-
network. For each layer we will use the compiled theano function that
156+
network. For each layer we will use the compiled Theano function that
155157
implements a SGD step towards optimizing the weights for reducing
156158
the reconstruction cost of that layer. This function will be applied
157159
to the training set for a fixed number of epochs given by
@@ -161,9 +163,9 @@ to the training set for a fixed number of epochs given by
161163
:start-after: start-snippet-4
162164
:end-before: end-snippet-4
163165

164-
The fine-tuning loop is very similar with the one in the :ref:`mlp`, the
165-
only difference is that we will use now the functions given by
166-
``build_finetune_functions`` .
166+
The fine-tuning loop is very similar to the one in the :ref:`mlp`. The
167+
only difference is that it uses the functions given by
168+
``build_finetune_functions``.
167169

168170
Running the Code
169171
++++++++++++++++
@@ -175,8 +177,8 @@ The user can run the code by calling:
175177
python code/SdA.py
176178

177179
By default the code runs 15 pre-training epochs for each layer, with a batch
178-
size of 1. The corruption level for the first layer is 0.1, for the second
179-
0.2 and 0.3 for the third. The pretraining learning rate is was 0.001 and
180+
size of 1. The corruption levels are 0.1 for the first layer, 0.2 for the second,
181+
and 0.3 for the third. The pretraining learning rate is 0.001 and
180182
the finetuning learning rate is 0.1. Pre-training takes 585.01 minutes, with
181183
an average of 13 minutes per epoch. Fine-tuning is completed after 36 epochs
182184
in 444.2 minutes, with an average of 12.34 minutes per epoch. The final
@@ -188,13 +190,13 @@ Xeon E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS.
188190
Tips and Tricks
189191
+++++++++++++++
190192

191-
One way to improve the running time of your code (given that you have
193+
One way to improve the running time of your code (assuming you have
192194
sufficient memory available), is to compute how the network, up to layer
193195
:math:`k-1`, transforms your data. Namely, you start by training your first
194196
layer dA. Once it is trained, you can compute the hidden units values for
195197
every datapoint in your dataset and store this as a new dataset that you will
196-
use to train the dA corresponding to layer 2. Once you trained the dA for
198+
use to train the dA corresponding to layer 2. Once you have trained the dA for
197199
layer 2, you compute, in a similar fashion, the dataset for layer 3 and so on.
198200
You can see now, that at this point, the dAs are trained individually, and
199201
they just provide (one to the other) a non-linear transformation of the input.
200-
Once all dAs are trained, you can start fine-tunning the model.
202+
Once all dAs are trained, you can start fine-tuning the model.

0 commit comments

Comments
 (0)