@@ -4,7 +4,7 @@ Stacked Denoising Autoencoders (SdA)
4
4
====================================
5
5
6
6
.. note::
7
- This section assumes the reader has already read through :doc:`logreg`
7
+ This section assumes you have already read through :doc:`logreg`
8
8
and :doc:`mlp`. Additionally it uses the following Theano functions
9
9
and concepts : `T.tanh`_, `shared variables`_, `basic arithmetic ops`_, `T.grad`_, `Random numbers`_, `floatX`_. If you intend to run the code on GPU also read `GPU`_.
10
10
@@ -32,46 +32,48 @@ Stacked Denoising Autoencoders (SdA)
32
32
The Stacked Denoising Autoencoder (SdA) is an extension of the stacked
33
33
autoencoder [Bengio07]_ and it was introduced in [Vincent08]_.
34
34
35
- This tutorial builds on the previous tutorial :ref:`dA` and we recommend,
36
- especially if you do not have experience with autoencoders, to read it
35
+ This tutorial builds on the previous tutorial :ref:`dA`.
36
+ Especially if you do not have experience with autoencoders, we recommend reading it
37
37
before going any further.
38
38
39
39
.. _stacked_autoencoders:
40
40
41
41
Stacked Autoencoders
42
42
++++++++++++++++++++
43
43
44
- The denoising autoencoders can be stacked to form a deep network by
44
+ Denoising autoencoders can be stacked to form a deep network by
45
45
feeding the latent representation (output code)
46
- of the denoising auto-encoder found on the layer
46
+ of the denoising autoencoder found on the layer
47
47
below as input to the current layer. The **unsupervised pre-training** of such an
48
48
architecture is done one layer at a time. Each layer is trained as
49
- a denoising auto-encoder by minimizing the reconstruction of its input
49
+ a denoising autoencoder by minimizing the error in reconstructing its input
50
50
(which is the output code of the previous layer).
51
51
Once the first :math:`k` layers
52
52
are trained, we can train the :math:`k+1`-th layer because we can now
53
53
compute the code or latent representation from the layer below.
54
+
54
55
Once all layers are pre-trained, the network goes through a second stage
55
56
of training called **fine-tuning**. Here we consider **supervised fine-tuning**
56
57
where we want to minimize prediction error on a supervised task.
57
- For this we first add a logistic regression
58
+ For this, we first add a logistic regression
58
59
layer on top of the network (more precisely on the output code of the
59
60
output layer). We then
60
61
train the entire network as we would train a multilayer
61
62
perceptron. At this point, we only consider the encoding parts of
62
63
each auto-encoder.
63
64
This stage is supervised, since now we use the target class during
64
- training (see the :ref:`mlp` for details on the multilayer perceptron).
65
+ training. (See the :ref:`mlp` for details on the multilayer perceptron.)
65
66
66
67
This can be easily implemented in Theano, using the class defined
67
- before for a denoising autoencoder. We can see the stacked denoising
68
- autoencoder as having two facades, one is a list of
69
- autoencoders, the other is an MLP. During pre-training we use the first facade, i.e we treat our model
68
+ previously for a denoising autoencoder. We can see the stacked denoising
69
+ autoencoder as having two facades: a list of
70
+ autoencoders, and an MLP. During pre-training we use the first facade, i.e., we treat our model
70
71
as a list of autoencoders, and train each autoencoder seperately. In the
71
- second stage of training, we use the second facade. These two
72
- facedes are linked by the fact that the autoencoders and the sigmoid layers of
73
- the MLP share parameters, and the fact that autoencoders get as input latent
74
- representations of intermediate layers of the MLP.
72
+ second stage of training, we use the second facade. These two facades are linked because:
73
+
74
+ * the autoencoders and the sigmoid layers of the MLP share parameters, and
75
+
76
+ * the latent representations computed by intermediate layers of the MLP are fed as input to the autoencoders.
75
77
76
78
.. literalinclude:: ../code/SdA.py
77
79
:start-after: start-snippet-1
@@ -80,78 +82,78 @@ representations of intermediate layers of the MLP.
80
82
``self.sigmoid_layers`` will store the sigmoid layers of the MLP facade, while
81
83
``self.dA_layers`` will store the denoising autoencoder associated with the layers of the MLP.
82
84
83
- Next step , we construct ``n_layers`` sigmoid layers (we use the
84
- ``HiddenLayer `` class introduced in :ref:`mlp`, with the only
85
- modification that we replaced the non-linearity from ``tanh`` to the
86
- logistic function :math:`s(x) = \frac{1}{1+e^{-x}}`) and ``n_layers``
87
- denoising autoencoders, where ``n_layers`` is the depth of our model .
88
- We link the sigmoid layers such that they form an MLP, and construct
89
- each denoising autoencoder such that they share the weight matrix and the
90
- bias of the encoding part with its corresponding sigmoid layer.
85
+ Next, we construct ``n_layers`` sigmoid layers and ``n_layers`` denoising
86
+ autoencoders, where ``n_layers `` is the depth of our model. We use the
87
+ ``HiddenLayer`` class introduced in :ref:`mlp`, with one
88
+ modification: we replace the ``tanh`` non-linearity with the
89
+ logistic function :math:`s(x) = \frac{1}{1+e^{-x}}`) .
90
+ We link the sigmoid layers to form an MLP, and construct
91
+ the denoising autoencoders such that each shares the weight matrix and the
92
+ bias of its encoding part with its corresponding sigmoid layer.
91
93
92
94
.. literalinclude:: ../code/SdA.py
93
95
:start-after: start-snippet-2
94
96
:end-before: end-snippet-2
95
97
96
- All we need now is to add the logistic layer on top of the sigmoid
98
+ All we need now is to add a logistic layer on top of the sigmoid
97
99
layers such that we have an MLP. We will
98
100
use the ``LogisticRegression`` class introduced in :ref:`logreg`.
99
101
100
102
.. literalinclude:: ../code/SdA.py
101
103
:start-after: end-snippet-2
102
104
:end-before: def pretraining_functions
103
105
104
- The class also provides a method that generates training functions for
105
- each of the denoising autoencoder associated with the different layers.
106
+ The ``SdA`` class also provides a method that generates training functions for
107
+ the denoising autoencoders in its layers.
106
108
They are returned as a list, where element :math:`i` is a function that
107
- implements one step of training the ``dA`` correspoinding to layer
109
+ implements one step of training the ``dA`` corresponding to layer
108
110
:math:`i`.
109
111
110
112
.. literalinclude:: ../code/SdA.py
111
113
:start-after: self.errors = self.logLayer.errors(self.y)
112
114
:end-before: corruption_level = T.scalar('corruption')
113
115
114
- In order to be able to change the corruption level or the learning rate
115
- during training we associate a Theano variable to them.
116
+ To be able to change the corruption level or the learning rate
117
+ during training, we associate Theano variables with them.
116
118
117
119
.. literalinclude:: ../code/SdA.py
118
120
:start-after: index = T.lscalar('index')
119
121
:end-before: def build_finetune_functions
120
122
121
123
Now any function ``pretrain_fns[i]`` takes as arguments ``index`` and
122
- optionally ``corruption`` -- the corruption level or ``lr`` -- the
123
- learning rate. Note that the name of the parameters are the name given
124
- to the Theano variables when they are constructed, not the name of the
125
- python variables (``learning_rate`` or ``corruption_level``). Keep this
124
+ optionally ``corruption``--- the corruption level or ``lr``--- the
125
+ learning rate. Note that the names of the parameters are the names given
126
+ to the Theano variables when they are constructed, not the names of the
127
+ Python variables (``learning_rate`` or ``corruption_level``). Keep this
126
128
in mind when working with Theano.
127
129
128
- In the same fashion we build a method for constructing function required
129
- during finetuning ( a ``train_model ``, a ``validate_model `` and a
130
- ``test_model`` function).
130
+ In the same fashion we build a method for constructing the functions required
131
+ during finetuning (``train_fn ``, ``valid_score `` and
132
+ ``test_score``).
131
133
132
134
.. literalinclude:: ../code/SdA.py
133
135
:pyobject: SdA.build_finetune_functions
134
136
135
- Note that the returned ``valid_score`` and ``test_score`` are not Theano
136
- functions, but rather python functions that also loop over the entire
137
- validation set and the entire test set producing a list of the losses
137
+ Note that ``valid_score`` and ``test_score`` are not Theano
138
+ functions, but rather Python functions that loop over the entire
139
+ validation set and the entire test set, respectively, producing a list of the losses
138
140
over these sets.
139
141
140
142
Putting it all together
141
143
+++++++++++++++++++++++
142
144
143
- The few lines of code below constructs the stacked denoising
144
- autoencoder :
145
+ The few lines of code below construct the stacked denoising
146
+ autoencoder:
145
147
146
148
.. literalinclude:: ../code/SdA.py
147
149
:start-after: start-snippet-3
148
150
:end-before: end-snippet-3
149
151
150
- There are two stages in training this network, a layer-wise pre-training and
151
- fine-tuning afterwards .
152
+ There are two stages of training for this network: layer-wise pre-training
153
+ followed by fine-tuning.
152
154
153
155
For the pre-training stage, we will loop over all the layers of the
154
- network. For each layer we will use the compiled theano function that
156
+ network. For each layer we will use the compiled Theano function that
155
157
implements a SGD step towards optimizing the weights for reducing
156
158
the reconstruction cost of that layer. This function will be applied
157
159
to the training set for a fixed number of epochs given by
@@ -161,9 +163,9 @@ to the training set for a fixed number of epochs given by
161
163
:start-after: start-snippet-4
162
164
:end-before: end-snippet-4
163
165
164
- The fine-tuning loop is very similar with the one in the :ref:`mlp`, the
165
- only difference is that we will use now the functions given by
166
- ``build_finetune_functions`` .
166
+ The fine-tuning loop is very similar to the one in the :ref:`mlp`. The
167
+ only difference is that it uses the functions given by
168
+ ``build_finetune_functions``.
167
169
168
170
Running the Code
169
171
++++++++++++++++
@@ -175,8 +177,8 @@ The user can run the code by calling:
175
177
python code/SdA.py
176
178
177
179
By default the code runs 15 pre-training epochs for each layer, with a batch
178
- size of 1. The corruption level for the first layer is 0.1, for the second
179
- 0.2 and 0.3 for the third. The pretraining learning rate is was 0.001 and
180
+ size of 1. The corruption levels are 0.1 for the first layer, 0.2 for the second,
181
+ and 0.3 for the third. The pretraining learning rate is 0.001 and
180
182
the finetuning learning rate is 0.1. Pre-training takes 585.01 minutes, with
181
183
an average of 13 minutes per epoch. Fine-tuning is completed after 36 epochs
182
184
in 444.2 minutes, with an average of 12.34 minutes per epoch. The final
@@ -188,13 +190,13 @@ Xeon E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS.
188
190
Tips and Tricks
189
191
+++++++++++++++
190
192
191
- One way to improve the running time of your code (given that you have
193
+ One way to improve the running time of your code (assuming you have
192
194
sufficient memory available), is to compute how the network, up to layer
193
195
:math:`k-1`, transforms your data. Namely, you start by training your first
194
196
layer dA. Once it is trained, you can compute the hidden units values for
195
197
every datapoint in your dataset and store this as a new dataset that you will
196
- use to train the dA corresponding to layer 2. Once you trained the dA for
198
+ use to train the dA corresponding to layer 2. Once you have trained the dA for
197
199
layer 2, you compute, in a similar fashion, the dataset for layer 3 and so on.
198
200
You can see now, that at this point, the dAs are trained individually, and
199
201
they just provide (one to the other) a non-linear transformation of the input.
200
- Once all dAs are trained, you can start fine-tunning the model.
202
+ Once all dAs are trained, you can start fine-tuning the model.
0 commit comments