Skip to content

Commit 195e9fe

Browse files
committed
Edits to denoising autoencoder tutorial for clarity.
1 parent 1133e56 commit 195e9fe

File tree

1 file changed

+113
-118
lines changed

1 file changed

+113
-118
lines changed

doc/dA.txt

Lines changed: 113 additions & 118 deletions
Original file line numberDiff line numberDiff line change
@@ -41,78 +41,81 @@ Autoencoders
4141

4242
See section 4.6 of [Bengio09]_ for an overview of auto-encoders.
4343
An autoencoder takes an input :math:`\mathbf{x} \in [0,1]^d` and first
44-
maps it (with an *encoder*) to a hidden representation :math:`\mathbf{y} \in [0,1]^{d'}`
44+
maps it (with an *encoder)* to a hidden representation :math:`\mathbf{y} \in [0,1]^{d'}`
4545
through a deterministic mapping, e.g.:
4646

4747
.. math::
4848

4949
\mathbf{y} = s(\mathbf{W}\mathbf{x} + \mathbf{b})
5050

51-
Where :math:`s` is a non-linearity such as the sigmoid.
52-
The latent representation :math:`\mathbf{y}`, or **code** is then mapped back (with a *decoder*) into a
53-
**reconstruction** :math:`\mathbf{z}` of same shape as
54-
:math:`\mathbf{x}` through a similar transformation, e.g.:
51+
Where :math:`s` is a non-linearity such as the sigmoid. The latent
52+
representation :math:`\mathbf{y}`, or **code** is then mapped back (with a
53+
*decoder)* into a **reconstruction** :math:`\mathbf{z}` of the same shape as
54+
:math:`\mathbf{x}`. The mapping happens through a similar transformation, e.g.:
5555

5656
.. math::
5757

5858
\mathbf{z} = s(\mathbf{W'}\mathbf{y} + \mathbf{b'})
5959

60-
where ' does not indicate transpose, and
61-
:math:`\mathbf{z}` should be seen as a prediction of :math:`\mathbf{x}`, given the code :math:`\mathbf{y}`.
62-
The weight matrix :math:`\mathbf{W'}` of the reverse mapping may be
63-
optionally constrained by :math:`\mathbf{W'} = \mathbf{W}^T`, which is
64-
an instance of *tied weights*. The parameters of this model (namely
65-
:math:`\mathbf{W}`, :math:`\mathbf{b}`,
66-
:math:`\mathbf{b'}` and, if one doesn't use tied weights, also
67-
:math:`\mathbf{W'}`) are optimized such that the average reconstruction
68-
error is minimized. The reconstruction error can be measured in many ways, depending
69-
on the appropriate distributional assumptions on the input given the code, e.g., using the
70-
traditional *squared error* :math:`L(\mathbf{x}, \mathbf{z}) = || \mathbf{x} - \mathbf{z} ||^2`,
71-
or if the input is interpreted as either bit vectors or vectors of
72-
bit probabilities by the reconstruction *cross-entropy* defined as :
60+
(Here, the prime symbol does *not necessarily* indicate matrix transposition.)
61+
:math:`\mathbf{z}` should be seen as a prediction of :math:`\mathbf{x}`, given
62+
the code :math:`\mathbf{y}`. Optionally, the weight matrix :math:`\mathbf{W'}`
63+
of the reverse mapping may be constrained to be the transpose of the forward
64+
mapping: :math:`\mathbf{W'} = \mathbf{W}^T`. This is referred to as *tied
65+
weights*. The parameters of this model (namely :math:`\mathbf{W}`,
66+
:math:`\mathbf{b}`, :math:`\mathbf{b'}` and, if one doesn't use tied weights,
67+
also :math:`\mathbf{W'}`) are optimized such that the average reconstruction
68+
error is minimized.
69+
70+
The reconstruction error can be measured in many ways, depending on the
71+
appropriate distributional assumptions on the input given the code. The
72+
traditional *squared error* :math:`L(\mathbf{x} \mathbf{z}) = || \mathbf{x} -
73+
\mathbf{z} ||^2`, can be used. If the input is interpreted as either bit
74+
vectors or vectors of bit probabilities, *cross-entropy* of the reconstruction
75+
can be used:
7376

7477
.. math::
7578

7679
L_{H} (\mathbf{x}, \mathbf{z}) = - \sum^d_{k=1}[\mathbf{x}_k \log
7780
\mathbf{z}_k + (1 - \mathbf{x}_k)\log(1 - \mathbf{z}_k)]
7881

79-
The hope is that the code :math:`\mathbf{y}` is a distributed representation
80-
that captures the coordinates along the main factors of variation in the data
81-
(similarly to how the projection on principal components captures the main factors
82-
of variation in the data).
83-
Because :math:`\mathbf{y}` is viewed as a lossy compression of :math:`\mathbf{x}`, it cannot
84-
be a good compression (with small loss) for all :math:`\mathbf{x}`, so learning
85-
drives it to be one that is a good compression in particular for training
86-
examples, and hopefully for others as well, but not for arbitrary inputs.
87-
That is the sense in which an auto-encoder generalizes: it gives low reconstruction
88-
error to test examples from the same distribution as the training examples,
89-
but generally high reconstruction error to uniformly chosen configurations of the
90-
input vector.
91-
92-
If there is one linear hidden layer (the code) and
93-
the mean squared error criterion is used to train the network, then the :math:`k`
94-
hidden units learn to project the input in the span of the first :math:`k`
95-
principal components of the data. If the hidden
96-
layer is non-linear, the auto-encoder behaves differently from PCA,
97-
with the ability to capture multi-modal aspects of the input
98-
distribution. The departure from PCA becomes even more important when
99-
we consider *stacking multiple encoders* (and their corresponding decoders)
100-
when building a deep auto-encoder [Hinton06]_.
101-
102-
We want to implement an auto-encoder using Theano, in the form of a class,
103-
that could be afterwards used in constructing a stacked autoencoder. The
104-
first step is to create shared variables for the parameters of the
105-
autoencoder ( :math:`\mathbf{W}`, :math:`\mathbf{b}` and
106-
:math:`\mathbf{b'}`, since we are using tied weights in this tutorial ):
82+
The hope is that the code :math:`\mathbf{y}` is a *distributed* representation
83+
that captures the coordinates along the main factors of variation in the data.
84+
This is similar to the way the projection on principal components would capture
85+
the main factors of variation in the data. Indeed, if there is one linear
86+
hidden layer (the *code)* and the mean squared error criterion is used to train
87+
the network, then the :math:`k` hidden units learn to project the input in the
88+
span of the first :math:`k` principal components of the data. If the hidden
89+
layer is non-linear, the auto-encoder behaves differently from PCA, with the
90+
ability to capture multi-modal aspects of the input distribution. The departure
91+
from PCA becomes even more important when we consider *stacking multiple
92+
encoders* (and their corresponding decoders) when building a deep auto-encoder
93+
[Hinton06]_.
94+
95+
Because :math:`\mathbf{y}` is viewed as a lossy compression of
96+
:math:`\mathbf{x}`, it cannot be a good (small-loss) compression for all
97+
:math:`\mathbf{x}`. Optimization makes it a good compression for training
98+
examples, and hopefully for other inputs as well, but not for arbitrary inputs.
99+
That is the sense in which an auto-encoder generalizes: it gives low
100+
reconstruction error on test examples from the same distribution as the
101+
training examples, but generally high reconstruction error on samples randomly
102+
chosen from the input space.
103+
104+
We want to implement an auto-encoder using Theano, in the form of a class, that
105+
could be afterwards used in constructing a stacked autoencoder. The first step
106+
is to create shared variables for the parameters of the autoencoder
107+
:math:`\mathbf{W}`, :math:`\mathbf{b}` and :math:`\mathbf{b'}`. (Since we are
108+
using tied weights in this tutorial, :math:`\mathbf{W}^T` will be used for
109+
:math:`\mathbf{W'}`):
107110

108111
.. literalinclude:: ../code/dA.py
109112
:start-after: start-snippet-1
110113
:end-before: end-snippet-1
111114

112-
Note that we pass the symbolic ``input`` to the autoencoder as a
113-
parameter. This is such that later we can concatenate layers of
114-
autoencoders to form a deep network: the symbolic output (the :math:`\mathbf{y}` above) of
115-
the k-th layer will be the symbolic input of the (k+1)-th.
115+
Note that we pass the symbolic ``input`` to the autoencoder as a parameter.
116+
This is so that we can concatenate layers of autoencoders to form a deep
117+
network: the symbolic output (the :math:`\mathbf{y}` above) of layer :math:`k` will
118+
be the symbolic input of layer :math:`k+1`.
116119

117120
Now we can express the computation of the latent representation and of the reconstructed
118121
signal:
@@ -137,45 +140,41 @@ reconstruction cost is approximately minimized.
137140
:start-after: theano_rng = RandomStreams(rng.randint(2 ** 30))
138141
:end-before: start_time = time.clock()
139142

140-
One serious potential issue with auto-encoders is that if there is no other
141-
constraint besides minimizing the reconstruction error,
142-
then an auto-encoder with :math:`n` inputs and an
143-
encoding of dimension at least :math:`n` could potentially just learn
144-
the identity function, for which many encodings would be useless (e.g.,
145-
just copying the input), i.e., the autoencoder would not differentiate
146-
test examples (from the training distribution) from other input configurations.
147-
Surprisingly, experiments reported in [Bengio07]_ nonetheless
148-
suggest that in practice, when trained with
149-
stochastic gradient descent, non-linear auto-encoders with more hidden units
150-
than inputs (called overcomplete) yield useful representations
151-
(in the sense of classification error measured on a network taking this
152-
representation in input). A simple explanation is based on the
153-
observation that stochastic gradient
154-
descent with early stopping is similar to an L2 regularization of the
155-
parameters. To achieve perfect reconstruction of continuous
156-
inputs, a one-hidden layer auto-encoder with non-linear hidden units
157-
(exactly like in the above code)
158-
needs very small weights in the first (encoding) layer (to bring the non-linearity of
159-
the hidden units in their linear regime) and very large weights in the
160-
second (decoding) layer.
161-
With binary inputs, very large weights are
162-
also needed to completely minimize the reconstruction error. Since the
163-
implicit or explicit regularization makes it difficult to reach
164-
large-weight solutions, the optimization algorithm finds encodings which
165-
only work well for examples similar to those in the training set, which is
166-
what we want. It means that the representation is exploiting statistical
167-
regularities present in the training set, rather than learning to
168-
replicate the identity function.
169-
170-
There are different ways that an auto-encoder with more hidden units
171-
than inputs could be prevented from learning the identity, and still
172-
capture something useful about the input in its hidden representation.
173-
One is the addition of sparsity (forcing many of the hidden units to
174-
be zero or near-zero), and it has been exploited very successfully
175-
by many [Ranzato07]_ [Lee08]_. Another is to add randomness in the transformation from
176-
input to reconstruction. This is exploited in Restricted Boltzmann
177-
Machines (discussed later in :ref:`rbm`), as well as in
178-
Denoising Auto-Encoders, discussed below.
143+
If there is no constraint besides minimizing the reconstruction error, one
144+
might expect an auto-encoder with :math:`n` inputs and an encoding of dimension
145+
:math:`n` (or greater) to learn the identity function, merely mapping an input
146+
to its copy. Such an autoencoder would not differentiate test examples (from
147+
the training distribution) from other input configurations.
148+
149+
Surprisingly,
150+
experiments reported in [Bengio07]_ suggest that, in practice, when trained
151+
with stochastic gradient descent, non-linear auto-encoders with more hidden
152+
units than inputs (called overcomplete) yield useful representations. (Here,
153+
"useful" means that a network taking the encoding as input has low
154+
classification error.)
155+
156+
A simple explanation is that stochastic gradient descent with early stopping is
157+
similar to an L2 regularization of the parameters. To achieve perfect
158+
reconstruction of continuous inputs, a one-hidden layer auto-encoder with
159+
non-linear hidden units (exactly like in the above code) needs very small
160+
weights in the first (encoding) layer, to bring the non-linearity of the hidden
161+
units into their linear regime, and very large weights in the second (decoding)
162+
layer. With binary inputs, very large weights are also needed to completely
163+
minimize the reconstruction error. Since the implicit or explicit
164+
regularization makes it difficult to reach large-weight solutions, the
165+
optimization algorithm finds encodings which only work well for examples
166+
similar to those in the training set, which is what we want. It means that the
167+
*representation is exploiting statistical regularities present in the training
168+
set,* rather than merely learning to replicate the input.
169+
170+
There are other ways by which an auto-encoder with more hidden units than inputs
171+
could be prevented from learning the identity function, capturing something
172+
useful about the input in its hidden representation. One is the addition of
173+
*sparsity* (forcing many of the hidden units to be zero or near-zero). Sparsity
174+
has been exploited very successfully by many [Ranzato07]_ [Lee08]_. Another is
175+
to add randomness in the transformation from input to reconstruction. This
176+
technique is used in Restricted Boltzmann Machines (discussed later in
177+
:ref:`rbm`), as well as in Denoising Auto-Encoders, discussed below.
179178

180179
.. _DA:
181180

@@ -188,27 +187,23 @@ from simply learning the identity, we train the
188187
autoencoder to *reconstruct the input from a corrupted version of it*.
189188

190189
The denoising auto-encoder is a stochastic version of the auto-encoder.
191-
Intuitively, a denoising auto-encoder does two things: try to encode the
192-
input (preserve the information about the input), and try to undo the
193-
effect of a corruption process stochastically applied to the input of the
194-
auto-encoder. The latter can only be done by capturing the statistical
195-
dependencies between the inputs. The denoising
196-
auto-encoder can be understood from different perspectives
197-
( the manifold learning perspective,
198-
stochastic operator perspective,
199-
bottom-up -- information theoretic perspective,
200-
top-down -- generative model perspective ), all of which are explained in
201-
[Vincent08].
202-
See also section 7.2 of [Bengio09]_ for an overview of auto-encoders.
203-
204-
In [Vincent08], the stochastic corruption process
205-
consists in randomly setting some of the inputs (as many as half of them)
206-
to zero. Hence the denoising auto-encoder is trying to *predict the corrupted (i.e. missing)
207-
values from the uncorrupted (i.e., non-missing) values*, for randomly selected subsets of
208-
missing patterns. Note how being able to predict any subset of variables
209-
from the rest is a sufficient condition for completely capturing the
210-
joint distribution between a set of variables (this is how Gibbs
211-
sampling works).
190+
Intuitively, a denoising auto-encoder does two things: try to encode the input
191+
(preserve the information about the input), and try to undo the effect of a
192+
corruption process stochastically applied to the input of the auto-encoder. The
193+
latter can only be done by capturing the statistical dependencies between the
194+
inputs. The denoising auto-encoder can be understood from different
195+
perspectives ( the manifold learning perspective, stochastic operator
196+
perspective, bottom-up -- information theoretic perspective, top-down --
197+
generative model perspective ), all of which are explained in [Vincent08]_. See
198+
also section 7.2 of [Bengio09]_ for an overview of auto-encoders.
199+
200+
In [Vincent08]_, the stochastic corruption process randomly sets some of the
201+
inputs (as many as half of them) to zero. Hence the denoising auto-encoder is
202+
trying to *predict the corrupted (i.e. missing) values from the uncorrupted
203+
(i.e., non-missing) values*, for randomly selected subsets of missing patterns.
204+
Note how being able to predict any subset of variables from the rest is a
205+
sufficient condition for completely capturing the joint distribution between a
206+
set of variables (this is how Gibbs sampling works).
212207

213208
To convert the autoencoder class into a denoising autoencoder class, all we
214209
need to do is to add a stochastic corruption step operating on the input. The input can be
@@ -236,11 +231,11 @@ does just that :
236231

237232

238233

239-
In the stacked autoencoder class (:ref:`stacked_autoencoders`) the
240-
weights of the ``dA`` class have to be shared with those of an
241-
corresponding sigmoid layer. For this reason, the constructor of the ``dA`` also gets Theano
242-
variables pointing to the shared parameters. If those parameters are left
243-
to ``None``, new ones will be constructed.
234+
In the stacked autoencoder class (:ref:`stacked_autoencoders`) the weights of
235+
the ``dA`` class have to be shared with those of a corresponding sigmoid layer.
236+
For this reason, the constructor of the ``dA`` also gets Theano variables
237+
pointing to the shared parameters. If those parameters are left to ``None``,
238+
new ones will be constructed.
244239

245240
The final denoising autoencoder class becomes :
246241

@@ -465,14 +460,14 @@ it.
465460
print ('Training took %f minutes' % (pretraining_time / 60.))
466461

467462
In order to get a feeling of what the network learned we are going to
468-
plot the filters (defined by the weight matrix). Bare in mind however,
463+
plot the filters (defined by the weight matrix). Bear in mind, however,
469464
that this does not provide the entire story,
470465
since we neglect the biases and plot the weights up to a multiplicative
471466
constant (weights are converted to values between 0 and 1).
472467

473468
To plot our filters we will need the help of ``tile_raster_images`` (see
474-
:ref:`how-to-plot`) so we urge the reader to familiarize himself with
475-
it. Also using the help of PIL library, the following lines of code will
469+
:ref:`how-to-plot`) so we urge the reader to familiarize himself with it. Also
470+
using the help of the Python Image Library, the following lines of code will
476471
save the filters as an image :
477472

478473
.. code-block:: python

0 commit comments

Comments
 (0)