Skip to content

Commit 8840be8

Browse files
committed
Merge pull request lisa-lab#64 from carriepl/update_lstm_tutorial
Update LTSM tutorial with proper description.
2 parents 9c3c66c + 7f067b8 commit 8840be8

File tree

5 files changed

+198
-21
lines changed

5 files changed

+198
-21
lines changed

doc/contents.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ Contents
2020
DBN
2121
hmc
2222
rnnslu
23+
lstm
2324
rnnrbm
2425
utilities
2526
references

doc/images/lstm.png

13.5 KB
Loading

doc/images/lstm_memorycell.png

14 KB
Loading

doc/index.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ Recurrent neural networks with word embeddings and context window:
5353
* :ref:`Semantic Parsing of Speech using Recurrent Net <rnnslu>`
5454

5555
LSTM network for sentiment analysis:
56-
* :ref:`LSTM network <lstm>` - Only the code for now
56+
* :ref:`LSTM network <lstm>`
5757

5858
Energy-based recurrent neural network (RNN-RBM):
5959
* :ref:`Modeling and generating sequences of polyphonic music <rnnrbm>`

doc/lstm.txt

Lines changed: 196 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,177 @@
11
.. _lstm:
22

3-
LSTM Network for Sentiment Analysis
3+
LSTM Networks for Sentiment Analysis
44
**********************************************
55

66
Summary
77
+++++++
88

9-
This tutorial aims to provide an example of how a Recurrent Neural Network (RNN) using the Long Short Term Memory (LSTM) architecture can be implemented using Theano. In this tutorial, this model is used to perform sentiment analysis on movie reviews from the `Large Movie Review Dataset <http://ai.stanford.edu/~amaas/data/sentiment/>`_, sometimes known as the IMDB dataset.
9+
This tutorial aims to provide an example of how a Recurrent Neural Network
10+
(RNN) using the Long Short Term Memory (LSTM) architecture can be implemented
11+
using Theano. In this tutorial, this model is used to perform sentiment
12+
analysis on movie reviews from the `Large Movie Review Dataset
13+
<http://ai.stanford.edu/~amaas/data/sentiment/>`_, sometimes known as the
14+
IMDB dataset.
15+
16+
In this task, given a movie review, the model attempts to predict whether it
17+
is positive or negative. This is a binary classification task.
18+
19+
Data
20+
++++
21+
22+
As previously mentioned, the provided scripts are used to train a LSTM
23+
recurrent neural network on the Large Movie Review Dataset dataset.
24+
25+
While the dataset is public, in this tutorial we provide a copy of the dataset
26+
that has previously been preprocessed according to the needs of this LSTM
27+
implementation. Running the code provided in this tutorial will automatically
28+
download the data to the local directory.
29+
30+
Model
31+
+++++
32+
33+
LSTM
34+
====
35+
36+
In a *traditional* recurrent neural network, during the gradient
37+
back-propagation phase, the gradient signal can end up being multiplied a
38+
large number of times (as many as the number of timesteps) by the weight
39+
matrix associated with the connections between the neurons of the recurrent
40+
hidden layer. This means that, the magnitude of weights in the transition
41+
matrix can have a strong impact on the learning process.
42+
43+
If the weights in this matrix are small (or, more formally, if the leading
44+
eigenvalue of the weight matrix is smaller than 1.0), it can lead to a
45+
situation called *vanishing gradients* where the gradient signal gets so small
46+
that learning either becomes very slow or stops working altogether. It can
47+
also make more difficult the task of learning long-term dependencies in the
48+
data. Conversely, if the weights in this matrix are large (or, again, more
49+
formally, if the leading eigenvalue of the weight matrix is larger than 1.0),
50+
it can lead to a situation where the gradient signal is so large that it can
51+
cause learning to diverge. This is often referred to as *exploding gradients*.
52+
53+
These issues are the main motivation behind the LSTM model which introduces a
54+
new structure called a *memory cell* (see Figure 1 below). A memory cell is
55+
composed of four main elements: an input gate, a neuron with a self-recurrent
56+
connection (a connection to itself), a forget gate and an output gate. The
57+
self-recurrent connection has a weight of 1.0 and ensures that, barring any
58+
outside interference, the state of a memory cell can remain constant from one
59+
timestep to another. The gates serve to modulate the interactions between the
60+
memory cell itself and its environment. The input gate can allow incoming
61+
signal to alter the state of the memory cell or block it. On the other hand,
62+
the output gate can allow the state of the memory cell to have an effect on
63+
other neurons or prevent it. Finally, the forget gate can modulate the memory
64+
cell’s self-recurrent connection, allowing the cell to remember or forget its
65+
previous state, as needed.
66+
67+
.. figure:: images/lstm_memorycell.png
68+
:align: center
69+
70+
**Figure 1** : Illustration of an LSTM memory cell.
71+
72+
The equations below describe how a layer of memory cells is updated at every
73+
timestep :math:`t`. In these equations :
74+
75+
* :math:`x_t` is the input to the memory cell layer at time :math:`t`
76+
* :math:`W_i`, :math:`W_f`, :math:`W_c`, :math:`W_o`, :math:`U_i`,
77+
:math:`U_f`, :math:`U_c`, :math:`U_o` and :math:`V_o` are weight
78+
matrices
79+
* :math:`b_i`, :math:`b_f`, :math:`b_c` and :math:`b_o` are bias vectors
80+
81+
82+
First, we compute the values for :math:`i_t`, the input gate, and
83+
:math:`\widetilde{C_t}` the candidate value for the states of the memory
84+
cells at time :math:`t` :
85+
86+
.. math::
87+
:label: 1
88+
89+
i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)
90+
91+
.. math::
92+
:label: 2
93+
94+
\widetilde{C_t} = tanh(W_c x_t + U_c h_{t-1} + b_c)
95+
96+
Second, we compute the value for :math:`f_t`, the activation of the memory
97+
cells' forget gates at time :math:`t` :
98+
99+
.. math::
100+
:label: 3
101+
102+
f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)
103+
104+
Given the value of the input gate activation :math:`i_t`, the forget gate
105+
activation :math:`f_t` and the candidate state value :math:`\widetilde{C_t}`,
106+
we can compute :math:`C_t` the memory cells' new state at time :math:`t` :
107+
108+
.. math::
109+
:label: 4
110+
111+
C_t = i_t * \widetilde{C_t} + f_t * C_{t-1}
112+
113+
With the new state of the memory cells, we can compute the value of their
114+
output gates and, subsequently, their outputs :
115+
116+
.. math::
117+
:label: 5
118+
119+
o_t = \sigma(W_o x_t + U_o h_{t-1} + V_o C_t + b_1)
120+
121+
.. math::
122+
:label: 6
123+
124+
h_t = o_t * tanh(C_t)
125+
126+
Our model
127+
=========
128+
129+
The model we used in this tutorial is a variation of the standard LSTM model.
130+
In this variant, the activation of a cell’s output gate does not depend on the
131+
memory cell’s state :math:`C_t`. This allows us to perform part of the
132+
computation more efficiently (see the implementation note, below, for
133+
details). This means that, in the variant we have implemented, there is no
134+
matrix :math:`V_o` and equation :eq:`5` is replaced by equation :eq:`5-alt` :
135+
136+
.. math::
137+
:label: 5-alt
138+
139+
o_t = \sigma(W_o x_t + U_o h_{t-1} + b_1)
140+
141+
Our model is composed of a single LSTM layer followed by an average pooling
142+
and a logistic regression layer as illustrated in Figure 2 below. Thus, from
143+
an input sequence :math:`x_0, x_1, x_2, ..., x_n`, the memory cells in the
144+
LSTM layer will produce a representation sequence :math:`h_0, h_1, h_2, ...,
145+
h_n`. This representation sequence is then averaged over all timesteps
146+
resulting in representation h. Finally, this representation is fed to a
147+
logistic regression layer whose target is the class label associated with the
148+
input sequence.
149+
150+
.. figure:: images/lstm.png
151+
:align: center
152+
153+
**Figure 2** : Illustration of the model used in this tutorial. It is
154+
composed of a single LSTM layer followed by mean pooling over time and
155+
logistic regression.
156+
157+
**Implementation note** : In the code included this tutorial, the equations
158+
:eq:`1`, :eq:`2`, :eq:`3` and :eq:`5-alt` are performed in parallel to make
159+
the computation more efficient. This is possible because none of these
160+
equations rely on a result produced by the other ones. It is achieved by
161+
concatenating the four matrices :math:`W_*` into a single weight matrix
162+
:math:`W` and performing the same concatenation on the weight matrices
163+
:math:`U_*` to produce the matrix :math:`U` and the bias vectors :math:`b_*`
164+
to produce the vector :math:`b`. Then, the pre-nonlinearity activations can
165+
be computed with :
166+
167+
.. math::
168+
169+
z = \sigma(W x_t + U h_{t-1} + b)
170+
171+
The result is then sliced to obtain the pre-nonlinearity activations for
172+
:math:`i`, :math:`f`, :math:`\widetilde{C_t}`, and :math:`o` and the
173+
non-linearities are then applied independently for each.
10174

11-
In this task, given a movie review, the model attempts to predict whether it is positive or negative. This is a binary classification task.
12175

13176
Code - Citations - Contact
14177
++++++++++++++++++++++++++
@@ -22,23 +185,33 @@ The LSTM implementation can be found in the two following files :
22185

23186
* `imdb.py <http://deeplearning.net/tutorial/code/imdb.py>`_ : Secondary script. Handles the loading and preprocessing of the IMDB dataset.
24187

25-
Data
26-
====
188+
After downloading both scripts and putting both in the same folder, the user
189+
can run the code by calling:
27190

28-
As previously mentionned, the provided scripts are used to train a LSTM
29-
recurrent neural on the Large Movie Review Dataset dataset.
191+
.. code-block:: bash
30192

31-
While the dataset is public, in this tutorial we provide a copy of the dataset
32-
that has previously been preprocessed according to the needs of this LSTM
33-
implementation. You can download this preprocessed version of the dataset
34-
using the script `download.sh <https://raw.githubusercontent.com/lisa-lab/DeepLearningTutorials/master/data/download.sh>`_ and uncompress it.
193+
THEANO_FLAGS="floatX=float32" python train_lstm.py
194+
195+
The script will automatically download the data and decompress it.
35196

36197
Papers
37198
======
38199

39-
If you use this tutorial, please cite the following papers:
200+
If you use this tutorial, please cite the following papers.
201+
202+
Introduction of the LSTM model:
203+
204+
* `[pdf] <http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf>`_ Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
205+
206+
Addition of the forget gate to the LSTM model:
40207

41-
* `[pdf] <http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf>`_ HOCHREITER, Sepp et SCHMIDHUBER, Jürgen. Long short-term memory. Neural computation, 1997, vol. 9, no 8, p. 1735-1780. 1997.
208+
* `[pdf] <http://www.mitpressjournals.org/doi/pdf/10.1162/089976600300015015>`_ Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural computation, 12(10), 2451-2471.
209+
210+
More recent LSTM paper:
211+
212+
* `[pdf] <http://www.cs.toronto.edu/~graves/preprint.pdf>`_ Graves, Alex. Supervised sequence labelling with recurrent neural networks. Vol. 385. Springer, 2012.
213+
214+
Papers related to Theano:
42215

43216
* `[pdf] <http://www.iro.umontreal.ca/~lisa/pointeurs/nips2012_deep_workshop_theano_final.pdf>`_ Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
44217

@@ -52,14 +225,17 @@ Contact
52225
Please email `Kyunghyun Cho <http://www.kyunghyuncho.me/>`_ for any
53226
problem report or feedback. We will be glad to hear from you.
54227

55-
Running the Code
56-
++++++++++++++++
228+
References
229+
++++++++++
57230

58-
After downloading both the scripts, downloading and uncompressing the data and
59-
putting all those files in the same folder, the user can run the code by
60-
calling:
231+
* Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
61232

62-
.. code-block:: bash
233+
* Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural computation, 12(10), 2451-2471.
63234

64-
THEANO_FLAGS="floatX=float32" python train_lstm.py
235+
* Graves, A. (2012). Supervised sequence labelling with recurrent neural networks (Vol. 385). Springer.
236+
237+
* Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.
238+
239+
* Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2), 157-166.
65240

241+
* Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 142-150). Association for Computational Linguistics.

0 commit comments

Comments
 (0)