In the last few years, Recurrent Neural Networks (RNNs) have proved effective on several NLP tasks. Despite such great success, their ability to model sequence labeling is still limited. This lead research toward solutions where RNNs are combined with models which already proved effective in this domain, such as CRFs. In this work we propose a solution far simpler but very effective: an evolution of the simple Jordan RNN, where labels are reinjected as input into the network, and converted into embeddings, in the same way as words. We compare this RNN variant to all the other RNN models, Elman and Jordan RNN, LSTM and GRU, on two well-known tasks of Spoken Language Understanding (SLU). Thanks to label embeddings and their combination at the hidden layer, the proposed variant, which uses more parameters than Elman and Jordan RNNs, but far fewer than LSTM and GRU, is more effective than other RNNs, but also outperforms sophisticated CRF models.
- 1.
\(h_*\) means the hidden layer of any model, as the output layer is computed in the same way for all networks described in this paper.
- 2.
In the literature \(\varPhi \) and \(\varGamma \) are the sigmoid and tanh, respectively.
- 3.
The one-hot representation of a token represented by an index i in a dictionary, is a vector v of the same size as the dictionary and assigned zero everywhere, except at position i where it is 1.
- 4.
In our case, \(y_i\) is explicitely converted from probability distribution to one-hot representation.
- 5.
Indeed we observed better performances when using a word window with respect to when using a single word.
- 6.
Available at http://deeplearning.net/tutorial/rnnslu.html.
- 7.
For example the component localization can be combined with other components like city, relative-distance, generic-relative-location, street etc.
- 8.
https://www.gnu.org/software/octave/; Our code is described at http://www.marcodinarelli.it/software.php and available upon request.
- 9.
http://www.openblas.net; This library allows a speed-up of roughly \(330\times \) on a single matrix-matrix multiplication using 16 cores. This is very attractive with respect to the speed-up of \(380\times \) that can be reached with a GPU, keeping into account that both Octave and OpenBLAS are available for free.
- 10.
This is a publication in French, but results in the tables are easy to understand and directly comparable to our results.
- 11.
We did not run further experiments because without a GPU, experiments on the Penn Treebank are still quite expensive.
- 12.
The errors made by the system are classified as Insertions (I), Deletions (D) and Substitutions (S). The sum of these errors is divided by the number of concepts in the reference annotation (R): \(CER = \frac{I + D + S}{R}\).
