Long Short-Term Memory Recurrent Neural Network Architectures For Large Scale Acoustic Modeling
Long Short-Term Memory Recurrent Neural Network Architectures For Large Scale Acoustic Modeling
Long Short-Term Memory Recurrent Neural Network Architectures For Large Scale Acoustic Modeling
Google, USA
{hasim,andrewsenior,fsb@google.com}
it ot recurrent recurrent
recurrent
output
output
ct mt yt LSTM
input
g cell h
xt rt (a) LSTM
output LSTM
output (c) LSTMP
(b) DLSTM recurrent
LSTM memory blocks
output
Figure 1: LSTMP RNN architecture. A single memory block is
(d) DLSTMP
shown for clarity.
it = σ(Wix xt + Wim mt−1 + Wic ct−1 + bi ) (1) 2.3. LSTMP - LSTM with Recurrent Projection Layer
ft = σ(Wf x xt + Wf m mt−1 + Wf c ct−1 + bf ) (2)
The standard LSTM RNN architecture has an input layer, a re-
ct = ft ct−1 + it g(Wcx xt + Wcm mt−1 + bc ) (3) current LSTM layer and an output layer. The input layer is con-
ot = σ(Wox xt + Wom mt−1 + Woc ct + bo ) (4) nected to the LSTM layer. The recurrent connections in the
mt = ot h(ct ) (5) LSTM layer are directly from the cell output units to the cell
input units, input gates, output gates and forget gates. The cell
yt = φ(Wym mt + by ) (6) output units are also connected to the output layer of the net-
work. The total number of parameters N in a standard LSTM
where the W terms denote weight matrices (e.g. Wix is the ma-
network with one cell in each memory block, ignoring the bi-
trix of weights from the input gate to the input), Wic , Wf c , Woc
ases, can be calculated as N = nc × nc × 4 + ni × nc × 4 +
are diagonal weight matrices for peephole connections, the b
nc × no + nc × 3, where nc is the number of memory cells
terms denote bias vectors (bi is the input gate bias vector), σ is
(and number of memory blocks in this case), ni is the number
the logistic sigmoid function, and i, f , o and c are respectively
of input units, and no is the number of output units. The com-
the input gate, forget gate, output gate and cell activation vec-
putational complexity of learning LSTM models per weight and
tors, all of which are the same size as the cell output activation
time step with the stochastic gradient descent (SGD) optimiza-
vector m, is the element-wise product of the vectors, g and h
tion technique is O(1). Therefore, the learning computational
are the cell input and cell output activation functions, generally
complexity per time step is O(N ). The learning time for a net-
and in this paper tanh, and φ is the network output activation
work with a moderate number of inputs is dominated by the
function, softmax in this paper.
nc × (4 × nc + no ) factor. For the tasks requiring a large
number of output units and a large number of memory cells to
2.2. Deep LSTM store temporal contextual information, learning LSTM models
As with DNNs with deeper architectures, deep LSTM RNNs become computationally expensive.
have been successfully used for speech recognition [11, 17, 2]. As an alternative to the standard architecture, we proposed
Deep LSTM RNNs are built by stacking multiple LSTM lay- the Long Short-Term Memory Projected (LSTMP) architec-
ers. Note that LSTM RNNs are already deep architectures in ture to address the computational complexity of learning LSTM
the sense that they can be considered as a feed-forward neu- models [3]. This architecture, shown in Figure 1 has a separate
ral network unrolled in time where each layer shares the same linear projection layer after the LSTM layer. The recurrent con-
model parameters. One can see that the inputs to the model nections now connect from this recurrent projection layer to the
go through multiple non-linear layers as in DNNs, however the input of the LSTM layer. The network output units are con-
features from a given time instant are only processed by a sin- nected to this recurrent layer. The number of parameters in this
gle nonlinear layer before contributing the output for that time model is nc ×nr ×4+ni ×nc ×4+nr ×no +nc ×nr +nc ×3,
instructions. We implemented activation functions and gradi-
Table 1: Experiments with LSTM and LSTMP RNN architec- ent calculations on matrices using SIMD instructions to benefit
tures showing test set WERs and frame accuracies on devel- from parallelization.
opment and training sets. L indicates the number of layers,
We use the truncated backpropagation through time (BPTT)
for shallow (1L) and deep (2,4,5,7L) networks. C indicates the
learning algorithm [22] to compute parameter gradients on short
number of memory cells, P the number of recurrent projection
subsequences of the training utterances. Activations are for-
units, and N the total number of parameters.
ward propagated for a fixed step time Tbptt (e.g. 20). Cross
C P Depth N Dev Train WER entropy gradients are computed for this subsequence and back-
(%) (%) (%) propagated to its start. For computational efficiency each thread
840 - 5L 37M 67.7 70.7 10.9 operates on subsequences of four utterances at a time, so matrix
440 - 5L 13M 67.6 70.1 10.8 multiplies can operate in parallel on four frames at a time. We
600 - 2L 13M 66.4 68.5 11.3 use asynchronous stochastic gradient descent (ASGD) [23] to
385 - 7L 13M 66.2 68.5 11.2 optimize the network parameters, updating the parameters asyn-
750 - 1L 13M 63.3 65.5 12.4 chronously from multiple threads on a multi-core machine. This
effectively increases the batch size and reduces the correlation
6000 800 1L 36M 67.3 74.9 11.8 of the frames in a given batch. After a thread has updated the
2048 512 2L 22M 68.8 72.0 10.8 parameters, it continues with the next subsequence in each utter-
1024 512 3L 20M 69.3 72.5 10.7 ance, preserving the LSTM state, or starts new utterances with
1024 512 2L 15M 69.0 74.0 10.7 reset state when one finishes. Note that the last subsequence
800 512 2L 13M 69.0 72.7 10.7 of each utterance can be shorter than Tbptt but is padded to the
2048 512 1L 13M 67.3 71.8 11.3 full length, though no gradient is generated for these padding
frames.
This highly parallel single machine ASGD framework de-
where nr is the number of units in the recurrent projection layer. scribed in [3] proved slow for training models of the scale we
In this case, the model size and the learning computational com- have used for large scale ASR with DNNs (many millions of
plexity are dominated by the nr × (4 × nc + no ) factor. Hence, parameters). To scale further, we replicate the single-machine
this allows us to reduce the number of parameters by the ratio workers on many (e.g. 500) separate machines, each with three,
nr
nc
. By setting nr < nc we can increase the model memory synchronized, computation threads. Each worker communi-
(nc ) and still be able to control the number of parameters in the cates with a shared, distributed parameter server [23] which
recurrent connections and output layer. stores the LSTM parameters. When a worker has computed the
With the proposed LSTMP architecture, the equations for parameter gradient on a minibatch (of 3 × 4 × Tbptt frames), the
the activations of network units change slightly, the mt−1 acti- gradient vector is partitioned and sent to the parameter server
vation vector is replaced with rt−1 and the following is added: shards which each add the gradients to their parameters and re-
rt = Wrm mt (7) spond with the new parameters. The parameter server shards
aggregate parameter updates completely asynchronously. For
yt = φ(Wyr rt + by ) (8) instance, gradient updates from workers may arrive in different
where the r denote the recurrent unit activations. orders at different shards of the parameter server. Despite the
asynchrony, we observe stable convergence, though the learn-
2.4. Deep LSTMP ing rate must be reduced, as would be expected because of the
increase in the effective batch size from the greater parallelism.
Similar to deep LSTM, we propose deep LSTMP where multi-
ple LSTM layers each with a separate recurrent projection layer
are stacked. LSTMP allows the memory of the model to be in- 4. Experiments
creased independently from the output layer and recurrent con- We evaluate and compare the performance of LSTM RNN ar-
nections. However, we noticed that increasing the memory size chitectures on a large vocabulary speech recognition task – the
makes the model more prone to overfitting by memorizing the Google Voice Search task. We use a hybrid approach [24]
input sequence data. We know that DNNs generalize better to for acoustic modeling with LSTM RNNs, wherein the neural
unseen examples with increasing depth. The depth makes the networks estimate hidden Markov model (HMM) state posteri-
models harder to overfit to the training data since the inputs ors. We scale the state posteriors by the state priors estimated
to the network need to go through many non-linear functions. as the relative state frequency from the training data to obtain
With this motivation, we have experimented with deep LSTMP the acoustic frame likelihoods. We deweight the silence state
architectures, where the aim is increasing the memory size and counts by a factor of 2.7 when estimating the state frequencies.
generalization power of the model.
4.1. Systems & Evaluation
3. Distributed Training: Scaling up to
All the networks are trained on a 3 million utterance (about
Large Models with Parallelization 1900 hours) dataset consisting of anonymized and hand-
We chose to implement the LSTM RNN architectures on multi- transcribed utterances. The dataset is represented with 25ms
core CPU rather than on GPU. The decision was based on frames of 40-dimensional log-filterbank energy features com-
CPU’s relatively simpler implementation complexity, ease of puted every 10ms. The utterances are aligned with a 85 million
debugging and the ability to use clusters made from commod- parameter DNN with 14247 CD states. The weights in all the
ity hardware. For matrix operations, we used the Eigen matrix networks are initialized to the range (-0.02, 0.02) with a uni-
library [21]. This templated C++ library provides efficient im- form distribution. We try to set the learning rate specific to a
plementations for matrix operations on CPU using vectorized network architecture and its configuration to the largest value
70 c800_r512x2 c800_r512x2
c440x5
c2048_r512
72 c440x5
c2048_r512
c600x2 c600x2
68 c385x7 c385x7
c750 c750
70
Frame Accuracy (%)
62 64