0% found this document useful (0 votes)

20 views8 pages

Scaling-up-online-speech-recognition-using-ConvNets

The document presents a novel online speech recognition system utilizing Time-Depth Separable convolutions and Connectionist Temporal Classification, achieving three times the throughput of a hybrid ASR baseline while reducing latency and improving word error rate. The design focuses on computational efficiency and low-latency performance, making it suitable for applications like live captioning. Experimental results demonstrate the system's effectiveness in real-time processing with a detailed analysis of throughput, latency, and accuracy metrics.

Uploaded by

cv siva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views8 pages

Scaling-up-online-speech-recognition-using-ConvNets

Uploaded by

cv siva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

S CALING U P O NLINE S PEECH R ECOGNITION U SING

C ONV N ETS

A P REPRINT

Vineel Pratap Qiantong Xu Jacob Kahn Gilad Avidov

Facebook, Menlo Park Facebook, Menlo Park Facebook, Menlo Park Facebook, Menlo Park
vineelkpratap@fb.com qiantong@fb.com jacobkahn@fb.com avidov@fb.com

Tatiana Likhomanenko Awni Hannun Vitaliy Liptchinsky Gabriel Synnaeve

Facebook, Menlo Park Facebook, NYC Facebook, Menlo Park Facebook, NYC
antares@fb.com awni@fb.com vitaliy888@fb.com gab@fb.com

Ronan Collobert
Facebook, Menlo Park
locronan@fb.com

January 10, 2020

A BSTRACT

We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS)
convolutions and Connectionist Temporal Classification (CTC). The system has almost three times
the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word
error rate. We improve the core TDS architecture in order to limit the future context and hence reduce
latency while maintaining accuracy. Also important to the efficiency of the recognizer is our highly
optimized beam search decoder. To show the impact of our design choices, we analyze throughput,
latency and accuracy and also discuss how these metrics can be tuned based on the user requirements.

1 Introduction

The process of transcribing speech in real-time from an input audio stream is known as online speech recognition. Most
automatic speech recognition (ASR) research focuses on improving accuracy without the constraint of performing
recognition in real time. For certain applications such as live video captioning or on-device transcription, however, low
latency speech recognition is essential. In these cases, online speech recognition with a limited time delay is needed to
provide a good user experience.
Furthermore, deployment of a real-time speech recognition system at scale poses a number of challenges. First, the
system must be computationally efficient to minimize the required hardware resources and to handle a large number of
concurrent requests. Second, the system needs to minimize the time between a spoken word appearing in the audio and
the corresponding text produced by the system. Third, the deployed recognizer should be competitive in accuracy with
an offline research-grade system. We measure the three aforementioned constraints using the following metrics: (i)
throughput, (ii) Real-Time Factor (RTF) and latency, and (iii) Word Error Rate (WER).
We discuss the design of a novel ASR system that achieves 3x better throughput compared to a strong hybrid baseline.
We use Time-Depth Separable convolutions [1] as the building block for our acoustic model and an efficient beam-search
decoder provided by the wav2letter++ [2]1 open-source ASR toolkit.

1
https://github.com/facebookresearch/wav2letter
A PREPRINT - JANUARY 10, 2020

LayerNorm

LayerNorm
Input Output

ReLU
ReLU
Conv 1D Linear Linear
T x w*c T x w*c

in, out channels : w*c ; stride : 1 ;

filter_size : kw ; num_groups : w ; in, out channels : w*c
padding : { kw - 1 - rPad, rPad }

Figure 1: Time-Depth Separable Convolutional Block, TDS(c, kw, w, rP ad).

The rest of the paper is organized as follows. In Section 3, we review the design principles of our system, including
acoustic models, the decoder and present novel ideas to reduce latency and greatly improve computational efficiency. in
Section 4 we discuss our experimental setup: benchmarks for measuring throughput and accuracy, as well as hardware
configuration. In Section 5 we present out experimental results and ablative analysis. Finally, in Section 6 we discuss
future work and conclude.

2 Related Work

Taking an ASR system from research to a low-latency, high-throughput deployment while maintaining low WER
involves non-trivial changes to the implementation. For example, many research systems use bi-directional RNNs [3, 4]
or (self) attention [5, 6] which require unlimited future context and make low-latency deployment impossible. Fully
convolutional acoustic models can be very efficient to train, but are often used with large future context sizes [7]. Here,
we build on a large body of work in making research-grade ASR systems work in resource constrained low-latency
environments.
Our architecture uses Time-Depth Separable convolution (TDS) [1] as the core building block. Bi-directional RNNs
and Transformers [8] either require considerable changes for low-latency deployment [9, 10] or degrade rapidly when
limiting the amount of future context [11]. While latency controlled bi-directional LSTMs are commonly used in online
speech recognition [12], incorporating future context with convolutions can yield more accurate and lower latency
models [13]. We find that the TDS convolution can maintain low WERs with a limited amount of future context.
Some recent work in low-latency speech recognition uses the RNN Transducer (RNN-T) [14] as the core architecture [15,
16]. Unlike attention models, the RNN-T decoder is causal and can be easily streamed since it does not rely on future
context. Departing from prior work, we find that a CTC trained model can yield better WER than an RNN-T model
while being simpler and more efficient to deploy.
Depth-wise separable convolutions [17] have been used in domains such as computer vision [18] and machine
translation [19] yielding dramatic reductions in model size and computational throughput while maintaining accuracy.
The TDS architecture we use here also results in much lighter weight yet still low WER models. Other architectures,
such as the Time-Channel Separable convolution have also shown similar gains in computational efficiency with little if
any hit to accuracy [20].

3 Technical Details

In this section we describe the architectural design, algorithmic choices and various performance optimizations of our
speech recognizer.

3.1 Low-latency acoustic models

Our acoustic models are based on Time-Depth Separable (TDS) convolutions [1]. The reasons for choosing these
models is two-fold. First, TDS blocks make use of grouped convolutions which dramatically reduce the number of
parameters while still achieving low WER. This makes inference computationally efficient and keeps the model size
small. Second, limiting the future context for convolution operations in TDS, which is necessary for maintaining low
latency, leads to a small degradation in WER.

2
A PREPRINT - JANUARY 10, 2020

Figure 1 shows the architecture of the TDS Block used in our work. The TDS block we use is modified from the
original TDS implementation to make the architecture streaming friendly. The changes are described below.

3.1.1 Asymmetrically padded convolutions

The latency of 1-D convolutions is impacted by the number of future inputs needed to generate the current output. In
order to minimize this, we use asymmetric padding for convolutions which adds more (zero) padding at the start of the
input. This reduces the dependency of the current output on the future input and hence reduces the latency of the model.
For example, consider a TDS(10, 9, 80, 4) block (See Fig. 1 for notation) which uses symmetric padding. In order to
generate the first output frame, the 1-D convolution with a filter size of 9 needs 5 input frames, since 4 frames will be
padded to the start of the input. We can reduce this future context dependency to just 2 frames if we use a TDS(10, 9,
80, 1) block which pads with 8 frames at the start of the input.

3.1.2 Removing time dependency of layernorm

The TDS Block in [1] performs layer normalization across all axes including time for a given sample. This makes the
output depend on the full sample which makes streaming impractical. We resolve this by performing standard layer
normalization [21] which normalizes across the width (w) and channels (C) axes. For an input x of shape T × w ∗ c,
we compute the layernorm output x̂ as
x[i, j] − µi
x̂[i, j] = g ∗ p 2 + b i ∈ [0, T ), j ∈ [0, w ∗ c) (1)
σi +
1 2 1 2
P P
where µi = w∗c j x[i, j], σi = w∗c j (x[i, j] − µi ) , g and b are scalar affine transform parameters and is a small
constant added for numerical stability.
Figure 2 shows the architecture of our acoustic model. The model consists of two 15-channel, Output ( T’ x N )

three 19-channel, four 23-channel and five 27-channel TDS blocks. They are separated by
1-D convolution layers which increase the number of output channels and optionally perform Linear : 2160 —> N
subsampling. A final linear layer produces an N-dimensional output (N is the size of the
token set) which is later passed to a log-softmax layer prior to computing the CTC loss. The 5 x TDS (27, 11, 80, 0)
model has a total of 104 million parameters and has a total subsampling factor of 8. It takes
LayerNorm
80-dimensional log mel-scale filter bank features as input. The features are extracted with
a stride of 10ms so the model has a receptive field of ∼10 seconds per frame and a future ReLU

context of 250ms. Conv1D : 1520 —> 2160

(kw: 11, dw: 1, pad: {10,0})

3.2 Online Beam Search Decoding 4 x TDS (23, 11, 80, 0)

LayerNorm
To integrate a language model, we develop an online version of the beam search decoder ReLU
based on wav2letter++ [2]. The online decoder consumes the encoded frames for the current
Conv1D : 1520 —> 1840
input audio chunk and extends the beam search graph. We output the most likely sequence (kw: 12, dw: 2, pad: {9,1})
of words based on this extended beam search graph for every chunk. Since the best path can
change as we extend the beam search graph, we allow for correction of partial transcriptions 3 x TDS (19, 9, 80, 1)

generated earlier. Also, in order not to have the history in memory grow infinitely with LayerNorm
incoming audio chunks, pruning is applied to the history buffer after consuming a certain ReLU
number of chunks. Conv1D : 1200 —> 1520
(kw: 10, dw: 2, pad: {7,1})
Apart from the existing performance optimizations included in wav2letter++ decoder [7], we
introduce two further pruning techniques. First, we consider only top-K (e.g. K = 50) tokens 2 x TDS (15, 9, 80, 1)
according to the acoustic model score, when expanding each hypothesis in the beam. This
LayerNorm
is also commonly known as acoustic pruning. Second, we propose only the blank symbol
ReLU
if its posterior probability is larger than 0.95 [22]. With these optimizations as well as the
Conv1D : 80 —> 1200
8x reduction in the number frames from subsampling in the acoustic model, we find that (kw: 10, dw: 2, pad: {7,1})
decoding time accounts for about 5% of the overall inference procedure.
Input ( T x 80 )
3.3 Inference Implementation
Figure 2: Acoustic
We wrote a standalone, modular platform to perform the full online inference procedure. The Model Architecture.
pipeline takes audio chunks as input and processes them in an online manner. To perform the Notation: dw = stride,
matrix multiplications and 1-D group convolutions required by the acoustic model, we use kw = filter size

3
A PREPRINT - JANUARY 10, 2020

the 16-bit floating point FBGEMM2 implementation. We have carefully optimized memory use by relying on efficient
I/O buffers and extensively profiled to remove any excess computational overhead.

4 Experimental Setup
4.1 Data

The training set used for our experiments consists of around 1 million utterances (∼13.7K hours) of in-house English
videos publicly shared by users. The data is completely anonymized and doesn’t contain any personally-identifiable
information (PII). We report results on two test sets - vid-clean and vid-noisy consisting of around 1.4K utterances each
(∼20 hours). More information about the data set can be found in [23].

4.2 Training

All our experiments are run using wav2letter++ framework [2] with 64 GPUs for each experiment. We use 80-
dimensional log mel-scale filter banks as input features, with STFTs computed on 25ms Hamming windows strided by
10ms. For the output token set, we use a vocabulary of 5000 sub-word tokens generated from SentencePiece toolkit
[24]. We use local mean and variance normalization instead of global normalization on the input features prior to the
acoustic model so that the system can run in an online manner. Local normalization computes the summary statistics
over the prior n frames and uses them to normalizing the input features at the current frame. We use n = 300 for all of
our experiments which corresponds to a window size of approximately 3 seconds. We also use SpecAugment [3] for
data augmentation in all of the experiments.

4.3 Baseline Systems

We compare our work with two strong baselines. All the systems (including ours) use the same training, validation and
test sets.

Baseline 1 The first baseline [23] is a hybrid system based on context-dependent graphemes (chenones) and uses
multi-layer Latency Controlled Bidirectional Long Short-Term Memory layers (LC-BLSTM)[25] in the acoustic model.
The system is initially bootstrapped with cross entropy (CE) training and then fine-tuned with the lattice-free maximum
mutual information (LF-MMI) [26] loss function.

Baseline 2 The second system [15] is based on the RNN-T [14] architecture and uses Latency Controlled Bidirectional
Long Short-Term Memory layers (LC-BLSTM) [25] in the acoustic model. The token set consists of 200 sentence
pieces, constructed with the sentence piece library [24]. Unlike Baseline 1, the second baseline is trained in an
end-to-end manner.

4.4 Evaluation Benchmarks

For consistency in results, all the benchmarks, including baselines, are run on Intel Skylake CPUs with 18 physical
cores and 64GB of RAM. We describe the performance metrics evaluated in our experiments below

4.4.1 Real-Time Factor

Real-Time Factor (RTF) is the ratio between the time taken to process the input and the input duration. For a system to
be considered real-time, RTF should be ≤ 1. RTF is dependent on the number of concurrent streams being run by the
system. Further, we define “RTF@40” as the RTF using 40 concurrent streams.

4.4.2 Throughput
Throughput is defined as the rate at which audio is consumed. In other words, it is the number of audio seconds
processed per wall clock second.

4.4.3 User-perceived latency

As the name suggests, user-perceived latency is the latency metric most relevant to the end user. It is affected by chunk
size, RTF, acoustic model context, and decoder output delay. Quantifying latency as a function of these parameters is
2
https://github.com/pytorch/FBGEMM

4
A PREPRINT - JANUARY 10, 2020

difficult. Instead we compute this latency in an empirical and end-to-end manner by measuring the timestamp of when
a transcribed word is available to the user and compare this with the same word’s timestamp in the original audio.
More formally, we define user-perceived latency as the average delay between the end timestamp of the words in the
reference transcript and time when it is shown to the user by the system. As an example, consider a 1 sec audio file
with transcript “how are you” and the corresponding start and end word timestamps for the words as (100ms-200ms),
(300ms-400ms), (500ms-600ms) respectively. Lets consider an ASR system with RTF 0.2 which processes the audio
file and outputs the words “how”, “are” after consuming 500ms of audio and the word “you” after looking at 1000ms of
the audio.
For this system, word “the” will be produced at 600ms (500ms for audio chunk to be available and 100ms (500ms
* RTF) for processing it while the true timestamp of end of the word is 200ms which gives us a latency of 400ms.
Computing this for every word and taking the average, we can see that this system has average user-perceived latency
of (400ms + 200ms + 500ms) / 3 = 366.67 ms.
We have measured user-perceived latency on a set of 1000 samples from the TIMIT [27] corpus, which comes with
word alignments in advance. We note that all the ASR systems that we consider produce the correct transcript on these
samples (WER = 0), which is necessary for this latency analysis.

5 Results and Analysis

Table 1 shows the performance comparison of our system with the two baselines mentioned in 4.3. We use 750ms
chunk size for all the experiments with our system. We can see that our system is able to achieve better WER at a much
higher throughput even when using 16-bit floating-point precision as compared to the 8-bit fixed point precision used
by the baselines.

Baseline 1 Baseline 2 Our System

LC-BLSTM + CE + LF-MMI LC-BLSTM + RNN-T TDS + CTC
vid-clean WER 14.1 13.93 13.19
vid-noisy WER 22.15 22.58 21.16
Total Parameters 80 mil 60 mil 104 mil
Inference Precision INT8 INT8 FP16
Throughput 55 64 147
RTF@40 0.70 0.60 0.26
User-perceived latency 1.18 sec - 1.09 sec

Table 1: A comparison of our system with the baseline systems.

5.1 Effect of low-latency acoustic models

We performed an ablation study to observe the change in WER from the original TDS model [1] with global input
normalization to the low latency version we propose here. Table 2 shows that there is only a ∼4% relative increase in
WER because of the changes we make to decrease the latency of the model. We observe that limiting the future context
of the acoustic model contributes the most to the degradation in WER.

vid-clean vid-noisy future context vid-clean WER vid-noisy WER

WER WER
250 ms 13.19 21.16
Original TDS with input globalnorm 12.64 20.45 500 ms 13.07 20.81
+ globalnorm → localnorm 12.72 20.46 1000 ms 12.92 20.46
+ remove time-axis for layernorm 12.71 20.44 2500 ms 12.80 20.73
+ 250ms future context limit 13.19 21.16 5000 ms 12.65 20.44

Table 2: The effect on WER when decreasing the latency of Table 3: Effect of future context on WER performance
the TDS model

Table 3 shows the effect of WER with varying future context sizes. We have kept the receptive field of our convolutional
acoustic model fixed at 10 seconds, use local normalization and remove time from layer normalization axes for all the

5
A PREPRINT - JANUARY 10, 2020

<end>
500 ms

250 ms 1000 ms 2500 ms 5000 ms

time (in seconds)

Figure 3: The acoustic model latency when varying the future context size. The transcript of input audio is “yes” and
end of this word is marked in the audio with <end>. Word timestamps generated by greedy decoding for different
future context sizes are marked at various locations in the input audio.

throughput RTF throughput User-perceived latency

User-perceived latency (msec)

150
150
0.8 1,500
throughput
throughput

100 0.6 100

RTF

0.4 1,000
50 50
0.2

0 0 0 500
0 20 40 60 80 100 120 0 500 1,000 1,500
# concurrent streams chunk size (in sec)
Figure 4: RTF, throughput trade-off vs concurrent streams Figure 5: Latency, throughput trade-off vs chunk size

experiments. We see that the WER almost always improves as the future context size is increases. Figure 3 shows the
word timestamps generated from greedy decoding on an input audio file with only one spoken word. We see that the
transcripts can be produced faster by using smaller future context sizes.
These results demonstrate that asymmetric padding is important for low latency convolutional acoustic models. We see
a significant reduction in latency with just ∼ 4% relative drop in WER going from a model with symmetric convolutions
(5 sec future context) to a model with asymmetric convolutions (250ms future context).

5.2 Effect of concurrent streams

As discussed in Section 1, processing multiple audio streams in parallel is necessary to achieve higher throughput.
Figure 4 shows the effect of increasing the number of concurrent stream on throughput and RTF for a fixed chunk size
of 750ms. We see that throughput increases dramatically up to about 30 concurrent streams beyond which it no longer
improves. On the other hand, RTF consistently increases as we increase the number of concurrent streams. While any
setting with RTF < 1 is considered real-time, as discussed in Section 4.4.3, lower RTF improves the user-perceived
latency.

5.3 Effect of chunk size

For streaming speech recognition, we feed the acoustic model with audio in chunks every T milliseconds (where T is
the chunk size). The transcript is then generated in an online fashion. Figure 5 shows the effect of increasing the audio
chunk size on throughput and latency. We see that even though throughput can be increased by increasing chunk size,
the user-perceived latency also increases.

6
A PREPRINT - JANUARY 10, 2020

6 Conclusion and Future Work

We present a convolutional online speech recognition system which outperforms two strong baselines in throughput,
WER and latency. We also demonstrate the trade-offs in improving one metric over another and how the metrics can
be fine-tuned according to the application requirements. In future work we plan to further improve throughput by
using 8-bit fixed precision quantization and techniques including reducing the model depth on demand [28] and weight
sparsity [29]. The above techniques may also help reduce the model size while speeding up inference time. We have
also deployed our models on-device, but we leave a full analysis with competitive benchmarks to future work.

Acknowledgements
We would like to thank Mahaveer Jain and Duc Le for help in evaluating the baseline models.

References
[1] Awni Hannun, Ann Lee, Qiantong Xu, and Ronan Collobert. Sequence-to-sequence speech recognition with
time-depth separable convolutions. Interspeech 2019, Sep 2019.
[2] Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky, and
Ronan Collobert. Wav2letter++: A fast open-source speech recognition system. In IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 6460–6464. IEEE, 2019.
[3] Daniel S. Park, William Chan, Yu Zhang, et al. Specaugment: A simple data augmentation method for automatic
speech recognition. Interspeech 2019, Sep 2019.
[4] Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann Ney. Improved training of end-to-end attention models for
speech recognition. arXiv preprint arXiv:1805.03294, 2018.
[5] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli
Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. State-of-the-art speech recognition with sequence-to-
sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 4774–4778. IEEE, 2018.
[6] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki,
Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al. A comparative study on transformer vs rnn
in speech applications. arXiv preprint arXiv:1909.06317, 2019.
[7] Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas Usunier, Gabriel Synnaeve, and Ronan Collobert.
Fully convolutional speech recognition. arXiv preprint arXiv:1812.06864, 2018.
[8] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser,
and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages
5998–6008, 2017.
[9] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared
Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in
english and mandarin. In International conference on machine learning, pages 173–182, 2016.
[10] Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway long
short-term memory rnns for distant speech recognition. In 2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 5755–5759. IEEE, 2016.
[11] Yongqiang Wang, Abdelrahman Mohamed, Duc Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang,
Andros Tjandra, Xiaohui Zhang, Frank Zhang, et al. Transformer-based acoustic modeling for hybrid speech
recognition. arXiv preprint arXiv:1910.09799, 2019.
[12] Shaofei Xue and Zhijie Yan. Improving latency-controlled blstm acoustic models for online speech recognition.
In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5340–5344.
IEEE, 2017.
[13] Vijayaditya Peddinti, Yiming Wang, Daniel Povey, and Sanjeev Khudanpur. Low latency acoustic modeling using
temporal convolution and lstms. IEEE Signal Processing Letters, 25(3):373–377, 2017.
[14] Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.
[15] Mahaveer Jain, Kjell Schubert, Jay Mahadeokar, Ching-Feng Yeh, Kaustubh Kalgaonkar, Anuroop Sriram,
Christian Fuegen, and Michael L Seltzer. Rnn-t for latency controlled asr with improved beam search. arXiv
preprint arXiv:1911.01629, 2019.

7
A PREPRINT - JANUARY 10, 2020

[16] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach,
Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. Streaming end-to-end speech recognition for mobile devices.
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6381–6385. IEEE,
2019.
[17] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 1251–1258, 2017.
[18] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco
Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications.
arXiv preprint arXiv:1704.04861, 2017.
[19] Lukasz Kaiser, Aidan N Gomez, and Francois Chollet. Depthwise separable convolutions for neural machine
translation. arXiv preprint arXiv:1706.03059, 2017.
[20] Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan
Leary, Jason Li, and Yang Zhang. Quartznet: Deep automatic speech recognition with 1d time-channel separable
convolutions. arXiv preprint arXiv:1910.10261, 2019.
[21] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016.
[22] Jinhwan Park, Yoonho Boo, Iksoo Choi, Sungho Shin, and Wonyong Sung. Fully neural network based speech
recognition on mobile and embedded devices. In Advances in Neural Information Processing Systems, pages
10620–10630, 2018.
[23] Duc Le, Xiaohui Zhang, Weiyi Zheng, Christian Fügen, Geoffrey Zweig, and Michael L. Seltzer. From senones to
chenones: Tied context-dependent graphemes for hybrid speech recognition, 2019.
[24] Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword
candidates, 2018.
[25] Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway long
short-term memory rnns for distant speech recognition. 2016 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Mar 2016.
[26] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming
Wang, and Sanjeev Khudanpur. Purely sequence-trained neural networks for asr based on lattice-free mmi. In
Interspeech 2016, pages 2751–2755, 2016.
[27] J. Garofolo, Lori Lamel, W. Fisher, Jonathan Fiscus, D. Pallett, N. Dahlgren, and V. Zue. Timit acoustic-phonetic
continuous speech corpus. Linguistic Data Consortium, 11 1992.
[28] Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout,
2019.
[29] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian
Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis, 2018.

Embedded Design
No ratings yet
Embedded Design
62 pages
Case Analysis On Danamma V Amar Singh
No ratings yet
Case Analysis On Danamma V Amar Singh
8 pages
Representation Analysis Methods - For Translation
No ratings yet
Representation Analysis Methods - For Translation
218 pages
End-To-End Speech Recognition Models
No ratings yet
End-To-End Speech Recognition Models
94 pages
T4 - Towards End-To-End Speech Recognition PDF
No ratings yet
T4 - Towards End-To-End Speech Recognition PDF
177 pages
17 2017 Lecture1-2 INT312
0% (2)
17 2017 Lecture1-2 INT312
21 pages
CS0-003 Practice
100% (1)
CS0-003 Practice
16 pages
DOC-20241111-WA0002.
No ratings yet
DOC-20241111-WA0002.
10 pages
Performance
No ratings yet
Performance
43 pages
Fujitsu Siemens - P24W 3
No ratings yet
Fujitsu Siemens - P24W 3
35 pages
Nishna Nyachhyon
No ratings yet
Nishna Nyachhyon
12 pages
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
No ratings yet
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
16 pages
Question papers-53
No ratings yet
Question papers-53
53 pages
iDS-TCM403-BI Datasheet 20240226
No ratings yet
iDS-TCM403-BI Datasheet 20240226
6 pages
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
No ratings yet
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
20 pages
Li - 2022 - Recent Advances in End-to-End Automatic Speech Recognition（语音识别综述）
No ratings yet
Li - 2022 - Recent Advances in End-to-End Automatic Speech Recognition（语音识别综述）
26 pages
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
No ratings yet
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
10 pages
Preprints202212 0426 v1
No ratings yet
Preprints202212 0426 v1
18 pages
Keynote Slides
No ratings yet
Keynote Slides
33 pages
BTP Thesis rs1 End-To-End-Asr
No ratings yet
BTP Thesis rs1 End-To-End-Asr
51 pages
B Li Interspeech 2017
No ratings yet
B Li Interspeech 2017
5 pages
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
No ratings yet
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
12 pages
End-to-End Speech Recognition: A Survey
No ratings yet
End-to-End Speech Recognition: A Survey
27 pages
Hybrid CTC/Attention Architecture For End-to-End Speech Recognition
No ratings yet
Hybrid CTC/Attention Architecture For End-to-End Speech Recognition
16 pages
Iaesarticle
No ratings yet
Iaesarticle
10 pages
Deep Speech - Scaling Up End-To-End Speech Recognition
No ratings yet
Deep Speech - Scaling Up End-To-End Speech Recognition
12 pages
7sem_projectreport
No ratings yet
7sem_projectreport
33 pages
Data-Driven Neural Network Based Feature - Phd-Thesis
No ratings yet
Data-Driven Neural Network Based Feature - Phd-Thesis
155 pages
Seminar_Report_Final
No ratings yet
Seminar_Report_Final
37 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Joint CTC-attention Decoding For End-To-End Speech Recognitionhori Et Al - 2017
No ratings yet
Joint CTC-attention Decoding For End-To-End Speech Recognitionhori Et Al - 2017
12 pages
E-Government Adoption of Human Resource Management in Sorong City, Indonesia
No ratings yet
E-Government Adoption of Human Resource Management in Sorong City, Indonesia
20 pages
All CS MCQs
100% (2)
All CS MCQs
82 pages
Presentation 2
No ratings yet
Presentation 2
12 pages
Voice Assistant (4)
No ratings yet
Voice Assistant (4)
34 pages
116.00000050
No ratings yet
116.00000050
64 pages
Delve Deep Into End-To-End Automatic Speech Recognition Models
No ratings yet
Delve Deep Into End-To-End Automatic Speech Recognition Models
6 pages
Arik 17 A
No ratings yet
Arik 17 A
10 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Effects of Dataset Sampling Rate for Noise Cancellation Through Deep Learning
No ratings yet
Effects of Dataset Sampling Rate for Noise Cancellation Through Deep Learning
16 pages
Unit3 to unit 5 MCQ
No ratings yet
Unit3 to unit 5 MCQ
10 pages
MSGLN
No ratings yet
MSGLN
10 pages
224s.22.lec1
No ratings yet
224s.22.lec1
31 pages
1-s2.0-S0957417424009850-main
No ratings yet
1-s2.0-S0957417424009850-main
11 pages
Full Download Illustrated Microsoft Office 365 and Access 2016 Introductory 1st Edition Friedrichsen Test Bank
100% (53)
Full Download Illustrated Microsoft Office 365 and Access 2016 Introductory 1st Edition Friedrichsen Test Bank
36 pages
ISM_Report_Final
No ratings yet
ISM_Report_Final
33 pages
A Review of Strategies To Decrease The Duration of Indwelling Urethral Catheters and Potentially Reduce The Incidence of Catheter-Associated Urinary Tract Infections - ProQuest
No ratings yet
A Review of Strategies To Decrease The Duration of Indwelling Urethral Catheters and Potentially Reduce The Incidence of Catheter-Associated Urinary Tract Infections - ProQuest
10 pages
1-s2.0-S0893608005001206-main
No ratings yet
1-s2.0-S0893608005001206-main
9 pages
Presentation 3
No ratings yet
Presentation 3
24 pages
Hyb Conformer
No ratings yet
Hyb Conformer
5 pages
A BETTER AND FASTER END-TO-END MODEL FOR STREAMING ASR
No ratings yet
A BETTER AND FASTER END-TO-END MODEL FOR STREAMING ASR
5 pages
2104.02194v2
No ratings yet
2104.02194v2
5 pages
Text to Speech Seminar
No ratings yet
Text to Speech Seminar
10 pages
Session 5_ Speech Recognition
No ratings yet
Session 5_ Speech Recognition
20 pages
CHAPTER ONE
No ratings yet
CHAPTER ONE
13 pages
Esser & Lindoerfer, 1989
No ratings yet
Esser & Lindoerfer, 1989
12 pages
Lightweight_End-to-end_Text-to-speech_Synthesis_fo
No ratings yet
Lightweight_End-to-end_Text-to-speech_Synthesis_fo
6 pages
Transformer-Transducer End-to-End Speech Recognition with Self-Attention
No ratings yet
Transformer-Transducer End-to-End Speech Recognition with Self-Attention
5 pages
BiLSTM_BPTT
No ratings yet
BiLSTM_BPTT
8 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
basepaperssummary
No ratings yet
basepaperssummary
6 pages
connectionist temporal classification
No ratings yet
connectionist temporal classification
6 pages
SWOT Analysis of Maruti Suzuki
No ratings yet
SWOT Analysis of Maruti Suzuki
5 pages
DRNN-AM
No ratings yet
DRNN-AM
5 pages
Lexicon-Free Conversational Speech Recognition With Neural Networks
No ratings yet
Lexicon-Free Conversational Speech Recognition With Neural Networks
10 pages
Convai Technical Overview Speech Ai Part 2 2301964
No ratings yet
Convai Technical Overview Speech Ai Part 2 2301964
11 pages
Progress - Report - of - Intership MD Shams Alam
No ratings yet
Progress - Report - of - Intership MD Shams Alam
4 pages
DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH
No ratings yet
DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH
5 pages
Integrated Method of Deep Learning and Large Language Model in Speech Recognition
No ratings yet
Integrated Method of Deep Learning and Large Language Model in Speech Recognition
6 pages
MOTO
83% (6)
MOTO
583 pages
Rmo 35-90
100% (1)
Rmo 35-90
4 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
ai
No ratings yet
ai
8 pages
ai project sona-1 (1)_250630_194118
No ratings yet
ai project sona-1 (1)_250630_194118
10 pages
u4
No ratings yet
u4
8 pages
Christanta Ginting Finance & Acc Manager Resume
No ratings yet
Christanta Ginting Finance & Acc Manager Resume
4 pages
Gillette Date Codes: Razor Serial Numbers Were Impressed On All Gillette
No ratings yet
Gillette Date Codes: Razor Serial Numbers Were Impressed On All Gillette
7 pages
Image: Mansukhbhai Prajapati With APJ Abdul Kalam
No ratings yet
Image: Mansukhbhai Prajapati With APJ Abdul Kalam
14 pages
1-s2.0-S2666521223000236-main
No ratings yet
1-s2.0-S2666521223000236-main
4 pages
Scope: Lube-Gear, Synthetic C 1 1
No ratings yet
Scope: Lube-Gear, Synthetic C 1 1
1 page
Hye Revision Test - Motion and Measurement
No ratings yet
Hye Revision Test - Motion and Measurement
3 pages
Changes during the Industrial Revolution in Britain
No ratings yet
Changes during the Industrial Revolution in Britain
3 pages
Walsh
No ratings yet
Walsh
4 pages
UUCMS - ಸಮಗ್ರ ವಿಶ್ವವಿದ್ಯಾಲಯ ಮತ್ತು ಕಾಲೇಜು ನಿರ್ವಹಣಾ ವ್ಯವಸ್ಥೆ
No ratings yet
UUCMS - ಸಮಗ್ರ ವಿಶ್ವವಿದ್ಯಾಲಯ ಮತ್ತು ಕಾಲೇಜು ನಿರ್ವಹಣಾ ವ್ಯವಸ್ಥೆ
2 pages
Plan View: 65-Xd-Of1 - Drainage Culvert Formwork & Reinforcement
No ratings yet
Plan View: 65-Xd-Of1 - Drainage Culvert Formwork & Reinforcement
2 pages
Vector8 Te2: Enset Ngine
No ratings yet
Vector8 Te2: Enset Ngine
2 pages
Ruger AR-556 Modern Sporting Rifle Spec Sheet
No ratings yet
Ruger AR-556 Modern Sporting Rifle Spec Sheet
1 page
Telephone Communication System Essentials
From Everand
Telephone Communication System Essentials
S. Sudhananthan
5/5 (1)
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet