0% found this document useful (0 votes)
136 views

CTC Loss Function

The document discusses Connectionist Temporal Classification (CTC) as an end-to-end approach for automatic speech recognition. CTC trains a recurrent neural network to map acoustic input sequences directly to character output sequences without requiring frame-level alignment. It handles variable input/output lengths by introducing a "blank" symbol and summing over all possible alignments. CTC has been shown to produce competitive results on standard speech recognition benchmarks.

Uploaded by

Ivan Fadillah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

CTC Loss Function

The document discusses Connectionist Temporal Classification (CTC) as an end-to-end approach for automatic speech recognition. CTC trains a recurrent neural network to map acoustic input sequences directly to character output sequences without requiring frame-level alignment. It handles variable input/output lengths by introducing a "blank" symbol and summing over all possible alignments. CTC has been shown to produce competitive results on standard speech recognition benchmarks.

Uploaded by

Ivan Fadillah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

End-to-end systems 1: CTC

(Connectionist Temporal Classification)

Steve Renals

Automatic Speech Recognition – ASR Lecture 15


11 March 2019

ASR Lecture 15 End-to-end systems 1: CTC 1


End-to-end systems

End-to-end systems are systems which learn to directly map from an input
sequence X to an output sequence Y , estimating P(Y |X )
Y can be a sequence of words or subwords
ML trained HMMs are kind of end-to-end system – the HMM estimates P(X |Y ),
and when combined with a language model gives an estimate of P(Y |X )
Sequence discriminative training of HMMs (using GMMs or DNNs) can be
regarded as end-to-end
But training is quite complicated – need to estimate the denominator (total
likelihood) using lattices, first train conventionally (ML for GMMs, CE for NNs) then
finetune using sequence discriminative training
Lattice-free MMI is one way to address these issues
Other approaches based on recurrent networks which directly map input to output
sequences
CTC – Connectionist Temporal Classification
Encoder-decoder approaches
ASR Lecture 15 End-to-end systems 1: CTC 2
End-to-end systems

End-to-end systems are systems which learn to directly map from an input
sequence X to an output sequence Y , estimating P(Y |X )
Y can be a sequence of words or subwords
ML trained HMMs are kind of end-to-end system – the HMM estimates P(X |Y ),
and when combined with a language model gives an estimate of P(Y |X )
Sequence discriminative training of HMMs (using GMMs or DNNs) can be
regarded as end-to-end
But training is quite complicated – need to estimate the denominator (total
likelihood) using lattices, first train conventionally (ML for GMMs, CE for NNs) then
finetune using sequence discriminative training
Lattice-free MMI is one way to address these issues
Other approaches based on recurrent networks which directly map input to output
sequences
CTC – Connectionist Temporal Classification
Encoder-decoder approaches (next lecture)
ASR Lecture 15 End-to-end systems 1: CTC 2
Here Wk and bk denote the k’th column of the weight matrix and k’th bias, respectively.
Once we have computed a prediction for P(ct |x), we compute the CTC loss [13] L(ŷ, y) to measure
Deep Speech
the error in prediction. During training, we can evaluate the gradient rŷ L(ŷ, y) with respect to
the network outputs given the ground-truth character sequence y. From this point, computing the
gradient with respect to all of the model parameters may be done via back-propagation through the
rest of the network. We use Nesterov’s Accelerated gradient method for training [41].3
Output: character probabilities (a-z, <apostrophe>, <space>, <blank>)
Trained using CTC

Softmax output layer

Bidirectional recurrent
hidden layer

3 feed-forward
hidden layers

Input: Filter bank features (spectrogram)


Figure 1: Structure of our RNN model and notation.
Hannun et al (2014), “Deep Speech: Scaling up end-to-end speech recognition”,
The complete RNN model is illustrated in Figure 1. Note that its structure is considerably simpler
https://arxiv.org/abs/1412.5567.
than related models from the literature [14]—we have limited ourselves to a single recurrent layer
(which is the hardest to parallelize) andASR
we do not use
Lecture 15 Long-Short-Term-Memory
End-to-end systems 1: (LSTM)
CTC circuits. 3
Deep Speech: Results

Model SWB CH Full


Vesely et al. (GMM-HMM BMMI) [44] 18.6 33.0 25.8
Vesely et al. (DNN-HMM sMBR) [44] 12.6 24.1 18.4
Maas et al. (DNN-HMM SWB) [28] 14.6 26.3 20.5
Maas et al. (DNN-HMM FSH) [28] 16.0 23.7 19.9
Seide et al. (CD-DNN) [39] 16.1 n/a n/a
Kingsbury et al. (DNN-HMM sMBR HF) [22] 13.3 n/a n/a
Sainath et al. (CNN-HMM) [36] 11.5 n/a n/a
Soltau et al. (MLP/CNN+I-Vector) [40] 10.4 n/a n/a
Deep Speech SWB 20.0 31.8 25.9
Deep Speech SWB + FSH 12.6 19.3 16.0

Table 3: Published error rates (%WER) on Switchboard dataset splits. The columns labeled “SWB”
and “CH” are respectively the easy and hard subsets of Hub5’00.

5.2 Noisy speech


ASR Lecture 15 End-to-end systems 1: CTC 4
Deep Speech Training

Maps from acoustic frames X to subword sequences S, where S is a sequence of


characters (in some other CTC approaches, S can be a sequence of phones)
CTC loss function
Makes good use of large training data
Synthetic additional training data by jittering the signal and adding noise
Many computational optimisations
n-gram language model to impose word-level constraints
Competitive results on standard tasks

ASR Lecture 15 End-to-end systems 1: CTC 5


Deep Speech Training

Maps from acoustic frames X to subword sequences S, where S is a sequence of


characters (in some other CTC approaches, S can be a sequence of phones)
CTC loss function
Makes good use of large training data
Synthetic additional training data by jittering the signal and adding noise
Many computational optimisations
n-gram language model to impose word-level constraints
Competitive results on standard tasks

ASR Lecture 15 End-to-end systems 1: CTC 5


Connectionist Temporal Classification (CTC)

Train a recurrent network to map from input sequence X to output sequence S


sequences can be different lengths – for speech, input sequence X (acoustic frames)
is much longer than output sequence S (characters or phonemes)
CTC does not require frame-level alignment (matching each input frame to an
output token)
CTC sums over all possible alignments (similar to forward-backward algorithm) –
“alignment free”
Possible to back-propagate gradients through CTC

Gopod overview of CTC: Awni Hannun, “Sequence Modeling with CTC”, Distill.
https://distill.pub/2017/ctc

ASR Lecture 15 End-to-end systems 1: CTC 6


CTC: Alignment
Imagine mapping (x1 , x2 , x3 , x4 , x5 , x6 ) to [a, b, c]
Possible alignments: aaabbc, aabbcc, abbbbc,. . .
However
Don’t always want to map every input frame to an output symbol (e.g. if there is
“inter-symbol silence”)
Want to be able to have two identical symbols adjacent to each other – keep the
difference between
Solve this using an additional blank symbol ()
CTC output compression
1 Merge repeating characters
2 Remove blanks
Thus to model the same character successively, separate with a blank
Some possible alignments for [h, e, l, l, o] and [h, e, l, o] given a 10-element input
sequence
[h, e, l, l, o]: helllo; hellloo
[h, e, l, o]: hellllo; hhelo
ASR Lecture 15 End-to-end systems 1: CTC 7
CTC: Alignment example

h h e ϵ ϵ l l l ϵ l l o
First, merge repeat
characters.
h e ϵ l ϵ l o
Then, remove any ϵ
tokens.
h e l l o
The remaining characters
are the output.
h e l l o

ASR Lecture 15 End-to-end systems 1: CTC 8


CTC: Valid and invalid alignments

Consider an output [c, a, t] with an input of length six

Valid Alignments Invalid Alignments


corresponds to
ϵ c c ϵ a t c ϵ c ϵ a t Y = [c, c, a, t]

c c a a t t c c a a t has length 5

c a ϵ ϵ ϵ t c ϵ ϵ ϵ t t missing the 'a'

ASR Lecture 15 End-to-end systems 1: CTC 9


CTC: Alignment properties

Monotonic – Alignments are monotonic (left-to-right model); no re-ordering


(unlike neural machine translation)
Many-to-one – Alignments are many-to-one; many inputs can map to the same
output (however a single input cannot map to many outputs)
CTC doesn’t find a single alignment: it sums over all possible alignments

ASR Lecture 15 End-to-end systems 1: CTC 10


CTC: Loss function (1)

Let C be an output label sequence, including blanks and repetitions – same length
as input sequence X
Posterior probability of output labels C = (c1 , . . . ct , . . . cT ) given the input
sequence X = (x1 , . . . xt , . . . xT ):
T
Y
P(C |X ) = y (ct , t)
t=1

where y (ct , t) is the output for label ct at time t


This is the probability of a single alignment

ASR Lecture 15 End-to-end systems 1: CTC 11


CTC: Loss function (2)

Let S be the target output sequence after compression


Compute the posterior probability of the target sequence S = (s1 , . . . sm , . . . sM )
(M ≤ T ) given X by summing over the possible CTC alignments:
X
P(S|X ) = P(C |X )
c∈A(S)

where A is the set of possible output label sequences c that can be mapped to S
using the CTC compression rules (merge repeated labels, then remove blanks)
The CTC loss function LCTC is given by the negative log likelihood of the sum of
CTC alignments:
LCTC = − log P(S|X )
Perform the sum over alignments using dynamic programming – similar structure
as used in forward-backward algorithm and Viterbi (see Hannun for details)
Various NN architectures can be used for CTC – usually use a deep bidirectional
LSTM RNN
ASR Lecture 15 End-to-end systems 1: CTC 12
CTC: Distribution over alignments
We start with an input sequence,
like a spectrogram of audio.

The input is fed into an RNN,


for example.

h h h h h h h h h h
e e e e e e e e e e The network gives pt (a | X ),
a distribution over the outputs
l l l l l l l l l l {h, e, l, o, ϵ} for each input step.

o o o o o o o o o o
ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ

h e ϵ l l ϵ l l o o With the per time-step output


distribution, we compute the
h h e l l ϵ ϵ l ϵ o probability of different sequences

ϵ e ϵ l l ϵ ϵ l o o

h e l l o By marginalizing over alignments,


we get a distribution over outputs.
e l l o
h e l o
ASR Lecture 15 End-to-end systems 1: CTC 13
Understanding CTC: Conditional independence assumption

Each output is dependent on the entire input sequence (in Deep Speech this is
achieved using a bidirectional recurrent layer)
Given the inputs, each output is independent of the other outputs (conditional
independence)
CTC does not learn a language model over the outputs, although a language
model can be applied later
Graphical model showing dependences in CTC:

a1 a2 aT

ASR Lecture 15 End-to-end systems 1: CTC 14


Understanding CTC: CTC and HMM

a b ϵ a ϵ b ϵ

Left-to-right HMM CTC HMM

CTC can be interpreted as an HMM with additional (skippable) blank states,


trained discriminatively

ASR Lecture 15 End-to-end systems 1: CTC 15


Applying language models to CTC

Direct interpolation of a language model with the CTC acoustic model:

Ŵ = arg max(α log P(S|X ) + log P(W ))


W

Only consider word sequences W which correspond to the subword sequence S


(using a lexicon)
α is an empirically determined scale factor to match the acoustic model to the
language model
Lexicon-free CTC: use a “subword language model” P(S) (Maas et al, 2015)
WFST implementation: create an FST T which transforms a framewise label
sequence c into the subword sequence S, then compose with L and G :
T ◦ min(det(L ◦ G )) (Miao et al, 2015)

ASR Lecture 15 End-to-end systems 1: CTC 16


Mozilla Deep Speech

Mozilla have released an Open Source TensorFlow implementation of the Deep


Speech architecture:
https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/
https://github.com/mozilla/DeepSpeech
Close to state-of-the-art results on librispeech
Mozilla Common Voice project: https://voice.mozilla.org/en

ASR Lecture 15 End-to-end systems 1: CTC 17


Summary and reading

CTC is an alternative approach to sequence discriminative training, typically


applied to RNN systems
Used in “Deep Speech” architecture for end-to-end speech recognition
Reading
A Hannun et al (2014), “Deep Speech: Scaling up end-to-end speech recognition”,
ArXiV:1412.5567. https://arxiv.org/abs/1412.5567
A Hannun (2017), “Sequence Modeling with CTC”, Distill.
https://distill.pub/2017/ctc
Background reading
Y Miao et al (2015), “EESEN: End-to-end speech recognition using deep RNN
models and WFST-based decoding”, ASRU-2105.
https://ieeexplore.ieee.org/abstract/document/7404790
A Maas et al (2015). “Lexicon-free conversational speech recognition with neural
networks”, NAACL HLT 2015, http://www.aclweb.org/anthology/N15-1038

ASR Lecture 15 End-to-end systems 1: CTC 18

You might also like