Academia.eduAcademia.edu

Progress in example based automatic speech recognition

2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

In this paper we present a number of improvements that were recently made to the template based speech recognition system developed at ESAT. Combining these improvements resulted in a decrease in word error rate from 9.6% to 8.2% on the Nov92, 20k trigram, Wall Street Journal task. The improvements are along different lines. Apart from the time warping already applied within the DTW, it was found beneficial to apply additional length compensation on the template score. The single best score was replaced by a weighted k-NN average, while maintaining natural successor information as an ensemble cost. The local geometry of the acoustic space is now taken into account by assigning a diagonal covariance matrix to each input frame. Context sensitivity of short templates is increased by taking cross boundary scores into account for sorting the N best templates. Furthermore boundaries on the template segmentations may be relaxed. Finally context dependent word templates are now being used for short words. Several other variants that were not retained in the final system are discussed as well.

PROGRESS IN EXAMPLE BASED AUTOMATIC SPEECH RECOGNITION Kris Demuynck, Dino Seppi, Hugo Van hamme, Dirk Van Compernolle Katholieke Universiteit Leuven, Department of Electrical Engineering - ESAT Kasteelpark Arenberg 10, Bus 2441, B-3001 Leuven, Belgium {Kris.Demuynck,Dino.Seppi,Hugo.Vanhamme,Dirk.VanCompernolle}@esat.kuleuven.be word arc (from a word graph) sport ABSTRACT In this paper we present a number of improvements that were recently made to the template based speech recognition system developed at ESAT. Combining these improvements resulted in a decrease in word error rate from 9.6% to 8.2% on the Nov92, 20k trigram, Wall Street Journal task. The improvements are along different lines. Apart from the time warping already applied within the DTW, it was found beneficial to apply additional length compensation on the template score. The single best score was replaced by a weighted k-NN average, while maintaining natural successor information as an ensemble cost. The local geometry of the acoustic space is now taken into account by assigning a diagonal covariance matrix to each input frame. Context sensitivity of short templates is increased by taking cross boundary scores into account for sorting the N best templates. Furthermore boundaries on the template segmentations may be relaxed. Finally context dependent word templates are now being used for short words. Several other variants that were not retained in the final system are discussed as well. Index Terms— Speech Recognition, Template Based Recognition, Example Based Recognition, k Nearest Neighbours, DTW 1. INTRODUCTION Speech recognition using an example based approach has gained significant attention over the past couple of years [1, 2, 3, 4, 5, 6]. At ESAT we have developed a large vocabulary template based speech recognition system. The original system was described in [1] with additional improvements reported on in [7, 8]. In this paper we present new significant enhancements over these previous systems, bringing it close in performance to state-of-the-art HMM systems. The key idea behind our example based approach is: avoid the information loss resulting from abstracting large amounts of data into compact models such as HMMs. Instead, store the observed data enriched with some annotations and infer on-the-fly from the stored examples when dealing with a classification or recognition problem by finding those examples –chunks of audio with their corresponding annotations– that resemble the input data best. This onthe-fly inference has the potential advantage of making better use of the available data by focusing on the most relevant data, e.g. those examples of a phone which were uttered at the same speaking rate by a speaker with a similar accent, and originating from the same word (or syllable) as the word the recognizer hypothesizes. The quality of the inference engine –typically k Nearest Neighbours (k-NN) based– is crucial to the performance of an example This work was supported by FWO research project G.0260.07 “Telex”, FWO travel grant K.02.105.10N, the Sound-to-Sense EU Marie Curie Research Training Network (RTN-CT-2006-035561) and the John Hopkins 2011 Summer Workshop on Speech Recognition with Segmental CRFs. /s/ underlying phone arcs /p/ /o/ /r/ /t/ expansion into k template arcs (nearest neighbours) .. . .. . .. . .. . .. . templm Fig. 1. Word Graph with Phone Segment Annotations expanded to a Template Graph. based system: it must be fast and yet at the same time make optimal use of the available data. One weakness in our existing framework is –given the single best Viterbi decoding strategy– the system’s inherent sensitivity to errors in the training database such as incorrect annotations, bad segmentations, highly unusual pronunciations, and inaccurate or missing phonetic transcriptions in the lexicon. Previous work such as data sharpening [7] and data pruning [9] aimed directly at mitigating these problems. Some of the techniques presented in this paper continue along this research line while other techniques aim to make better use of the available data. The paper is organized as follows. First, we describe our baseline system. The following sections handle the major enhancements in the inference engine: (i) usage of weighted k-NN template scores instead of the single best Viterbi decoding, (ii) assigning a local sensitivity matrix (diagonal covariance) to each input frame, (iii) modifications to the dynamic time warping constraints and score computation, and (iv) usage of word based templates. The last sections deal with some open issues and possible further improvements. 2. BASELINE SYSTEM The Wall Street Journal database was used for all experiments reported on in this paper. Results are given for the Nov92 20k open vocabulary non verbalized punctuation test set using the default trigram language model. The phonetic transcriptions for the train data and the 20k test lexicon were drawn from CMUdict 0.6d. A baseline template based system was created according to the principles described in [8]: an HMM system generates word graphs enriched with phone segmentations; k-NN template lists are added to each phone; the word score is calculated with a Viterbi search through the templates after applying the template transition costs. This process is illustrated in figure 1. The baseline HMM system is trained on the WSJ0+1 SI-284 data comprising 81 hours from 284 speakers. SPRAAK [10] was used to create a conventional HMM system using Mel Spectra, incorporating vocal tract length normalization and mean subtraction, postprocesed by mutual information discriminant analysis (MIDA), as features. The HMM 3. WEIGHTED K-NN SCORES Despite data sharpening, the single best Viterbi decoding approach used in [1, 7, 8] remains sensitive to outlier and mislabeling effects. One solution is to use the forward(-backward) algorithm in combination with a proper transformation of the scores instead of the Viterbi algorithm for decoding, similar to what was done in [11]. The built-in averaging of the scores in the forward algorithm automatically mitigates the impact of outliers. In this work, we opted for an even simpler solution: averaging the scores –again after a proper transformation– before the decoding, hence avoiding Viterbi/forward decoding at the template level altogether. Note that k nearest neighbour templates per phone arc (see figure 1) give rise to k×k transitions, resulting in a non-negligible cost for the template based Viterbi/forward decoding for typical values of k = 50 . . . 100. The natural successor cost [1], a cost that is added on the transition between any two templates that are not immediate successors in the train database, has given consistent relative error reductions between 5 and 10%. Using weighted average scores requires another mechanism to preserve this valuable information. In the new implementation –before averaging– a natural successor cost is added to a template score if none of the adjacent template arcs is the natural successor. This is repeated for left and right context. We also noticed that scores could best be adjusted based on the length of the phone arc under consideration. This is due to the fact that for shorter phone arcs there are (i) more candidate templates, (ii) less frames to match and hence less chance for badly matching frames, and (iii) the Itakura constraints [12] in the dynamic time warping (DTW) are less restrictive on the boundary frames. Equation 1 shows the score length compensation and score averaging that was used in the final system, with l the length of the phone arc in frames, si the ith best template score, cli and cri the left and right natural successor cost, α = 0.15 and N = 5. v u N 1 u ′ (1) s = α × t PN 1 l l r 2 i=1 (si +ci +ci ) This equation was chosen as it does a reasonable job in approximating the process of first transforming the template scores (norm2 distances) into some probability measure, followed by the weighted average of the probabilities, followed by a back transformation from the average probability to an average template score. reference template (y) system uses a shared pool of 32k Gaussians and 5875 cross-word context-dependent tied triphone states. For the template system, all data from the 284 training speakers are used, without any cleaning for pronunciation or transcriptions errors. The baseline HMM system was used to segment the train database into 2826699 phone templates. The features used in the template system are the same as in the HMM system, except for additional data sharpening to reduce the influence of outliers [7]. Using the decision tree approach from [8], the phone templates were subdivided into 4219 cross-word context-dependent triphone classes, with a minimum of 256 templates per class. Context-dependent variants of the word arcs (see figure 1) are added so that the triphone class for each word boundary phone is uniquely defined by the word internal adjacent phone on the one side and all possible cross-word adjacent phones on the other side. For development and parameter tuning we relied on the Dev92 development set. Optimization was done on either word error rate (WER) or on the product of the posterior probabilities on the correct path when WER was deemed too noisy. xi−1 yj T0 T1 yj−1 yj−2 T2 xi T 0 : (x i−1,yj ) (x i,yj ) T 1 : (x i−1,yj−1) (x i,yj ) T 2 : (x i−1,yj−2) (x i,yj ) test data (x) Fig. 2. DTW Path Constraints. 4. LOCAL SENSITIVITY MATRIX The investigation in distance measures for DTW in [13] lead to contradictory results: either the Euclidean or the Mahalanobis distance measure performed best depending on the task (phone classification versus continuous speech recognition). This behaviour could be related to whether one was looking in the near neighbourhood (single frame classification) of the input frame or whether one needed to also consider not so near reference frames, e.g. when matching longer templates. By making the covariance (sensitivity) matrix needed for the Mahalanobis distance dependent on the input frame instead of the reference frame (more correctly, the triphone class the reference frame belongs to), most of the problems that adversely affect the Mahalanobis distance measure from [13] can be avoided. Since the Jacobian now depends on the input frame, it can be readily dropped. Furthermore, the properties of the distance measure now no longer change dramatically based on the (reference) frame one compares the input frame with. Yet, at the same time, the distance measure is adaptive to the local distribution (manifold) of speech data. One available resource that already models the local distribution of the speech data is the shared pool of Gaussians from the HMM baseline system. We used the set of (diagonal covariance) Gaussians N (x; µi , Σi ), i ∈ [1 . . . M ] with corresponding a priori probabilities αi as follows to assign a covariance Σx to an input frame x: PM β i=1 αi N (x; µi , Σi ) Σi (2) Σx = P M β i=1 αi N (x; µi , Σi ) By setting the parameter β to a value smaller than 1.0 (β equaled 0.4 in our experiments), Σx changes more slowly in function of x. From a theoretical point of view, β reflects the difference between the physical dimensionality of the observation vector x (39 in our experiments) and the intrinsic dimensionality of speech (which is somewhere between 4 and 11, depending on the phone [14]). 5. DTW SCORING In this section we investigate the integration of the frame distances into a template distance by means of (dynamic) time warping. Dynamic time warping aligns each input frame xi to one reference frame yj given some transition constraints so that the resulting sum of inter frame distances and transition costs is minimal. A common DTW implementation, which is also our reference implementation, allows the three types of transitions depicted in figure 2. Each of these transitions corresponds to a different rate of speech for the test data versus the reference template: T0 : (xi−1 , yj ) → (xi , yj ) extreme slow rate of speech T1 : (xi−1 , yj−1 ) → (xi , yj ) same rate of speech T2 : (xi−1 , yj−2 ) → (xi , yj ) double rate of speech template boundaries wa rp in g reference template (y) phone/word arc boundaries e ic nam dy tim variable start/end boundaries (for templates) no time warping (1−on−1 mapping) l x left/right surrounding frames central region test data (x) plate and phone/word arc boundaries as given. Since the left and right surrounding regions only comprise a few frames, time warping in these regions might be unnecessary, i.e. when calculating the acoustic similarity for the left and right lx surrounding frames, a straightforward 1-to-1 alignment is assumed (the thick dashed alignment path in figure 3). In both setups, the extended score (sum of the scores of the central, left and right surrounding regions) was only used for ranking the templates for k-NN selection. Decoding relied on the score of the central part only. Multiple experiments consistently showed a preference for quite large contextual windows (up to 9 frames wide) which may be surprising given that the context-dependent templates already have context sensitivity built in. Given these wider context windows it was in the end not surprising that the normal dynamic time warping should also be applied in these regions. 6. WORD BASED TEMPLATES central (DTW) score Fig. 3. Dynamic Time Warping with Boundary Extensions. The Itakura constraints [12] make the type T0 and T2 transitions more symmetric by allowing a T0 transitions after a T1 or T2 transition only. This way, the minimal rate of speech of the test data is now at least half of that in the reference template (an alternation of T0 and T1 transitions). We also tried out a quite different warping strategy: linear warping. Given that the average length of a phone is only 8 frames, and given that the train database contains examples uttered at different speaking rates, dynamic warping may be inessential or even undesirable. Experiments with the different warping constraints were quite conclusive: the limitations imposed by the Itakura constraints improved the accuracy with respect to the baseline, while linear warping was clearly inferior to dynamic warping. Another open issue with DTW is how to penalize non-diagonal (fast and slow speaking rate) transitions. This penalty can either be a fixed additive cost or a cost proportional to the inter frame distance (implemented by multiplying the local distance with some factor higher than 1.0). Whereas previous experiments [1] were inconclusive, the current WSJ setup showed a clear preference for additive costs. Additive costs also make more sense from a theoretical point of view since (i) additive costs can be interpreted as transition probabilities, similar to the transition probabilities in an HMM system, and (ii) the multiplicative costs will upscale the inter frame distances for those frames which need a non-diagonal warping; in other words more weight is assigned to the acoustic (dis-)similarity of those frames which show a non-standard behaviour. Finally, we investigated the impact of promoting acoustic continuity when selecting the k-NN templates. This was done by taking the similarity between the audio to the left and right of a reference template and the audio surrounding a hypothesized phone or word arc into account during k-NN template selection. Figure 3 shows the most flexible configuration we investigated. DTW with backtracking over the extended region allows us to subdivide the DTW score into similarity measures for the central part and left and right surroundings. Characteristic for this approach is that it allows time warping in the surrounding regions which in turn allows for some automatic adjustment of the template boundaries. Relaxing the template boundaries has the potential benefit of making the system less dependent on the quality of the template boundaries and on the exact begin and end time of the phone and word arcs in the word graph. We also experimented with a second setup which took the tem- The natural successor cost (see [1] and section 3) improves the recognition accuracy by promoting the use of longer template sequences. A next logical step is to actually use longer templates. In this work, we experimented with whole word templates. Whole word templates are expected to be most effective for function words since these are typically not modeled well by generic phonetic transcriptions and there are sufficient examples available to be used as reference templates by themselves. The segmentation of the train database into word templates was done with the baseline HMM system, assuring consistency between the phone and word segmentations. The word templates are handled identically to the phone templates, using the same parameters for length compensation and averaging (section 3), local sensitivity distance measure (section 4) and DTW scoring (section 5). Similar to the phone templates, context dependent variants were generated when sufficient examples were available. The questions for the decision tree for a specific word were limited to those applied to the left context of the word’s initial phone and to the right context of the word’s final phone. This assures that the word context dependency is a subset of the phone context dependency, which simplifies the decoding process. We provided for a smooth transition between word and phone modeling for those words with no or only a small number of templates (e.g. non-function words). In those cases the word score was replaced by an average of the word score ws and the phone score wp , weighted with the number of context dependent word examples Nw that were returned by the k-NN search: « „ Nw Nw ′ sw = sw × , (3) + sp × 1 − k k with k the maximum number of near neighbour templates the inference engine looked for, which equaled 75 in our experiments. 7. RESULTS AND DISCUSSION Table 1 summarizes the main results. Hereunder, we provide additional information and insight and give our main conclusions. Score averaging & adjustment: The boost in performance thanks to the weighted average template scores was significant mainly because we managed to keep exploiting the natural successor concept. Natural successor costs are no longer applied on individual inter template transitions but are now incorporated in the single ensemble DTW score as an average natural successor cost for the selected k-NN neighbours. System initial template system + score averaging & adjustment + local sensitivity & boundary extension + word templates Dev92 10.6% 10.3% 10.0% Nov92 9.6% 8.9% 8.5% 8.2% Table 1. Baseline system and impact of the successive improvements. WER on the WSJ Dev92 and Nov92 20k trigram task. Local sensitivity & distance metric: For understanding the impact of the local distance modeling, it is important to realize that our raw features have undergone a MIDA transformation. MIDA is not only discriminative, but also decorrelates the features and does a proper scaling of the axes. The outcome is a feature space where an Euclidean distance metric performs remarkably well. Hence the modeling of the local manifold, which is not discriminative in its approach, provides only incremental improvements over a single global MIDA transform. When using for example Mel-Cepstra as features, much larger improvements can be observed. Instead of an L2 based distance measures, one may also opt for radical different metrics. One promising alternative is the sparse decompositions proposed by [6]. Boundary extension: Context-dependent templates enforce continuity at the symbolic (phone) level. Natural successors promote acoustic continuity, although in a limited and strict sense only. The consistent positive results obtained with both methods indicate that it is important to foster symbolic and acoustic template continuity. The “boundary extension” does exactly that. While it tackles problems due to poor segmentations in the database, its main contribution is that, by using overlapping segments, the k-NN lists of adjacent phones contain more natural successors (from 5.6% for lx = 0 to 12.7% for lx = 9) and hence favor a greater naturalness between adjacent template ensembles. Multi-phone templates: Word templates are well suited for the frequent (function) words. Less frequent words may benefit from smaller more generic units, e.g. syllables. The current framework already allows for multiple levels of symbolic representation, so word and syllable units can even be combined. Computational load: The use of ensemble scores (k-NN weighted average template scores) reduces the decoder complexity significantly: the single best template decoder had a search space that was an order of magnitude larger. By means of a careful implementation, the overhead of the template based score computation was reduced to a 0.1xRT load on a 12-core machine. Hence the current implementation lends itself much better than its predecessors for being used with larger databases. The template induced overhead can be kept constant by creating more context dependent variants and by preselecting templates on the basis of their length (within a small range of the test segment length) when handling large databases. 8. CONCLUSIONS In this paper we presented a set of relatively simple techniques for improving template based speech recognition, leading to a 15% relative word error rate reduction. One key advance was the migration from a single best template decoder to a phone/word decoder which uses ensemble scores (k-NN weighted average template scores). Furthermore, by maintaining the natural successor costs –now calculated based on adjacent template ensembles– and by adding “boundary extensions”, template continuity was improved. While already close to state-of-the-art HMM systems, the template based approach has still room for significant improvements by exploiting the wealth of meta-information that is present in the template lists. Such an approach is presented in a companion paper [15]. Further improvements may come from newly emerging techniques such as sparse decompositions [6]. 9. REFERENCES [1] Mathias De Wachter, Mike Matton, Kris Demuynck, Patrick Wambacq, Ronald Cools, and Dirk Van Compernolle, “Template based continuous speech recognition,” IEEE Trans. on ASLP, vol. 15, pp. 1377–1390, May 2007. [2] Viktoria Maier and Roger K. Moore, “Temporal episodic memory model: An evolution of Minerva2,” in Proc. INTERSPEECH, Aug. 2007, pp. 866–869. [3] V. Ramasubramanian, Kaustubh Kulkarni, and Bernhard Kämmerer, “Acoustic modeling by phoneme templates and modified one-pass DP decoding for continuous speech recognition,” in Proc. ICASSP, Apr. 2008, pp. 4105–4108. [4] Xie Sun and Yunxin Zhao, “Integrate template matching and statistical modeling for speech recognition,” in Proc. INTERSPEECH, Sept. 2010, pp. 74–77. [5] Ladan Golipour and Douglas O’Shaughnessy, “Phoneme classification and lattice rescoring based on a k-NN approach,” in Proc. INTERSPEECH, Sept. 2010, pp. 1954–1957. [6] Dimitri Kanevsky, Tara N. Sainath, Bhuvana Ramabhadran, and David Nahamoo, “An analysis of sparseness and regularization in exemplar-based methods for speech classification,” in Proc. INTERSPEECH, Sept. 2010, pp. 2842–2845. [7] Mathias De Wachter, Kris Demuynck, and Dirk Van Compernolle, “Outlier correction for local distance measures in example based speech recognition,” in Proc. ICASSP, Apr. 2007, vol. IV, pp. 433–436. [8] Sébastien Demange and Dirk Van Compernolle, “HEAR: an hybrid episodic-abstract speech recognizer,” in Proc. INTERSPEECH, Sept. 2009, pp. 3067–3070. [9] Dino Seppi and Dirk Van Compernolle, “Data pruning for template-based automatic speech recognition,” in Proc. INTERSPEECH, Sept. 2010, pp. 901–904. [10] Kris Demuynck, Jan Roelens, Dirk Van Compernolle, and Patrick Wambacq, “SPRAAK: An open source speech recognition and automatic annotation kit,” in Proc. INTERSPEECH, Sept. 2008, p. 495. [11] Kris Demuynck, Dirk Van Compernolle, and Patrick Wambacq, “Doing away with the Viterbi approximation,” in Proc. ICASSP, May 2002, vol. I, pp. 717–720. [12] Fumitada Itakura, “Minimum prediction residual principle applied to speech recognition,” IEEE Trans. on ASSP, vol. 23, no. 1, pp. 67–72, Feb. 1975. [13] Mathias De Wachter, Kris Demuynck, Patrick Wambacq, and Dirk Van Compernolle, “Evaluating acoustic distance measures for template based recognition,” in Proc. INTERSPEECH, Aug. 2007, pp. 874–877. [14] Mattias Nilsson and Bastiaan Kleijn, “On the estimation of differential entropy from data located on embedded manifolds,” IEEE Trans. on Information Theory, vol. 53, no. 7, July 2007. [15] Kris Demuynck, Dino Seppi, Dirk Van Compernolle, Geoffrey Zweig, and Patrick Nguyen, “Integrating meta-information into exemplar-based speech recognition with segmental conditional random fields,” in Proc. ICASSP, 2011, submitted.