Academia.eduAcademia.edu

Advances in automatic transcription of Italian broadcast news

2000

This paper presents some recent improvements in automatic transcription of Italian broadcast news obtained at ITCirst.

ADVANCES IN AUTOMATIC TRANSCRIPTION OF ITALIAN BROADCAST NEWS F. Brugnara, M. Cettolo, M. Federico, D. Giuliani ITC-irst - Centro per la Ricerca Scientifica e Tecnologica I-38050 Povo, Trento, Italy. ABSTRACT This paper presents some recent improvements in automatic transcription of Italian broadcast news obtained at ITCirst. A first preliminary activity was carried out in order to develop a suitable speech corpus for the Italian language. The resulting corpus, formed by recordings covering 30 hours of radio news, was exploited for developing a baseline system for transcription of broadcast news. The system performs in different stages: acoustic segmentation and classification, speaker clustering, acoustic model adaptation and speech decoding. Major recent advances allowing performance improvement concern with speech segmentation and clustering, acoustic modeling, acoustic model adaptation and the language model. The transcription system features a 14.3% word error rate on planned studio speech and 18.7% on the whole test set formed by recordings of radio broadcast news. When applied to a test set formed by recordings of television broadcast news, the system features 16.5% word error rate on planned studio speech and 23.2% by considering the whole test set. 1. INTRODUCTION Technologies that make the management and the access of multimedia archives easier are receiving more and more attention due to the increasing availability of large multimedia digital libraries. Technologies for audio transcription and indexing of multimedia archives are among the emerging technologies. This work presents a system for the automatic transcription of Italian broadcast news. A preliminary activity was carried out in order to develop a suitable speech corpus for the Italian language. 30 hours of recordings covering radio news of several years were collected, labelled and transcribed. The annotated material formed a speech corpus called Italian Broadcast News Corpus (IBNC). This corpus was used for developing the system. The current system performs four stages [1]: acoustic segmentation and classification, speaker clustering, acoustic model adaptation, and speech transcription. The transcription system was tested with two different sets of data formed, respectively, by recordings of radio and television broadcast news. On radio recordings, the system features a 14.3% word error rate on planned studio speech and 18.7% on the whole test set. When applied to television broadcast news, the system features 16.5% word error rate on planned studio speech and 23.2% on the whole test sample. The paper is organized as follows. Section 2 introduces the activities of data collection and annotation ongoing at ITC-irst. Section 3 briefly sketches the audio segmentation, classification and clustering algorithms. Section 4 outlines the adopted acoustic modeling, language model, recognition engine and speaker adaptation. Experimental results are presented in Section 5. 2. BROADCAST NEWS CORPORA Since Summer 1999, ITC-irst has been collecting Italian broadcast news corpora. First, the IBNC corpus was developed under a contract with ELRA/ELDA1 . RAI, the major Italian broadcast company, supplied recordings of radio news programs sampled from its internal digital archive. The collection consists of 150 programs, for a total time of about 30 hours, issued between 1992 and 1999. The corpus contains about 7 hours of telephone speech. The IBNC was segmented, labelled and transcribed following conventions similar to those adopted by the Linguistic Data Consortium2 for the HUB-4 corpora. The corpus, which was released in April 2000, will be distributed by ELRA. Recently, ITC-irst has started to develop an in-house corpus of television broadcast news, a small part of which has been used in this work. According to our plans, by the end of this year about 100 hours of transcribed material will be available for development and evaluation purposes. 3. SEGMENTATION AND CLUSTERING The Bayesian Information Criterion (BIC) [2] is applied to segment the input audio stream into acoustically homogeneous chunks. Gaussians mixture models are then used to classify segments in terms of acoustic source and channel. Emission probability densities consist of mixtures of 1 European Language Resources Association. 2 www.ldc.upenn.edu. 1024 multi-variate Gaussian components having diagonal covariance matrices. Observations are 39-dimension vectors (see Section 4.1). Six classes are considered for classification: female/male wide-band speech, female/male narrowband speech, pure music, and silence plus other non-speech events. Clustering of speech segments is done by a bottom-up scheme [2, 3] that groups segments which are acoustically close with respect to the BIC. As a result, this step should gather segments of the same speaker. To evaluate the segmentation algorithm in detecting the break points, recall and precision are computed with respect to target (manually annotated) boundaries: with half a second time tolerance, the two measures result equal to 82.9% and 85.9%, respectively. Classification accuracy, in terms of the six above classes, is 95.6%. Defining the purity of a cluster as the percentage of speech uttered by its dominant speaker, the clustering algorithm provides an average cluster purity of 94.1%. More details about the content of this section can be found in [4]. 4. SPEECH TRANSCRIPTION 4.1. Acoustic Modeling Acoustic modeling is based on continuous-density HMMs. The acoustic parameter vector comprises 12 mel-scaled cepstral coefficients, the log-energy and their first and second time-derivatives. For either channel condition, a set of context-dependent units is defined, based on the SAMPA phonetic alphabet. Some additional units have also been introduced, to cover silence, background noise and a number of spontaneous speech phenomena. They are used both in training and in recognition by allowing their optional insertion between words. Acoustic training was performed with Baum-Welch reestimation using the training portion of the IBNC corpus, augmented with other datasets collected at ITC-irst. Table 1 shows the amount of training data used for each model set. The lexicon was produced with an automatic transcription tool, and then manually checked to compensate for possible errors in the transcription of acronyms and foreign words. 4.1.1. Context-dependent models The context-dependent models include a set of triphones which are well represented in the training data, augmented with a set of left-dependent or right-dependent diphones used as backoff models for unseen contexts. A backoff unit is defined as the “union” of all units which share a partial context: for example, a left-dependent unit is defined as the union of all the triphones which share the common left context, plus a possible “remainder” for that left context (see Figure 1). The “remainder” unit is meant to represent the collection of all the different contexts of a phone which appear in the training set, but not often enough to have a dedicated model. Only triphones and remainders are directly estimated on the training database. The models of backoff units are built afterwards by means of an agglomeration technique described below. First, a set of “trainable” triphones is selected by imposing a threshold on their occurrences on the training data, that was set to 10 for these experiments. A transcription of the training data is then generated by using this set of units. When a triphone appears that is not in the selected list, it is mapped on a remainder unit. Between the two possible remainders that could be substituted, the one with the highest number of occurrences is chosen. This set of models is then trained using Baum-Welch reestimation, starting with single-component mixtures, and iteratively splitting Gaussians selected according to their usage counters during the first iterations. Beside, references to Gaussians are pruned from mixtures if their weights fall below a predefined threshold. The tying scheme enforces sharing of Gaussian components among allophone mixtures of a particular phone, which are in the same position within the model. Mixture weights are left specific to each model. During reestimation, all the statistics of transitions and distributions are stored for later use by the agglomeration procedure. Training Data #Words #Triphones #Backoff Models #Gaussians wide-band 26h:42m 236000 6150 2327 15752 narrow-band 15h:42m 121000 3814 1181 10573 Table 1: Properties of the two model sets used in the experiments. As previously stated, models for backoff units are built by agglomeration. For each backoff unit, a model is built whose topology corresponds to the common topology of the allophones of the base phone. The transition probabilities are then estimated by summing the counters computed during BaumWelch reestimation of all the subsumed triphones. Similarly, each mixture of the agglomerate model is built by joining the components of all the mixtures in the same position within the triphone models, and estimating their weights by combining the statistics of the weights of the included mixtures. The backoff models therefore contain model-specific mixtures whose components belong to the common Gaussian pool of the allophones. The same procedure was applied for the wide-band and narrow-band models. The characteristics of the resulting model sets are reported in Table 1. As for the training lexicon, the recognition lexicon is generated using triphones only when there is a minimum number of occurrences in the training set. This time, however, miss- _a_ phones b_a_ b_a_* b_a_n _a_v b_a_r *_a_v f_a_v Figure 1: Model hierarchy. The symbol ing triphones are not mapped to remainders, but to backoff units, and the threshold is set to 50. Thresholding is applied to backoff units as well, so that context-independent units can appear. 4.2. Language Model A trigram language model was developed by mainly exploiting newspaper text sources. In particular, a 133M-word collection of the nation wide newspaper La Stampa was employed, that includes all issues between 1992 and 1998. Moreover, the broadcast news transcriptions of the training data were added. Numeric expressions occurring in the texts were replaced by labels. As explained later, during the decoding phase, labels will be linked to specific rule-based LMs. A lexicon of the most frequent 64K words was selected. The lexicon gives a 2.2% OOV rate on the newspaper corpus and about 1.6% on the IBNC corpus. An interpolated trigram LM was estimated by employing a non linear discounting function and a pruning strategy that deletes trigrams on the basis of their context frequency. A good trade-off was obtained by using a shift-1 discounting function and by pruning all trigrams with context frequency less than 10. This results in a pruned LM with a perplexity of 188 and a size of 14M. 4.3. Recognition Engine The recognizer is a single-step Viterbi decoder. The basic Viterbi beam-search algorithm is extended [5] to deal with recurrent transition networks, which in this case are used to represent subgrammars in the LM for common entities such as cardinal numbers, ordinal numbers, and percentages. The use of subgrammars and a recursive decoder has the effect of reducing the network size during decoding, as there is no need to embed the subgrammars in the main trigram network for each occurrence, and also allows for a more robust estimation of the LM. The 64K-word trigram LM is mapped into a static network with a shared-tail topology [6]. The main network has about 11M states, 10M named transitions and 17M empty transitions. diphones l_a_v r_a_v triphones "agglomerate" models "real" models denotes the “remainder” context. 4.4. Acoustic Model Adaptation On each cluster of speech segments, unsupervised acoustic model adaptation is carried out by exploiting the transcriptions generated by a preliminary decoding step. Gaussian components in the system are adapted using the Maximum Likelihood Linear Regression (MLLR) technique [7, 8]. A global regression class is considered for adapting only the means or both means and variances. Mean vectors are adapted using a full transformation matrix, while a diagonal transformation matrix is used to adapt variances. 5. EXPERIMENTAL RESULTS Transcription experiments were carried out with two different test sets: the IBNC test set, formed by 6 radio news programs for a total of 1h:15m, and a test set formed by two television news programs, for a total of 40 minutes. In all cases, automatic transcription experiments exploited the segmentation automatically computed by the acoustic segmentation and classification modules. According to their class, speech segments were processed with the general wideband or narrow-band acoustic models. The resulting transcripts were used to perform acoustic model adaptation on each cluster. Then, the final decoding step was performed by using the adapted acoustic models. Table 2 reports results obtained on the IBNC test set. The Word Error Rate (WER) is reported for speech material corresponding to the different focus conditions and for the whole test set. Recognition experiments were carried out without (Baseline column) and with acoustic model adaptation (MLLR column). Acoustic model adaptation refers to Gaussian mean adaptation. Adaptation of both means and variances did not improve further the performance. Table 3 reports results obtained on recordings of television broadcast news. Comparing these results with those obtained on the IBNC test set, it can be noted that they are tangibly worse. It should be pointed out that the first test set consists of material recorded in homogeneous conditions with respect to the training portion of the IBNC corpus. Instead, the second test set consists of recordings from television programs which are different from radio programs for many aspects such as the studio acoustic conditions and the structure of the programs. This is also shown by the different distributions of the Focus condition F0 F1 F2 F3 F4 F5 FX Global Word distrib. 57.4% 0.9% 23.1% 6.1% 3.0% 0.0% 9.4% 100.0% Baseline 15.6% 42.6% 27.8% 23.8% 35.1% 25.0% 20.6% MLLR (means) 14.3% 29.6% 26.0% 21.6% 29.8% 20.9% 18.7% ing, fast unsupervised speaker adaptation and language model adaptation [9]. Furthermore, speaker tracking algorithms are under development to improve the indexing capabilities of the system. 7. ACKNOWLEDGMENTS The here presented work has been carried out within the European project CORETEX (IST-1999-11876). Table 2: Performance of the speech transcriber on the IBNC test set. 8. REFERENCES Focus condition F0 F1 F2 F3 F4 F5 FX Global Word distrib. 36.4% 0.4% 0.0% 3.9% 27.8% 0.0% 31.5% 100.0% Baseline 18.1% 48.0% 33.3% 36.3% 34.5% 29.0% MLLR (means) 16.5% 52.0% 22.0% 28.5% 25.9% 23.2% Table 3: Performance of the speech transcriber on recordings of television broadcast news. uttered words with respect to the focus conditions, as reported in Tables 2 and 3. In the television news programs telephone speech (focus condition F2) is not present, while in the radio programs telephone speech represents a significant part of the whole speech material. On the contrary, speech with generic background (focus condition F4) represents a significant part of the television recordings while in the first test set material corresponding to condition F4 represents a marginal part of whole set of data. 6. CONCLUSION Work is in progress at ITC-irst to augment the training data and to improve algorithms. A comparison with a previous version of the system [1], which used less training material, shows that augmenting the training data and improving acoustic modeling allowed a sigWER to nificant improvement of performance, from on the IBNC test set. Furthermore, the difference, in terms of transcription performance of radio and television broadcast news, suggests that the training set should be enlarged in order to contain an adequate amount of examples of recordings of television news programs. Activity for collecting a suitable amount of data, for development and evaluation purposes, has been already scheduled. Future work will be devoted mainly to acoustic model- View publication stats [1] Brugnara, F., Cettolo, M., Federico, M. and Giuliani D., “A System for the Segmentation and Transcription of Italian Radio News”, in Proc. of the International Conference on Content-Based Multimedia Information Access (RIAO), Paris, France, pp. 364–371, 2000. [2] Chen, S. S. and Gopalakrishnan, P. S., “Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion”, in Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, 1998. [3] Tritschler, A. and Gopinath, R., “Improved Speaker Segmentation and Segment Clustering using Bayesian Information Criterion”, in Proc. of EUROSPEECH, Budapest, Hungary, pp. 679–682, 1999. [4] Cettolo, M., “Segmentation, Classification and Clustering of an Italian Broadcast News Corpus”, in Proc. of the International Conference on Content-Based Multimedia Information Access (RIAO), Paris, France, pp. 372–381, 2000. [5] Brugnara, F. and Federico, M, “Dynamic Language Models for Interactive Speech Applications”, in Proc. of EUROSPEECH, Rhodes, Greece, pp. 2751–2754, 1997. [6] Brugnara, F. and Cettolo, M., “Improvements in Treebased Language Model Representation”, in Proc. of EUROSPEECH, Madrid, Spain, pp. 2075–2078, 1995. [7] Leggetter, C. J. and Woodland, P. C., “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models”, Computer Speech and Language, 9:171–185, 1995. [8] Gales, M. J. F., “Maximum likelihood linear transformations for HMM-based speech recognition”, Computer Speech and Language, 12:75–98, 1998. [9] Federico, M., “Efficient Language Model Adaptation through MDI Estimation”, in Proc. of EUROSPEECH, Budapest, Hungary, pp. 1583–1586, 1999.