ADVANCES IN AUTOMATIC TRANSCRIPTION OF ITALIAN BROADCAST NEWS
F. Brugnara, M. Cettolo, M. Federico, D. Giuliani
ITC-irst - Centro per la Ricerca Scientifica e Tecnologica
I-38050 Povo, Trento, Italy.
ABSTRACT
This paper presents some recent improvements in automatic transcription of Italian broadcast news obtained at ITCirst.
A first preliminary activity was carried out in order to develop a suitable speech corpus for the Italian language. The
resulting corpus, formed by recordings covering 30 hours of
radio news, was exploited for developing a baseline system
for transcription of broadcast news. The system performs
in different stages: acoustic segmentation and classification,
speaker clustering, acoustic model adaptation and speech decoding. Major recent advances allowing performance improvement concern with speech segmentation and clustering,
acoustic modeling, acoustic model adaptation and the language model.
The transcription system features a 14.3% word error rate
on planned studio speech and 18.7% on the whole test set
formed by recordings of radio broadcast news. When applied to a test set formed by recordings of television broadcast
news, the system features 16.5% word error rate on planned
studio speech and 23.2% by considering the whole test set.
1. INTRODUCTION
Technologies that make the management and the access of
multimedia archives easier are receiving more and more attention due to the increasing availability of large multimedia
digital libraries. Technologies for audio transcription and indexing of multimedia archives are among the emerging technologies.
This work presents a system for the automatic transcription of Italian broadcast news.
A preliminary activity was carried out in order to develop a suitable speech corpus for the Italian language. 30
hours of recordings covering radio news of several years were
collected, labelled and transcribed. The annotated material
formed a speech corpus called Italian Broadcast News Corpus (IBNC). This corpus was used for developing the system.
The current system performs four stages [1]: acoustic
segmentation and classification, speaker clustering, acoustic
model adaptation, and speech transcription. The transcription system was tested with two different sets of data formed,
respectively, by recordings of radio and television broadcast
news. On radio recordings, the system features a 14.3% word
error rate on planned studio speech and 18.7% on the whole
test set. When applied to television broadcast news, the system features 16.5% word error rate on planned studio speech
and 23.2% on the whole test sample.
The paper is organized as follows. Section 2 introduces
the activities of data collection and annotation ongoing at
ITC-irst. Section 3 briefly sketches the audio segmentation,
classification and clustering algorithms. Section 4 outlines
the adopted acoustic modeling, language model, recognition
engine and speaker adaptation. Experimental results are presented in Section 5.
2. BROADCAST NEWS CORPORA
Since Summer 1999, ITC-irst has been collecting Italian
broadcast news corpora. First, the IBNC corpus was developed under a contract with ELRA/ELDA1 . RAI, the major
Italian broadcast company, supplied recordings of radio news
programs sampled from its internal digital archive. The collection consists of 150 programs, for a total time of about
30 hours, issued between 1992 and 1999. The corpus contains about 7 hours of telephone speech. The IBNC was segmented, labelled and transcribed following conventions similar to those adopted by the Linguistic Data Consortium2 for
the HUB-4 corpora. The corpus, which was released in April
2000, will be distributed by ELRA.
Recently, ITC-irst has started to develop an in-house corpus of television broadcast news, a small part of which has
been used in this work. According to our plans, by the end
of this year about 100 hours of transcribed material will be
available for development and evaluation purposes.
3. SEGMENTATION AND CLUSTERING
The Bayesian Information Criterion (BIC) [2] is applied to
segment the input audio stream into acoustically homogeneous chunks. Gaussians mixture models are then used
to classify segments in terms of acoustic source and channel. Emission probability densities consist of mixtures of
1 European
Language Resources Association.
2 www.ldc.upenn.edu.
1024 multi-variate Gaussian components having diagonal covariance matrices. Observations are 39-dimension vectors
(see Section 4.1). Six classes are considered for classification: female/male wide-band speech, female/male narrowband speech, pure music, and silence plus other non-speech
events.
Clustering of speech segments is done by a bottom-up
scheme [2, 3] that groups segments which are acoustically
close with respect to the BIC. As a result, this step should
gather segments of the same speaker.
To evaluate the segmentation algorithm in detecting the break
points, recall and precision are computed with respect to target (manually annotated) boundaries: with half a second time
tolerance, the two measures result equal to 82.9% and 85.9%,
respectively.
Classification accuracy, in terms of the six above classes, is
95.6%. Defining the purity of a cluster as the percentage of
speech uttered by its dominant speaker, the clustering algorithm provides an average cluster purity of 94.1%.
More details about the content of this section can be found in
[4].
4. SPEECH TRANSCRIPTION
4.1. Acoustic Modeling
Acoustic modeling is based on continuous-density HMMs.
The acoustic parameter vector comprises 12 mel-scaled cepstral coefficients, the log-energy and their first and second time-derivatives. For either channel condition, a set of
context-dependent units is defined, based on the SAMPA phonetic alphabet. Some additional units have also been introduced, to cover silence, background noise and a number of
spontaneous speech phenomena. They are used both in training and in recognition by allowing their optional insertion between words.
Acoustic training was performed with Baum-Welch reestimation using the training portion of the IBNC corpus, augmented
with other datasets collected at ITC-irst. Table 1 shows the
amount of training data used for each model set.
The lexicon was produced with an automatic transcription
tool, and then manually checked to compensate for possible
errors in the transcription of acronyms and foreign words.
4.1.1. Context-dependent models
The context-dependent models include a set of triphones
which are well represented in the training data, augmented
with a set of left-dependent or right-dependent diphones used
as backoff models for unseen contexts.
A backoff unit is defined as the “union” of all units which
share a partial context: for example, a left-dependent unit is
defined as the union of all the triphones which share the common left context, plus a possible “remainder” for that left context (see Figure 1). The “remainder” unit is meant to represent
the collection of all the different contexts of a phone which
appear in the training set, but not often enough to have a dedicated model. Only triphones and remainders are directly estimated on the training database. The models of backoff units
are built afterwards by means of an agglomeration technique
described below.
First, a set of “trainable” triphones is selected by imposing a
threshold on their occurrences on the training data, that was
set to 10 for these experiments. A transcription of the training data is then generated by using this set of units. When a
triphone appears that is not in the selected list, it is mapped
on a remainder unit. Between the two possible remainders
that could be substituted, the one with the highest number of
occurrences is chosen.
This set of models is then trained using Baum-Welch reestimation, starting with single-component mixtures, and iteratively splitting Gaussians selected according to their usage counters during the first iterations. Beside, references to
Gaussians are pruned from mixtures if their weights fall below a predefined threshold. The tying scheme enforces sharing of Gaussian components among allophone mixtures of a
particular phone, which are in the same position within the
model. Mixture weights are left specific to each model. During reestimation, all the statistics of transitions and distributions are stored for later use by the agglomeration procedure.
Training Data
#Words
#Triphones
#Backoff Models
#Gaussians
wide-band
26h:42m
236000
6150
2327
15752
narrow-band
15h:42m
121000
3814
1181
10573
Table 1: Properties of the two model sets used in the experiments.
As previously stated, models for backoff units are built by agglomeration. For each backoff unit, a model is built whose
topology corresponds to the common topology of the allophones of the base phone. The transition probabilities are then
estimated by summing the counters computed during BaumWelch reestimation of all the subsumed triphones. Similarly, each mixture of the agglomerate model is built by joining the components of all the mixtures in the same position
within the triphone models, and estimating their weights by
combining the statistics of the weights of the included mixtures. The backoff models therefore contain model-specific
mixtures whose components belong to the common Gaussian
pool of the allophones. The same procedure was applied for
the wide-band and narrow-band models. The characteristics
of the resulting model sets are reported in Table 1.
As for the training lexicon, the recognition lexicon is generated using triphones only when there is a minimum number
of occurrences in the training set. This time, however, miss-
_a_
phones
b_a_
b_a_*
b_a_n
_a_v
b_a_r
*_a_v
f_a_v
Figure 1: Model hierarchy. The symbol
ing triphones are not mapped to remainders, but to backoff
units, and the threshold is set to 50. Thresholding is applied
to backoff units as well, so that context-independent units can
appear.
4.2. Language Model
A trigram language model was developed by mainly exploiting newspaper text sources. In particular, a 133M-word collection of the nation wide newspaper La Stampa was employed, that includes all issues between 1992 and 1998.
Moreover, the broadcast news transcriptions of the training
data were added. Numeric expressions occurring in the texts
were replaced by labels. As explained later, during the decoding phase, labels will be linked to specific rule-based LMs.
A lexicon of the most frequent 64K words was selected.
The lexicon gives a 2.2% OOV rate on the newspaper corpus and about 1.6% on the IBNC corpus. An interpolated trigram LM was estimated by employing a non linear discounting function and a pruning strategy that deletes trigrams on
the basis of their context frequency. A good trade-off was obtained by using a shift-1 discounting function and by pruning
all trigrams with context frequency less than 10. This results
in a pruned LM with a perplexity of 188 and a size of 14M.
4.3. Recognition Engine
The recognizer is a single-step Viterbi decoder. The basic
Viterbi beam-search algorithm is extended [5] to deal with
recurrent transition networks, which in this case are used to
represent subgrammars in the LM for common entities such
as cardinal numbers, ordinal numbers, and percentages. The
use of subgrammars and a recursive decoder has the effect of
reducing the network size during decoding, as there is no need
to embed the subgrammars in the main trigram network for
each occurrence, and also allows for a more robust estimation
of the LM.
The 64K-word trigram LM is mapped into a static network with a shared-tail topology [6]. The main network has
about 11M states, 10M named transitions and 17M empty
transitions.
diphones
l_a_v
r_a_v triphones
"agglomerate"
models
"real" models
denotes the “remainder” context.
4.4. Acoustic Model Adaptation
On each cluster of speech segments, unsupervised acoustic
model adaptation is carried out by exploiting the transcriptions generated by a preliminary decoding step.
Gaussian components in the system are adapted using the
Maximum Likelihood Linear Regression (MLLR) technique
[7, 8]. A global regression class is considered for adapting
only the means or both means and variances. Mean vectors
are adapted using a full transformation matrix, while a diagonal transformation matrix is used to adapt variances.
5. EXPERIMENTAL RESULTS
Transcription experiments were carried out with two different
test sets: the IBNC test set, formed by 6 radio news programs
for a total of 1h:15m, and a test set formed by two television
news programs, for a total of 40 minutes.
In all cases, automatic transcription experiments exploited
the segmentation automatically computed by the acoustic segmentation and classification modules. According to their
class, speech segments were processed with the general wideband or narrow-band acoustic models. The resulting transcripts were used to perform acoustic model adaptation on
each cluster. Then, the final decoding step was performed by
using the adapted acoustic models.
Table 2 reports results obtained on the IBNC test set.
The Word Error Rate (WER) is reported for speech material corresponding to the different focus conditions and for
the whole test set. Recognition experiments were carried out
without (Baseline column) and with acoustic model adaptation (MLLR column). Acoustic model adaptation refers to
Gaussian mean adaptation. Adaptation of both means and
variances did not improve further the performance.
Table 3 reports results obtained on recordings of television
broadcast news. Comparing these results with those obtained
on the IBNC test set, it can be noted that they are tangibly
worse.
It should be pointed out that the first test set consists of
material recorded in homogeneous conditions with respect to
the training portion of the IBNC corpus. Instead, the second
test set consists of recordings from television programs which
are different from radio programs for many aspects such as
the studio acoustic conditions and the structure of the programs. This is also shown by the different distributions of the
Focus
condition
F0
F1
F2
F3
F4
F5
FX
Global
Word
distrib.
57.4%
0.9%
23.1%
6.1%
3.0%
0.0%
9.4%
100.0%
Baseline
15.6%
42.6%
27.8%
23.8%
35.1%
25.0%
20.6%
MLLR
(means)
14.3%
29.6%
26.0%
21.6%
29.8%
20.9%
18.7%
ing, fast unsupervised speaker adaptation and language model
adaptation [9]. Furthermore, speaker tracking algorithms are
under development to improve the indexing capabilities of the
system.
7. ACKNOWLEDGMENTS
The here presented work has been carried out within the European project CORETEX (IST-1999-11876).
Table 2: Performance of the speech transcriber on the IBNC
test set.
8. REFERENCES
Focus
condition
F0
F1
F2
F3
F4
F5
FX
Global
Word
distrib.
36.4%
0.4%
0.0%
3.9%
27.8%
0.0%
31.5%
100.0%
Baseline
18.1%
48.0%
33.3%
36.3%
34.5%
29.0%
MLLR
(means)
16.5%
52.0%
22.0%
28.5%
25.9%
23.2%
Table 3: Performance of the speech transcriber on recordings
of television broadcast news.
uttered words with respect to the focus conditions, as reported
in Tables 2 and 3. In the television news programs telephone
speech (focus condition F2) is not present, while in the radio
programs telephone speech represents a significant part of the
whole speech material. On the contrary, speech with generic
background (focus condition F4) represents a significant part
of the television recordings while in the first test set material
corresponding to condition F4 represents a marginal part of
whole set of data.
6. CONCLUSION
Work is in progress at ITC-irst to augment the training data
and to improve algorithms.
A comparison with a previous version of the system [1],
which used less training material, shows that augmenting the
training data and improving acoustic modeling allowed a sigWER to
nificant improvement of performance, from
on the IBNC test set. Furthermore, the difference,
in terms of transcription performance of radio and television
broadcast news, suggests that the training set should be enlarged in order to contain an adequate amount of examples of
recordings of television news programs. Activity for collecting a suitable amount of data, for development and evaluation
purposes, has been already scheduled.
Future work will be devoted mainly to acoustic model-
View publication stats
[1] Brugnara, F., Cettolo, M., Federico, M. and Giuliani D.,
“A System for the Segmentation and Transcription of
Italian Radio News”, in Proc. of the International Conference on Content-Based Multimedia Information Access (RIAO), Paris, France, pp. 364–371, 2000.
[2] Chen, S. S. and Gopalakrishnan, P. S., “Speaker, Environment and Channel Change Detection and Clustering
via the Bayesian Information Criterion”, in Proc. of the
DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, 1998.
[3] Tritschler, A. and Gopinath, R., “Improved Speaker
Segmentation and Segment Clustering using Bayesian
Information Criterion”, in Proc. of EUROSPEECH, Budapest, Hungary, pp. 679–682, 1999.
[4] Cettolo, M., “Segmentation, Classification and Clustering of an Italian Broadcast News Corpus”, in Proc.
of the International Conference on Content-Based Multimedia Information Access (RIAO), Paris, France, pp.
372–381, 2000.
[5] Brugnara, F. and Federico, M, “Dynamic Language
Models for Interactive Speech Applications”, in Proc. of
EUROSPEECH, Rhodes, Greece, pp. 2751–2754, 1997.
[6] Brugnara, F. and Cettolo, M., “Improvements in Treebased Language Model Representation”, in Proc. of EUROSPEECH, Madrid, Spain, pp. 2075–2078, 1995.
[7] Leggetter, C. J. and Woodland, P. C., “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models”, Computer Speech
and Language, 9:171–185, 1995.
[8] Gales, M. J. F., “Maximum likelihood linear transformations for HMM-based speech recognition”, Computer
Speech and Language, 12:75–98, 1998.
[9] Federico, M., “Efficient Language Model Adaptation
through MDI Estimation”, in Proc. of EUROSPEECH,
Budapest, Hungary, pp. 1583–1586, 1999.