Abstract
This paper describes the 2007 meeting speech-to-text system for lecture rooms developed at the Interactive Systems Laboratories (ISL), for the multiple distant microphone condition, which has been evaluated in the RT-07 Rich Transcription Meeting Evaluation sponsored by the US National Institute of Standards and Technologies (NIST). We describe the principal differences between our current system and those submitted in previous years, namely the use of a signal adaptive front-end (realized by warped-twice warped minimum variance distortionless response spectral estimation), improved acoustic (including maximum mutual information estimation) and language models, cross adaptation between systems which differ in the front-end as well as the phoneme set, the use of a discriminative criteria instead of the signal-to-noise ratio for the selection of the channel to be used and the use of decoder based speech segmentation.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Fügen, C., Wölfel, M., McDonough, J.W., Ikbal, S., Kraft, F., Laskowski, K., Ostendorf, M., Stüker, S., Kumatani, K.: Advances in lecture recognition: The ISL RT-06S evaluation System. In: Proc. of Interspeech (2006)
Wölfel, M., Fügen, C., Ikbal, S., McDonough, J.W.: Multi-source far-distance microphone selection and combination for automatic transcription of lectures. In: Proc. of Interspeech (2006)
Wölfel, M., McDonough, J.: Combining multi-source far distance speech recognition strategies: Beamforming, blind channel and confusion network combination. In: Proc. of Interspeech (2005)
Metze, F., Jin, Q., Fügen, C., Laskowski, K., Pan, Y., Schultz, T.: Issues in meeting transcription – The ISL meeting transcription system. In: Proc. of ICSLP (2004)
Stüker, S., Fügen, C., Kraft, F., Wölfel, M.: The ISL 2007 english speech transcription system for european parliament speeches. In: Proc. of Interspeech (2007)
Wölfel, M., McDonough, J.: Minimum variance distortionless response spectral estimation: Review and refinements. IEEE Signal Processing Magazine (September 2005)
Wölfel, M.: Warped-twice minimum variance distortionless response spectral estimation. In: Proc. of EUSIPCO (2006)
Stüker, S., Fügen, C., Burger, S., Wölfel, M.: Cross-system adaptation and combination for continuous speech recognition: The influence of phoneme set and acoustic front-end. In: Proc. of Interspeech (2006)
Wölfel, M.: Channel selection by class separability measures for automatic transcriptions on distant microphones. In: Proc. of Interspeech (2007)
Povey, D., Woodland, P.: Improved discriminative training techniques for large vocabulary continuous speech recognition. In: Proc. of ICASSP, Salt Lake City, UT, USA (May 2001)
Zhan, P., Westphal, M.: Speaker normalization based on frequency warping. In: Proc. of ICASSP (1997)
Boakye, K., Stolcke, A.: Improved speech activity detection using cross-channel features for recognition of multiparty meetings. In: Proc. of Interspeech (2006)
Jin, Q., Schultz, T.: Speaker segmentation and clustering in meetings. In: Proc. of ICSLP (2004)
Soltau, H., Metze, F., Fügen, C., Waibel, A.: A one pass-decoder based on polymorphic linguistic context assignment. In: Proc. of ASRU (2001)
Gales, M.J.F.: Semi-tied covariance matrices. In: Proc. of ICASSP (1998)
Gales, M.J.F.: Adaptive training schemes for robust asr. In: Proc. of ASRU
McDonough, J., Schaaf, T., Waibel, A.: On maximum mutual information speaker-adapted training. In: Proc. of ICASSP (2002)
Scripts for web data collection provided by University of Washington, http://ssli.ee.washington.edu/projects/ears/WebData/web_data_collection.html
Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Proc. of ICSLP (2002)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language Modeling, Computer Science Group, Harvard University, Tech. Rep. TR-10-98, (1998)
Black, A.W., Taylor, P.A.: The festival speech synthesis system: System documentation, Human Communciation Research Centre, University of Edinburgh, Edinburgh, Scotland, United Kongdom, Tech. Rep. HCRC/TR-83 (1997)
Fisher, W.M.: A statistical text-to-phone function using n-grams and rules. In: Proc. of ICASSP (1999)
Yu, H., Tam, Y.-C., Schaaf, T., Stüker, S., Jin, Q., Noamany, M., Schultz, T.: The ISL RT04 mandarin broadcast news evaluation system. In: Proc. of EARS Rich Transcription Workshop (2004)
Lamel, L., Gauvain, J.-L.: Alternate phone models for conversational speech. In: Proc. of ICASSP (2005)
Stüker, S., Fügen, C., Hsiao, R., Ikbal, S., Jin, Q., Kraft, F., Paulik, M., Raab, M.W.M., Tam, Y.-C.: The ISL TC-STAR spring 2006 ASR evaluation systems. In: Proc. of TC-Star Workshop on Speech-to-Speech Translation (2006)
Mangu, L., Brill, E., Stolcke, A.: Finding consensus among words: Lattice-based word error minimization. In: Proc. of EUROSPEECH (1999)
CHIL – computers in the human interaction loop, http://chil.server.de
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wölfel, M., Stüker, S., Kraft, F. (2008). The ISL RT-07 Speech-to-Text System. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds) Multimodal Technologies for Perception of Humans. RT CLEAR 2007 2007. Lecture Notes in Computer Science, vol 4625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68585-2_43
Download citation
DOI: https://doi.org/10.1007/978-3-540-68585-2_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68584-5
Online ISBN: 978-3-540-68585-2
eBook Packages: Computer ScienceComputer Science (R0)