The ISL RT-07 Speech-to-Text System

Wölfel, Matthias; Stüker, Sebastian; Kraft, Florian

doi:10.1007/978-3-540-68585-2_43

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4625))

Included in the following conference series:

1266 Accesses

Abstract

This paper describes the 2007 meeting speech-to-text system for lecture rooms developed at the Interactive Systems Laboratories (ISL), for the multiple distant microphone condition, which has been evaluated in the RT-07 Rich Transcription Meeting Evaluation sponsored by the US National Institute of Standards and Technologies (NIST). We describe the principal differences between our current system and those submitted in previous years, namely the use of a signal adaptive front-end (realized by warped-twice warped minimum variance distortionless response spectral estimation), improved acoustic (including maximum mutual information estimation) and language models, cross adaptation between systems which differ in the front-end as well as the phoneme set, the use of a discriminative criteria instead of the signal-to-noise ratio for the selection of the channel to be used and the use of decoder based speech segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Exploiting Alternatives for Text-To-Speech Synthesis: From Machine to Human

A Novel and Intelligent Approach for Indian Locale Based Text-to-Speech Model by Hybridizing Wave Net and Wave Glow with Mel-Spectrogram Analysis

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

References

Fügen, C., Wölfel, M., McDonough, J.W., Ikbal, S., Kraft, F., Laskowski, K., Ostendorf, M., Stüker, S., Kumatani, K.: Advances in lecture recognition: The ISL RT-06S evaluation System. In: Proc. of Interspeech (2006)
Google Scholar
Wölfel, M., Fügen, C., Ikbal, S., McDonough, J.W.: Multi-source far-distance microphone selection and combination for automatic transcription of lectures. In: Proc. of Interspeech (2006)
Google Scholar
Wölfel, M., McDonough, J.: Combining multi-source far distance speech recognition strategies: Beamforming, blind channel and confusion network combination. In: Proc. of Interspeech (2005)
Google Scholar
Metze, F., Jin, Q., Fügen, C., Laskowski, K., Pan, Y., Schultz, T.: Issues in meeting transcription – The ISL meeting transcription system. In: Proc. of ICSLP (2004)
Google Scholar
Stüker, S., Fügen, C., Kraft, F., Wölfel, M.: The ISL 2007 english speech transcription system for european parliament speeches. In: Proc. of Interspeech (2007)
Google Scholar
Wölfel, M., McDonough, J.: Minimum variance distortionless response spectral estimation: Review and refinements. IEEE Signal Processing Magazine (September 2005)
Google Scholar
Wölfel, M.: Warped-twice minimum variance distortionless response spectral estimation. In: Proc. of EUSIPCO (2006)
Google Scholar
Stüker, S., Fügen, C., Burger, S., Wölfel, M.: Cross-system adaptation and combination for continuous speech recognition: The influence of phoneme set and acoustic front-end. In: Proc. of Interspeech (2006)
Google Scholar
Wölfel, M.: Channel selection by class separability measures for automatic transcriptions on distant microphones. In: Proc. of Interspeech (2007)
Google Scholar
Povey, D., Woodland, P.: Improved discriminative training techniques for large vocabulary continuous speech recognition. In: Proc. of ICASSP, Salt Lake City, UT, USA (May 2001)
Google Scholar
Zhan, P., Westphal, M.: Speaker normalization based on frequency warping. In: Proc. of ICASSP (1997)
Google Scholar
Boakye, K., Stolcke, A.: Improved speech activity detection using cross-channel features for recognition of multiparty meetings. In: Proc. of Interspeech (2006)
Google Scholar
Jin, Q., Schultz, T.: Speaker segmentation and clustering in meetings. In: Proc. of ICSLP (2004)
Google Scholar
Soltau, H., Metze, F., Fügen, C., Waibel, A.: A one pass-decoder based on polymorphic linguistic context assignment. In: Proc. of ASRU (2001)
Google Scholar
Gales, M.J.F.: Semi-tied covariance matrices. In: Proc. of ICASSP (1998)
Google Scholar
Gales, M.J.F.: Adaptive training schemes for robust asr. In: Proc. of ASRU
Google Scholar
McDonough, J., Schaaf, T., Waibel, A.: On maximum mutual information speaker-adapted training. In: Proc. of ICASSP (2002)
Google Scholar
Scripts for web data collection provided by University of Washington, http://ssli.ee.washington.edu/projects/ears/WebData/web_data_collection.html
Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Proc. of ICSLP (2002)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language Modeling, Computer Science Group, Harvard University, Tech. Rep. TR-10-98, (1998)
Google Scholar
Black, A.W., Taylor, P.A.: The festival speech synthesis system: System documentation, Human Communciation Research Centre, University of Edinburgh, Edinburgh, Scotland, United Kongdom, Tech. Rep. HCRC/TR-83 (1997)
Google Scholar
Fisher, W.M.: A statistical text-to-phone function using n-grams and rules. In: Proc. of ICASSP (1999)
Google Scholar
Yu, H., Tam, Y.-C., Schaaf, T., Stüker, S., Jin, Q., Noamany, M., Schultz, T.: The ISL RT04 mandarin broadcast news evaluation system. In: Proc. of EARS Rich Transcription Workshop (2004)
Google Scholar
Lamel, L., Gauvain, J.-L.: Alternate phone models for conversational speech. In: Proc. of ICASSP (2005)
Google Scholar
Stüker, S., Fügen, C., Hsiao, R., Ikbal, S., Jin, Q., Kraft, F., Paulik, M., Raab, M.W.M., Tam, Y.-C.: The ISL TC-STAR spring 2006 ASR evaluation systems. In: Proc. of TC-Star Workshop on Speech-to-Speech Translation (2006)
Google Scholar
Mangu, L., Brill, E., Stolcke, A.: Finding consensus among words: Lattice-based word error minimization. In: Proc. of EUROSPEECH (1999)
Google Scholar
CHIL – computers in the human interaction loop, http://chil.server.de

Download references

Author information

Authors and Affiliations

Interactive Systems Laboratories Institut für Theoretische Informatik, Universität Karlsruhe (TH), Am Fasanengarten 5, 76131, Karlsruhe, Germany
Matthias Wölfel, Sebastian Stüker & Florian Kraft

Authors

Matthias Wölfel
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Stüker
View author publications
You can also search for this author in PubMed Google Scholar
Florian Kraft
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Rainer Stiefelhagen Rachel Bowers Jonathan Fiscus

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wölfel, M., Stüker, S., Kraft, F. (2008). The ISL RT-07 Speech-to-Text System. In: Stiefelhagen, R., Bowers, R., Fiscus, J. (eds) Multimodal Technologies for Perception of Humans. RT CLEAR 2007 2007. Lecture Notes in Computer Science, vol 4625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68585-2_43

Download citation

DOI: https://doi.org/10.1007/978-3-540-68585-2_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68584-5
Online ISBN: 978-3-540-68585-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The ISL RT-07 Speech-to-Text System

Abstract

Access this chapter

Preview

Similar content being viewed by others

Exploiting Alternatives for Text-To-Speech Synthesis: From Machine to Human

A Novel and Intelligent Approach for Indian Locale Based Text-to-Speech Model by Hybridizing Wave Net and Wave Glow with Mel-Spectrogram Analysis

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

The ISL RT-07 Speech-to-Text System

Abstract

Access this chapter

Preview

Similar content being viewed by others

Exploiting Alternatives for Text-To-Speech Synthesis: From Machine to Human

A Novel and Intelligent Approach for Indian Locale Based Text-to-Speech Model by Hybridizing Wave Net and Wave Glow with Mel-Spectrogram Analysis

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation