In this work, we address the noise robustness of the pattern recognition systems by investigating... more In this work, we address the noise robustness of the pattern recognition systems by investigating the application of Reservoir Computing Networks (RCNs) on speech and image recognition tasks. Our work introduces different architectures of RCN-based systems along with a coherent task-independent strategy to optimize the reservoir parameters. We show that such systems are more robust that the state-of-the-arts in the presence of noise and RCNs can be used for both robust recognition tasks as well as denoising approaches. Moreover, the successful application of RCNs on different tasks using the proposed strategy supports our claim that it is task-independent.
Speaker diarization includes two steps: speaker segmentation and speaker clustering. Speaker segm... more Speaker diarization includes two steps: speaker segmentation and speaker clustering. Speaker segmentation searches for speaker boundaries, whereas speaker clustering aims at grouping speech segments of the same speaker. In this work, the segmentation is improved by replacing the Bayesian Information Criterion (BIC) with a new iVector-based approach. Unlike BIC-based methods which trigger on any acoustic dissimilarities, the proposed method suppresses phonetic variations and accentuates speaker differences. More specifically our method generates boundaries based on the distance between two speaker factor vectors that are extracted on a frame-byframe basis. The extraction relies on an eigenvoice matrix so that large differences between speaker factor vectors indicate a different speaker. A Mahalanobis-based distance measure, in which the covariance matrix compensates for the remaining and detrimental phonetic variability, is shown to generate accurate boundaries. The detected segments are clustered by a state-ofthe-art iVector Probabilistic Linear Discriminant Analysis system. Experiments on the COST278 multilingual broadcast news database show relative reductions of 50% in boundary detection errors. The speaker error rate is reduced by 8% relative.
In this paper we present novel ways of incorporating syllable information into an HMM based speec... more In this paper we present novel ways of incorporating syllable information into an HMM based speech recognition system. Syllable based acoustic modelling is appealing as syllables have certain acousticphonetic dependencies that can not be modeled in a pure phone based system. On the other hand, syllable based systems suffer from sparsity issues. In this paper we investigate the potential of different acoustic units such as phone, phone clusters, phones-in-syllables, demi-syllables and syllables in combination with a variety of back-off schemes. Experimental results are presented on the Wall Street Journal database. When working with traditional frame based features only, results only show minor improvements. However, we expect that the developed system will show its full potential when incorporating additional segmental features at the syllable level.
In automatic speech recognition, as in many areas of machine learning, stochastic modeling relies... more In automatic speech recognition, as in many areas of machine learning, stochastic modeling relies on neural networks more and more. Both in acoustic and language modeling, neural networks today mark the state of the art for large vocabulary continuous speech recognition, providing huge improvements over former approaches that were solely based on Gaussian mixture hidden markov models and count-based language models. We give an overview of current activities in neural network based modeling for automatic speech recognition. This includes discussions of network topologies and cell types, training and optimization, choice of input features, adaptation and normalization, multitask training, as well as neural network based language modeling. Despite the clear progress obtained with neural network modeling in speech recognition, a lot is to be done, yet to obtain a consistent and self-contained neural network based modeling approach that ties in with the former state of the art. We will conclude by a discussion of open problems as well as potential future directions w.r.t. to neural network integration into automatic speech recognition systems.
In this work we propose to integrate a soft voice activity detection (VAD) module in an iVector-b... more In this work we propose to integrate a soft voice activity detection (VAD) module in an iVector-based speaker segmentation system. As speaker change detection should be based on speaker information only, we want it to disregard the nonspeech frames by applying speech posteriors during the estimation of the Baum-Welch statistics. The speaker segmentation relies on speaker factors which are extracted on a frame-byframe basis using an eigenvoice matrix. Speaker boundaries are inserted at positions where the distance between the speaker factors at both sides is large. A Mahalanobis distance seems capable of suppressing the effects of differences in the phonetic content at both sides, and therefore, to generate more accurate speaker boundaries. This iVector-based segmentation significantly outperforms Bayesian Information Criterion (BIC) segmentation methods and can be made adaptive on a file-by-file basis in a two-pass approach. Experiments on the COST278 multilingual broadcast news database show significant reductions of the boundary detection error rate by integrating the soft VAD. Furthermore, the more accurate boundaries induce a slight improvement of the iVector Probabilistic Linear Discriminant Analysis system that is employed for speaker clustering.
De dag van de Fonetiek 2003 Over lopend onderzoek naar spraak en spraaktechnologie (http://www.fo... more De dag van de Fonetiek 2003 Over lopend onderzoek naar spraak en spraaktechnologie (http://www.fon.hum.uva.nl/FonetischeVereniging/) Donderdag 18 december 2003 in de Sweelinckzaal, Drift 21 te Utrecht Georganiseerd door de Nederlandse Vereniging voor Fonetische Wetenschappen ¡toegang gratis!
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper is concerned with the task of speaker verification on audio with multiple overlapping ... more This paper is concerned with the task of speaker verification on audio with multiple overlapping speakers. Most speaker verification systems are designed with the assumption of a single speaker being present in a given audio segment. However, in a real-world setting this assumption does not always hold. In this paper, we demonstrate that current speaker verification systems are not robust against audio with noticeable speaker overlap. To alleviate this issue, we propose marginmixup, a simple training strategy that can easily be adopted by existing speaker verification pipelines to make the resulting speaker embeddings robust against multi-speaker audio. In contrast to other methods, margin-mixup requires no alterations to regular speaker verification architectures, while attaining better results. On our multi-speaker test set based on VoxCeleb1, the proposed margin-mixup strategy improves the EER on average with 44.4% relative to our state-of-the-art speaker verification baseline systems.
In this technical report we describe the IDLAB top-scoring submissions for the VoxCeleb Speaker R... more In this technical report we describe the IDLAB top-scoring submissions for the VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20) in the supervised and unsupervised speaker verification tracks. For the supervised verification tracks we trained 6 state-of-the-art ECAPA-TDNN systems and 4 Resnet34 based systems with architectural variations. On all models we apply a large margin fine-tuning strategy, which enables the training procedure to use higher margin penalties by using longer training utterances. In addition, we use qualityaware score calibration which introduces quality metrics in the calibration system to generate more consistent scores across varying levels of utterance conditions. A fusion of all systems with both enhancements applied led to the first place on the open and closed supervised verification tracks. The unsupervised system is trained through contrastive learning. Subsequent pseudo-label generation by iterative clustering of the training embeddings allows the use of supervised techniques. This procedure led to the winning submission on the unsupervised track, and its performance is closing in on supervised training.
The Speaker and Language Recognition Workshop (Odyssey 2014)
This paper presents a system to identify the spoken language in challenging audio material such a... more This paper presents a system to identify the spoken language in challenging audio material such as broadcast news shows. The audio material targeted by the system is characterized by a large range of background conditions (e.g. studio recordings vs. outdoor interviews) and a considerable amount of non-native speakers. The designed model-based language classifier automatically identifies intervals of Flemish (Belgian Dutch), English or French speech. The proposed system is iVector-based, but unlike the standard approach it does not model the Total Variability. Instead, it relies on the original Joint Factor Analysis recipe by modeling the different sources of variability separately. For each speaker a fixed-length low-dimensional feature vector is extracted which encodes the language variability and the other sources of variability separately. The language factors are then fed to a simple language classifier. When assessed on a self-composed dataset containing 9 hours of monolingual broadcast news, 9 hours of multilingual broadcast news and 10 hours of documentaries, this classifier is found to outperform a state-of-the-art eigenchannel compensated discriminativelytrained GMM system by up to 20% relative. A standard iVector baseline is outperformed by up to 40% relative.
In this work, we address the noise robustness of the pattern recognition systems by investigating... more In this work, we address the noise robustness of the pattern recognition systems by investigating the application of Reservoir Computing Networks (RCNs) on speech and image recognition tasks. Our work introduces different architectures of RCN-based systems along with a coherent task-independent strategy to optimize the reservoir parameters. We show that such systems are more robust that the state-of-the-arts in the presence of noise and RCNs can be used for both robust recognition tasks as well as denoising approaches. Moreover, the successful application of RCNs on different tasks using the proposed strategy supports our claim that it is task-independent.
Speaker diarization includes two steps: speaker segmentation and speaker clustering. Speaker segm... more Speaker diarization includes two steps: speaker segmentation and speaker clustering. Speaker segmentation searches for speaker boundaries, whereas speaker clustering aims at grouping speech segments of the same speaker. In this work, the segmentation is improved by replacing the Bayesian Information Criterion (BIC) with a new iVector-based approach. Unlike BIC-based methods which trigger on any acoustic dissimilarities, the proposed method suppresses phonetic variations and accentuates speaker differences. More specifically our method generates boundaries based on the distance between two speaker factor vectors that are extracted on a frame-byframe basis. The extraction relies on an eigenvoice matrix so that large differences between speaker factor vectors indicate a different speaker. A Mahalanobis-based distance measure, in which the covariance matrix compensates for the remaining and detrimental phonetic variability, is shown to generate accurate boundaries. The detected segments are clustered by a state-ofthe-art iVector Probabilistic Linear Discriminant Analysis system. Experiments on the COST278 multilingual broadcast news database show relative reductions of 50% in boundary detection errors. The speaker error rate is reduced by 8% relative.
In this paper we present novel ways of incorporating syllable information into an HMM based speec... more In this paper we present novel ways of incorporating syllable information into an HMM based speech recognition system. Syllable based acoustic modelling is appealing as syllables have certain acousticphonetic dependencies that can not be modeled in a pure phone based system. On the other hand, syllable based systems suffer from sparsity issues. In this paper we investigate the potential of different acoustic units such as phone, phone clusters, phones-in-syllables, demi-syllables and syllables in combination with a variety of back-off schemes. Experimental results are presented on the Wall Street Journal database. When working with traditional frame based features only, results only show minor improvements. However, we expect that the developed system will show its full potential when incorporating additional segmental features at the syllable level.
In automatic speech recognition, as in many areas of machine learning, stochastic modeling relies... more In automatic speech recognition, as in many areas of machine learning, stochastic modeling relies on neural networks more and more. Both in acoustic and language modeling, neural networks today mark the state of the art for large vocabulary continuous speech recognition, providing huge improvements over former approaches that were solely based on Gaussian mixture hidden markov models and count-based language models. We give an overview of current activities in neural network based modeling for automatic speech recognition. This includes discussions of network topologies and cell types, training and optimization, choice of input features, adaptation and normalization, multitask training, as well as neural network based language modeling. Despite the clear progress obtained with neural network modeling in speech recognition, a lot is to be done, yet to obtain a consistent and self-contained neural network based modeling approach that ties in with the former state of the art. We will conclude by a discussion of open problems as well as potential future directions w.r.t. to neural network integration into automatic speech recognition systems.
In this work we propose to integrate a soft voice activity detection (VAD) module in an iVector-b... more In this work we propose to integrate a soft voice activity detection (VAD) module in an iVector-based speaker segmentation system. As speaker change detection should be based on speaker information only, we want it to disregard the nonspeech frames by applying speech posteriors during the estimation of the Baum-Welch statistics. The speaker segmentation relies on speaker factors which are extracted on a frame-byframe basis using an eigenvoice matrix. Speaker boundaries are inserted at positions where the distance between the speaker factors at both sides is large. A Mahalanobis distance seems capable of suppressing the effects of differences in the phonetic content at both sides, and therefore, to generate more accurate speaker boundaries. This iVector-based segmentation significantly outperforms Bayesian Information Criterion (BIC) segmentation methods and can be made adaptive on a file-by-file basis in a two-pass approach. Experiments on the COST278 multilingual broadcast news database show significant reductions of the boundary detection error rate by integrating the soft VAD. Furthermore, the more accurate boundaries induce a slight improvement of the iVector Probabilistic Linear Discriminant Analysis system that is employed for speaker clustering.
De dag van de Fonetiek 2003 Over lopend onderzoek naar spraak en spraaktechnologie (http://www.fo... more De dag van de Fonetiek 2003 Over lopend onderzoek naar spraak en spraaktechnologie (http://www.fon.hum.uva.nl/FonetischeVereniging/) Donderdag 18 december 2003 in de Sweelinckzaal, Drift 21 te Utrecht Georganiseerd door de Nederlandse Vereniging voor Fonetische Wetenschappen ¡toegang gratis!
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper is concerned with the task of speaker verification on audio with multiple overlapping ... more This paper is concerned with the task of speaker verification on audio with multiple overlapping speakers. Most speaker verification systems are designed with the assumption of a single speaker being present in a given audio segment. However, in a real-world setting this assumption does not always hold. In this paper, we demonstrate that current speaker verification systems are not robust against audio with noticeable speaker overlap. To alleviate this issue, we propose marginmixup, a simple training strategy that can easily be adopted by existing speaker verification pipelines to make the resulting speaker embeddings robust against multi-speaker audio. In contrast to other methods, margin-mixup requires no alterations to regular speaker verification architectures, while attaining better results. On our multi-speaker test set based on VoxCeleb1, the proposed margin-mixup strategy improves the EER on average with 44.4% relative to our state-of-the-art speaker verification baseline systems.
In this technical report we describe the IDLAB top-scoring submissions for the VoxCeleb Speaker R... more In this technical report we describe the IDLAB top-scoring submissions for the VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20) in the supervised and unsupervised speaker verification tracks. For the supervised verification tracks we trained 6 state-of-the-art ECAPA-TDNN systems and 4 Resnet34 based systems with architectural variations. On all models we apply a large margin fine-tuning strategy, which enables the training procedure to use higher margin penalties by using longer training utterances. In addition, we use qualityaware score calibration which introduces quality metrics in the calibration system to generate more consistent scores across varying levels of utterance conditions. A fusion of all systems with both enhancements applied led to the first place on the open and closed supervised verification tracks. The unsupervised system is trained through contrastive learning. Subsequent pseudo-label generation by iterative clustering of the training embeddings allows the use of supervised techniques. This procedure led to the winning submission on the unsupervised track, and its performance is closing in on supervised training.
The Speaker and Language Recognition Workshop (Odyssey 2014)
This paper presents a system to identify the spoken language in challenging audio material such a... more This paper presents a system to identify the spoken language in challenging audio material such as broadcast news shows. The audio material targeted by the system is characterized by a large range of background conditions (e.g. studio recordings vs. outdoor interviews) and a considerable amount of non-native speakers. The designed model-based language classifier automatically identifies intervals of Flemish (Belgian Dutch), English or French speech. The proposed system is iVector-based, but unlike the standard approach it does not model the Total Variability. Instead, it relies on the original Joint Factor Analysis recipe by modeling the different sources of variability separately. For each speaker a fixed-length low-dimensional feature vector is extracted which encodes the language variability and the other sources of variability separately. The language factors are then fed to a simple language classifier. When assessed on a self-composed dataset containing 9 hours of monolingual broadcast news, 9 hours of multilingual broadcast news and 10 hours of documentaries, this classifier is found to outperform a state-of-the-art eigenchannel compensated discriminativelytrained GMM system by up to 20% relative. A standard iVector baseline is outperformed by up to 40% relative.
Uploads
Papers by Kris Demuynck