Text-Independent Speaker Verification Using Long Short-Term Memory Networks
Text-Independent Speaker Verification Using Long Short-Term Memory Networks
Text-Independent Speaker Verification Using Long Short-Term Memory Networks
Abstract—In this paper, an architecture based on Long on the created background model, the new speakers
Short-Term Memory Networks has been proposed for the will be enrolled in creating the speaker model. Tech-
text-independent scenario which is aimed to capture the nically, the speakers’ models are generated using
temporal speaker-related information by operating over
the universal background model. In the evaluation
traditional speech features. For speaker verification, at
first, a background model must be created for speaker phase, the test utterances will be compared to the
representation. Then, in enrollment stage, the speaker speaker models for further identification or verifica-
models will be created based on the enrollment utterances. tion.
For this work, the model will be trained in an end-to-end Recently, by the success of deep learning in
fashion to combine the first two stages. The main goal applications such as in biomedical purposes [1],
of end-to-end training is the model being optimized to be
[2], automatic speech recognition, image recogni-
consistent with the speaker verification protocol. The end-
to-end training jointly learns the background and speaker tion and network sparsity [3]–[6], the DNN-based
models by creating the representation space. The LSTM approaches have also been proposed for Speaker
architecture is trained to create a discrimination space Recognition (SR) [7], [8].
for validating the match and non-match pairs for speaker The traditional speaker verification models such
verification. The proposed architecture demonstrate its as Gaussian Mixture Model-Universal Background
superiority in the text-independent compared to other
Model (GMM-UBM) [9] and i-vector [10] have
traditional methods.
been the state-of-the-art for long. The drawback
of these approaches is the employed unsupervised
I. I NTRODUCTION fashion that does not optimize them for verification
setup. Recently, supervised methods proposed for
The main goal of Speaker Verification (SV) is model adaptation to speaker verification such as the
the process of verifying a query sample belonging one presented in [11] and PLDA-based i-vectors
to a speaker utterance by comparing to the exist- model [12]. Convolutional Neural Networks (CNNs)
ing speaker models. Speaker verification is usually has also been used for speech recognition and
split into two text-independent and text-dependant speaker verification [8], [13] inspired by their their
categories. Text-dependent includes the scenario in superior power for action recognition [14] and scene
which all the speakers are uttering the same phrase understanding [15]. Capsule networks introduced
while in text-independent no prior information is by Hinton et al. [16] has shown quite remarkable
considered for what the speakers are saying. The performance in different tasks [17], [18], and
later setting is much more challenging as it can demonstrated the potential and power to be used
contain numerous variations for non-speaker in- for similar purposes.
formation that can be misleading while extracting In the present work, we propose the use of
solely speaker information is desired. LSTMs by using MFCCs1 speech features for di-
The speaker verification, in general, consists of rectly capturing the temporal information of the
three stages: Training, enrollment, and evaluation. In speaker-related information rather than dealing with
training, the universal background model is trained
using the gallery of speakers. In enrollment, based 1
Mel Frequency Cepstral Coefficients
2
non-speaker information which plays no role for utterances. From this point, different approaches
speaker verification. have been proposed on how to integrate these enroll-
ment features for creating the speaker model. The
II. R ELATED WORKS tradition one is aggregating the representations by
There is a huge literature on speaker verification. averaging the outputs of the DNN which is called
However, we only focus on the research efforts d-vector system [8], [19].
which are based on deep learning deep learning.
One of the traditional successful works in speaker C. Evaluation
verification is the use of Locally Connected Net- For evaluation, the test utterance is the input of
works (LCNs) [19] for the text-dependent scenario. the network and the output is the utterance represen-
Deep networks have also been used as feature tative. The output representative will be compared to
extractor for representing speaker models [20], [21]. different speaker model and the verification criterion
We investigate LSTMs in an end-to-end fashion will be some similarity function. For evaluation
for speaker verification. As Convolutional Neural purposes, the traditional Equal Error Rate (EER)
Networks [22] have successfully been used for will often be used which is the operating point in
the speech recognition [23] some works use their that false reject rate and false accept rate are equal.
architecture for speaker verification [7], [24]. The
most similar work to ours is [20] in which they IV. M ODEL
use LSTMs for the text-dependent setting. On the
contrary, we use LSTMs for the text-independent The main goal is to implement LSTMs on top of
scenario which is a more challenging one. speech extracted features. The input to the model
as well as the architecture itself is explained in the
following subsections.
III. S PEAKER V ERIFICATION U SING D EEP
N EURAL N ETWORKS
A. Input
Here, we explain the speaker verification phases
using deep learning. In different works, these The raw signal is extracted and 25ms windows
steps have been adopted regarding the procedure with %60 overlapping are used for the generation of
proposed by their research efforts such as i- the spectrogram as depicted in Fig. 1. By selecting
vector [10], [25], d-vector system [8]. 1-second of the sound stream, 40 log-energy of filter
banks per window and performing mean and vari-
ance normalization, a feature window of 40 × 100
is generated for each 1-second utterance. Before
A. Development
feature extraction, voice activity detection has been
In the development stage which also called done over the raw input for eliminating the silence.
training, the speaker utterances are used for The derivative feature has not been used as using
background model generation which ideally them did not make any improvement considering
should be a universal model for speaker model the empirical evaluations. For feature extraction, we
representation. DNNs are employed due to their used SpeechPy library [26].
power for feature extraction. By using deep models,
the feature learning will be done for creating an
output space which represents the speaker in a
universal model.
B. Enrollment
In this phase, a model must be created for each
speaker. For each speaker, by collecting the spoken
utterances and feeding to the trained network, dif- Fig. 1. The feature extraction from the raw signal.
ferent output features will be generated for speaker
3
possible gradient explotion [34]. It’s been shown can be seen in Table I, our proposed architecture
that effective pair selection can drastically improve outperforms the other methods.
the verification accuracy [35]. Speaker verification
is performed using the protocol consistent with [36] TABLE I
for which the name identities start with E will be T HE ARCHITECTURE USED FOR VERIFICATION PURPOSE .
used for evaluation.
Model EER
Algorithm 1: The utilized pair selection algo-
GMM-UBM [9] 27.1
rithm for selecting the main contributing impos- I-vectors [10] 24.7
tor pairs I-vectors [10] + PLDA [37] 23.5
Update: Freeze weights! LSTM [ours] 22.9
Evaluate: Input data and get output distance
vector;
Search: Return max and min distances for
match pairs : max gen & min gen;
C. Effect of Utterance Duration
Thresholding: Calculate th = th0 × max gen
min gen
;
One one the main advantage of the baseline meth-
while impostor pair do
ods such as [10] is their ability to capture robust
if imp > max gen + th then
speaker characteristics through long utterances. As
discard;
demonstrated in Fig. 3, our proposed method out-
else
performs the others for short utterances considering
feed the pair;
we used 1-second utterances. However, it is worth
to have a fair comparison for longer utterances as
well. In order to have a one-to-one comparison,
we modified our architecture to feed and train the
A. Baselines system on longer utterances. In all experiments,
We compare our method with different base- the duration of utterances utilized for development,
line methods. The GMM-UBM method [9] if the enrollment, and evaluation are the same.
first candidate. The MFCCs features with 40 co-
efficients are extracted and used. The Universal
Background Model (UBM) is trained using 1024
mixture components. The I-Vector model [10], with
and without Probabilistic Linear Discriminant Anal-
ysis (PLDA) [37], has also been implemented as the
baseline.
The other baseline is the use of DNNs with
locally-connected layers as proposed in [19]. In
the d-vector system, after development phase, the
d-vectors extracted from the enrollment utterances
will be aggregated to each other for generating the
final representation. Finally, in the evaluation stage,
the similarity function determines the closest d-
Fig. 3. The effect of the utterance duration (EER).
vector of the test utterances to the speaker models.
As can be observed in Fig. 3, the superiority
B. Comparison to Different Methods of our method is only in short utterances and in
Here we compare the baseline approaches with longer utterances, the traditional baseline methods
the proposed model as provided in Table I. We such as [10], still are the winners and LSTMs
utilized the architecture and the setup as discussed fails to capture effectively inter- and inter-speaker
in Section IV-B and Section IV-C, respectively. As variations.
5