Self-Supervised Learning With Cluster-Aware-DINO For High-Performance Robust Speaker Verification

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.
8, AUGUST 2015 1
Self-Supervised Learning with Cluster-Aware-DINO

for High-Performance Robust Speaker Verification
Bing Han, Student Member, IEEE, Zhengyang Chen, Student Member, IEEE,
and Yanmin Qian, Senior Member, IEEE
Abstract—Automatic speaker verification task has made great so on, to achieve excellent performance compared with tra-
achievements using deep learning approaches with the large- ditional methods such as Gaussian Mixture Model-Universal
scale manually annotated dataset. However, it’s very difficult
arXiv:2304.05754v1 [cs.SD] 12 Apr 2023
Background Model (GMM-UBM) [12], i-vector [13]. How-

and expensive to collect a large amount of well-labeled data for
system building. Recently, self-supervised speaker verification has ever, all these methods are based on fully-supervised training
attracted a lot of interest by the reason of its no-dependency on and usually require large amounts of training data with accu-
labeled data. In this article, we propose a novel and advanced self- rate human annotations, while as we know that the collection
supervised learning framework which can construct a very strong of large-scale well-labeled data is actually very difficult and
speaker verification system with high performance without using expensive.
any labeled data. To avoid the impact of false negative pairs from
the contrastive-learning based self-supervised learning, we adopt To reduce the high dependency on labeled data, recently
the self-distillation with no labels (DINO) framework as the initial self-supervised learning has attracted a lot of interest and some
model, which can be trained without exploiting negative pairs. researchers are focusing on applying it to speaker verification
Then, we further introduce a cluster-aware training strategy for tasks. Inspired by the great success of speech pre-trained
DINO to improve the diversity of data. In the iteration learning models, e.g. wav2vec 2.0 [14] and HuBERT [15], in auto-
stage, due to a mass of unreliable labels from unsupervised
clustering, the quality of pseudo labels is important for the system matic speech recognition (ASR) tasks, some researchers [16]
performance. This motivates us to propose dynamic loss-gate and tried to extract the universal speech representation to fine-
label correction (DLG-LC) methods to alleviate the performance tune on SV task directly. Since these pre-trained models
degradation caused by unreliable labels. More specifically, we are trained without explicit speaker information, the results
model the loss distribution with Gaussian Mixture Model (GMM) of simply fine-tuning are not ideal. In the work [17], the
and obtain the loss-gate threshold dynamically to distinguish
the reliable and unreliable labels. Besides, we adopt the model speech representation learned from large-scale unlabeled data
predictions to correct the unreliable label, for better utilizing the were explored to replace the acoustic features, and then the
unreliable data rather than dropping them directly. Moreover, normal supervised deep model was trained as usual. Although
we extend the DLG-LC from single-modality to multi-modality promising performance has been obtained, it still requires
on the audio-visual dataset to further improve the performance. labeled data for training and the parameter size is unacceptable
The experiments are performed on the commonly used Voxceleb
dataset. Compared to the best-known self-supervised speaker for real applications due to the large pre-trained model.
verification system, our proposed method obtain 22.17%, 27.94% To take full advantage of the large-scale unlabeled data,
and 25.56% relative EER improvement on Vox-O, Vox-E and inspired by text-to-speech (TTS) task, a generative method
Vox-H test sets, even with fewer iterations, smaller models, and has been investigated in [18] to separate speaker represen-
simpler clustering methods. More importantly, the newly pro- tation with the help of phone information. Subsequently,
posed self-supervised learning system even achieves comparable
results with the fully supervised system on Voxceleb dataset, but some researchers came up with a hypothesis that speech
without using any human labeled data. segments truncated from the same utterance belong to the
same speaker while those from different utterances belong
Index Terms—self-supervised speaker verification, cluster-
aware dino, dynamic loss-gate, label correction, multi-modality to different speakers, which is approximately true. Based on
this hypothesis, many efforts [19], [20], [21], [22], [23] have
been made to obtain discriminative speaker representations
by maximizing information between different segments from
I. I NTRODUCTION the same utterance via contrastive-learning. Then, inspired
EAKER verification (SV) is a task that utilizes speech by [24], an iterative learning framework [25] was developed to
S as the biometric feature to verify the speakers’ identities.
Recently, deep learning methods have been widely applied for
further improve the performance of self-supervised SV system.
This state-of-the-art system usually consists of two stages. In
speaker verification (SV) tasks and many efforts have been the first stage, contrastive-learning based objective function is
made such as various model architecture [2], [3], [4], [5], [6], applied to train a speaker encoder. In stage II, it adopts the
training objection [7], [8], [9], pooling methods [10], [11] and pre-trained model in stage I to estimate the pseudo labels by
clustering and then uses them as the supervised signal to train a
Part of the results have been presented at Interspeech 2022 [1]. new encoder. This process is performed iteratively to improve
All the authors are with the X-Lance Lab, Department of Computer the performance continuously.
Science and Engineering & MoE Key Laboratory of Artificial Intelligence,
AI Institute, Shanghai Jiao Tong University, Shanghai, 200240 P. R. China This two-stage framework has obtained excellent perfor-
(e-mail:{hanbing97, zhengyang.chen, yanminqian}@sjtu.edu.cn) mance [26], [27], [28], [29], [30], but there are many short-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
comings, which restrict the further improvement of the sys- single-modality to audio-visual multi-modality. Multi-
tem performance. For contrastive-learning methods in stage modal data utilize multi-modal knowledge and make
I, speech segments cropped from different utterances are reliable label selection more efficient.
regarded as negative pairs to be pushed away from each other 4) With these strategies, we achieve a great performance
in speaker space. However, different utterances may belong leap compared with the state-of-the-art (SOTA) sys-
to the same speaker in the real situation, which shows that tem with self-supervised learning nowadays, even with
this inaccurate assumption might make some mistakes. For the fewer iterations, smaller models, and simpler clustering
second iterative stage, [24], [25] have proved that many pseudo methods. More promisingly, this newly proposed self-
labels generated by the clustering algorithm lack reliability, supervised learning framework can approach the current
which would confuse and degrade the model. Hence, the SOTA of the fully supervised system, and achieve com-
key to improving the performance is finding a way to select parable performance.
high-quality pseudo labels. Based on this hypothesis, in [29],
they observed that the data with lower loss is more reliable II. S ELF -S UPERVISED L EARNING FOR S PEAKER
than those with unreliable labels, and then proposed a loss- V ERIFICATION
gate learning strategy to distinguish the reliable labels from
In this section, the commonly utilized two-stage self-
unreliable ones by setting a loss threshold. Only the data
supervised speaker verification framework is reviewed, includ-
whose loss is under the threshold can be used to update the
ing the first contrastive-learning stage for pre-trained model
network. Although this approach led to further improvements,
and the second iterative learning stage.
the manually set thresholds in each iteration are not flexible,
and data with unreliable labels is not fully utilized.
In this paper, we propose several new strategies for self- A. Contrastive based Self-Supervised Speaker Verification
supervised learning speaker verification. Firstly, we introduce Self-supervised learning (SSL) is a type of unsupervised
the DINO (distillation with no labels) [31] in the first pretraining manners, which can design pretext or proxy and learn
training stage, which is only based on maximizing the sim- the representations from the data itself. Common SSL methods
ilarity between the augmented segments pairs sampled from can be roughly divided into two classes: generative [18] and
the same utterance. To minimize channel and environmental contrastive [35], [36] methods. Based on the hypothesis that
impacts and increase data’s diversity, we propose a cluster- segments sampled from the same utterance belong to the
aware training strategy for DINO to further improve its per- same speaker while those from different utterances come
formance. In the second iterative stage, we model the loss from different speakers, most studies of SV tasks focus on
distribution data using GMM with two components, in which contrastive learning approaches. Among them, SimCLR [37]
each component represents the data with reliable labels or un- is one of the most popular contrastive learning frameworks.
reliable labels. Then, the dynamic loss-gate (DLG) threshold, Its basic idea is that minimize the distance between the
computed with the estimated GMM, is used to distinguish the representations of augmented segments cropped from the same
two types of labels, which is more flexible than the manually utterance as well as maximize negative pairs from different
tuned threshold. Besides, inspired by semi-supervised learning utterances. Besides, MoCo [38] framework provides further
works [32], [33], [34], we propose label correction (LC) to performance gain through a dynamic dictionary with a queue
leverage the model’s prediction as target label and use it to and a moving-averaged encoder. Based on these frameworks,
correct the unreliable pseudo label, instead of discarding the many works such as equilibrium learning [23], augmentation
unreliable data directly [29]. Finally, we incorporate multi- adversarial training [20], channel-invariant training [21], proto-
modality into the above proposed DLG-LC strategy and clus- type momentum [22] are proposed to learn more discriminative
tering step. Benefiting from the complementary audio and speaker representation.
visual information of different modalities, DLG-LC can select
the data with reliable labels more effectively.
B. Iterative Framework for Self-Supervised Speaker Verifica-
The main contributions of this paper are summarized as
tion
follows:
1) The DINO framework is introduced as the self-supervised Considering that the assumption of contrastive learning can
learning framework to obtain the initial pre-trained naturally introduce label error and might degrade the model,
model, which is negative-pairs free to avoid the impact of in [25], they proposed an iterative, self-evolving framework
the false negative pairs. In addition, cluster-aware training to further improve the performance of self-supervised speaker
strategy is designed to enhance DINO and it can improve verification systems. This framework is mainly divided into
the diversity of data and obtain better performance. two stages, and they are illustrated as follows:
2) To select the high-quality data more effectively and • Stage I: Pre-training
flexibly in the second iterative stage, dynamic loss-gate 1) Use contrastive learning or other self-supervised learn-
(DLG) is developed which can determine the loss-gate ing methods to pre-train a speaker encoder as the initial
threshold dynamically to select the data with reliable model.
labels. Meanwhile, label correction (LC) is also adopted 2) With the pre-trained model, extract the speaker embed-
to further improve the results. dings for the training set and then apply a clustering
3) The DLG-LC method is further extended from audio algorithm to assign pseudo labels.
Fig. 1: Framework of distillation with no label (DINO) for self-supervised speaker representation learning
• Stage II: Iterative training and pseudo labeling. multi-crop strategy [40], and the long segments can be used
1) Train a new encoder with the pseudo labels generated to extract more stable speaker embedding. It is notable that
by the previous step. when sampling, these segments should overlap as little as
2) Perform a clustering algorithm to update pseudo labels possible. Same as the previous work [20], [21], [22], [23],
with the new encoder. we still obey the assumption that the segments cropped from
3) Repeat stage II several times until the model converges. the same utterance belong to the same speaker and then apply
Although this framework requires high computing resources different kinds of data augmentation on them by adding noise
due to the several iterations, it is widely used in [26], [27], or room impulse response for robust performance. Unlike
[39], [28], [29] for its advanced performance. In addition, SimCLR [37], which only uses one encoder to do contrastive
this framework is extended to the audio-visual dataset in [30] learning, our model consists of not only a student encoder
and achieves better performance with the help of multi-modal but also a momentum teacher encoder whose architecture is
information in the clustering algorithm. similar as knowledge distillation [41]. After augmentation,
all segments pass through the student while only the long
III. C LUSTER -AWARE -DINO FOR S PEAKER V ERIFICATION segments pass through the teacher, thus encouraging the short-
to-long correspondences by minimizing the cross-entropy H(·)
For contrastive-learning based methods in previous work,
between two distributions as the following Equation.2:
they shared the same assumption that segments in a batch
belong to different speakers. But this assumption does not hold
X X
Lce = H(Pt (x) | Ps (x0 )) (2)
all the time because repeat speakers might appear in the same
x∈{xl1 ,xl2 } x0 ∈{xl1 ,xl2 ,xs1 ,...,xs4 }
batch. Taking the statistics on Voxceleb 2 as an example, we
can compute the probability of repeat speakers on Voxceleb 2 where output distributions of momentum teacher network fθt
by Equation.1 and the results are listed in Table. I. and student network fθs are denoted by Pt and Ps respectively.
And P can be computed by using a softmax function to
ANS S!
prepeat (S, N ) = 1 − =1− N (1) normalize the output:
S N S (S − N )!
fθs (x)
where S is the speaker number in training set and N is batch Ps (x) = Sof tmax( ) (3)
s
size.
where s > 0 is the temperature parameter that can control
TABLE I: The probability of repeat speaker in a batch the sharpness of the output distribution. Similarly, there is a
Batch Size 16 32 64 128 256
formula holds for Pt with temperature t > 0, too. Moreover,
a mean computed over batches is used for centering teacher
Probability 0.020 0.080 0.286 0.745 0.996
model’s output distribution. During the training, both sharp-
ening and centering are applied to avoid trivial solution [31].
According to the table, the larger batch size leads to a higher The teacher and student own the same architecture but with
probability of repeating which will cause a bad impact on the different parameters due to the different update methods. The
model. We can use a small batch size to alleviate this problem, student is updated by gradient descent while the teacher is
but it will degrade the performance [38]. updated by the exponential moving average (EMA) of the
student’s parameters. EMA’s update rule is:
A. DINO based Self-Supervised Learning
θt ← λθt + (1 − λ)θs (4)
To tackle this problem, negative-pairs free DINO [31] is
introduced to self-supervised speaker verification task, and the where λ is adjusted by a cosine schedule [42] from 0.996
whole framework is shown in Fig. 1. to 1 during training. Speaker embeddings are extracted by
Firstly, 4 short {xs1 , xs2 , xs3 , xs4 } and 2 long segments Encoders. Then speaker embeddings are fed into the Projec-
{xl1 , xl2 } are randomly sampled from an utterance using a tion Head which contains a 3-layers perceptron with hidden
dimension 2048 followed by `2 normalization and a weight process will enter the next stage. In the second stage, the
normalized fully connected layer with K dimensions. The clustering algorithm is performed using the extracted speaker
whole architecture is similar to [40]. embeddings and we assume that utterances in the same cluster
In addition, a cosine-based consistency loss is added to belong to the same person. As shown in Fig. 2, the positive
ensure that the speaker embedding is encoded into cosine pairs are sampled from several utterances belonging to the
space which is more suitable for the scoring and clustering in same cluster rather than a single utterance. These pairs come
the following. It works by maximizing the cosine similarity from the same speaker, but with different speaking contents
among the embeddings extracted from the same speaker. and channels, which leads to a high data diversity and makes
Finally, the total loss is summarized with coefficient α: the model pay more attention to the speaker’s information
X X e · e0 instead of irrelevant factors. Considering the resource con-
Ldino = Lce +α (1− ) (5) sumption of extracting the embedded speaker, the clustering
kek ke0 k
l l 0 l l s s
e∈{e1 ,e2 } e ∈{e1 ,e2 ,e1 ,...,e4 } operation will be done every few rounds.
where e represents the extracted speaker embedding from IV. I TERATIVE L EARNING WITH DYNAMIC L OSS - GATE
encoder. AND L ABEL C ORRECTION
Based on the proposed CA-DINO self-supervised learning,
B. Cluster-Aware Training on DINO we then apply the iterative learning framework [25] to further
For traditional DINO, all segments are sampled from the improve the performance of self-supervised SV. During the
same utterance to form positive pairs. Limited by the duration iterative process, a serious problem is that the generated
of the utterances, these segments usually have a great degree pseudo labels contain a lot of noises which will confuse and
of overlaps. As mentioned above, the optimization of DINO degrade the network. Considering this limitation, several works
is encourage short-to-long correspondences by minimizing have been done to select high-quality pseudo labels. In [25], an
the cross-entropy between two distributions of positive pairs. aggressive training method is applied to purify the labels using
Because there are a lot of overlapped parts in the segments, the clustering confidence but achieves minor profit. In [29], they
model might tend to pay more attention to the content, channel conducted a toy experiment and observed that data with lower
and other irrelevant information of the overlapped parts, and loss is more reliable. Then, they propose a loss-gate (LG)
ignore the speaker information in the audio. Although we will strategy to select the data with lower loss by setting a fixed
add different types of data augmentation to segments, the data threshold and only use these data to update the model. With
still lacks diversity which could lead the model optimization the LG strategy, the system achieved obvious improvement,
in the wrong direction. but the threshold setting in this method is heavily dependent
on human experience, and the unreliable data are not fully
utilized.
In this section, we will introduce our proposed DLG-LC
to adjust the loss-gate threshold dynamically and correct the
unreliable pseudo label to fully utilize the data, and then this
DLG-LC approach is extended to utilize the multi-modality
for further improvements.
A. Dynamic Loss-Gate
In order to determine an appropriate loss-gate threshold,
we implemented the LG learning and visualized to analyze
the distribution of loss values on Voxceleb 2 [43] dataset.
Fig. 2: Difference between traditional DINO and cluster- The histogram of loss values is provided in Fig. 3. According
aware training DINO. (a) Traditional DINO:long and short to the figure, there exist two sharp peaks in the distribution
segments are sampled from the same utterance to compose obviously. And similar experiments conducted in [44] have
the positive pairs. (b) Cluster-aware training DINO: through shown that these data with reliable and unreliable labels can
simple clustering algorithm, we consider that the same speaker be represented by two peaks respectively. If we can find a way
in the same cluster shares the same identity and segments are to model the distribution, then the loss-gate threshold can be
cropped from the corresponding cluster. determined dynamically as the loss distribution varies, which
can avoid laborious manual tuning.
In order to reduce the overlaps of segments and increase Gaussian distribution is an important continuous probability
the diversity of data, we propose a clustering-aware (CA) distribution of real-valued random variables, whose general
training strategy for DINO while maintaining the original form of the probability density function is defined in Equa-
assumptions as much as possible, which is named CA-DINO tion.6.
in the following. We divide model training into two stages. In 1 1 x−µ 2
N µ, σ 2 = √ exp(− (

) ) (6)
the early stage of training, we optimize the model according to σ 2π 2 σ
the traditional DINO strategy. When the model has the ability where location parameter and scale parameter are denoted by
to extract discriminative speaker representation, the training µ and σ respectively. Gaussian distribution’s shape is like a
Algorithm 1: The proposed Dynamic Loss-Gate and

Label Correction
Input: mini-batch Dm = {(x1 , x2 , y)}n i=1 ; two threshold τ1 and
τ2 ; Network g(·); sharpness factor c
Output: the loss of the mini-batch
1 for (x1 , x2 , y) ∈ Dm do
2 xclean , xaug = x1 , augment(x2 ) # augment one segment
3 pclean , paug = g(xclean ), g(xaug ) # output distribution
4 Compute the AAM-softmax loss lclean and laug according the
pseudo label y
5 Record the lclean value
6 if lclean < τ1 then
7 return laug # pseudo label y is reliable
8 else
9 if max(pclean ) > τ2 then
10 pclean
ˆ = sharp(pclean , c ) # sharp the distribution
11 compute the cross-entropy l between pclean ˆ and paug
Fig. 3: Loss distribution of Loss-gate (LG) learning [29] on 12 return l
Voxceleb 2 [43]. Loss value is scaled by log function, and the 13 else
14 return 0 # prediction isn’t reliable
lines are estimated by GMM with two components. 15 end if
16 end if
17 end for
bell, with low on both sides and high in the middle, which 18 After one epoch, re-estimate the GMM on the recorded loss values
and then update the τ1
is very similar to the “peaks” of loss in Fig. 3. In this case,
Gaussian Mixture Model (GMM) with two components can be
applied to model the loss distribution of reliable and unreliable
samples respectively: that we can utilize the unreliable data effectively. Researchers
in [46] have indicated that the network is capable of clustering
p(x) = λ1 N µ1 , σ12 + λ2 N µ2 , σ22

(7) noisy samples into their correct classes. To leverage this abil-
ity, we hypothesize that the output prediction of the model is
where λ1 and λ2 represent the weights for two Gaussian
more reliable than pseudo labels generated by clustering. Thus
components. After fitting, the fitted curves are plotted in
the predicted posterior probability is regarded as the target
Fig. 3, it’s obvious to find that the two weighted Gaussian
labels and incorporated into the objective loss function to
components can be used to approach these two “peaks”. Then,
prevent the model from fitting into inaccurate labels. However,
by computing the loss values whose probabilities belonging
not all prediction labels are suitable for training. Inspired
to the two components are equal, the loss-gate threshold τ1
by [47], [33], we assume that the prediction label owns high
can be obtained easily to distinguish between the reliable and
confidence if the model assigns a high probability to one of the
unreliable data:
possible classes. Then, another threshold τ2 is introduced to
τ1 : p1 (τ1 ) = p2 (τ1 ) (8)
retain the prediction whose probability of largest class is above
where p1 (x) = λ1 N µ1 , σ1 and p2 (x) = λ2 N µ2 , σ22 .
2 τ2 , and the label correction loss is defined as the following
For each epoch, all loss values are recorded for re-estimating Equation 10:
the parameters of GMM, so τ1 can be tuned dynamically N
1li >τ1 ,max(pˆi )>τ2 H(pclean
X
according to the current training condition. LLC = ˆ | paug ) (10)
Our DLG introduces this dynamical loss-gate threshold i=1
τ1 into the speaker classification loss function ArcMargin
where paug represents the output probability of augmented
Softmax (AAM) [45] to select the data and only these retained
segments and pcleanˆ represents their corresponding clean
data with losses under the threshold are used to update the
version. H(·) here denotes the cross-entropy loss function be-
parameters of the network.
tween two probability distributions. In addition, to encourage a
N peaky distribution, a sharpening operation is applied on pclean
ˆ
es(cos(θyi ,i +m))
1li <τ1 log
X
LDLG = (9) with sharpness factor c which is described in Equation. 3.
i=1
Z
Then, the DLG loss and LC loss are combined to optimize
Pc
where Z = es(cos(θyi ,i +m)) + j=1,j6=i es(cos(θyi ,i )) , θj,i is the speaker model as Equation. 11.
the angle between the column vector Wj and embedding xi . L = LDLG + LLC (11)
s is the scaling factor and m is hyperparameter to control
the margin. AAM can enforce larger gaps between the nearest More specifically, the pseudo-code for describing the flow
speakers and is widely adopted in speaker recognition tasks. of the DLG-LC algorithm is provided in detail and shown in
Algorithm. 1.
B. Label Correction
For those unreliable data with large losses, it’s wasteful to C. Incorporate with Multi-Modality
drop them away directly. Therefore, we propose the label cor- The researchers in [30] have introduced the multi-modality
rection (LC) strategy to correct pseudo labels dynamically so information into the data clustering step to generate more
accurate pseudo labels in self-supervised speaker verification. we excluded some utterances with the video missing in the
In our work, considering the feature that audio and visual from data set. Then, the final audio-visual training set comprises
the same video share the same speaker identity, we also try to 1, 091, 251 utterances among 5, 994 speakers, extracted from
add the visual modality to our DLG-LC method for better data YouTube.
utilization, hoping to obtain further improvement. Our fusion For the evaluation, we report the experimental results on
of visual information is mainly divided into two aspects: one is 3 trials as defined in [43]: the Original, Extended, and Hard
to use visual information to help DLG-LC select more reliable Voxceleb test sets. Vox-O is the original test set of Voxceleb 1
data, and the other is to make clustering results better during contains 37, 720 trials from 40 speakers. Vox-E is an trial list
data clustering. which (using the entire dataset) contains 581, 480 trials from
1) multi-modal based DLG-LC: Different from the single- 1251 speakers. Vox-H is a hard evaluation list consisting of
modal DLG-LC, our strategy of selecting reliable data has 552, 536 pairs sampled from 1190 speakers in Voxceleb 1, all
been slightly adjusted. For multi-modal data, we will use two of which are from the same nationality and gender.
independent encoders to encode audio and visual data. Then,
through recording the loss values, we can obtain two loss- B. Metrics
gate thresholds for audio and visual respectively. For an audio- The main metrics adopted in this paper are (i) Equal Error
visual instance, it can be regarded as a reliable label only if Rate (EER) which is the error rate when both acceptance and
its loss values are both under these two loss-gate thresholds, rejection rates are equal, and (ii) the normalized minimum
and then we will optimize this instance with AAM softmax Detection Cost Function (minDCF) which is defined by Equa-
which is defined as Equation. 9. tion. 12 :
For unreliable data, the multi-modal label correction will
Cdet = Cmiss × Pmiss × Ptar + Cf a × Pf a × (1 − Ptar ) (12)
be performed on it. First, we compare whether the predicted
labels of the two modal networks are consistent. If the where we set the prior target probability Ptar as 0.01 and equal
predictions of the two models belong to the same class, it weights between misses Cmiss and false alarms Cf a . Both
indicates that the accuracy of the prediction is relatively high. EER and minDCF are commonly used as evaluation metrics
Unlike single-mode label correction, which uses soft labels for speaker verification systems.
for training, our output is verified by multi-modal, which has
higher reliability. As a result, we use the “hard” labels (i.e. C. Data Augmentation
the arg max of the model’s distribution) as labels to optimize 1) Audio: To generate extra training samples and increase
the models by AAM softmax. If the network disagrees with the diversity of data, we perform online data augmentation
the predicted labels, then we use the soft labels to optimize strategy [49] by adding background noise or convolutional re-
models respectively. verberation noise from MUSAN [50] and RIR dataset [51] re-
2) multi-modal based data clustering: In the previous spectively. The noise types in MUSAN include ambient noise,
training step, the multi-modal information was only used to music, television, and babble noise for the background additive
select reliable data, and the models of the two modalities noise. We can obtain augmented data by mixing the noise
were not structurally related. As a result, we can obtain with the original speech in time-domain waveform directly
audio ga (·) and visual encoders gv (·) independently. Given and the signal-to-noise ratios (SNR) are randomly applied
a dataset with audio xa and visual modality xv , we can use between 5 to 20 dB. For the reverberation, the convolution
trained encoder to extract audio embedding ea and visual operation is performed with 40,000 simulated room impulse
embedding ev respectively. Considering that the audio and responses (RIR) [51]. After applying the augmentation, we
visual embeddings contain complementary information from normalize the waveform value for stable training. We used
different modalities, we apply an additional clustering on the 80-dimensional log Mel filter-bank energies with 25ms length
joint representation eav = (ea , ev ), which is formed as the Hamming windows and 10ms window shift as the acoustic
concatenation of audio and visual embeddings. With the joint features, while no voice activity detection (VAD) is involved
operation, the representation will be more discriminative and in our experiments.
the cluster will be more robust. Then, pseudo labels for the 2) Visual: For each video segment in VoxCeleb 1 & 2
next iteration will be generated by k-means on these audio- datasets, images are extracted at one frame per second. Then,
visual joint embeddings. we align the faces in extracted frames using the landmarks
predicted by MTCNN [52] and after that, the similarity
V. E XPERIMENTS S ETUP transformation is used to map the face region to the same
shape (3 × 112 × 96). In order to better extract visual features,
A. Dataset we convert the image to the most common size of the
The experiments are conducted on Voxceleb [48], [43] model (3 × 224 × 224). And in the following, several data
which is a large-scale audio-visual dataset for the speaker augmentation strategies including random color distortion,
recognition task. For the model training in stage I and II random horizontal flipping, random grey scaling, and random
of self-supervised learning, we adopt the development set of Gaussian blur are applied to the original images with a certain
Voxceleb 2 [43] for training the networks, and no speaker probability. Finally, we normalize the pixel value of each
identity information is used during this process. Because we image to the range of [-0.5, 0.5] before feeding it into the
introduced visual features into the iterative learning stage, model.
D. CA-DINO Setup TABLE III: Model Architecture of audio encoder ECAPA-

TDNN [53]. C (kernal size, channels) denotes the convo-
1) DINO: For DINO, considering the training time and
lutional 1D layer. F is the dimension of the input acoutic
memory limitation, we adopt ECAPA-TDNN [53] as an audio
features which is determined by the number of frequency bins
encoder to learn discriminative speaker representation, which
of the Mel spectrogram. T relates to the frames of the speech
is a time-delay neural network (TDNN) [3] based backbone
segments.
with emphasized channel attention, propagation, and aggrega-
tion. It employs a channel- and context- dependent attention Layer Structure Output Size
mechanism [54], Multi-layer Feature Aggregation (MFA), as Input - F ×T
well as Squeeze-Excitation (SE) [55] and residual blocks. The Conv1D C(5, 512) 512 × T
model architecture of ECAPA-TDNN is shown in Table.III. C(1, 512)
For each utterance, two long (3 seconds) and four short (2 SE-Res2Block 1 C(3, 64) × 8, dilation 2 512 × T
seconds) segments are randomly cropped and regarded as C(1, 512)
C(1, 512)
positive pairs. It is worth noting that all the segments will be SE-Res2Block 2 C(3, 64) × 8, dilation 3 512 × T
applied data augmentation, and after that, they are encoded C(1, 512)
into 192-dimensional speaker embeddings by the encoder. C(1, 512)
SE-Res2Block 3 C(3, 64) × 8, dilation 4 512 × T
Similar to the configuration in [31], the K in the DINO C(1, 512)
projection head is set as 65, 536. Temperatures for the teacher Conv1D C(1, 1536) 1536 × T
t and the student s are 0.04 and 0.1 respectively. In addition, Pooling Layer Attentive Stat Pooling 3072 × 1
we set cosine loss weight α as 1.0 to balance two losses. The Embedding - 192
whole training process will last 150 epochs. Model parameters
are updated using stochastic gradient descent (SGD) algorithm
with weight decay 5e-5. The learning rate is linearly ramped the number of real speakers as the number of clusters, we
up from 0 to 0.2 in the first 20 epochs, and then it decays to choose 7500 as the cluster number to verify the robustness
1e-5 with the cosine schedule [42]. Moreover, the momentum of our method. For label correction, sharpen parameters c
also follows the cosine schedule from 0.996 to 1.0. and threshold τ2 are set as 0.1 and 0.5 respectively. The
2) Cluster-Aware Training: For cluster-aware training strat- learning rate decays from 0.1 to 5e-5 exponentially and we
egy, we train the model normally in the first 90 epochs. After set momentum and weight decay as 0.9 and 1e-4. Finally, the
that, k-means based clustering algorithm is applied on the training process will last 100 epochs.
whole training set every 5 epochs, which is supported by 2) Multi Modality: For audio-visual based DLG-LC, ex-
faiss library [56]. The results of clustering are used for the cept for the addition of an image encoder, other configura-
generation of training data. Positive pairs are sampled from tions are consistent with the single-modal. We employ the
utterances belonging to the same cluster rather than the single ResNet34 [57] as the backbone network for the visual encoder,
one. which is similar to the recent works [58], [59]. More detail is
shown in Table. II.
TABLE II: Model Architecture of visual encoder
ResNet34 [57]. C (kernal size, channel) denotes the
VI. E XPERIMENTAL R ESULTS
convolutional 2D layer. [·] represents the residual block and
L is the image size of input. The experiments are performed in six parts. In section VI-A,
performance comparison of proposed Cluster-aware DINO
Layer Structure Output Size with previous works in stage I are reported, and we discuss
Input - 3×L×L how the number of clusters affects the cluster-aware training
Conv2D C(3 × 3, 32) 32 × L × L strategy. In section VI-B, we report the speaker verification
performance of CA-DINO finetuned on the small-scale labeled

C(3 × 3, 32) L L
Residual Block 1 × 3, stride 2 32 × ×
C(3 × 3, 32)
2 2
data. In section VI-C, an ablation study of our proposed
C(3 × 3, 64) L L DLG-LC is given to demonstrate its effectiveness. Then, sec-
Residual Block 2 × 4 , stride 2 64 × ×
C(3 × 3, 64)
4 4
C(3 × 3, 128) L L
tion VI-D and section VI-E show that the proposed dynamic
Residual Block 3 × 6 , stride 2 128 × × loss-gate and label correction can improve the performance
C(3 × 3, 128)
8 8
C(3 × 3, 256) L L under both single modal and multi-modal scenarios. Finally, in
Residual Block 4 × 3 , stride 2 256 × ×
C(3 × 3, 256) 16 16
section VI-F, a comprehensive comparison between our newly
Embedding - 192 proposed self-supervised learning method and previous work
demonstrates the superiority and robustness of our system.
E. DLG-LC Setup A. Evaluation of CA-DINO based Speaker Verification

1) Single Modality: In this stage, for a fair comparison with Table IV reports the speaker verification performance of our
[29], we also adopt ECAPA-TDNN [53] as our audio encoder proposed methods and other previous self-supervised speaker
to extract speaker embedding. For clustering, we choose k- models. All the methods are trained on Voxceleb 2 without any
means algorithm to assign the pseudo label to the training speaker label and evaluated on the Vox-O test set. According to
set. Unlike some works [29], [39], [30] that directly regard the results, we can find that the methods based on contrastive
TABLE IV: Performance comparison of the proposed CA- TABLE VI: EER(%) comparison of finetuning the pre-trained
DINO with other self-supervised speaker verification methods. self-supervised model with different amount of labeled data
SSL means Self-Supervised Learning. EER (%) and minDCF from Voxceleb 1. Results are evaluated on Vox-O which is
(p=0.01) are evaluated on Vox-O test set. the test set of Voxceleb 1.
SSL Methods EER (%) minDCF Initial Model None 10% 20% 50% 100%
Disent [35] 22.090 - Random 32.78 6.893 5.276 3.691 2.755
CDDL [36] 17.520 - SimCLR 8.547 4.388 3.797 3.266 2.936
GCL [19] 15.260 - CA-DINO 3.585 2.393 2.356 2.016 1.835
i-vector [20] 15.280 0.63 (p=0.05)
AP + AAT [20] 8.650 0.45 (p=0.05)
SimCLR + uniform [21] 8.280 0.610
MoCo + WavAug [22] 8.230 0.590 supervised learning with pretrain-finetune framework, i.e. fine-
Unif+CEL [23] 8.010 - tuning the self-supervised model with a small amount of
labeled data in the downstream speaker verification task.
DINO 31.233 0.990
+ EMA 4.404 0.434
We randomly sample 10%/20%/50%/100% labeled utterances
+ + Cluster Aware (CA) 3.585 0.353 from Voxceleb1 [48] as the supervision and finetune the self-
supervised models with these data.
From Table VI, it is observed that self-supervised model,
learning [20], [21], [22], [23] have greatly improved the per- both SimCLR and proposed CA-DINO, made great improve-
formance compared with the traditional work [35], [36], [19]. ments compared with model training from scratch, which
Our proposed negative-pairs-free CA-DINO achieves a great shows that a pretraining model with better initialization is very
performance leap again, which shows that negative pairs are important in low-resource conditions. Moreover, comparing
indeed a bottleneck for performance improvement. In addition, the proposed CA-DINO with SimCLR, the proposed non-
we also provide the ablation study of CA-DINO at the bottom contrastive CA-DINO outperforms SimCLR obviously and can
of Table IV. When we train the DINO without exponential obtain a good performance position only with few labeled data
moving average (EMA), it’s difficult to converge and only in downstream speaker verification tasks. Moreover, with only
obtains a very bad result which demonstrates that EMA is 10% part of labeled data, CA-DINO even achieves a better
the key to preventing the model from collapsing. And then we performance than the fully supervised system, i.e. 2.393% vs.
apply the cluster-aware (CA) strategy when training the DINO, 2.755%, which is meaningful to economize lots of manual
the performance has been further improved. The proposed CA- annotation.
DINO achieves the EER of 3.585%, with 55.24% relative
EER improvement compared with the previously published C. Evaluation of proposed DLG-LC
best performance of self-supervised speaker verification [23].
TABLE VII: EER (%) comparison on Vox-O, E, H of the
TABLE V: Performance comparison of cluster-aware training
proposed DLG-LC in Iteration 1. In this experiment, pseudo
with different cluster number. EER (%) is evaluated on Vox-
labels are estimated from our pre-trained CA-DINO system.
O test set. 1080k here means that one utterance is one class,
SimCLR and CA-DINO here mean we used all the data with
which is equivalent to training without cluster-aware strategy.
the estimated pseudo labels as the supervisory signal without
# Cluster 1080k 30k 20k 10k 5k any data selection strategy during the system training.
EER(%) 4.404 3.909 3.946 3.585 3.978 Method Threshold Vox-O Vox-E Vox-H
SimCLR - 6.281 7.428 11.54
During the cluster-aware training, there exists a k-means CA-DINO - 2.909 3.315 5.692
clustering operation. We also conducted an experiment to
CA-DINO
explore the influence of the number of clusters on the per- + LG [29] 1 2.441 2.930 4.892
formance and the results are reported in Table V. It is + LG [29] 3 2.516 3.037 5.094
observed that our proposed cluster-aware training strategy can + LG [29] 5 2.553 3.052 5.173
bring significant and stable improvements for all the given
CA-DINO
number of clusters compared with the baseline system (1080k). + DLG Dynamic 2.186 2.473 4.306
Meanwhile, CA-DINO with 10k cluster number outperforms ++ LC Dynamic 2.021 2.331 4.012
other systems which shows that the reasonable setting for the
number of clusters can maximize the performance improve- Based on pseudo labels generated by pre-trained models
ment. in stage I, we conduct some experiments to illustrate the
effectiveness of our proposed methods. The corresponding
B. Evaluation of CA-DINO with Pretrain-Finetune Frame- results are presented in Table VII. Firstly, following the
work with Labeled Data iterative learning framework proposed by [25], we estimate
In order to better illustrate the superior performance of the pseudo labels based on the speaker embedding extracted
our proposed CA-DINO, we conduct an exploration of self- by CA-DINO and train a new encoder using these labels.
TABLE VIII: EER (%) and minDCF (p=0.01) comparison on Vox-O, Vox-E, and Vox-H test sets for different iterations of
the proposed DLG-LC with other strategies. SimCLR and CA-DINO without DLC-LC mean that we used all the estimated
pseudo labels of the data without data selecting in training process.
Initial Model DLG-LC Iteration Vox-O Vox-E Vox-H

EER(%) minDCF EER(%) minDCF EER(%) minDCF
Initial 8.547 0.6453 9.228 0.6912 14.21 0.7757
1 6.281 0.5811 7.428 0.6221 11.54 0.7213
2 5.914 0.5299 6.745 0.5880 10.54 0.6971
SimCLR 5
3 5.547 0.5259 6.407 0.5580 10.14 0.6698
4 4.872 0.4651 5.593 0.5144 8.923 0.6408
5 4.484 0.4545 5.225 0.5055 8.501 0.6321
Initial 3.585 0.3529 3.852 0.4182 6.918 0.5743
1 2.909 0.3000 3.315 0.3372 5.692 0.4654
CA-DINO 5 2 2.606 0.2887 3.181 0.3211 5.403 0.4489
3 2.558 0.3054 3.064 0.3176 5.342 0.4482
4 2.643 0.2825 3.065 0.3200 5.291 0.4483
Initial 3.585 0.3529 3.852 0.4182 6.918 0.5743
1 2.021 0.2171 2.331 0.2419 4.012 0.3484
CA-DINO 2 1.596 0.1665 2.004 0.2089 3.484 0.3083
3 1.585 0.1671 1.879 0.1963 3.293 0.2941
4 1.606 0.1636 1.906 0.2028 3.274 0.2955
In order to reflect the superiority of our method, we also According to the results, it is observed that the iterative
trained a model based on SimCLR which is the most popular learning method can continuously improve the performance of
self-supervised speaker verification method [21]. From the the system with the increase of iteration number. However, the
results in the Table, we can see that the model based on CA- convergence speed based on SimCLR is significantly slower
DINO surpasses SimCLR on all test sets with a very large than that based on CA-DINO. SimCLR does not converge
improvement. Then based on pre-trained CA-DINO, we also even in the 5th round, while CA-DINO has achieved the best
conduct an exploration of DLG-LC in Iteration 1. According to performance position in the 3rd round. In addition, the final
the results, it can be observed that the loss-gate (LG) learning performance of SimCLR with iterative learning is even still
with fixed thresholds to select data can bring significant worse than the initial performance of CA-DINO. The proposed
improvement compared with the system trained without any CA-DINO owns consistent large advantages over SimCLR in
data selection. It means that loss-gate can effectively select each iteration which further demonstrates the superiority of the
reliable labels which are of benefit to the model. However, proposed CA-DINO in self-supervised speaker verification.
we also try to set different thresholds (1, 3, 5), and find that Based on the pseudo-labels generated by CA-DINO, we ap-
the choice of threshold also has a non-negligible impact on plied the proposed strategy of DLG-LC, and the performance
model performance [29]. Based on the estimated GMM, our significantly improved further. It only takes one round of
proposed dynamic loss-gate (DLG) can adjust the threshold iteration to obtain better results than three rounds of iterations
dynamically considering the current training situation and without DLC-LC, which shows the importance of the dynamic
obtains better performance than LG which only adopts a fixed threshold filtering and label correction on data usage. After
threshold during the whole training process. In addition, we convergence with more iterations, its performance is much
apply the label correction (LC) strategy to make full use of better than the system without DLG-LC. It shows that the pro-
data with unreliable labels and the results are further improved. posed DLG-LC can not only speed up the model convergence
Compared with the baseline system (SimCLR without data and reduce the training time but also significantly boost the
selection), the proposed CA-DINO with DLG-LC outperforms performance upper limit of the self-supervised learning model.
it by relative 70.05%, 68.61%, 65.23% EER reduction on
Vox-O, Vox-E and Vox-H sets respectively. E. Incorporate with Multi-Modality
Then we introduce visual information in the iterative learn-
D. Iterative Learning with DLG-LC ing process. The difference from the work in [30] is that we
In order to further illustrate the superiority of our proposed not only use multi-modality when doing the data clustering
method, we carried out several rounds of iterative training but also utilize multi-modality information when applying data
following [25]. We summarize the performance of EER and selection through DLG-LC. Table IX illustrates the EER and
minDCF of each iteration with or without the proposed DLG- minDCF performance comparison of DLG-LC with single-
LC strategy on Vox-O, Vox-E, Vox-H test sets, and the results and multi-modality.
are presented in Table VIII. Firstly, we compare the iterative It is observed that incorporating both audio-visual modality
results of SimCLR and CA-DINO respectively, and it is noted knowledge in the iterative learning can obtain another large
that both of them are trained without any loss-gate strategies. performance, which demonstrates that extra visual information
TABLE IX: EER (%) and minDCF (p=0.01) comparison on Vox-O, Vox-E, Vox-H test sets for different iterations of the
proposed DLG-LC with single- or multi-modality. It’s noted that they are both initialed with CA-DINO in the first self-
supervised pretraining stage. Both our audio and visual encoders are trained independently, and the fusion of multi-modal
information only performs when clustering data and selecting data in iterative learning. We do the testing still with single
audio modality.
Training Modality Iteration Vox-O Vox-E Vox-H

EER(%) minDCF EER(%) minDCF EER(%) minDCF
Audio Initial 3.585 0.3529 3.852 0.4182 6.918 0.5743
1 2.021 0.2171 2.331 0.2419 4.012 0.3484
2 1.596 0.1665 2.004 0.2089 3.484 0.3083
Audio
3 1.585 0.1671 1.879 0.1963 3.293 0.2941
4 1.606 0.1636 1.906 0.2028 3.274 0.2955
1 1.537 0.1326 1.789 0.1910 3.235 0.3007
Audio-Visual 2 1.292 0.1565 1.571 0.1688 2.799 0.2676
3 1.356 0.1553 1.602 0.1711 2.839 0.2712
TABLE X: EER (%) comparison on Vox-O, Vox-E, Vox-H among the proposed CA-DINO with DLG-LC and other most
advanced self-supervised systems. The model architecture, clustering number, method and iteration rounds of each system are
listed in detail. Noted that AHC and K-M here mean Agglomerative Hierarchical Clustering and k-means. ECAPA-S (Small)
and ECAPA-L (Large) here denote the ECAPA-TDNN with 512 channels and 1024 channels respectively.
Methods Model # Iteration # Clusters Cluster Vox-O (EER) Vox-E (EER) Vox-H (EER)
Fully Supervised [53] ECAPA-S - - - 1.010 1.240 2.320
IDLab [26] ECAPA-L 7 7500 AHC 2.100 - -
JHU [27] Res2Net50 5 7500 AHC 1.890 - -
SNU [28] ECAPA-L 5 7500 AHC 1.660 - -
LG [29] ECAPA-L 5 6000 K-M 1.660 2.180 3.760
DKU + single-modal [30] ResNet34 5 6000 K-M 2.740 3.080 5.480
DKU + multi-modal [30] ResNet34 5 6000 K-M 1.920 2.030 3.720
CA-DINO ECAPA-S 3 7500 K-M 2.558 2.129 5.148
CA-DINO + DLG-LC + single-modal ECAPA-S 3 7500 K-M 1.585 1.879 3.293
CA-DINO + DLG-LC + multi-modal ECAPA-S 2 7500 K-M 1.292 1.571 2.799
CA-DINO + DLG-LC + multi-modal* ECAPA-S 2 7500 K-M 1.191 1.474 2.543
* The results are given with adaptive s-norm [60] for a fair comparison with fully supervised system [53].
can make the data usage better. Take the EER of Vox-H pseudo labels. Moreover, when clustering data, we set the
as an example, with only single modality audio data, the number of clusters to 7500 instead of 6000, because 6000 is
relative EER reduction of the current and previous iterations close to the real number of speakers (5994) in the training set
are 42.01%, 13.16%, and 5.48% on Vox-H trials for the first which is too special. From the results, it’s obviously observed
three iterations. If iterative learning with audio-visual data, the that our proposed new self-supervised speaker verification
relative EER reduction percentages are 53.24%, 13.48% for framework is far superior to all the existing methods in both
the first two iterations. single- and multi-modality, even with fewer iterations, smaller
model, and simpler clustering method. For the single modality
F. Comparison with Other Systems condition, the proposed CA-DINO with DLG-LC outperforms
the best system (LG) [29] by relative 4.52%, 13.81% and
In this section, a performance comparison among our
12.42% on Vox-O, Vox-E and Vox-H sets respectively with
proposed CA-DINO with DLG-LC and other self-supervised
only 3 iterations. If we use audio-visual data in the iterative
speaker verification systems is given in Table X, and most
learning stage, the corresponding improvement is enlarged
of them are from the latest Voxceleb Speaker Recognition
to relative 22.17%, 27.94% and 25.56%, which is a great
Challenge (VoxSRC) [61], [62] which represent the most
performance leap.
advanced systems nowadays. Besides, the fully supervised
system is also illustrated as the first line of Table X for In summary, our proposed system achieves the new state-
comparison. of-the-art performance for self-supervised speaker verification
Compared with the previous works using large-size models, with a large performance improvement, despite we train the
the model we adopt is ECAPA-S (Small, C=512) which has systems with fewer iterations, smaller model, and simpler
fewer parameters and requires fewer computation resources. clustering method. More promisingly, compared to the conven-
Compared to AHC (Agglomerative Hierarchical Clustering), tional fully supervised system with ECAPA-TDNN-Small, our
to make it easier to be implemented, we adopt a simpler and newly proposed self-supervised learning system even obtains
more convenient clustering method K-M (k-means) to generate a comparable performance with the supervised system, but
without using any ground-truth labels. [10] S. Wang, Y. Yang, Y. Qian, and K. Yu, “Revisiting the statistics pooling
layer in deep speaker embedding learning,” in 2021 12th International
Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE,
VII. C ONCLUSION 2021, pp. 1–5.
[11] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker
In this work, we propose an advanced cluster-aware DINO embeddings for text-independent speaker verification.” in Proc. ISCA
(CA-DINO) with dynamic loss-gate and label correction Interspeech, vol. 2018, 2018, pp. 3573–3577.
(DLG-LC) for self-supervised speaker verification. The DINO [12] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification
framework is introduced so that the system can be trained using adapted gaussian mixture models,” Digital signal processing,
vol. 10, no. 1-3, pp. 19–41, 2000.
without negative samples, which is greatly improved compared [13] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-
with other self-supervised models. Then cluster-aware training end factor analysis for speaker verification,” IEEE/ACM Trans. ASLP.,
is designed into DINO framework, and positive samples are vol. 19, no. 4, pp. 788–798, 2011.
[14] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A
collected from the same category rather than only single framework for self-supervised learning of speech representations,” Proc.
sentences, so that the model can utilize more diverse data and NIPS, vol. 33, pp. 12 449–12 460, 2020.
obtain system improvement. In the iterative learning stage, a [15] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and
A. Mohamed, “Hubert: Self-supervised speech representation learning
dynamic loss-gate is obtained by modeling the loss histogram by masked prediction of hidden units,” IEEE/ACM Trans. ASLP., vol. 29,
with Gaussian distribution, to select reliable data when training pp. 3451–3460, 2021.
on pseudo labels. Instead of dropping unreliable data directly, [16] Z. Fan, M. Li, S. Zhou, and B. Xu, “Exploring wav2vec 2.0
the predicted posterior is adopted as the target distribution on speaker verification and language identification,” arXiv preprint
arXiv:2012.06185, 2020.
to prevent fitting into incorrect samples. Moreover, multi- [17] Z. Chen, S. Chen, Y. Wu, Y. Qian, C. Wang, S. Liu, Y. Qian, and
modal information is incorporated into DLG-LC to further M. Zeng, “Large-scale self-supervised speech representation learning for
improve performance. The experiments on Voxceleb show automatic speaker verification,” in Proc. IEEE ICASSP. IEEE, 2022,
pp. 6147–6151.
that our newly proposed CA-DINO with DLG-LC is superior [18] T. Stafylakis, J. Rohdin, O. Plchot, P. Mizera, and L. Burget, “Self-
and achieves the new state-of-the-art performance for self- supervised speaker embeddings,” arXiv preprint arXiv:1904.03486,
supervised speaker verification. More promisingly, the gap 2019.
[19] N. Inoue and K. Goto, “Semi-supervised contrastive learning with
between unsupervised and supervised representation learn- generalized contrastive loss and its application to speaker recognition,”
ing is dramatically reduced for speaker verification, and an in Proc. IEEE APSIPA ASC. IEEE, 2020, pp. 1641–1646.
approaching performance of the fully supervised system is [20] J. Huh, H. S. Heo, J. Kang, S. Watanabe, and J. S. Chung, “Augmen-
obtained with our self-supervised learning method on speaker tation adversarial training for unsupervised speaker recognition,” arXiv
preprint arXiv:2007.12085, 2020.
verification. [21] H. Zhang, Y. Zou, and H. Wang, “Contrastive self-supervised learn-
ing for text-independent speaker verification,” in Proc. IEEE ICASSP.
IEEE, 2021, pp. 6713–6717.
ACKNOWLEDGMENT
[22] W. Xia, C. Zhang, C. Weng, M. Yu, and D. Yu, “Self-supervised
This work was supported in part by China NSFC projects text-independent speaker verification using prototypical momentum con-
under Grants 62122050 and 62071288, and in part by Shang- trastive learning,” in Proc. IEEE ICASSP. IEEE, 2021, pp. 6723–6727.
[23] S. H. Mun, W. H. Kang, M. H. Han, and N. S. Kim, “Unsupervised rep-
hai Municipal Science and Technology Major Project under resentation learning for speaker recognition via contrastive equilibrium
Grant 2021SHZDZX0102. Experiments have been carried out learning,” arXiv preprint arXiv:2010.11433, 2020.
on the PI super-computer at Shanghai Jiao Tong University. [24] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering
for unsupervised learning of visual features,” in Proc. ECCV, 2018, pp.
132–149.
R EFERENCES [25] D. Cai, W. Wang, and M. Li, “An iterative framework for self-supervised
deep speaker representation learning,” in Proc. IEEE ICASSP. IEEE,
[1] B. Han, Z. Chen, and Y. Qian, “Self-supervised speaker verification us- 2021, pp. 6728–6732.
ing dynamic loss-gate and label correction,” in Proc. ISCA Interspeech, [26] J. Thienpondt, B. Desplanques, and K. Demuynck, “The idlab voxceleb
2022. speaker recognition challenge 2020 system description,” arXiv preprint
[2] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez- arXiv:2010.12468, 2020.
Dominguez, “Deep neural networks for small footprint text-dependent
[27] J. Cho, J. Villalba, and N. Dehak, “The jhu submission to voxsrc-21:
speaker verification,” in Proc. IEEE ICASSP, 2014, pp. 4052–4056.
Track 3,” arXiv preprint arXiv:2109.13425, 2021.
[3] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,
“X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. [28] S. H. Mun, M. H. Han, and N. S. Kim, “Snu-hil system for the voxceleb
IEEE ICASSP. IEEE, 2018, pp. 5329–5333. speaker recognition challenge 2021,” VoxSRC, 2021.
[4] H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot, “But [29] R. Tao, K. A. Lee, R. K. Das, V. Hautamäki, and H. Li, “Self-
system description to voxceleb speaker recognition challenge 2019,” supervised speaker recognition with loss-gated learning,” arXiv preprint
arXiv preprint arXiv:1910.12592, 2019. arXiv:2110.03869, 2021.
[5] B. Han, Z. Chen, and Y. Qian, “Local information modeling with self- [30] D. Cai, W. Wang, and M. Li, “Incorporating visual information in audio
attention for speaker verification,” in Proc. IEEE ICASSP. IEEE, 2022, based self-supervised speaker recognition,” IEEE/ACM Trans. ASLP.,
pp. 6727–6731. vol. 30, pp. 1422–1435, 2022.
[6] B. Han, Z. Chen, B. Liu, and Y. Qian, “Mlp-svnet: A multi-layer [31] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and
perceptrons based network for speaker verification,” in Proc. IEEE A. Joulin, “Emerging properties in self-supervised vision transformers,”
ICASSP. IEEE, 2022, pp. 7522–7526. in Proc. ICCV, 2021, pp. 9650–9660.
[7] Y. Liu, L. He, and J. Liu, “Large margin softmax loss for speaker [32] D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn,
verification,” arXiv preprint arXiv:1904.03479, 2019. H. Zhang, and C. Raffel, “Remixmatch: Semi-supervised learning with
[8] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, distribution alignment and augmentation anchoring,” arXiv preprint
S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker arXiv:1911.09785, 2019.
recognition,” arXiv preprint arXiv:2003.11982, 2020. [33] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel,
[9] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text- E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-
dependent speaker verification,” in Proc. IEEE ICASSP. IEEE, 2016, supervised learning with consistency and confidence,” Proc. NIPS,
pp. 5115–5119. vol. 33, pp. 596–608, 2020.
[34] B. Zhang, Y. Wang, W. Hou, H. Wu, J. Wang, M. Okumura, and T. Shi- [61] A. Nagrani, J. S. Chung, J. Huh, A. Brown, E. Coto, W. Xie,
nozaki, “Flexmatch: Boosting semi-supervised learning with curriculum M. McLaren, D. A. Reynolds, and A. Zisserman, “Voxsrc 2020:
pseudo labeling,” Proc. NIPS, vol. 34, 2021. The second voxceleb speaker recognition challenge,” arXiv preprint
[35] A. Nagrani, J. S. Chung, S. Albanie, and A. Zisserman, “Disentangled arXiv:2012.06867, 2020.
speech embeddings using cross-modal self-supervision,” in Proc. IEEE [62] A. Brown, J. Huh, J. S. Chung, A. Nagrani, and A. Zisserman, “Voxsrc
ICASSP. IEEE, 2020, pp. 6829–6833. 2021: The third voxceleb speaker recognition challenge,” arXiv preprint
[36] S.-W. Chung, H. G. Kang, and J. S. Chung, “Seeing voices and arXiv:2201.04583, 2022.
hearing voices: learning discriminative embeddings using cross-modal
self-supervision,” arXiv preprint arXiv:2004.14326, 2020.
[37] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework
for contrastive learning of visual representations,” in Proc. ICML.
PMLR, 2020, pp. 1597–1607.
[38] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast
for unsupervised visual representation learning,” in Proc. CVPR, June
2020.
[39] D. Cai and M. Li, “The dku-dukeece system for the self-supervision
speaker verification task of the 2021 voxceleb speaker recognition
challenge,” arXiv preprint arXiv:2109.02853, 2021.
[40] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin,
“Unsupervised learning of visual features by contrasting cluster assign-
ments,” in Proc. NIPS, 2020.
[41] G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a
neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
[42] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with
warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
[43] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker
recognition,” in Proc. ISCA Interspeech, 2018, pp. 1086–1090.
[44] E. Arazo, D. Ortego, P. Albert, N. O’Connor, and K. McGuinness,
“Unsupervised label noise modeling and loss correction,” in Proc. ICML.
PMLR, 2019, pp. 312–321.
[45] X. Xiang, S. Wang, H. Huang, Y. Qian, and K. Yu, “Margin matters:
Towards more discriminative deep neural network embeddings for
speaker recognition,” in Proc. IEEE APSIPA ASC. IEEE, 2019, pp.
1652–1656.
[46] F. Tong, Y. Liu, S. Li, J. Wang, L. Li, and Q. Hong, “Automatic Error
Correction for Speaker Embedding Learning with Noisy Labels,” in
Proc. ISCA Interspeech, 2021, pp. 4628–4632.
[47] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervised
learning method for deep neural networks,” in Proc. ICML, vol. 3, no. 2,
2013, p. 896.
[48] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A large-scale
speaker identification dataset,” in Proc. ISCA Interspeech, 2017, pp.
2616–2620.
[49] W. Cai, J. Chen, J. Zhang, and M. Li, “On-the-fly data loader
and utterance-level aggregation for speaker and language recognition,”
IEEE/ACM Trans. ASLP., vol. 28, pp. 1038–1051, 2020.
[50] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise
corpus,” arXiv preprint arXiv:1510.08484, 2015.
[51] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A
study on data augmentation of reverberant speech for robust speech
recognition,” in Proc. IEEE ICASSP. IEEE, 2017, pp. 5220–5224.
[52] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
alignment using multitask cascaded convolutional networks,” IEEE
signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016.
[53] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Em-
phasized channel attention, propagation and aggregation in tdnn based
speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
[54] S. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. H. Torr,
“Res2net: A new multi-scale backbone architecture,” IEEE transactions
on pattern analysis and machine intelligence, 2019.
[55] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
CVPR, 2018, pp. 7132–7141.
[56] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search
with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–
547, 2019.
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[58] Y. Kim, W. Park, M.-C. Roh, and J. Shin, “Groupface: Learning
latent groups and constructing group-based representations for face
recognition,” in Proc. CVPR, 2020, pp. 5621–5630.
[59] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular
margin loss for deep face recognition,” in Proc. CVPR, 2019, pp. 4690–
4699.
[60] P. Matějka, O. Novotný, O. Plchot, L. Burget, M. D. Sánchez, and
J. Černocký, “Analysis of score normalization in multilingual speaker
recognition,” in Proc. ISCA Interspeech, 2017, pp. 1567–1571.

Self-Supervised Learning With Cluster-Aware-DINO For High-Performance Robust Speaker Verification

Uploaded by

Copyright:

Available Formats

Self-Supervised Learning With Cluster-Aware-DINO For High-Performance Robust Speaker Verification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Self-Supervised Learning With Cluster-Aware-DINO For High-Performance Robust Speaker Verification

Uploaded by

Copyright:

Available Formats

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

Self-Supervised Learning with Cluster-Aware-DINO

Background Model (GMM-UBM) [12], i-vector [13]. How-

Algorithm 1: The proposed Dynamic Loss-Gate and

D. CA-DINO Setup TABLE III: Model Architecture of audio encoder ECAPA-

E. DLG-LC Setup A. Evaluation of CA-DINO based Speaker Verification

Initial Model DLG-LC Iteration Vox-O Vox-E Vox-H

Training Modality Iteration Vox-O Vox-E Vox-H

You might also like