Esimcse: Enhanced Sample Building Method For Contrastive Learning of Unsupervised Sentence Embedding
Esimcse: Enhanced Sample Building Method For Contrastive Learning of Unsupervised Sentence Embedding
Esimcse: Enhanced Sample Building Method For Contrastive Learning of Unsupervised Sentence Embedding
Abstract 1 Introduction
Contrastive learning has been attracting much The large-scale pre-trained language model (De-
vlin et al., 2018; Liu et al., 2019), represented by
arXiv:2109.04380v1 [cs.CL] 9 Sep 2021
formation, while a negative pair derived from two Dataset length diff ≤ 3 length diff > 3
different sentences generally would contain differ- STS12 0.7298 0.6035
ent length information. Therefore, positive pairs STS13 0.8508 0.8396
and negative pairs are different in the length infor- STS14 0.7971 0.6676
mation they contained, which can act as a feature STS15 0.8374 0.7603
to distinguish them. Specifically, due to such a dif- STS16 0.8134 0.7677
ference, the semantic similarity model trained with STS-B 0.8148 0.6924
these pairs can be biased, which probably considers
that two sentences of the same or similar lengths Table 1: The spearman correlation of sentence pairs
are more similar in semantics. with a length difference of ≤ 3 and > 3.
Table 2: An example of different methods to change the length of a sentence. The similarity scores are predicted
by official released “unsup-simcse-bert-base-uncased” model.
ferred to positive pairs. The core idea of unsup- we take sub-word repetition as an example. Given
SimCSE is to use identical sentences to build the a sentence s, after processing by a sub-word
positive pairs, i.e., x+
i = xi . Note that in Trans- tokenizer, we get a sub-word sequence x =
former, there is a dropout mask placed on fully- {x1 , x2 , ..., xN }, N being the length of sequence.
connected layers and attention probabilities. And We define the number of repeated tokens as
thus the key ingredient is to feed the same input xi
to the encoder twice by applying different dropout dup len ∈ [0, max(2, int(dup rate ∗ N ))] (4)
masks zi and zi+ and output two separate sentence
embeddings to build a positive pair as follows: where dup rate is the maximal repetition rate,
which is a hyperparameter. Then dup len is a ran-
hi = fθ (xi , zi ) , h+ +
i = fθ xi , zi (1) domly sampled number in the set defined above,
which will introduce more diversity when extend-
With hi and h+ i for each sentence in a mini-batch ing the sequence length. After dup len is deter-
with batch size N , the contrastive learning objec- mined, we use uniform distribution to randomly
tive w.r.t xi is formulated as follows, select dup len sub-words that need to be repeated
+ from the sequence, which composes the dup set
esim(hi ,hi )/τ
`i = − log N
(2) as follows,
P sim(hi ,h+
j )/τ
e
j=1 dup set = unif orm(range = [1, N ], num = dup len)
where τ is a temperature hyperparameter and (5)
sim (hi , h0i ) is the similarity metric, which is typi- For example, if the 1th sub-word is in dup set, then
cally the cosine similarity function as follows, sequence x becomes x+ = {x1 , x1 , x2 , ..., xN }.
And different from unsup-SimCSE which passes x
+
h>
i hi to the pre-trained BERT twice, E-SimCSE passes
sim hi , h+
i =
(3)
khi k ·
h+
x and x+ independently.
i
the parameters of the encoder as θe and those of unsup-SimCSE, we download the officially pub-
the momentum-updated encoder as θm , we update lished model checkpoints 3 and reproduce evalu-
θm in the following way, ation results with the suggested hyper-parameters
in dev/test mode. experiments are conducted on
θm ← λθm + (1 − λ)θe (6) Nvidia 3090 GPUs.
where λ ∈ [0, 1) is a momentum coefficient param- Semantic Textual Similarity Tasks Semantic
eter. Note that only the parameters θe are updated textual similarity measures the semantic similar-
by back-propagation. And here we introduce θm ity of any two sentences. STS 2012–2016 (Agirre
to generate sentence embeddings for the queue, be- et al., 2012, 2013, 2014, 2015, 2016) and STS-B
cause the momentum update can make θm evolve (Cer et al., 2017) are widely used semantic tex-
more smoothly than θe . As a result, though the tual similarity benchmark datasets, which measure
embeddings in the queue are encoded by differ- the semantic similarity of two sentences with the
ent encoders (in different “steps” during training), cosine similarity of the corresponding sentence em-
the difference among these encoders can be made beddings. After deriving the semantic similarities
small. of all pairs in the test set, we follow unsup-SimCSE
With sentence embeddings in the queue, the loss to use Spearman correlation to measure the corre-
function of ESimCSE is further modifed as follows, lation between the ranks of predicted similarities
and the ground-truth. For a set of size n, the n raw
+
esim(hi ,hi )/τ scores Xi , Yi are converted to its corresponding
`i = − log N M ranks rgXi , rgYi , then the Spearman correlation is
sim(hi ,h+
j )/τ
+
esim(hi ,hm )/τ
P P
e + defined as follows
j=1 m=1
(7) cov (rgX , rgY )
where h+is denotes a sentence embedding in the rs = (8)
m σrgX σrgY
momentum-updated queue, and M is the size of
the queue. where cov (rgX , rgY ) is the covariance of the rank
variables, σrgX and σrgY are the standard devia-
4 Experiment tions of the rank variables. Spearman correlation
has a value between -1 and 1, which will be high
4.1 Evaluation Setup
when the ranks of predicted similarities and the
Following unsup-SimCSE, we use 1-million sen- ground-truth are similar.
tences randomly drawn from English Wikipedia
for training1 . Then we conduct our experiments on 4.2 Training Details
7 standard semantic textual similarity (STS) tasks. We start from pre-trained checkpoints of
The detail statistics are shown in Table 3. STS12- BERT(uncased) or RoBERTa(cased) using both
STS16 datasets do not have train or development the base and the large versions, and we add an
sets, and thus we evaluate the models on the devel- MLP layer on top of the [CLS] representation
opment set of STS-B to search for better settings to get the sentence embedding. We implement
of the hyper-parameters. The SentEval toolkit2 is ESimCSE based on Huggingface’s transformers
used for evaluation. For the compared baseline package4 . And we train our models for one epoch
1 3
https://huggingface.co/datasets/princeton-nlp/datasets- https://github.com/princeton-nlp/SimCSE
for-simcse/resolve/main/wiki1m for simcse.txt 4
https://github.com/huggingface/transformers,version
2
https://github.com/facebookresearch/SentEval 4.2.1.
by using the Adam optimizer with the batch improves the measurement of semantic textual sim-
size = 64 and the hyper-parameter temperature ilarity in different settings of base models over the
τ = 0.05 in Eq. (3). The learning rate is set as previous state-of-the-art unsup-SimCSE. Specifi-
3e-5 for ESimCSE-BERTbase model and 1e-5 cally, our proposed ESimCSE outperforms unsup-
for other models. The dropout rate is p = 0.1 SimCSE by +2.02% on BERTbase , +0.90% on
for base models, p = 0.15 for large models. For BERTlarge , +0.87% on RoBERTabase , +0.55%
the momentum contrast, we empirically choose a on RoBERTalarge , respectively.
relatively large momentum λ = 0.995. In addition, We also explore how much improvement it can
we evaluate the model every 125 training steps on bring to unsup-SimCSE when only using word rep-
the development set of STS-B and keep the best etition or momentum contrast. As shown in table
checkpoint for the final evaluation on test sets. We 6 and 7, either word repetition or momentum con-
use sub-word repetition instead of word repetition, trast can bring substantial improvements to unsup-
which will be further discussed in the ablation SimCSE. It means that both proposed methods to
study section. enhance the positive pairs and negative pairs are ef-
fective. Better yet, these two modifications can be
4.3 Main Results superimposed (ESimCSE) to get further improve-
Table 4 shows the best results obtained on the STS- ments.
B development sets. We highlight the highest num-
bers among models with the same pre-trained en- 5 Ablation Study
coder as bold. ♣ denotes the evaluation results
This section investigates how different dropout
from the official published model by (Gao et al.,
rates, repetition rates, sentence-length-extension
2021). It can be seen that our proposed ESim-
methods, and momentum contrast queue size af-
CSE outperforms unsup-SimCSE by +2.40% on
fect ESimCSE’s performance. We only change
BERTbase , +2.19% on BERTlarge , +1.19% on
one hyperparameter at a time. All results use our
RoBERTabase , +0.26% on RoBERTalarge , respec-
ESimCSE-BERTbase model and are evaluated on
tively.
the development set of STS-B.
The comparison between the proposed ESim-
CSE and unsup-SimCSE on the development set 5.1 Effect of Dropout Rate
gives us the first glance at the superiority of the
proposed ESimCSE. Then we further evaluate the Dropout is the key ingredient to the unsup-SimCSE
corresponding checkpoints on the test sets. Ta- model, so different dropout rates p are crucial to
ble 5 shows the evaluation results on 7 STS test the model’s performance. According to (Gao et al.,
sets. It can be seen that ESimCSE substantially 2021), the optimal dropout rate for unsup-SimCSE-
BERTbase is p = 0.1. Considering that ESimCSE
additionally introduces word repetition and momen-
Model STS-B tum contrast mechanisms, we re-examine the im-
pact of different dropouts on its performance. We
unsup-SimCSE-BERTbase ♣ 82.45
experiment on three typical dropout rates, and the
ESimCSE-BERTbase 84.85 (+2.40)
results are shown in the table 8. Specifically, when
unsup-SimCSE-BERTlarge ♣ 84.41 the dropout is 0.1, it achieves the best performance
ESimCSE-BERTlarge 86.60 (+2.19) on the STS-B development set. When the dropout
unsup-SimCSE-RoBERTabase ♣ 83.91 increases to 0.15, the performance is close to that
ESimCSE-RoBERTabase 85.10 (+1.19) of 0.1, with no significant drop. And even when
the dropout reaches 0.2, the performance drops by
unsup-SimCSE-RoBERTalarge ♣ 85.07 nearly 1%, but it still outperforms unsup-SimCSE.
ESimCSE-RoBERTalarge 85.33 (+0.26) The experimental results kind of show the robust-
ness of the superiority of the proposed ESimCSE
Table 4: Sentence embedding performance on seman-
tic textual similarity (STS) development sets in terms over unsup-SimCSE, in terms of dropout rate.
of Spearman’s correlation, with BERTbase , BERTlarge ,
RoBERTabase , RoBERTalarge as base models. ♣ : 5.2 Effect of Repetition Rate
results from official published model by (Gao et al., Word repetition can bring improvement by diver-
2021). sifying the length difference of positive pairs in
Model STS12 STS13 STS14 SICK15 STS16 STS-B SICK-R Avg.
unsup-SimCSE-BERTbase ♣ 68.40 82.41 74.38 80.91 78.56 76.85 72.23 76.25
ESimCSE-BERTbase 73.40 83.27 77.25 82.66 78.81 80.17 72.30 78.27 (+2.02)
unsup-SimCSE-BERTlarge ♣ 70.88 84.16 76.43 84.50 79.76 79.26 73.88 78.41
ESimCSE-BERTlarge 73.21 85.37 77.73 84.30 78.92 80.73 74.89 79.31 (+0.90)
unsup-SimCSE-RoBERTabase ♣ 70.16 81.77 73.24 81.36 80.65 80.22 68.56 76.57
ESimCSE-RoBERTabase 69.90 82.50 74.68 83.19 80.30 80.99 70.54 77.44 (+0.87)
unsup-SimCSE-RoBERTalarge ♣ 72.86 83.99 75.62 84.77 81.80 81.98 71.26 78.90
ESimCSE-RoBERTalarge 73.20 84.93 76.88 84.86 81.21 82.79 72.27 79.45 (+0.55)
Table 5: Sentence embedding performance on 7 semantic textual similarity (STS) test sets, in terms of Spearman’s
correlation, with BERTbase , BERTlarge , RoBERTabase , RoBERTalarge as base models. ♣ : results from official
published model by (Gao et al., 2021)..
Table 7: Improvements on 7 STS test sets that word repetition or momentum contrast brings to unsup-SimCSE.
6 Related Work
Unsupervised sentence representation learning has
been widely studied. (Socher et al., 2011; Hill
et al., 2016; Le and Mikolov, 2014) propose to
Length-extension Method STS-B learn sentence representation according to the inter-
unsup-SimCSE-BERTbase 82.45 nal structure of each sentence. (Kiros et al., 2015;
Inserting Stop-words 81.72 Logeswaran and Lee, 2018) predict the surround-
Inserting [MASK] 83.08 ing sentences of a given sentence based on the
Word Repetition 84.40 distribution hypothesis. (Pagliardini et al., 2017)
Sub-word Repetition 84.85 propose Sent2Vec, a simple unsupervised model
allowing to compose sentence embeddings using
Table 10: Effects of sentence-length-extension method word vectors along with n-gram embeddings.
on the STS-B development set in terms of Spearman’s Recently, contrastive learning has been explored
correlation. in unsupervised sentence representation learning
and has become a promising trend (Zhang et al.,
2020; Wu et al., 2020; Meng et al., 2021; Gao et al.,
2021; Yan et al., 2021). Those contrastive learning
based methods for sentence embeddings are gener- References
ally based on the assumption that a good semantic Eneko Agirre, Carmen Banea, Claire Cardie, Daniel
representation should be able to bring similar sen- Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei
tences closer while pushing away dissimilar ones. Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada
Therefore, those methods use various data augmen- Mihalcea, et al. 2015. Semeval-2015 task 2: Seman-
tic textual similarity, english, spanish and pilot on
tation methods to randomly generate two different interpretability. In Proceedings of the 9th interna-
views for each sentence and design an effective tional workshop on semantic evaluation (SemEval
loss function to make them closer in the seman- 2015), pages 252–263.
tic representation space. Among these contrastive Eneko Agirre, Carmen Banea, Claire Cardie, Daniel
methods, the most related ones to our work are Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei
unsup-ConSERT and unsup-SimSCE. ConSERT Guo, Rada Mihalcea, German Rigau, and Janyce
explores various effective data augmentation strate- Wiebe. 2014. Semeval-2014 task 10: Multilingual
gies(e.g., adversarial attack, token shuffling, Cutoff, semantic textual similarity. In Proceedings of the
8th international workshop on semantic evaluation
dropout) to generate different views for contrastive (SemEval 2014), pages 81–91.
learning and analyze their effects on unsupervised
sentence representation transfer. Unsup-SimSCE, Eneko Agirre, Carmen Banea, Daniel Cer, Mona
Diab, Aitor Gonzalez Agirre, Rada Mihalcea, Ger-
the current state-of-the-art unsupervised method man Rigau Claramunt, and Janyce Wiebe. 2016.
uses only standard dropout as minimal data aug- Semeval-2016 task 1: Semantic textual similar-
mentation, and feed an identical sentence to a pre- ity, monolingual and cross-lingual evaluation. In
trained model twice with independently sampled SemEval-2016. 10th International Workshop on Se-
mantic Evaluation; 2016 Jun 16-17; San Diego, CA.
dropout masks to generate two distinct sentence Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (As-
embeddings as a positive pair. Unsup-SimSCE is sociation for Computational Linguistics).
very simple but works surprisingly well, perform-
Eneko Agirre, Daniel Cer, Mona Diab, and Aitor
ing on par with previously supervised counterparts.
Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pi-
However, we find that unsup-SimCSE constructs lot on semantic textual similarity. In * SEM 2012:
each positive pair with two sentences of the same The First Joint Conference on Lexical and Compu-
length, which can mislead the learning of sentence tational Semantics–Volume 1: Proceedings of the
embeddings. So we propose a simple but effective main conference and the shared task, and Volume
2: Proceedings of the Sixth International Workshop
method temed “word repetition” to alleviate it. We on Semantic Evaluation (SemEval 2012), pages 385–
also propose to use the momentum contrast method 393.
to increase the number of negative pairs involved in
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-
the loss calculation, which encourages the model Agirre, and Weiwei Guo. 2013. * sem 2013 shared
towards more refined learning. task: Semantic textual similarity. In Second joint
conference on lexical and computational semantics
7 Conclusion and Future Work (* SEM), volume 1: proceedings of the Main confer-
ence and the shared task: semantic textual similar-
ity, pages 32–43.
In this paper, we propose optimizations to construct
positive and negative pairs for unsup-SimCSE Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-
and combine them with unsup-SimCSE, which is Gazpio, and Lucia Specia. 2017. Semeval-2017
termed ESimCSE. Through extensive experiments, task 1: Semantic textual similarity-multilingual and
cross-lingual focused evaluation. arXiv preprint
the proposed ESimCSE achieves considerable im- arXiv:1708.00055.
provements on standard semantic text similarity
tasks over unsup-SimCSE. Ting Chen, Simon Kornblith, Mohammad Norouzi,
and Geoffrey Hinton. 2020. A simple framework for
As unsup-SimCSE treats all negative pairs the contrastive learning of visual representations. In In-
same importance. Some negative pairs are quite ternational conference on machine learning, pages
different from positive pairs, while others are rela- 1597–1607. PMLR.
tively close to positive pairs. This distinction will Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
be helpful for embedding retrieval tasks but not re- Kristina Toutanova. 2018. Bert: Pre-training of deep
flected in the objective function of unsup-SimCSE. bidirectional transformers for language understand-
Therefore, in the future, we will focus on designing ing. arXiv preprint arXiv:1810.04805.
a more refined objective function to improve the Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan
discrimination between different negative pairs. Ding, and Pengtao Xie. 2020. Cert: Contrastive
self-supervised learning for language understanding. Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim,
arXiv preprint arXiv:2005.12766. and Lidong Bing. 2020. An unsupervised sentence
embedding method by mutual information maxi-
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. mization. arXiv preprint arXiv:2009.12061.
Simcse: Simple contrastive learning of sentence em-
beddings. arXiv preprint arXiv:2104.08821.