Esimcse: Enhanced Sample Building Method For Contrastive Learning of Unsupervised Sentence Embedding

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

ESimCSE: Enhanced Sample Building Method for Contrastive Learning

of Unsupervised Sentence Embedding


Xing Wu1,2,3 , Chaochen Gao1,2∗, Liangjun Zang1 , Jizhong Han1 , Zhongyuan Wang3 , Songlin Hu1,2
1
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
2
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
3
Kuaishou Technology, Beijing, China
{gaochaochen,zangliangjun,hanjizhong,husonglin}@iie.ac.cn
{wuxing,wangzhongyuan}@kuaishou.com

Abstract 1 Introduction

Contrastive learning has been attracting much The large-scale pre-trained language model (De-
vlin et al., 2018; Liu et al., 2019), represented by
arXiv:2109.04380v1 [cs.CL] 9 Sep 2021

attention for learning unsupervised sentence


embeddings. The current state-of-the-art un- BERT, benefits many downstream supervised tasks
supervised method is the unsupervised Sim- through finetuning methods. However, when apply-
CSE (unsup-SimCSE). Unsup-SimCSE takes ing BERT’s native sentence embeddings directly
dropout as a minimal data augmentation for semantic similarity tasks without labeled data,
method, and passes the same input sentence
the performance is hardly satisfactory (Gao et al.,
to a pre-trained Transformer encoder (with
dropout turned on) twice to obtain the two cor-
2021; Yan et al., 2021). Recently, researchers have
responding embeddings to build a positive pair. proposed using contrastive learning to learn better
As the length information of a sentence will unsupervised sentence embeddings. Contrastive
generally be encoded into the sentence embed- learning aims to learn effective sentence embed-
dings due to the usage of position embedding dings based on the assumption that effective sen-
in Transformer, each positive pair in unsup- tence embeddings should bring similar sentences
SimCSE actually contains the same length in- closer while pushing away dissimilar ones. It gen-
formation. And thus unsup-SimCSE trained
erally uses various data augmentation methods to
with these positive pairs is probably biased,
which would tend to consider that sentences randomly generate different views for each sen-
of the same or similar length are more similar tence, and assumes a sentence is semantically more
in semantics. Through statistical observations, similar to its augmented counterpart than any other
we find that unsup-SimCSE does have such a sentence. The current state-of-the-art method is
problem. To alleviate it, we apply a simple rep- unsup-SimCSE (Gao et al., 2021), which gener-
etition operation to modify the input sentence, ates the state-of-the-art unsupervised sentence em-
and then pass the input sentence and its modi-
beddings and performs on par with previously su-
fied counterpart to the pre-trained Transformer
encoder, respectively, to get the positive pair.
pervised counterparts. Unsup-SimCSE implicitly
Additionally, we draw inspiration from the hypothesizes dropout acts as a minimal data aug-
community of computer vision and introduce mentation method. Specifically, unsup-SimCSE
a momentum contrast, enlarging the number composes N sentences in a batch and feeds each
of negative pairs without additional calcula- sentence to the pre-trained BERT twice with two
tions. The proposed two modifications are ap- independently sampled dropout masks. Then the
plied on positive and negative pairs separately, embeddings derived from the same sentence consti-
and build a new sentence embedding method,
tute a “positive pair”, while those derived from two
termed Enhanced Unsup-SimCSE (ESimCSE).
We evaluate the proposed ESimCSE on several different sentences constitute a “negative pair”.
benchmark datasets w.r.t the semantic text sim- Using dropout as a minimal data augmentation
ilarity (STS) task. Experimental results show method is simple and effective, but there is a weak
that ESimCSE outperforms the state-of-the-art point. Pretrained language models are built on
unsup-SimCSE by an average Spearman corre- Transformer blocks, which will encode the length
lation of 2.02% on BERT-base.
information of a sentence through position embed-

Work done during internship at Kuaishou Inc. The first dings. And thus a positive pair derived from the
two authors contribute equally. same sentence would contain the same length in-
Figure 1: The schematic diagram of the ESimCSE method. Unlike the unsup-SimCSE, ESimCSE performs word
repetition operations on the batch so that the lengths of positive pairs vary without changing the semantics of
sentences. This mechanism weakens the same-length hint for the model when predicting positive pairs. In addition,
ESimCSE also maintains several preceding mini-batches’ model outputs in a queue, termed momentum contrast,
which can expand the negative pairs involved in loss calculation. This mechanism allows pairs to be compared
more sufficiently in contrastive learning.

formation, while a negative pair derived from two Dataset length diff ≤ 3 length diff > 3
different sentences generally would contain differ- STS12 0.7298 0.6035
ent length information. Therefore, positive pairs STS13 0.8508 0.8396
and negative pairs are different in the length infor- STS14 0.7971 0.6676
mation they contained, which can act as a feature STS15 0.8374 0.7603
to distinguish them. Specifically, due to such a dif- STS16 0.8134 0.7677
ference, the semantic similarity model trained with STS-B 0.8148 0.6924
these pairs can be biased, which probably considers
that two sentences of the same or similar lengths Table 1: The spearman correlation of sentence pairs
are more similar in semantics. with a length difference of ≤ 3 and > 3.

To confirm the impact of the length difference,


we evaluate on standard semantic textual similar-
SimCSE. For each positive pair, we expect to
ity (STS) tasks with the unsup-SimCSE-BERTbase
change the length of a sentence without changing
model published by (Gao et al., 2021). We par-
its semantic meaning. Existing methods to change
tition STS task datasets into groups based on the
the length of a sentence generally use random in-
sentence pairs’ length difference, and calculate the
sertion and random deletion. However, inserting
corresponding semantic similarity with spearman
randomly selected words into a sentence may in-
correlation separately. As shown in Table 1, as
troduce extra noise, which will probably distort
the length difference increases, the performance of
the meaning of the sentence; deleting keywords
unsup-SimCSE gets worse. The performance of
from a sentence will also change its semantics sub-
unsup-SimCSE on sentences with similar length
stantially. Therefore, we propose a safer method,
(≤ 3) far exceeds the performance on sentences
termed “word repetition”, which randomly dupli-
with a larger difference in length (> 3).
cates some words in a sentence. For example, as
To alleviate this problem, we propose a sim- shown in Table 2, the original sentence is “I like
ple but effective enhancement method to unsup- this apple because it looks so fresh and I think it
reuse the encoded embeddings from the immedi-
ate preceding mini-batches to expand the negative
pairs, by maintaining a queue: which always en-
queue the sentence embeddings of the current mini-
batches and meanwhile dequeue the “oldest” ones.
As the enqueued sentence embeddings come from
the preceding mini-batches, we keep a momentum-
updated model by taking the moving-average of its
parameters and use the momentum model to gen-
erate enqueued sentence embeddings. Note that,
we turn off dropout when using the momentum en-
coder, which can narrow the gap between training
Figure 2: The performance trend on STS-B develop- and prediction.
ment set when batch size changes for unsup-SimCSE-
The above two optimizations are proposed sepa-
BERTbase model.
rately for building positive and negative pairs. We
finally combine both with unsup-SimCSE, which
is termed Enhanced SimCSE (ESimCSE). We illus-
should be delicious.” Random insertion may gener-
trate the schematic diagram of ESimCSE in Figure
ate “I don’t like this apple because but it looks so
1. The proposed ESimCSE is evaluated on the se-
not fresh and I think it should be dog delicious.” ,
mantic text similarity (STS) task with 7 STS-B
and random deletion may generate “I this apple be-
test sets. Experimental results show that ESimCSE
cause it looks so and I think it should be.”. Both de-
can substantially improve the similarity measuring
viate far from the meaning of the original sentence.
performance in different model settings over the
On the contrary, the method of “word repetition”
previous state-of-the-art unsup-SimCSE. Specifi-
may get “I like like this apple because it looks so
cally, ESimCSE gains an average increase of Spear-
so fresh and and I think it should be delicious.”,
man’s correlation over unsup-SimCSE by +2.02%
or “I I like this apple apple because it looks looks
on BERTbase , +0.90% on BERTlarge , +0.87% on
so fresh fresh and I think it should be delicious
RoBERTabase , +0.55% on RoBERTalarge , respec-
delicious.” Both keep the meaning of the original
tively.
sentence quite well.
Our contributions can be summarized as follows:
Apart from the optimization above for positive
pairs construction, we further explore how to op- • We observe that unsup-SimCSE constructs
timize the construction of negative pairs. Since each positive pair with two sentences of the
contrastive learning is carried out between positive same length, which can bias the learning pro-
pairs and negative pairs, theoretically more nega- cess. We propose a simple but effective “word
tive pairs can lead to better comparison between repetition” method to alleviate the problem.
the pairs (Chen et al., 2020). And thus a potential
• We propose to use the momentum contrast
optimization direction is to leverage more negative
method to increase the number of negative
pairs, encouraging the model towards more refined
pairs involved in the loss calculation, which
learning. However, according to (Gao et al., 2021),
encourages the model towards more refined
a larger batch size is not always a better choice.
learning.
For example, as show in Figure 2, for the unsup-
SimCSE-BERTbase model, the optimal batch size • We conduct extensive experiments on several
is 64, and other settings of the batch size will lower benchmark datasets w.r.t semantic text similar-
the performance. Therefore, we tend to figure out ity task. The experimental results well demon-
how to expand the negative pairs more effectively. strate that both proposed optimizations bring
In the community of computer vision, to alleviate substantial improvements to unsup-SimCSE.
the GPU memory limitation when expanding the
batch size, a feasible way is to introduce the mo- 2 Background: Unsup-SimCSE
mentum contrast (He et al., 2020), which is also n om
applied to natural language understanding (Fang Given a set of paired sentences xi , x+
i , where
i=1
et al., 2020). Momentum contrast allows us to xi and x+
i are semantically related and will be re-
Method Text Similarity
original sentence I like this apple because it looks so fresh and I think it should be 1.0
delicious.
random insertion I don’t like this apple because but it looks so not fresh and I think 0.76
it should be dog delicious.
random deletion I like this apple because it looks so fresh and I think it should be 0.77
delicious.
word repetition I like like this apple because it looks so so fresh and and I think it 1.0
should be delicious.
word repetition I I like this apple apple because it looks looks so fresh fresh and 0.98
I think it should be delicious delicious.

Table 2: An example of different methods to change the length of a sentence. The similarity scores are predicted
by official released “unsup-simcse-bert-base-uncased” model.

ferred to positive pairs. The core idea of unsup- we take sub-word repetition as an example. Given
SimCSE is to use identical sentences to build the a sentence s, after processing by a sub-word
positive pairs, i.e., x+
i = xi . Note that in Trans- tokenizer, we get a sub-word sequence x =
former, there is a dropout mask placed on fully- {x1 , x2 , ..., xN }, N being the length of sequence.
connected layers and attention probabilities. And We define the number of repeated tokens as
thus the key ingredient is to feed the same input xi
to the encoder twice by applying different dropout dup len ∈ [0, max(2, int(dup rate ∗ N ))] (4)
masks zi and zi+ and output two separate sentence
embeddings to build a positive pair as follows: where dup rate is the maximal repetition rate,
  which is a hyperparameter. Then dup len is a ran-
hi = fθ (xi , zi ) , h+ +
i = fθ xi , zi (1) domly sampled number in the set defined above,
which will introduce more diversity when extend-
With hi and h+ i for each sentence in a mini-batch ing the sequence length. After dup len is deter-
with batch size N , the contrastive learning objec- mined, we use uniform distribution to randomly
tive w.r.t xi is formulated as follows, select dup len sub-words that need to be repeated
+ from the sequence, which composes the dup set
esim(hi ,hi )/τ
`i = − log N
(2) as follows,
P sim(hi ,h+
j )/τ
e
j=1 dup set = unif orm(range = [1, N ], num = dup len)
where τ is a temperature hyperparameter and (5)
sim (hi , h0i ) is the similarity metric, which is typi- For example, if the 1th sub-word is in dup set, then
cally the cosine similarity function as follows, sequence x becomes x+ = {x1 , x1 , x2 , ..., xN }.
And different from unsup-SimCSE which passes x
+
  h>
i hi to the pre-trained BERT twice, E-SimCSE passes
sim hi , h+
i = (3)
khi k · h+
x and x+ independently.
i

3.2 Momentum Contrast


3 Proposed ESimCSE: Enhanced
The momentum contrast allows us to reuse the en-
unsup-SimCSE coded sentence embeddings from the immediate
In this section, we first introduce the word repeti- preceding mini-batches by maintaining a queue of
tion method to construct better positive pairs. Then a fixed size. Specifically, the embeddings in the
we introduce the momentum contrast method to queue are progressively replaced. When the out-
expand negative pairs. put sentence embeddings of the current mini-batch
is enqueued, the “oldest” ones in the queue are
3.1 Word Repetition removed if the queue is full. Note that we use
The word repetition mechanism randomly dupli- a momentum-updated encoder to encode the en-
cates some words/sub-words in a sentence. Here queued sentence embeddings. Formally, denoting
STS12 STS13 STS14 SICK15 STS16 STS-B SICK-R
train 0 0 0 0 0 5,749 4,500
dev 0 0 0 0 0 1,500 500
test 3,108 1,500 3,750 3,000 1,186 1,379 4,927

Table 3: Data statistics of standard semantic textual similarity (STS) tasks.

the parameters of the encoder as θe and those of unsup-SimCSE, we download the officially pub-
the momentum-updated encoder as θm , we update lished model checkpoints 3 and reproduce evalu-
θm in the following way, ation results with the suggested hyper-parameters
in dev/test mode. experiments are conducted on
θm ← λθm + (1 − λ)θe (6) Nvidia 3090 GPUs.

where λ ∈ [0, 1) is a momentum coefficient param- Semantic Textual Similarity Tasks Semantic
eter. Note that only the parameters θe are updated textual similarity measures the semantic similar-
by back-propagation. And here we introduce θm ity of any two sentences. STS 2012–2016 (Agirre
to generate sentence embeddings for the queue, be- et al., 2012, 2013, 2014, 2015, 2016) and STS-B
cause the momentum update can make θm evolve (Cer et al., 2017) are widely used semantic tex-
more smoothly than θe . As a result, though the tual similarity benchmark datasets, which measure
embeddings in the queue are encoded by differ- the semantic similarity of two sentences with the
ent encoders (in different “steps” during training), cosine similarity of the corresponding sentence em-
the difference among these encoders can be made beddings. After deriving the semantic similarities
small. of all pairs in the test set, we follow unsup-SimCSE
With sentence embeddings in the queue, the loss to use Spearman correlation to measure the corre-
function of ESimCSE is further modifed as follows, lation between the ranks of predicted similarities
and the ground-truth. For a set of size n, the n raw
+
esim(hi ,hi )/τ scores Xi , Yi are converted to its corresponding
`i = − log N M ranks rgXi , rgYi , then the Spearman correlation is
sim(hi ,h+
j )/τ
+
esim(hi ,hm )/τ
P P
e + defined as follows
j=1 m=1
(7) cov (rgX , rgY )
where h+is denotes a sentence embedding in the rs = (8)
m σrgX σrgY
momentum-updated queue, and M is the size of
the queue. where cov (rgX , rgY ) is the covariance of the rank
variables, σrgX and σrgY are the standard devia-
4 Experiment tions of the rank variables. Spearman correlation
has a value between -1 and 1, which will be high
4.1 Evaluation Setup
when the ranks of predicted similarities and the
Following unsup-SimCSE, we use 1-million sen- ground-truth are similar.
tences randomly drawn from English Wikipedia
for training1 . Then we conduct our experiments on 4.2 Training Details
7 standard semantic textual similarity (STS) tasks. We start from pre-trained checkpoints of
The detail statistics are shown in Table 3. STS12- BERT(uncased) or RoBERTa(cased) using both
STS16 datasets do not have train or development the base and the large versions, and we add an
sets, and thus we evaluate the models on the devel- MLP layer on top of the [CLS] representation
opment set of STS-B to search for better settings to get the sentence embedding. We implement
of the hyper-parameters. The SentEval toolkit2 is ESimCSE based on Huggingface’s transformers
used for evaluation. For the compared baseline package4 . And we train our models for one epoch
1 3
https://huggingface.co/datasets/princeton-nlp/datasets- https://github.com/princeton-nlp/SimCSE
for-simcse/resolve/main/wiki1m for simcse.txt 4
https://github.com/huggingface/transformers,version
2
https://github.com/facebookresearch/SentEval 4.2.1.
by using the Adam optimizer with the batch improves the measurement of semantic textual sim-
size = 64 and the hyper-parameter temperature ilarity in different settings of base models over the
τ = 0.05 in Eq. (3). The learning rate is set as previous state-of-the-art unsup-SimCSE. Specifi-
3e-5 for ESimCSE-BERTbase model and 1e-5 cally, our proposed ESimCSE outperforms unsup-
for other models. The dropout rate is p = 0.1 SimCSE by +2.02% on BERTbase , +0.90% on
for base models, p = 0.15 for large models. For BERTlarge , +0.87% on RoBERTabase , +0.55%
the momentum contrast, we empirically choose a on RoBERTalarge , respectively.
relatively large momentum λ = 0.995. In addition, We also explore how much improvement it can
we evaluate the model every 125 training steps on bring to unsup-SimCSE when only using word rep-
the development set of STS-B and keep the best etition or momentum contrast. As shown in table
checkpoint for the final evaluation on test sets. We 6 and 7, either word repetition or momentum con-
use sub-word repetition instead of word repetition, trast can bring substantial improvements to unsup-
which will be further discussed in the ablation SimCSE. It means that both proposed methods to
study section. enhance the positive pairs and negative pairs are ef-
fective. Better yet, these two modifications can be
4.3 Main Results superimposed (ESimCSE) to get further improve-
Table 4 shows the best results obtained on the STS- ments.
B development sets. We highlight the highest num-
bers among models with the same pre-trained en- 5 Ablation Study
coder as bold. ♣ denotes the evaluation results
This section investigates how different dropout
from the official published model by (Gao et al.,
rates, repetition rates, sentence-length-extension
2021). It can be seen that our proposed ESim-
methods, and momentum contrast queue size af-
CSE outperforms unsup-SimCSE by +2.40% on
fect ESimCSE’s performance. We only change
BERTbase , +2.19% on BERTlarge , +1.19% on
one hyperparameter at a time. All results use our
RoBERTabase , +0.26% on RoBERTalarge , respec-
ESimCSE-BERTbase model and are evaluated on
tively.
the development set of STS-B.
The comparison between the proposed ESim-
CSE and unsup-SimCSE on the development set 5.1 Effect of Dropout Rate
gives us the first glance at the superiority of the
proposed ESimCSE. Then we further evaluate the Dropout is the key ingredient to the unsup-SimCSE
corresponding checkpoints on the test sets. Ta- model, so different dropout rates p are crucial to
ble 5 shows the evaluation results on 7 STS test the model’s performance. According to (Gao et al.,
sets. It can be seen that ESimCSE substantially 2021), the optimal dropout rate for unsup-SimCSE-
BERTbase is p = 0.1. Considering that ESimCSE
additionally introduces word repetition and momen-
Model STS-B tum contrast mechanisms, we re-examine the im-
pact of different dropouts on its performance. We
unsup-SimCSE-BERTbase ♣ 82.45
experiment on three typical dropout rates, and the
ESimCSE-BERTbase 84.85 (+2.40)
results are shown in the table 8. Specifically, when
unsup-SimCSE-BERTlarge ♣ 84.41 the dropout is 0.1, it achieves the best performance
ESimCSE-BERTlarge 86.60 (+2.19) on the STS-B development set. When the dropout
unsup-SimCSE-RoBERTabase ♣ 83.91 increases to 0.15, the performance is close to that
ESimCSE-RoBERTabase 85.10 (+1.19) of 0.1, with no significant drop. And even when
the dropout reaches 0.2, the performance drops by
unsup-SimCSE-RoBERTalarge ♣ 85.07 nearly 1%, but it still outperforms unsup-SimCSE.
ESimCSE-RoBERTalarge 85.33 (+0.26) The experimental results kind of show the robust-
ness of the superiority of the proposed ESimCSE
Table 4: Sentence embedding performance on seman-
tic textual similarity (STS) development sets in terms over unsup-SimCSE, in terms of dropout rate.
of Spearman’s correlation, with BERTbase , BERTlarge ,
RoBERTabase , RoBERTalarge as base models. ♣ : 5.2 Effect of Repetition Rate
results from official published model by (Gao et al., Word repetition can bring improvement by diver-
2021). sifying the length difference of positive pairs in
Model STS12 STS13 STS14 SICK15 STS16 STS-B SICK-R Avg.
unsup-SimCSE-BERTbase ♣ 68.40 82.41 74.38 80.91 78.56 76.85 72.23 76.25
ESimCSE-BERTbase 73.40 83.27 77.25 82.66 78.81 80.17 72.30 78.27 (+2.02)
unsup-SimCSE-BERTlarge ♣ 70.88 84.16 76.43 84.50 79.76 79.26 73.88 78.41
ESimCSE-BERTlarge 73.21 85.37 77.73 84.30 78.92 80.73 74.89 79.31 (+0.90)
unsup-SimCSE-RoBERTabase ♣ 70.16 81.77 73.24 81.36 80.65 80.22 68.56 76.57
ESimCSE-RoBERTabase 69.90 82.50 74.68 83.19 80.30 80.99 70.54 77.44 (+0.87)
unsup-SimCSE-RoBERTalarge ♣ 72.86 83.99 75.62 84.77 81.80 81.98 71.26 78.90
ESimCSE-RoBERTalarge 73.20 84.93 76.88 84.86 81.21 82.79 72.27 79.45 (+0.55)

Table 5: Sentence embedding performance on 7 semantic textual similarity (STS) test sets, in terms of Spearman’s
correlation, with BERTbase , BERTlarge , RoBERTabase , RoBERTalarge as base models. ♣ : results from official
published model by (Gao et al., 2021)..

Model STS-B tokenization. For example, given a word “microbi-


ology”, word repetition will produce “microbiology
unsup-SimCSE-BERTbase ♣ 82.45
microbiology”, while sub-word repetition will pro-
+ word repetition 84.09 (+1.64)
duce “micro micro ##biology” or “micro ##biology
+ momentum contrast 83.98 (+1.53)
##biology”. Inserting stop-words is another word-
ESimCSE-BERTbase 84.85 (+2.40)
level expansion method. The selection of insertion
position is the same as the method of word repe-
Table 6: Improvement on STS-B development sets that
word repetition or momentum contrast brings to unsup- tition, except that the selected word is no longer
SimCSE. repeated, but a random stop-word is inserted in-
stead. Inserting [MASK] is similar, which inserts a
[MASK] token after the selected word. It is similar
the proposed ESimCSE. Intuitively, few repeti- to the pre-training input of BERT. We can regard
tions have a limited impact on the diversity of [MASK] as a dynamic context-compatible word
length difference, while many repetitions will dis- placeholder. As shown in Table 10, sub-word repe-
tort the semantics of sentences, making positive tition achieves the best performance, and word rep-
pairs not solid enough. To quantitatively study etition can also bring a good improvement, which
the effect of repetition rate on the model perfor- shows that more fine-grained repetition can better
mance, we slowly increase the repetition rate pa- alleviate the bias brought by the length difference
rameter dup rate from 0.08 to 0.36, with each of positive pairs. Inserting [MASK] can also bring
increase by 0.04. As shown in Table 9, when a small improvement, but inserting stop words will
dup rate = 0.32, ESimCSE-BERTbase achieves slightly decrease effect.
the best performance, a larger or smaller dup rate
will cause performance degradation, which is con- 5.4 Effect of Queue Size in Momentum
sistent with our intuition. Although there are small Contrast
fluctuations, most of the results of the proposed
ESimCSE still exceed the best results of unsup- The size of the momentum contrast queue deter-
SimCSE-BERTbase . mines the number of negative pairs involved in the
loss calculation. Without considering the time cost
5.3 Effect of Sentence-Length-Extension and the limitation of GPU memory, can a larger
Method queue size lead to better performance? We take
In addition to sub-word repetition, we also explore the BERTbase as the base model for ESimCSE and
three other methods to increase sentence length: experiment with the queue size equals to differ-
word repetition, inserting stop-words and inserting ent multiples of the batch size. The experimental
[MASK] (Devlin et al., 2018). The implementation results are listed in Table 11. The optimal result
of word repetition is similar to sub-word repetition, is reached when the queue size was 2.5 times the
except that the repetition operation occurs before batch size. A smaller or larger queue size will
Model STS12 STS13 STS14 SICK15 STS16 STS-B SICK-R Avg.
unsup-SimCSE-BERTbase ♣ 68.40 82.41 74.38 80.91 78.56 76.85 72.23 76.25
+ word repetition 69.79 83.43 75.65 82.44 79.43 79.44 71.86 77.43 (+1.18)
+ momentum contrast 71.41 82.23 74.94 82.99 79.85 79.48 71.85 77.54 (+1.29)
ESimCSE-BERTbase 73.40 83.27 77.25 82.66 78.81 80.17 72.30 78.27 (+2.02)

Table 7: Improvements on 7 STS test sets that word repetition or momentum contrast brings to unsup-SimCSE.

reduce the effect. It is intuitive because the intro-


duction of momentum contrast encourages more
negative pairs to participate in the loss calculation
p 0.1 0.15 0.2 so that the positive pairs can be compared more
sufficiently. But a too large queue size also reduces
STS-B 84.85 84.75 83.37
the benefit. We guess that is because the negative
Table 8: Effects of different dropout probabilities p on pairs in the momentum contrast are generated by
the STS-B development set in terms of Spearman’s cor- the past “steps” during training, and a larger queue
relation. will use the outputs of more outdated encoder mod-
els which are quite different from the current one.
And thus that will reduce the reliability of the loss
calculation.

Queue Size STS-B


1 × batch size 83.83
1.5 × batch size 83.81
dup rate 0.08 0.12 0.16 0.2
2 × batch size 83.03
STS-B 83.5 83.62 82.01 83.01
2.5 × batch size 84.85
dup rate 0.24 0.28 0.32 0.36 3 × batch size 82.66
STS-B 84.24 82.96 84.85 83.84
Table 11: Effects of queue size of momentum contrast
Table 9: Effects of repetition rate p on the STS-B de- on the STS-B development set in terms of Spearman’s
velopment set in terms of Spearman’s correlation. correlation.

6 Related Work
Unsupervised sentence representation learning has
been widely studied. (Socher et al., 2011; Hill
et al., 2016; Le and Mikolov, 2014) propose to
Length-extension Method STS-B learn sentence representation according to the inter-
unsup-SimCSE-BERTbase 82.45 nal structure of each sentence. (Kiros et al., 2015;
Inserting Stop-words 81.72 Logeswaran and Lee, 2018) predict the surround-
Inserting [MASK] 83.08 ing sentences of a given sentence based on the
Word Repetition 84.40 distribution hypothesis. (Pagliardini et al., 2017)
Sub-word Repetition 84.85 propose Sent2Vec, a simple unsupervised model
allowing to compose sentence embeddings using
Table 10: Effects of sentence-length-extension method word vectors along with n-gram embeddings.
on the STS-B development set in terms of Spearman’s Recently, contrastive learning has been explored
correlation. in unsupervised sentence representation learning
and has become a promising trend (Zhang et al.,
2020; Wu et al., 2020; Meng et al., 2021; Gao et al.,
2021; Yan et al., 2021). Those contrastive learning
based methods for sentence embeddings are gener- References
ally based on the assumption that a good semantic Eneko Agirre, Carmen Banea, Claire Cardie, Daniel
representation should be able to bring similar sen- Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei
tences closer while pushing away dissimilar ones. Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada
Therefore, those methods use various data augmen- Mihalcea, et al. 2015. Semeval-2015 task 2: Seman-
tic textual similarity, english, spanish and pilot on
tation methods to randomly generate two different interpretability. In Proceedings of the 9th interna-
views for each sentence and design an effective tional workshop on semantic evaluation (SemEval
loss function to make them closer in the seman- 2015), pages 252–263.
tic representation space. Among these contrastive Eneko Agirre, Carmen Banea, Claire Cardie, Daniel
methods, the most related ones to our work are Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei
unsup-ConSERT and unsup-SimSCE. ConSERT Guo, Rada Mihalcea, German Rigau, and Janyce
explores various effective data augmentation strate- Wiebe. 2014. Semeval-2014 task 10: Multilingual
gies(e.g., adversarial attack, token shuffling, Cutoff, semantic textual similarity. In Proceedings of the
8th international workshop on semantic evaluation
dropout) to generate different views for contrastive (SemEval 2014), pages 81–91.
learning and analyze their effects on unsupervised
sentence representation transfer. Unsup-SimSCE, Eneko Agirre, Carmen Banea, Daniel Cer, Mona
Diab, Aitor Gonzalez Agirre, Rada Mihalcea, Ger-
the current state-of-the-art unsupervised method man Rigau Claramunt, and Janyce Wiebe. 2016.
uses only standard dropout as minimal data aug- Semeval-2016 task 1: Semantic textual similar-
mentation, and feed an identical sentence to a pre- ity, monolingual and cross-lingual evaluation. In
trained model twice with independently sampled SemEval-2016. 10th International Workshop on Se-
mantic Evaluation; 2016 Jun 16-17; San Diego, CA.
dropout masks to generate two distinct sentence Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (As-
embeddings as a positive pair. Unsup-SimSCE is sociation for Computational Linguistics).
very simple but works surprisingly well, perform-
Eneko Agirre, Daniel Cer, Mona Diab, and Aitor
ing on par with previously supervised counterparts.
Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pi-
However, we find that unsup-SimCSE constructs lot on semantic textual similarity. In * SEM 2012:
each positive pair with two sentences of the same The First Joint Conference on Lexical and Compu-
length, which can mislead the learning of sentence tational Semantics–Volume 1: Proceedings of the
embeddings. So we propose a simple but effective main conference and the shared task, and Volume
2: Proceedings of the Sixth International Workshop
method temed “word repetition” to alleviate it. We on Semantic Evaluation (SemEval 2012), pages 385–
also propose to use the momentum contrast method 393.
to increase the number of negative pairs involved in
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-
the loss calculation, which encourages the model Agirre, and Weiwei Guo. 2013. * sem 2013 shared
towards more refined learning. task: Semantic textual similarity. In Second joint
conference on lexical and computational semantics
7 Conclusion and Future Work (* SEM), volume 1: proceedings of the Main confer-
ence and the shared task: semantic textual similar-
ity, pages 32–43.
In this paper, we propose optimizations to construct
positive and negative pairs for unsup-SimCSE Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-
and combine them with unsup-SimCSE, which is Gazpio, and Lucia Specia. 2017. Semeval-2017
termed ESimCSE. Through extensive experiments, task 1: Semantic textual similarity-multilingual and
cross-lingual focused evaluation. arXiv preprint
the proposed ESimCSE achieves considerable im- arXiv:1708.00055.
provements on standard semantic text similarity
tasks over unsup-SimCSE. Ting Chen, Simon Kornblith, Mohammad Norouzi,
and Geoffrey Hinton. 2020. A simple framework for
As unsup-SimCSE treats all negative pairs the contrastive learning of visual representations. In In-
same importance. Some negative pairs are quite ternational conference on machine learning, pages
different from positive pairs, while others are rela- 1597–1607. PMLR.
tively close to positive pairs. This distinction will Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
be helpful for embedding retrieval tasks but not re- Kristina Toutanova. 2018. Bert: Pre-training of deep
flected in the objective function of unsup-SimCSE. bidirectional transformers for language understand-
Therefore, in the future, we will focus on designing ing. arXiv preprint arXiv:1810.04805.
a more refined objective function to improve the Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan
discrimination between different negative pairs. Ding, and Pengtao Xie. 2020. Cert: Contrastive
self-supervised learning for language understanding. Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim,
arXiv preprint arXiv:2005.12766. and Lidong Bing. 2020. An unsupervised sentence
embedding method by mutual information maxi-
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. mization. arXiv preprint arXiv:2009.12061.
Simcse: Simple contrastive learning of sentence em-
beddings. arXiv preprint arXiv:2104.08821.

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and


Ross Girshick. 2020. Momentum contrast for unsu-
pervised visual representation learning. In Proceed-
ings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 9729–9738.

Felix Hill, Kyunghyun Cho, and Anna Korhonen.


2016. Learning distributed representations of
sentences from unlabelled data. arXiv preprint
arXiv:1602.03483.

Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov,


Richard Zemel, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Skip-thought vectors. In
Advances in neural information processing systems,
pages 3294–3302.

Quoc Le and Tomas Mikolov. 2014. Distributed repre-


sentations of sentences and documents. In Interna-
tional conference on machine learning, pages 1188–
1196. PMLR.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-


dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.

Lajanugen Logeswaran and Honglak Lee. 2018. An


efficient framework for learning sentence represen-
tations. arXiv preprint arXiv:1803.02893.

Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Ti-


wary, Paul Bennett, Jiawei Han, and Xia Song. 2021.
Coco-lm: Correcting and contrasting text sequences
for language model pretraining. arXiv preprint
arXiv:2102.08473.

Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi.


2017. Unsupervised learning of sentence embed-
dings using compositional n-gram features. arXiv
preprint arXiv:1703.02507.

Richard Socher, Eric Huang, Jeffrey Pennin, Christo-


pher D Manning, and Andrew Ng. 2011. Dynamic
pooling and unfolding recursive autoencoders for
paraphrase detection. Advances in neural informa-
tion processing systems, 24.

Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian


Khabsa, Fei Sun, and Hao Ma. 2020. Clear: Con-
trastive learning for sentence representation. arXiv
preprint arXiv:2012.15466.

Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng


Zhang, Wei Wu, and Weiran Xu. 2021. Con-
sert: A contrastive framework for self-supervised
sentence representation transfer. arXiv preprint
arXiv:2105.11741.

You might also like