2020.findings-emnlp.4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Adapting BERT for Word Sense Disambiguation with

Gloss Selection Objective and Example Sentences

Boon Peng Yap, Andrew Koh, Eng Siong Chng


Nanyang Technological University, Singapore
{boonpeng001, andr0081, aseschng}@ntu.edu.sg

Abstract Huang et al. (2019) proposed GlossBERT, a model


based on fine-tuning BERT on sequence-pair binary
Domain adaptation or transfer learning using
classification task, and achieved state-of-the-art re-
pre-trained language models such as BERT
has proven to be an effective approach for sults in terms of single model performance on sev-
many natural language processing tasks. In eral English all-words WSD benchmark datasets.
this work, we propose to formulate word sense In this paper, we extend the sequence-pair WSD
disambiguation as a relevance ranking task, model and propose a new task objective that can
and fine-tune BERT on sequence-pair ranking
better exploit the inherent relationships within pos-
task to select the most probable sense defini-
tion given a context sentence and a list of can- itive and negative sequence pairs. Briefly, our con-
didate sense definitions. We also introduce tribution is two-fold: (1) we formulate WSD as
a data augmentation technique for WSD us- gloss selection task, in which the model learns to
ing existing example sentences from WordNet. select the best context-gloss pair from a group of
Using the proposed training objective and data related pairs; (2) we demonstrate how to make use
augmentation technique, our models are able of additional lexical resources, namely the example
to achieve state-of-the-art results on the En- sentences from WordNet to further improve WSD
glish all-words benchmark datasets.1
performance.
1 Introduction We fine-tune BERT using the gloss selection ob-
jective on SemCor (Miller et al., 1994) plus ad-
In natural language processing, Word Sense Disam-
ditional training instances constructed from the
biguation (WSD) refers to the task of identifying
WordNet example sentences and evaluate its im-
the exact sense of an ambiguous word given the
pact on several commonly used benchmark datasets
context (Navigli, 2009). More specifically, WSD
for English all-words WSD. Experimental results
associates ambiguous words with predefined senses
show that the gloss selection objective can indeed
from an external sense inventory, e.g. WordNet
improve WSD performance; and using WordNet
(Miller, 1995) and BabelNet (Navigli and Ponzetto,
example sentences as additional training data can
2010).
offer further performance boost.
Recent studies in learning contextualized word
representations from language models, e.g. ELMo
(Peters et al., 2018), BERT (Devlin et al., 2019) 2 Related Work
and GPT-2 (Radford et al., 2019) attempt to alle-
viate the issue of insufficient labeled data by first BERT (Devlin et al., 2019) is a language repre-
pre-training a language model on a large text cor- sentation model based on multi-layer bidirectional
pus through self-supervised learning. The weights Transformer encoder (Vaswani et al., 2017). Pre-
from the pre-trained language model can then be vious experiment results have showed that signifi-
fine-tuned on downstream NLP tasks such as ques- cant improvement can be achieved in many down-
tion answering and natural language inference. For stream NLP tasks through fine-tuning BERT on
WSD, pre-trained BERT has been utilized in multi- those tasks. Several methods have been proposed
ple ways with varying degrees of success. Notably, to apply BERT for WSD. In this section, we briefly
1
Codes and pre-trained models are available at https: describe two commonly used approaches: feature-
//github.com/BPYap/BERT-WSD. based and fine-tuning approach.

41
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 41–46
November 16 - 20, 2020. c 2020 Association for Computational Linguistics
2.1 Feature-based Approaches ing fine-tuning, GlossBERT classifies each context-
gloss pair as either positive or negative depending
Feature-based WSD systems make use of contex-
on whether the sense definition corresponds to the
tualized word embeddings from BERT as input
correct sense of the target word in the context. Each
features for task-specific architectures. Vial et al.
context-gloss pair is treated as independent training
(2019) used the contextual embeddings as inputs in
instance and will be shuffled to a random position
a Transformer-based classifier. They proposed two
at the start of each training epoch. At inference
sense vocabulary compression techniques to reduce
stage, the context-gloss pair with the highest out-
the number of output classes by exploiting the se-
put score from the positive neuron among other
mantic relationships between different senses. The
candidates is chosen as the best answer.
Transformer-based classifiers were trained from
scratch using the reduced output classes on Sem- In this paper, we use similar context-gloss pairs
Cor and WordNet Gloss Corpus (WNGC). Their as inputs for our proposed WSD model. How-
ensemble model, which consists of 8 independently ever, instead of treating individual context-gloss
trained classifiers achieved state-of-the-art results pair as independent training instance, we group re-
on the English all-words WSD benchmark datasets. lated context-gloss pairs as 1 training instance, i.e.
context-gloss pairs with the same context but dif-
Besides deep learning-based approach, Loureiro
ferent candidate glosses are considered as 1 group.
and Jorge (2019) and Scarlini et al. (2020) con-
Using groups of context-gloss pairs as training data,
struct sense embeddings using the contextual em-
we formulate WSD as a ranking/selection problem
beddings from BERT. The former generates sense
where the most probable sense is ranked first. By
embeddings by averaging the contextual embed-
processing all related candidate senses in one go,
dings of sense-annotated tokens taken from Sem-
the WSD model will be able to learn better dis-
Cor while the latter constructs sense embeddings
criminating features between positive and negative
by concatenating the contextual embeddings of Ba-
context-gloss pairs.
belNet definitions with the contextual embeddings
of Wikipedia contexts. For WSD, both approaches
3 Methodology
make use of the constructed sense embeddings in
nearest neighbor classification (kNN), in which the We describe the implementation details of our ap-
simple 1-nearest neighbor approach from Scarlini proaches in this section. When customizing BERT
et al. (2020) showed substantial improvement over for WSD, we use a linear layer consisting of just 1
the nominal category of the English all-words WSD neuron in the output layer to compute the relevance
benchmark datasets. score for each context-gloss pair, in contrast to the
binary classification layer used in GlossBERT.
2.2 Fine-tuning Approaches Additionally, we also extract example sentences
Fine-tuning WSD systems directly adjust the pre- from WordNet 3.0 and use them as additional train-
trained weights on annotated corpora rather than ing data on top of the sense-annotated sentences
learning new weights from scratch. Du et al. from SemCor.
(2019) fine-tuned two separate and independent
BERT models simultaneously: one to encode sense- 3.1 Gloss Selection Objective
annotated sentences and another one to encode Following Huang et al. (2019), we construct posi-
sense definitions from WordNet. The hidden states tive and negative context-gloss pairs by combining
from the 2 encoders are then concatenated and used annotated sentences from SemCor and sense defini-
to train a multilayer perceptron classifier for WSD. tions from WordNet 3.0. The positive pair contains
Huang et al. (2019) proposed GlossBERT which a gloss representing the correct sense of the target
fine-tunes BERT on sequence-pair binary classifi- word while a negative pair contains a negative can-
cation tasks. The training data consists of context- didate gloss. Each target word in the contexts is
gloss pairs constructed using annotated sentences surrounded with two special [TGT] tokens. We
from SemCor and sense definitions from Word- group context-gloss pairs with the same context
Net 3.0. Each context-gloss pair contains a sen- and target word as a single training instance so that
tence from SemCor with a target word to be disam- they are processed sequentially by the neural net-
biguated (context) and a candidate sense definition work. As illustrated in Figure 1, the output layer
of the target word from WordNet (gloss). Dur- takes the hidden states of the [CLS] token from

42
So�max

[CLS] … crawl back up the [TGT] bank [TGT]toward ... [CLS] hidden state #1 score #1, 𝑠1 𝑒 𝑠1
[SEP] an arrangement of similar objects in a row or in
BERT Linear Layer 4 𝑒 𝑠𝑖
𝑖
�ers[SEP]

[CLS] … crawl back up the [TGT] bank [TGT] toward ... 𝑒 𝑠2


[CLS] hidden state #2 score #2, 𝑠2
[SEP] a container (usually with a slot in the top) for BERT Linear Layer 4 𝑠𝑖 Cross Entropy Loss
keeping money at home [SEP] 𝑖 𝑒
𝑒 𝑠3
−log 4 𝑠𝑖
[CLS] … crawl back up the [TGT] bank [TGT] toward ...
[CLS] hidden state #3 score #3, 𝑠3 𝑒 𝑠3 𝑖 𝑒
[SEP] sloping land (especially the slope beside a BERT Linear Layer 4 𝑠𝑖
body of water) [SEP] 𝑖 𝑒

[CLS] … crawl back up the [TGT] bank [TGT]toward ... [CLS] hidden state #4 score #4, 𝑠4 𝑒 𝑠4
[SEP] the funds held by a gambling house or the BERT Linear Layer 4 𝑠𝑖
dealer in some gambling games [SEP] 𝑖 𝑒

Figure 1: Visualisation of the gloss selection objective when computing the loss value for a training instance. The
context “He turned slowly and began to crawl back up the bank toward the rampart.” is annotated with the target
word “bank”. A training instance consists of n context-gloss pairs (n=4 in this case), including 1 positive pair
(shown in green) and n-1 negative pairs (shown in red). The order of the context-gloss pairs within each training
instance is randomized during the dataset construction step.

each context-gloss pair as input and calculate the members (i.e. synonyms). We introduce a rela-
corresponding relevance score. A softmax layer tively straightforward data augmentation technique
then aggregates the relevance scores from the same that combines the example sentences with posi-
group and computes the training loss using cross tive/negative glosses into additional context-gloss
entropy as loss function. Formally, the gloss selec- pairs. First, example sentences (context) are ex-
tion objective is given as follow: tracted from each synset and target words are identi-
m X n
fied via keyword matching and annotated with two
i
1 X [TGT] tokens. Then, context-gloss pairs are con-
loss = − [ 1(yi , j)log(pij )] (1)
m i=1 j=1 structed by combining the annotated contexts with
positive and negative glosses. Using this technique,
where m is the batch size, ni is number of candi- we were able to obtain 37,596 additional training
date glosses for the i-th training instance, 1(yi , j) instances (about 17% more training instances).
is the binary indicator if index j is the same as the
index of the positive context-gloss pair yi , and pij 4 Experiments
is the softmax value for the j-th candidate sense of
In this section, we introduce the datasets and ex-
i-th training instance, computed using the follow-
periment settings used to fine-tune BERT. We also
ing equation:
present the evaluation results of each model and
exp(Rel(contexti , glossij )) compare them against existing WSD systems.
pij = Pni (2)
k exp(Rel(contexti , glossik )) 4.1 Datasets
where Rel(context, gloss) denotes the relevance Both training and testing datasets were obtained
score of a context-gloss pair from the output layer. from the unified evaluation framework for WSD
Similar formulation was presented for web docu- (Raganato et al., 2017b). Our training dataset
ment ranking (Huang et al., 2013) and question- for gloss selection consists of 2 parts: a base-
answering natural language inference (Liu et al., line dataset with 226,036 training instances con-
2019). In the case of WSD, we are only inter- structed from SemCor and an augmented dataset
ested in the top-1 context-gloss pair. Hence, during with 37,596 training instances constructed using
testing, we select the context-gloss pair with the the data augmentation method. When constructing
highest relevance score and its corresponding sense the context-gloss pairs for the training datasets, we
as the most probable sense for the target word. select a maximum of n = 6 context-gloss pairs per
training instance; for testing datasets, all possible
3.2 Data Augmentation using Example candidate context-gloss pairs are considered.
Sentences The testing dataset contains 5 benchmark
Most synsets in WordNet 3.0 include one or more datasets from previous Senseval and SemEval com-
short sentences illustrating the usage of the synset petitions, including Senseval-2 (SE2), Senseval-3

43
Dev Test Concatenation of all datasets
System
SE07 SE2 SE3 SE13 SE15 Noun Verb Adj Adv ALL
Most frequent sense baseline 54.5 65.6 66.0 63.8 67.1 67.7 49.8 73.1 80.5 65.5
KB

Leskext +emb 56.7 63.0 63.7 66.2 64.6 70.0 51.1 51.7 80.6 64.2
Babelfy 51.6 67.0 63.5 66.4 70.3 68.9 50.7 73.2 79.8 66.4
IMS+emb 62.6 72.2 70.4 65.9 71.5 71.9 56.6 75.9 84.7 70.1
LSTM-LP 63.5 73.8 71.8 69.5 72.6 - - - - -
Sup

Bi-LSTM - 71.1 68.4 64.8 68.3 69.5 55.9 76.2 82.4 68.4
HCAN - 72.8 70.3 68.5 72.8 72.7 58.2 77.4 84.1 71.1
LMMS2348 (BERT) 68.1 76.3 75.6 75.1 77.0 - - - - 75.4
SemCor+WNGC, hypernyms (single) - - - - - - - - - 77.1
SemCor+WNGC, hypernyms (ensemble) 73.4 79.7 77.8 78.7 82.6 81.4 68.7 83.7 85.5 79.0
Feat

SENSEMBERTsup - - - - - 80.4 - - - -
BEM 2 74.5 79.4 77.4 79.7 81.7 81.4 68.5 83.0 87.9 79.0
EWISERhyper 2 75.2 80.8 79.0 80.7 81.8 82.9 69.4 83.6 87.3 80.1
BERTdef - 76.4 74.9 76.3 78.3 78.3 65.2 80.5 83.8 76.3
FT

GlossBERT (Sent-CLS-WS) 72.5 77.7 75.2 76.1 80.4 79.3 66.9 78.2 86.4 77.0
BERTbase (baseline) 73.6 79.4 76.8 77.4 81.5 80.6 67.9 82.2 87.3 78.2
BERTbase (augmented) 73.6 79.3 76.9 79.1 82.0 81.3 67.7 82.2 87.9 78.7
Ours

BERTlarge (baseline) 73.0 79.9 77.4 78.2 81.8 81.2 68.8 81.5 88.2 78.7
BERTlarge (augmented) 72.7 79.8 77.8 79.7 84.4 82.6 68.5 82.1 86.4 79.5

Table 1: F1-score (%) on the English all-words WSD benchmark datasets in Raganato et al. (2017b). The systems
are grouped into 5 categories: i) knowledge-based system (KB), i.e. the most frequent sense baseline, Leskext +emb
(Basile et al., 2014) and Babelfy (Moro et al., 2014), ii) supervised models (Sup), i.e. IMS+emb (Iacobacci et al.,
2016), LSTM-LP (Yuan et al., 2016), Bi-LSTM (Raganato et al., 2017a) and HCAN (Luo et al., 2018), iii) featured-
based approach using contextual embeddings from BERT (Feat), i.e. LMMS2348 (Loureiro and Jorge, 2019),
SemCor+WNGC (Vial et al., 2019), SENSEMBERTsup (Scarlini et al., 2020), BEM (Blevins and Zettlemoyer,
2020) and EWISERhyper (Bevilacqua and Navigli, 2020), iv) fine-tuning approach using BERT (FT), i.e. BERTdef
(Du et al., 2019) and GlossBERT (Huang et al., 2019), v) our models (Ours).

(SE3), SemEval-07 (SE07), SemEval-13 (SE13), 4.3 Evaluation Results


and SemEval-15 (SE15). Following Huang et al.
We evaluate the performance of each model and
(2019) and others, we choose SemEval-07 as the
report the F1-scores in Table 1, along with the
development set for tuning hyperparameters.
results from other WSD systems.
All 4 of our models trained on the proposed gloss
4.2 Experiment Settings
selection objective show substantial improvement
We experiment with both uncased BERTbase and over the non-ensemble systems across all bench-
BERTlarge models. BERTbase consists of 110M pa- mark datasets, which signifies the effectiveness of
rameters with 12 Transformer layers, 768 hidden this task formulation3 . The addition of augmented
units and 12 self-attention heads while BERTlarge training set further improves the performance, par-
consists of 340M parameters with 24 Transformer ticularly in the noun category. It is worth noting
layers, 1024 hidden units and 16 self-attention that Du et al. (2019) and Huang et al. (2019) re-
heads. We use the implementation from the trans- ported slightly worse or identical results when fine-
formers package (Wolf et al., 2019). In total, we tuning on BERTlarge , but both of our models fine-
trained 4 models on 2 setups: (1) BERTbase/large tuned on BERTlarge obtain considerable better re-
(baseline), using only the baseline dataset; (2) sults than the BERTbase counterparts. This may
BERTbase/large (augmented), using the concatena- be partially attributed to the fact that we were us-
tion of baseline and augmented dataset. ing the recently released whole-word masking vari-
At fine-tuning, we set the initial learning rate to ant of BERTlarge , which was shown to have a bet-
2e-5 with batch size of 128 over 4 training epochs. ter performance on the Multi-Genre Natural Lan-
The remaining hyperparameters are kept at the de- guage Inference (MultiNLI) benchmark. Although
fault values specified in the transformers package. the BERTlarge (augmented) model has lower F1-
2 3
For reference, we included the results from ACL2020. Statistically different from previously reported results
Since these results were not available at the time of writing (with p=0.05) under one-sided randomization test on the F1-
this paper, we did not compare with the results in Section 4.3. scores in concatenated dataset.

44
score on the development dataset, it outperforms mantic model. In Proceedings of COLING 2014,
the ensemble system consisting of eight indepen- the 25th International Conference on Computational
Linguistics: Technical Papers, pages 1591–1600.
dent BERTlarge models on three testing datasets and
achieves the best F1-score on the concatenation of Michele Bevilacqua and Roberto Navigli. 2020. Break-
all datasets. ing through the 80% glass ceiling: Raising the state
To illustrate that the improvement of WSD per- of the art in word sense disambiguation by incor-
porating knowledge graph information. In Proceed-
formance comes from the gloss selection objective ings of the 58th Annual Meeting of the Association
instead of hyperparameter settings, we fine-tune a for Computational Linguistics, pages 2854–2864,
BERTbase model on the unagumented training set Online.
using the same hyperparameter settings as Gloss-
Terra Blevins and Luke Zettlemoyer. 2020. Moving
BERT (Huang et al., 2019), i.e. setting learning down the long tail of word sense disambiguation
rate and batch size to 2e-5 and 64 respectively, and with gloss informed bi-encoders. In Proceedings of
using 4 context-gloss pairs for each target word. As the 58th Annual Meeting of the Association for Com-
shown in Table 2, our model fine-tuned on the pro- putational Linguistics, pages 1006–1017, Online.
posed gloss selection objective consistently outper- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
forms GlossBERT across all benchmark datasets Kristina Toutanova. 2019. Bert: Pre-training of
under the same hyperparameter settings. deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
SE07 SE2 SE3 SE13 SE15
the North American Chapter of the Association for
GlossBERT 72.5 77.7 75.2 76.1 80.4 Computational Linguistics: Human Language Tech-
BERTbase 73.0 79.1 77.3 77.4 81.0 nologies, Volume 1 (Long and Short Papers), pages
4171–4186.
Table 2: Comparison of F1-score (%) on differ- Jiaju Du, Fanchao Qi, and Maosong Sun. 2019. Using
ent benchmark datasets between GlossBERT and a bert for word sense disambiguation. arXiv preprint
BERTbase model fine-tuned with gloss selection objec- arXiv:1909.08358.
tive.
Luyao Huang, Chi Sun, Xipeng Qiu, and Xuan-Jing
Huang. 2019. Glossbert: Bert for word sense disam-
5 Conclusion biguation with gloss knowledge. In Proceedings of
the 2019 Conference on Empirical Methods in Nat-
We proposed the gloss selection objective for super- ural Language Processing and the 9th International
Joint Conference on Natural Language Processing
vised WSD, which formulates WSD as a relevance
(EMNLP-IJCNLP), pages 3500–3505.
ranking task based on context-gloss pairs. Our mod-
els fine-tuned on this objective outperform other Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
non-ensemble systems on five English all-words Alex Acero, and Larry Heck. 2013. Learning deep
structured semantic models for web search using
benchmark datasets. Furthermore, we demonstrate clickthrough data. In Proceedings of the 22nd ACM
how to generate additional training data without international conference on Information & Knowl-
external annotations using existing example sen- edge Management, pages 2333–2338.
tences from WordNet, which provides extra perfor-
Ignacio Iacobacci, Mohammad Taher Pilehvar, and
mance boost and enable our single-model system Roberto Navigli. 2016. Embeddings for word sense
to surpass the state-of-the-art ensemble system by disambiguation: An evaluation study. In Proceed-
a considerable margin on a number of benchmark ings of the 54th Annual Meeting of the Association
datasets. for Computational Linguistics (Volume 1: Long Pa-
pers), pages 897–907.
Acknowledgements Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-
feng Gao. 2019. Multi-task deep neural networks
We thank the meta-reviewer, the three anonymous
for natural language understanding. In Proceedings
reviewers and Ms.Vu Thi Ly for their insightful of the 57th Annual Meeting of the Association for
feedback and suggestions. Computational Linguistics, pages 4487–4496.

Daniel Loureiro and Alipio Jorge. 2019. Language


References modelling makes sense: Propagating representations
through wordnet for full-coverage word sense disam-
Pierpaolo Basile, Annalina Caputo, and Giovanni Se- biguation. In Proceedings of the 57th Annual Meet-
meraro. 2014. An enhanced lesk word sense dis- ing of the Association for Computational Linguistics,
ambiguation algorithm through a distributional se- pages 5682–5691.

45
Fuli Luo, Tianyu Liu, Zexue He, Qiaolin Xia, Zhi- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
fang Sui, and Baobao Chang. 2018. Leveraging Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
gloss knowledge in neural word sense disambigua- Kaiser, and Illia Polosukhin. 2017. Attention is all
tion by hierarchical co-attention. In Proceedings of you need. In Advances in neural information pro-
the 2018 Conference on Empirical Methods in Natu- cessing systems, pages 5998–6008.
ral Language Processing, pages 1402–1411.
Loı̈c Vial, Benjamin Lecouteux, and Didier Schwab.
George A Miller. 1995. Wordnet: a lexical database for 2019. Sense vocabulary compression through the se-
english. Communications of the ACM, 38(11):39– mantic knowledge of wordnet for neural word sense
41. disambiguation. In Wordnet Conference, page 108.
George A Miller, Martin Chodorow, Shari Landes, Thomas Wolf, L Debut, V Sanh, J Chaumond, C De-
Claudia Leacock, and Robert G Thomas. 1994. Us- langue, A Moi, P Cistac, T Rault, R Louf, M Fun-
ing a semantic concordance for sense identification. towicz, et al. 2019. Huggingface’s transformers:
In Proceedings of the workshop on Human Lan- State-of-the-art natural language processing. ArXiv,
guage Technology, pages 240–243. Association for abs/1910.03771.
Computational Linguistics.
Dayu Yuan, Julian Richardson, Ryan Doherty, Colin
Andrea Moro, Alessandro Raganato, and Roberto Nav- Evans, and Eric Altendorf. 2016. Semi-supervised
igli. 2014. Entity linking meets word sense disam- word sense disambiguation with neural models. In
biguation: a unified approach. Transactions of the Proceedings of COLING 2016, the 26th Interna-
Association for Computational Linguistics, 2:231– tional Conference on Computational Linguistics:
244. Technical Papers, pages 1374–1385.
Roberto Navigli. 2009. Word sense disambiguation: A
survey. ACM computing surveys (CSUR), 41(2):1– Appendix
69.
A Additional Details on Experiment Settings
Roberto Navigli and Simone Paolo Ponzetto. 2010. Ba-
belnet: Building a very large multilingual semantic All models are trained using a single Nvidia Tesla
network. In Proceedings of the 48th annual meet- K40 GPU with 12 GB of memory. Gradient ac-
ing of the association for computational linguistics, cumulation is used to accommodate large batch
pages 216–225. Association for Computational Lin- size.
guistics.
For hyperparameters search, we manually tune
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt for the optimal hyperparameter combinations using
Gardner, Christopher Clark, Kenton Lee, and Luke the following candidate values:
Zettlemoyer. 2018. Deep contextualized word repre-
sentations. In Proceedings of the 2018 Conference
• BERT variant: {cased, uncased}
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), pages 2227–
• Maximum number of glosses per context:
2237. {4, 6}
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, • Batch size: {32, 64, 128}
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI • Initial learning rate: {2e-5, 3e-5, 5e-5}
Blog, 1(8):9.
Alessandro Raganato, Claudio Delli Bovi, and Roberto • Warm-up steps: {0, 0.1 * total steps}
Navigli. 2017a. Neural sequence learning models
for word sense disambiguation. In Proceedings of At testing stage, model checkpoints with the
the 2017 Conference on Empirical Methods in Natu- highest F1 score on the development dataset,
ral Language Processing, pages 1156–1167. i.e. SemEval-07, evaluated at every 1000 steps
Alessandro Raganato, Jose Camacho-Collados, and over 4 training epochs, are selected for evalua-
Roberto Navigli. 2017b. Word sense disambigua- tion on the testing dataset. We use the scoring
tion: A unified evaluation framework and empiri- script downloaded from http://lcl.uniroma1.
cal comparison. In Proceedings of the 15th Confer- it/wsdeval/home.
ence of the European Chapter of the Association for
Computational Linguistics: Volume 1, Long Papers,
pages 99–110.
Bianca Scarlini, Tommaso Pasini, and Roberto Navigli.
2020. Sensembert: Context-enhanced sense embed-
dings for multilingual word sense disambiguation.
In Proc. of AAAI.

46

You might also like