2019-wiedemannetal-konvens-bert-2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

current state of the art. Two major approaches were ant of language modeling.

In our study, we in-


followed. First, Melamud et al. (2016) and Yuan et vestigate three most prominent and widely applied
al. (2016), for instance, compute sentence context approaches: Flair (Akbik et al., 2018), ELMo (Pe-
vectors for ambiguous target words. In the predic- ters et al., 2018), and BERT (Devlin et al., 2019).
tion phase, they select nearest neighbors of context
vectors to determine the target word sense. Yuan et Flair: For the contextualization provided in the
al. (2016) also use unlabeled sentences in a semi- Flair NLP framework, Akbik et al. (2018) take a
supervised label propagation approach to overcome static pre-trained word embedding vector, e.g. the
the sparse training data problem of the WSD task. GloVe word embeddings (Pennington et al., 2014),
Second, Kågebäck and Salomonsson (2016) em- and concatenate two context vectors based on the
ploy a recurrent neural network to classify sense left and right sentence context of the word to it.
labels for an ambiguous target word given its sur- Context vectors are computed by two recurrent neu-
rounding sentence context. In contrast to earlier ral models, one character language model trained
approaches, which relied on feature engineering from left to right, one another from right to left.
(Taghipour and Ng, 2015), their architecture only Their approach has been applied successfully es-
uses pretrained GloVe word embeddings (Penning- pecially for sequence tagging tasks such as named
ton et al., 2014) to achieve SOTA results on two entity recognition and part-of-speech tagging.
English lexical sample datasets. For the all-words
WSD task, Vial et al. (2018a) also employ a recur- ELMo: Embeddings from language models
rent neural network. But instead of single target (ELMo) (Peters et al., 2018) approaches contextual-
words, they sequentially classify sense labels for ization similar to Flair, but instead of two character
all tokens in a sentence. They also introduce an language models, two stacked recurrent models for
approach to collapse the sense vocabulary from words are trained, again one left to right, and an-
WordNet to unambiguous hypersenses, which in- other right to left. For CWEs, outputs from the
creases the label to sample ratio for each label, embedding layer, and the two bidirectional recur-
i.e. sense identifier. By training their network on rent layers are not concatenated, but collapsed into
the large sense annotated datasets SemCor (Miller one layer by a weighted, element-wise summation.
et al., 1993) and the Princeton Annotated Gloss
Corpus based on WordNet synset definitions (Fell- BERT: In contrast to the previous two ap-
baum, 1998), they achieve the highest performance proaches, Bidirectional Encoder Representations
so far on most all-words WSD benchmarks. A sim- from Transformers (BERT) (Devlin et al., 2019)
ilar architecture with an enhanced sense vocabulary does not rely on the merging of two uni-directional
compression was applied in (Vial et al., 2019), but recurrent language models with a (static) word em-
instead of GloVe embeddings, BERT wordpiece bedding, but provides contextualized token embed-
embeddings (Devlin et al., 2019) are used as in- dings in an end-to-end language model architec-
put for training. Especially the BERT embeddings ture. For this, a self-attention based transformer
further improved the performance yielding new architecture is used, which, in combination with a
state-of-the-art results. masked language modeling target, allows to train
the model seeing all left and right contexts of a
2.2 Contextualized Word Embeddings target word at the same time. Self-attention and
non-directionality of the language modeling task re-
The idea of modeling sentence or context-level se- sult in extraordinary performance gains compared
mantics together with word-level semantics proved to previous approaches.
to be a powerful innovation. For most downstream
NLP tasks, CWEs drastically improved the perfor- According to the distributional hypothesis, if the
mance of neural architectures compared to static same word regularly occurs in different, distinct
word embeddings. However, the contextualization contexts, we may assume polysemy of its mean-
methodologies differ widely. We, thus, hypothesize ing (Miller and Charles, 1991). Contextualized
that they are also very different in their ability to embeddings should be able to capture this prop-
capture polysemy. erty. In the following experiments, we investigate
Like static word embeddings, CWEs are trained this hypothesis on the example of the introduced
on large amounts of unlabeled data by some vari- models.
SE-2 (Tr) SE-2 (Te) SE-3 (Tr) SE-3 (Te) S7-T7 (coarse) S7-T17 (fine) SemCor WNGT
#sentences 8,611 4,328 7,860 3,944 126 245 37,176 117,659
#CWEs 8,742 4,385 9,280 4520 455 6,118 230,558 1,126,459
#distinct words 313 233 57 57 327 1,177 20,589 147,306
#senses 783 620 285 260 371 3,054 33,732 206,941
avg #senses p. word 2.50 2.66 5.00 4.56 1.13 2.59 1.64 1.40
avg #CWEs p. word & sense 11.16 7.07 32.56 17.38 1.23 2.00 6.83 5.44
avg k0 2.75 - 7.63 - - - 3.16 2.98

Table 1: Properties of our datasets. For the test sets (Te), we do not report k0 since they are not used as
kNN training instances.

3 Nearest Neighbor Classification for We test different values for our single hyper-
WSD parameter k ∈ {1, . . . , 10, 50, 100, 500, 1000}. Like
words in natural language, word senses follow a
We employ a rather simple approach to WSD us- power-law distribution. Due to this, simple base-
ing non-parametric nearest neighbor classification line approaches for WSD like the most frequent
(kNN) to investigate the semantic capabilities of sense (MFS) baseline are rather high and hard to
contextualized word embeddings. Compared to beat. Another effect of the skewed distribution are
parametric classification approaches such as sup- imbalanced training sets. Many senses described in
port vector machines or neural models, kNN has WordNet only have one or two example sentences
the advantage that we can directly investigate the in the training sets, or are not present at all. This is
training examples that lead to a certain classifier severely problematic for larger k and the default im-
decision. plementation of kNN because of the majority class
The kNN classification algorithm (Cover and dominating the classification result. To deal with
Hart, 1967) assigns a plurality vote of a sample’s sense distribution imbalance, we modify the ma-
nearest labeled neighbors in its vicinity. In the most jority voting of kNN to k0 = min(k, |Vs |) where Vs
simple case, one-nearest neighbor, it predicts the is the set of CWEs with the least frequent training
label from the nearest training instance by some examples for a given word sense s.
defined distance metric. Although complex weight-
ing schemes for kNN exist, we stick to the simple
non-parametric version of the algorithm to be able 4 Datasets
to better investigate the semantic properties of dif-
ferent contextualized embedding approaches. We conduct our experiments with the help of four
As distance measure for kNN, we rely on cosine standard WSD evaluation sets, two lexical sam-
distance of the CWE vectors. Our approach consid- ple tasks and two all-words tasks. As lexical sam-
ers only senses for a target word that have been ob- ple tasks, SensEval-2 (Kilgarriff, 2001, SE-2) and
served during training. We call this approach local- SensEval-3 (Mihalcea et al., 2004, SE-3) provide
ized nearest neighbor word sense disambiguation. a training data set and test set each. The all-words
We use spaCy1 (Honnibal and Johnson, 2015) for tasks of SemEval 2007 Task 7 (Navigli et al., 2007,
pre-processing and the lemma of a word as the tar- S7-T7) and Task 17 (Pradhan et al., 2007, S7-T17)
get word representation, e.g. ‘danced’, ‘dances’ and solely comprise test data, both with a substantial
‘dancing’ are mapped to the same lemma ‘dance’. overlap of their documents. The two sets differ
Since BERT uses wordpieces, i.e. subword units in granularity: While ambiguous terms in Task 17
of words instead of entire words or lemmas, we are annotated with one WordNet sense only, in
re-tokenize the lemmatized sentence and average Task 7 annotations are coarser clusters of highly
all wordpiece CWEs that belong to the target word. similar WordNet senses. For training of the all-
Moreover, for the experiments with BERT embed- words tasks, we use a) the SemCor dataset (Miller
dings2 , we follow the heuristic by Devlin et al. et al., 1993), and b) the Princeton WordNet gloss
(2019) and concatenate the averaged wordpiece corpus (WNGT) (Fellbaum, 1998) separately to
vectors of the last four layers. investigate the influence of different training sets
1 https://spacy.io/ on our approach. For all experiments, we utilize
2 We use the bert-large-uncased model. the suggested datasets as provided by the UFSAC

You might also like