2019-wiedemannetal-konvens-bert-2
2019-wiedemannetal-konvens-bert-2
2019-wiedemannetal-konvens-bert-2
Table 1: Properties of our datasets. For the test sets (Te), we do not report k0 since they are not used as
kNN training instances.
3 Nearest Neighbor Classification for We test different values for our single hyper-
WSD parameter k ∈ {1, . . . , 10, 50, 100, 500, 1000}. Like
words in natural language, word senses follow a
We employ a rather simple approach to WSD us- power-law distribution. Due to this, simple base-
ing non-parametric nearest neighbor classification line approaches for WSD like the most frequent
(kNN) to investigate the semantic capabilities of sense (MFS) baseline are rather high and hard to
contextualized word embeddings. Compared to beat. Another effect of the skewed distribution are
parametric classification approaches such as sup- imbalanced training sets. Many senses described in
port vector machines or neural models, kNN has WordNet only have one or two example sentences
the advantage that we can directly investigate the in the training sets, or are not present at all. This is
training examples that lead to a certain classifier severely problematic for larger k and the default im-
decision. plementation of kNN because of the majority class
The kNN classification algorithm (Cover and dominating the classification result. To deal with
Hart, 1967) assigns a plurality vote of a sample’s sense distribution imbalance, we modify the ma-
nearest labeled neighbors in its vicinity. In the most jority voting of kNN to k0 = min(k, |Vs |) where Vs
simple case, one-nearest neighbor, it predicts the is the set of CWEs with the least frequent training
label from the nearest training instance by some examples for a given word sense s.
defined distance metric. Although complex weight-
ing schemes for kNN exist, we stick to the simple
non-parametric version of the algorithm to be able 4 Datasets
to better investigate the semantic properties of dif-
ferent contextualized embedding approaches. We conduct our experiments with the help of four
As distance measure for kNN, we rely on cosine standard WSD evaluation sets, two lexical sam-
distance of the CWE vectors. Our approach consid- ple tasks and two all-words tasks. As lexical sam-
ers only senses for a target word that have been ob- ple tasks, SensEval-2 (Kilgarriff, 2001, SE-2) and
served during training. We call this approach local- SensEval-3 (Mihalcea et al., 2004, SE-3) provide
ized nearest neighbor word sense disambiguation. a training data set and test set each. The all-words
We use spaCy1 (Honnibal and Johnson, 2015) for tasks of SemEval 2007 Task 7 (Navigli et al., 2007,
pre-processing and the lemma of a word as the tar- S7-T7) and Task 17 (Pradhan et al., 2007, S7-T17)
get word representation, e.g. ‘danced’, ‘dances’ and solely comprise test data, both with a substantial
‘dancing’ are mapped to the same lemma ‘dance’. overlap of their documents. The two sets differ
Since BERT uses wordpieces, i.e. subword units in granularity: While ambiguous terms in Task 17
of words instead of entire words or lemmas, we are annotated with one WordNet sense only, in
re-tokenize the lemmatized sentence and average Task 7 annotations are coarser clusters of highly
all wordpiece CWEs that belong to the target word. similar WordNet senses. For training of the all-
Moreover, for the experiments with BERT embed- words tasks, we use a) the SemCor dataset (Miller
dings2 , we follow the heuristic by Devlin et al. et al., 1993), and b) the Princeton WordNet gloss
(2019) and concatenate the averaged wordpiece corpus (WNGT) (Fellbaum, 1998) separately to
vectors of the last four layers. investigate the influence of different training sets
1 https://spacy.io/ on our approach. For all experiments, we utilize
2 We use the bert-large-uncased model. the suggested datasets as provided by the UFSAC