Koldenhof BA EEMCS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Word Embeddings to Classify Types of Diachronic

Semantic Shift
Dylan Koldenhof
University of Twente
P.O. Box 217, 7500AE Enschede
The Netherlands
d.koldenhof@student.utwente.nl

ABSTRACT with homosexuality. Another somewhat older example is


Languages are constantly evolving, in many ways. One of the word “car”, initially referring to a horse-drawn vehicle,
the ways they evolve is in semantics, the meaning of words. which changed into having the meaning of automobile.
This presents an interesting challenge for automated Natu- Looking back far enough, these changes can add up to the
ral Language Processing (NLP), as a thorough manual in- point where it might make some sentences incomprehensi-
spection of this phenomenon is difficult. Much work has al- ble only knowing the modern meaning. Due to their incre-
ready shown promising results in detection of the semantic mental nature, comprehending the exact process and na-
shift but there is little in the field investigating the nature ture behind these changes can get very complicated. This
of these shifts. This research aims to fill this gap by inves- means lots of content over long periods needs to be parsed
tigating whether different types of semantic shifts can be in order to make complete sense of every semantic shift.
classified using word embeddings trained with Word2Vec. Thus, advances in NLP have sparked interest in apply-
Different machine learning classifiers are trained on em- ing this to semantic shifts. With these techniques, large
beddings which are themselves trained on Project Guten- volumes of text (corpora) can be analyzed, much quicker
berg ebooks spanning the period 1800-1849, and embed- than any human could, with promising results in detecting
dings trained on Wikipedia. Results show promise, but shifts [17, 7, 5].
with a top accuracy of 0.5 when validated on another time The most popular means of achieving this in current re-
period, there is room for improvement in future work. search is by comparing what are known as word embed-
dings. Word embeddings are vectors of a word which are
Keywords derived based on the context the word appears in. The
idea behind it is that semantics can be revealed based on
word embedding, diachronic semantic shift, language evo-
the context of a word, and thus the vectors represent a
lution, semantic change, word2vec, computational linguis-
word’s meaning. Consequently, word embeddings trained
tics, natural language processing
on different time periods can also reveal a change in word
meaning.
1. INTRODUCTION Going back to the initial linguistic research on semantic
The phenomenon of words changing meaning over time shift, many linguists such as Bloomfield [3] developed cat-
is something that becomes rather obvious looking at any egorizations of semantic shift.
text published some centuries ago. Many words used will
sound completely out of place with their modern meaning. So far in research on semantic shift using NLP these types
Hence, this phenomenon has been noticed and researched of categorizations do not seem to have been covered, whether
for quite some time [15, 14, 4], but has seen new light with using word embeddings or any other approach. An auto-
the advent of Natural Language Processing (NLP) tools, matic categorization could provide interesting insights for
allowing for new ways to detect and analyze it. This phe- historical linguistics, hence the goal of this research is to
nomenon is called by many names in literature, but key- provide an exploration into this topic. Specifically, the
words generally are diachronic (change across time) and main research question that will be asked is:
semantic (meaning of words), thus the full term that will RQ0: Can types of diachronic semantic shift be reliably
be used in this paper is “diachronic semantic shift”, short- automatically classified?
ened as “semantic shift” or simply “shift”. Which further leads into these three questions:
Generally, diachronic semantic shift is incremental and
happens over the span of multiple generations. An ex- • RQ1: What classification methods provide the best
ample often cited in the literature is the word “gay” which classification results?
over the past century slowly went from having a meaning
of happy, to nowadays almost exclusively being associated • RQ2: What types can be most clearly classified?
• RQ3: Do the classifiers yield consistent results across
Permission to make digital or hard copies of all or part of this work for
corpora?
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy oth- These research questions will be answered using word em-
erwise, or republish, to post on servers or to redistribute to lists, requires beddings trained with the SGNS (Skip-Gram with Nega-
prior specific permission and/or a fee. tive Sampling) method, using the Word2Vec implementa-
35th Twente Student Conference on IT July 2nd , 2021, Enschede, The tion of the Python Gensim library1 . For the training and
Netherlands.
1
Copyright 2021, University of Twente, Faculty of Electrical Engineer- https://radimrehurek.com/gensim/models/word2vec.
ing, Mathematics and Computer Science. html

1
initial testing of the classifiers, two corpora will be used. • Litotes: Opposite of hyperbole, e.g. the word “kill”
The first consists of Project Gutenberg2 ebooks spanning which in its original Germanic root had meanings in
approximately the first half of the 19th century, while the the sense of tormenting or vexing.
second consists of recent Wikipedia articles. Two versions
of the corpora are used, one using the words as they ap- • Degeneration: A meaning shifting into a lower sta-
pear in the texts, except lowercased and with punctuation tus, for instance the word “silly”, which has gone
removed, while the other versions contain lemmatized and through many meanings, starting in the sense of happy
part of speech-tagged tokens. These two approaches will or blessed to weak and the meaning it has now.
henceforth be referred to as “raw” and “lemma”. • Elevation: Opposite of degeneration, the word “knight”
After the models are trained on the corpora, they are went from a low servant to someone exalted.
aligned using the Orthogonal Procrustes (OP) method,
after which the vectors for the same word across the dif- Although there were some earlier methods, the advent of
ferent time periods can be compared. From the words that neural word embedding algorithms led to a large increase
have most significantly shifted on the lemmatized models, in research into diachronic semantic shift [12]. The first
an annotated list is made which will be used to categorize popular implementation of this method is the Word2Vec
shifts using different supervised machine learning classi- framework proposed by Mikolov et al. [10]. This is a neu-
fiers, with the differences between the vectors as features. ral network model that consists of two different training
Cross-validation will be performed on the annotated list methods: Continuous Bag of Words (CBOW) and Skip-
with the accuracy metrics used for an initial evaluation, Gram. CBOW is generally more appropriate for small cor-
both on the raw and lemma embeddings. pora while Skip-Gram is more suited for large ones [12],
To determine the generalizability of these classifiers, two and thus Skip-Gram will be used for this research.
different corpora will be used for validation, composed of The Word2Vec Skip-Gram model works by sliding a win-
time periods 1750-1799 and 1900-1949. Both of these are dow of a given size across the input text. For every word,
composed of Project Gutenberg texts and will be aligned pairs of the word and another word within the window
the same way as the training corpora. The classifiers will are formed. These pairs then form samples for training
then be used on another annotated list composed in the a neural network with a single hidden layer. The task
same manner, with the classifier predicting the categories. of the network is, given an input word, to evaluate the
The performance across these corpora and the training probablity of all words within the vocabulary to be in the
corpora can then be compared and the metrics of the clas- window of the input word. This task itself is not the goal
sifiers that show the best results on both corpus pairs will of the method at all, but the weights of the neurons in the
be used to answer the main research question. hidden layer after training are what make up the vector of
the input word [8].
2. RELATED WORK Along with this Negative Sampling is used in the training
As mentioned before, Bloomfield [3] proposes a catego- process. Normally, when given a training sample pair, the
rization of semantic shifts, which is popular and used as a weights for all input words are updated with the goal of
basis for more recent studies [6]. the probablity of all words outside the pair having proba-
blity 0, while for the input word in the pair the probablity
Bloomfield initially proposed nine shifts, which are as fol- of the other word in the pair will be 1. With a large
lows: vocabulary this means a lot of unnecessary computation,
since most words do not share much context, and thus the
• Narrowing: A meaning going from a wider to a nar- weight updates will be rather insignificant. With Nega-
rower scope, an example is the word “hound”, which tive Sampling, only a limited amount of “negative” words
originally was the word for dog in general and later (random words from the vocabulary) are chosen for every
evolved to mean hunting dog specifically. training sample for which the weights will be updated [9].
Skip-Gram with Negative Sampling (SGNS) was generally
• Widening: Opposite of narrowing, and the word “dog” deemed most effective by extensive evaluation studies [18,
itself is actually an example, which used to refer to 19].
a specific sort of dog.
In order to identify diachronic semantic shift however,
• Metaphor: A word used as a metaphor becoming there needs to be a way of measuring these word embed-
its primary meaning, e.g. “broadcast” as described dings across time. There have been two approaches to
before. this: static and dynamic embeddings.
The static approach relies on separating a corpus in dif-
• Metonymy: A meaning relating to a place or time
ferent time slices, and training embeddings for these time
shifting to another nearby place or time, e.g. the
slices independently. One approach is to train the model
word “cheek”, which had the meaning of jaw in older
on each of these time slices separately. The problem with
English.
this is that the embeddings cannot be compared directly.
• Synecdoche: Shift from a part to whole, or vice Due to the stochastic nature of the model, embeddings
versa. An example is the word “commercial” in the trained on different slices will not be directly compara-
sense of advertisement. Commercial is just one trait ble, only their relative distances (provided meanings stay
of what it is, but it shifted into the word for it en- similar) will be [2, 12] (see Figure 1).
tirely. Hence, one way to compensate for this is to align the dif-
ferent vector spaces using some sort of method. The best
• Hyperbole: A stronger meaning shifting into a weaker [18] of these is found to be the Orthogonal Procrustes (OP)
one, the word “quite” is an example, which used to method proposed by Hamilton et al. [5].
mean “completely” or “wholly”.
The other approach is dynamic embeddings. The key dif-
2
https://www.gutenberg.org/ ference between these and static embeddings is that dif-

2
LGBT
gay was used, only including articles with more than 5000
homosexual words. This consists of around 1.63B words. Due to the
long processing time, the lemma Wikipedia corpus is from
homosexual
LGBT a slightly older archive from 2017, which was pre-trained
nice
on Word2Vec5 .
gay
happy For validating the generalizability of the classifier and an-
swering RQ3, two different corpora are used, spanning the
nice
happy
time periods 1750-1799 and 1900-1949. These are both
retrieved from Gutenberg and preprocessed in the same
manner as the 1800-1849 corpus. The 1750-1799 corpus
consists of 765 texts containing 66.4M words, while the
Figure 1. Illustrating what happens when word embeddings 1900-1949 corpus consists of 10838 texts containing 646M
are trained on two different time periods and not aligned, words.
using as an example the word gay. Relative distances are
roughly the same between unchanged words, but their ab- 3.2 Training the model
solute positions in the vector space are completely different The corpora (except the pre-trained one) are subsequently
and hence they cannot be compared directly. Based on fig- trained with Gensim’s Word2Vec model, which includes
ures by Bianchi et al. [2]. an implementation of Skip-Gram with Negative Sampling
(SGNS). An exception is the lemma Wikipedia corpus,
which was pre-trained, also using SGNS. An overview of
ferent time slices are not modeled independently of one all parameters used can be found in Appendix A.
another, thus word embeddings at a given time slice are
based on their position at previous ones. Taking this into 3.3 Comparing embeddings
account yields more directed and smoother shift in some The alignment of the different models was done using the
cases, but a disadvantage is that time slices flow into one Orthogonal Procrustes (OP) method. To illustrate fur-
another too much [12]. Some examples of dynamic em- ther, the Orthogonal Procrustres problem is, given two
beddings are those proposed by Bamler and Mandt [1] matrices A and B, to find the orthogonal matrix most
and Rudolph and Blei [16]. closely mapping A to B. This mapping can be used on the
There have been some papers inquiring into the nature of two embedding spaces to map the older period to the space
semantic change, but none with categorizations like the of the newer one. Specifically, when Wt ∈ Rd×|V| –where
one proposed by Bloomfield [3]. Mitra et al [11] use a d is the number of dimensions in an embedding and V the
different approach from word embeddings to derive sense shared vocabulary of the models–is a matrix of all word
clusters. The shifts observed in these sense clusters are embeddings at period t, the following equation needs to
then assigned four categories: split, join, birth and death. be solved:
Hamilton et al. [5] derived statistical laws of shifts, such
Rt = arg min ||Wt Q − Wt+1 ||F (1)
as that frequent words shift at slower rates. Q

With the constraint that QT Q = I (orthogonality). The


3. METHODOLOGY resulting transformation Rt ∈ Rd×d can then be applied
3.1 Corpora and Preprocessing to Wt to map it to the space of Wt+1 , allowing the spaces
Project Gutenberg texts from approximately 1800-1849 to be compared [5]. The assumption of this method is that
were chosen for the older corpus. As Gutenberg meta- most words retain their meaning, because the alignment
data does not list publishing date, nor any other means seeks to minimize the distance between the periods. Shifts
of retrieving it easily (for example, ISBN number), the are detected because they are the outliers that do not fit
average between author birth and death year, which is with the alignment.
in Gutenberg’s metadata, was used to estimate publishing Before this however, the embeddings matrix was mean-
date. This yielded 4536 texts, consisting of around 341.8M centered along columns, i.e. the means of all embedding
words. dimensions are zero. Then the vectors were L2(Euclidean)-
Due to the large amount of proper nouns that will other- normalized. These steps improve the per formance of the
wise dominate the most changed words, along with some alignment method [18].
words having only changed as one function of the word The cosine distance (CD) between embeddings is then
(for instance, the verb “to power” relates to power in the used to determine the magnitude of the detected shift.
physical sense, while before it had the sense of might)3 ,
it was decided to also prepare a lemmatized and part of 3.4 Classification
speech (POS)-tagged version of the corpora. Lemmatiza- From words that were above a threshold of 0.75 CD in the
tion means words are reduced to their simplest grammati- lemma models, words were selected that could reasonably
cal forms. Such as plural nouns turned into their singular be manually classified. The reason the lemma models were
forms, and verbs to the first person present. POS-tagging used is that proper nouns could be filtered out, which dom-
consists of marking the part of speech, such as verb, noun, inate the words above the threshold in the raw corpora.
and adjective, for a given word. For the corpora used, all The selection process generally consisted of inspecting a
words are processed into strings consisting of the lemma- random selection of 250 words above 0.75 CD and then
tized form and the POS tag, separated by an underscore. verifying candidates and determining a category for them
The lemma forms were retrieved using the Python SpaCy using the Online Etymology Dictionary6 .
library4 . This process yielded a list of 88 words, annotated with
For the raw Wikipedia corpus, the archive from May 2021 three different categories. The full list can be found in
3 5
https://www.etymonline.com/word/power From http://vectors.nlpl.eu/repository/
4 6
https://spacy.io/ https://www.etymonline.com/

3
Appendix B.1. These are a reduction of Bloomfield’s cat- Furthermore, the differences between the raw and lemma
egories as described in Section 2, which was done because models appear to be fairly small except for two classifiers:
of the small size of the list along with many of Bloomfield’s Random Forest, which performs much better on the raw
categories being quite rare in the selection process. The models, and Stochastic Gradient Descent (SGD) with log
categories are as follows: loss, which is much better on the lemma models. Upon
further analysis of parameters, it showed that these dif-
• Scope change (narrowing and widening) ferences are only due to the random elements of these
classifiers, which gives them vastly different results every
• Metaphor and metonymy run. This makes them not very generalizable and thus it is
more beneficial to look at methods with less randomness.
• Other types/cannot be determined Excluding SGD and Random Forest, in Table 2 the best
classifiers for raw and lemma models are shown.
For each word in this list, the difference between the vec-
tors of the word from the different time periods in the Classifier Accuracy Tokens
aligned embedding space is computed. The elements of BernoulliNB 0.443 ± 0.163 raw
this delta vector are then used as features for training KNeighborsClassifier 0.397 ± 0.151 raw
different machine learning classifiers, with the categories SVC 0.386 ± 0.053 raw
functioning as labels.
SVC 0.432 ± 0.097 lemma
Before training the classifiers, the features are scaled to LogisticRegression 0.432 ± 0.125 lemma
have zero mean and unit variance. This reduces the ef- GaussianNB 0.420 ± 0.138 lemma
fect of outliers, along with potentially having training data
that can be more generalized across corpora. Table 2. Best classifiers for raw and lemma models.
To validate the performance of the models, another an-
notated list is used, created in the same manner as the 4.2 SVC grid-search
other. The difference is that this list is using vectors from
With most of these classifiers there is little room for im-
the 1750-1799 and 1900-1949 corpora, and only contains
provement except for the Support Vector Classifier(SVC),
22 words. It can be found in Appendix B.2. The best
since it allows a lot of parameter tweaking. Thus a grid-
performing classifiers on the 1800-1849/wiki pair will be
search was performed on this classifier, testing many com-
tested on this list in order to determine whether they still
binations of parameters to determine the best combina-
perform similarly. The best performing classifier on both
tion. The best results for both corpora are shown in Table
of these can then be concluded to be the most reliable for
3.
categorizing semantic shift on Word2Vec embeddings.
Kernel Parameters8 Accuracy9 Tokens
4. EVALUATION
Sigmoid C = 100, 0.477 ± 0.125 raw
4.1 First results on training corpora gamma = 0.1
The best classifiers are determined based on average pre- Sigmoid C = 1000, 0.477 ± 0.136 raw
diction accuracy across a Leave-One-Out(LOO) cross- gamma = 0.1
validation (CV) on the list. LOO-CV is when a model Sigmoid C = 10000, 0.477 ± 0.119 raw
is trained on all but one of the elements in the data set, gamma = 0.1
with the remaining single element used for testing. This Sigmoid C = 10, 0.614 ± 0.146 lemma
training is then repeated with a different test element ev- gamma = 0.3
ery time, until all elements in the data have been tested. Sigmoid C = 1, 0.614 ± 0.150 lemma
Given the small size of the list, this allows the classifier gamma = 0.4
to be trained as much as possible, as opposed to other Sigmoid C = 10, 0.591 ± 0.141 lemma
forms of CV, such as 10-fold, where different selections of gamma = 0.9
10 samples are taken out and used as testing sets.
The results of this first step can be found in Appendix C. Table 3. Top 3 results from parameter tweaking on raw and
The classifiers are referred to by their class names in the lemma models.
Python scikit-learn library7 , which is used for all of them.
The sigmoid kernel is the only one that showed up in the
What can initially be noted is that these initial accuracy
top 3 for both the raw and lemma models, so it is clearly
numbers are quite low. Crucially however, some outper-
best. However, the best parameters for the lemma and
form random or majority guessing. In Table 1 the dis-
raw models are different and on the lemma models the
tribution of the labels is shown. As a benchmark, the
accuracy goes up to much higher values than on the raw
accuracy of a ZeroR classifier, which is when only the la-
models. The confusion matrix of the classifier with the
bel which forms the majority of training data is predicted,
best accuracy is shown in Figure 2.
the accuracy would be 33/(28 + 27 + 33) = 0.375.
4.3 Validation performance
Label Count However, although the sigmoid classifier seems ideal, it
Scope (0) 28 performs poorly on the validation set, as can be seen in
Metaphor/metonymy(1) 33 Table 4.
Other(2) 27 9
Here and in subsequent tables parameters not given are
the default used in Gensim.
9
Table 1. Distribution of labels Here and in subsequent tables the values after ± repre-
sent the standard deviation as evaluated from a 10-fold
7
https://scikit-learn.org/stable/index.html cross-validation.

4
Figure 3. The confusion matrices for the SVC with RBF
kernel, C = 100 and γ = 0.0001. On the left the perfor-
mance on the training/CV pair and on the right the vali-
dation pair.

results, including equal confusion matrices. However, the


RBF kernel with C = 100 and γ = 0.0001 is slightly bet-
ter on the raw models than the linear kernel, and has a
slightly lower standard deviation, so this can be seen as
the best classifier. The confusion matrices using the RBF
kernel with C = 100 and γ = 0.0001 are shown in Figure 3.
Figure 2. Confusion matrix for the cross-validated lemma It can be seen that despite the similar accuracy, the clas-
models on sigmoid SVC with C = 10 and γ = 0.3. The sification results are still vastly different, with category 1
values are the proportion of the true label that is predicted being predicted well on the training pair, but not at all on
for a given label. Labels are as given in Table 1. the validation pair. The other classifiers that performed
well on the training/CV pair showed much worse perfor-
mance than the SVCs on the validation pair, so these are
Kernel Parameters Acc. CV Acc. val.
not covered further.
Sigmoid C = 10, gamma = 0.3 0.614 ± 0.146 0.23
Sigmoid C = 1, gamma = 0.4 0.614 ± 0.150 0.23 4.4 Aligning the pairs
Sigmoid C = 10, gamma = 0.9 0.591 ± 0.141 0.27
Due to the different confusion matrices seen in Figure 3,
it was thought that this was because the differences of the
Table 4. Sigmoid kernel compared on the cross-validated two pairs of vectors were not aligned and thus not compa-
training model pair and the validation pair. rable, similarly to two trained embeddings from different
time periods. Thus it was thought an alignment of the two
spaces of delta vectors, in the same manner as the align-
Since this classifier is not able to categorize semantic shift
ment of the embeddings, could yield better results. Unlike
on different models, it is rather useless for a general pur-
with aligning the embeddings, a shared vocabulary is not
pose. Lower values for gamma, which also performed de-
necessary, because there is no need to subtract the aligned
cently on the lemma model, yield more positive results on
vectors for the same word. While the transformation ma-
the verification set. Along with this, the other kernels that
trix is based on a shared vocabulary, this transformation
performed better than zeroR were also tested on the other
can be applied to the entire space, including words not in
pair. The best results from this evaluation can be seen in
the other pair. All the classifiers in Tables 4 and 5 were
Table 5.
again validated with this new alignment and the results
are shown in Table 6, along with the best result found by
Kernel Parameters Acc. CV Acc. val.
parameter tweaking at the top.
Sigmoid C = 1, 0.500 ± 0.166 0.36
As can be seen in the table, this best result with a sig-
gamma = 0.1
moid kernel and C = 5 and γ = 0.8 performs fairly well
Sigmoid C = 1, 0.375 ± 0.128 0.41
on both CV and the aligned deltas, with a substantial im-
gamma = 0.05
provement coming from the alignment. With some other
Sigmoid C = 1, 0.352 ± 0.091 0.50 classifiers, the opposite effect is shown, and the RBF ker-
gamma = 0.01 nel performs better unaligned, perhaps as a result of the
Linear C = 1 0.432 ± 0.097 0.41 classifier only being effective on the particular difference
RBF C = 100, 0.420 ± 0.086 0.36 existing between the pairs before alignment. The confu-
gamma = 0.001 sion matrices of the classifier for the cross-validation and
RBF C = 100, 0.432 ± 0.074 0.41 the aligned validation pair on the best result are shown in
gamma = 0.0001 Figure 4.
RBF C = 10000, 0.432 ± 0.097 0.41
gamma = 0.00001 From the confusion matrices it can be seen that the classi-
fication is a lot more consistent across the pairs compared
to the RBF kernel (Figure 3). The main difference here is
Table 5. Comparison of kernels across both model pairs
the worse performance in predicting label 1, but otherwise
the performance is very similar across both.
Though a sigmoid kernel shows the highest performance on
the validation model pair, its accuracy on the training/CV
pair is poor. Hence the most generalizable classifiers seem 5. DISCUSSION
to either be the RBF or linear kernels, with near equal In the end, the results suggest that some accuracy can be

5
Kernel Parameters Acc. al. Acc. no al. Acc. CV that reliable.
Sigmoid C = 5, 0.50 0.36 0.557 ± 0.109 The answer to RQ1 seems quite clearly a form of a Support
gamma = 0.8
Sigmoid C = 10, 0.41 0.23 0.614 ± 0.146
Vector Classifier, which is the only classifier that could be
gamma = 0.3 tweaked to yield an accuracy well above the zeroR bench-
Sigmoid C = 1, 0.36 0.23 0.614 ± 0.150 mark consistently on both raw and lemma tokens as well
gamma = 0.4 as on the validation pair.
Sigmoid C = 10, 0.45 0.27 0.591 ± 0.141
gamma = 0.9 RQ2 can be answered by the reduction of Bloomfield’s
Sigmoid C = 1, 0.23 0.36 0.500 ± 0.166 types that was performed, scope, metaphor/metonymy
gamma = 0.1 and other, but this was mostly done due to the imbal-
Sigmoid C = 1, 0.32 0.41 0.375 ± 0.128 ance on the annotated list. Within these types, the best
gamma = 0.05
Sigmoid C = 1, 0.36 0.50 0.352 ± 0.091
result on the training pair seemed to perform equally well
gamma = 0.01 on scope change and metaphor/metonymy (see Figure 2),
Linear C = 1 0.36 0.41 0.432 ± 0.097 with some reduced performance in the other category. On
RBF C = 100, 0.36 0.36 0.420 ± 0.086 the validation pair scope change performs equally well as
gamma = 0.001 on the training pair, so it could be said that this type can
RBF C = 100, 0.36 0.41 0.432 ± 0.074 be predicted most easily.
gamma = 0.0001
RBF C = 10000, 0.36 0.41 0.432 ± 0.097 Finally, RQ3 has to also be answered somewhat incon-
gamma = 0.00001 clusively. With the right parameters some classifiers show
decen results after alignment, but not as good as the cross-
Table 6. Comparison of kernels across both model pairs validation shows (Table 6).
with alignment.
There were quite some limitations in this research, so there
is much room for future work here. Firstly, Gutenberg
and Wikipedia have texts from quite different domains,
and many detected shifts were actually noise as a result
of this. For example the word “bye” as a shortening of
“goodbye”, which would only appear in a conversational
text, rarely appears on Wikipedia, but its sports sense
does, which is rare in the books. Thus this was detected
as a shift despite both senses having existed fora long time.
There is a balanced corpus of English text spanning the
years 1810-2009, known as COHA10 , but this is sold for a
rather high price and was thus not available to me.

Figure 4. The confusion matrices for the SVC with sigmoid Furthermore, this also ties into the small size of the anno-
kernel, C = 5 and γ = 0.8. On the left the performance on tated list, which was time-consuming to create due to the
the training/CV pair and on the right the aligned validation many words that only seemed to be detected as shifts due
pair. to the differences in corpora. It was also complicated to
classify many words, since they seemed to involve aspects
from multiple categories or simply not enough information
found in classifying semantic shift across different corpora, could be found on them. I am also not an expert on lin-
after aligning them. Though this is far from perfect, an guistics by any means, so having a group of linguists come
accuracy above 0.5 for both pairs is clearly better than the to a consensus over a bigger list could potentially greatly
zeroR benchmark of 0.38, and it goes above this threshold improve results.
in predicting every label, as can be seen in Figure 3. How- Another method that has not been attempted in this re-
ever, the aligned validation pair still shows poorer perfor- search is to use a dynamic embedding method instead of
mance than on the cross-validation, and the best perfor- alignment, with more time periods. These could give a
mance on cross-validation does not necessarily reflect the more detailed look into shifts across time, with a word
best performance on the validation pair. This is firstly that has undergone multiple shifts being able to be sep-
most likely the result of the small size of the training and arated into its different shifts due to the gradual process
test word list, and secondly the limitations of the align- being explicit in the dynamic embeddings.
ment, which is ultimately susceptible to some inaccuracy,
And finally, there is the method of Word2Vec itself. In
as it relies on the shared vocabulary between the pairs.
recent years a different type of embedding method has
Many words could have changed differently from 1750 to
emerged resulting in contextualized embeddings. These
1949 then from 1800 to the present. Furthermore on all
methods produce different embeddings for every context
confusion matrices it is clear that classifying label 2 (other)
a word appears in, and in grouping these embeddings to-
is less accurate. This is most likely a result of it being a
gether different senses that a word has can be represented
grouping of different categories, so there can be many dif-
as different vectors. This could make the detected se-
ferences within it. Label 1 (metaphor/metonymy) is only
mantic shift more precise for polysemous (having multi-
more inaccurate on the validation pair, which might just
ple meaning) words, as the vector would be free from the
be due to the small size of the validation word list.
influence of other senses that have not undergone a seman-
tic shift. However, this method is not perfect in detecting
6. CONCLUSIONS polysemy [13] and not always better than non-contextual
Overall, RQ0 has to be answered somewhat inconclusively, methods [19].
but with promising directions for future research. 0.6 accu-
racy is possible on the Gutenberg 1800-1849 and Wikipedia
corpora, but it is somewhat less effective across different
7. ACKNOWLEDGEMENTS
10
corpora, and even 0.6 accuracy can still not be considered Available from https://www.corpusdata.org/

6
I would like to thank my supervisor, Shenghui Wang, for task 1: Unsupervised lexical semantic change
giving me important directions and advice for this re- detection. ArXiv, abs/2007.11464, 2020.
search, and the Intelligent Interaction track chair, for the
useful info and feedback sessions.

8. REFERENCES
[1] R. Bamler and S. Mandt. Dynamic word
embeddings. In ICML, 2017.
[2] F. Bianchi, V. Di Carlo, P. Nicoli, and
M. Palmonari. Compass-aligned distributional
embeddings for studying semantic differences across
corpora. ArXiv, abs/2004.06519, 2020.
[3] L. Bloomfield. Language. George Allen & Unwin,
1933.
[4] A. Darmesteter. La vie des mots. Delagrave, 1887.
[5] W. L. Hamilton, J. Leskovec, and D. Jurafsky.
Diachronic word embeddings reveal statistical laws
of semantic change. CoRR, abs/1605.09096, 2016.
[6] A. Kutuzov, L. Øvrelid, T. Szymanski, and
E. Velldal. Diachronic word embeddings and
semantic shifts: a survey. In COLING, 2018.
[7] M. Martinc, P. K. Novak, and S. Pollak. Leveraging
contextual embeddings for detecting diachronic
semantic shift. ArXiv, abs/1912.01072, 2020.
[8] C. McCormick. Word2vec tutorial - the skip-gram
model. http://mccormickml.com/2016/04/19/
word2vec-tutorial-the-skip-gram-model/, Apr.
2016. Accessed: 27-06-2021.
[9] C. McCormick. Word2vec tutorial part 2 - negative
sampling. https://mccormickml.com/2017/01/11/
word2vec-tutorial-part-2-negative-sampling/,
Jan. 2017. Accessed: 27-06-2021.
[10] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
Efficient estimation of word representations in
vector space. In ICLR, 2013.
[11] S. Mitra, R. Mitra, M. Riedl, C. Biemann,
A. Mukherjee, and P. Goyal. That’s sick dude!:
Automatic identification of word sense change across
different timescales. In ACL, 2014.
[12] S. Montariol. Models of diachronic semantic change
using word embeddings. Theses, Université
Paris-Saclay, Feb. 2021.
[13] S. Montariol, E. Zosa, M. Martinc, and
L. Pivovarova. Capturing evolution in word usage:
Just add more clusters? Companion Proceedings of
the Web Conference 2020, 2020.
[14] H. Paul. Prinzipien der Sprachgeschichte. Niemeyer,
1880.
[15] K. Reisig. Professor K. Reisig’s Vorlesungen über
lateinische Sprachwissenschaft. Lehnold, 1839.
[16] M. Rudolph and D. Blei. Dynamic embeddings for
language evolution. In Proceedings of the 2018
World Wide Web Conference, WWW ’18, pages
1003–1011, Republic and Canton of Geneva, CHE,
2018. International World Wide Web Conferences
Steering Committee.
[17] E. Sagi, S. Kaufmann, and B. Clark. Tracing
semantic change with latent semantic analysis.
Current Methods in Historical Semantics, pages
161–183, Dec. 2011.
[18] D. Schlechtweg, A. Hätty, M. D. Tredici, and S. S.
im Walde. A wind of change: Detecting and
evaluating lexical semantic change across times and
domains. In ACL, 2019.
[19] D. Schlechtweg, B. McGillivray, S. Hengchen,
H. Dubossarsky, and N. Tahmasebi. Semeval-2020

7
APPENDIX Word Label Word Label
A. PARAMETERS WORD2VEC tally NOUN 0 campaign VERB 1
intriguing ADJ 2 devious ADJ 1
Corpus Parametersa lap NOUN 2 franchise NOUN 2
virtual ADJ 0 milestone NOUN 1
1800-1849 (raw) vector_size = 300,
serving NOUN 0 calculator NOUN 1
window = 5,
fatality NOUN 2 episode NOUN 0
min_count = 50, sg = 1
screen VERB 1 garner VERB 1
1800-1849 (lemma) vector_size = 300,
stereotype NOUN 1 rack VERB 2
window = 5,
rap NOUN 2 picket VERB 1
min_count = 100, sg = 1
power VERB 2 documentary NOUN 0
Wikipedia (raw) vector_size = 300, craft VERB 1 gig NOUN 2
window = 5, gay ADJ 2 definitely ADV 2
min_count = 200, sg = 1 commercial NOUN 2 cockpit NOUN 1
Wikipedia (lemma) vector_size = 300, dramatically ADV 2 rendition NOUN 1
window = 5, sg = 1b animated ADJ 1 shrink VERB 1
1750-1799 vector_size = 300, catcher NOUN 0 impact NOUN 1
window = 5, outstanding ADJ 1 sampler NOUN 2
min_count = 30, sg = 1 guy NOUN 0 film NOUN 2
1900-1949 vector_size = 300, destroyer NOUN 1
window = 5, squad NOUN 1
min_count = 100, sg = 1 moonshine NOUN 2
a album NOUN 2
Parameters not given are the default in Gensim. retarded ADJ 0
b
Other parameters unknown (pretrained). hectic ADJ 0
quite ADJ 2
clinch VERB 1
B. WORD LIST assist NOUN 0
B.1 Training list (1800-1849 - today) sedate VERB 0
brand NOUN 1
closet VERB 0
trans ADJ 0
untitled ADJ 0
party VERB 0
ejaculate VERB 2
concourse NOUN 0
unseasonably ADV 0
stint NOUN 1
urge NOUN 1
task VERB 1
portmanteau NOUN 1
installation NOUN 1
annexed ADJ 0
peak VERB 1
acoustic ADJ 0
presently ADV 1
expectancy NOUN 0
commitment NOUN 2
scribe VERB 0
cameo NOUN 1
coach NOUN 1
ill ADV 0
trend NOUN 1
alongside ADV 1
unavailable ADJ 0
focus VERB 1
figure VERB 1
caller NOUN 2
home VERB 1
bumper NOUN 2
operative NOUN 0
chair VERB 1
fag NOUN 2
cartel NOUN 2
shaver NOUN 2
jade NOUN 2
chum NOUN 2
unused ADJ 2
famously ADV 2
merchandise VERB 1
8
B.2 Validation list (1750-1799 – 1900-1949)
Word Label
radical NOUN 1
abstractedly ADV 2
divan NOUN 2
gamut NOUN 0
deplete VERB 0
denizen NOUN 0
bored ADJ 2
civilian NOUN 1
experimentally ADV 0
monitor NOUN 1
projector NOUN 1
electorate NOUN 0
obstreperous ADJ 0
awfully ADV 2
exploit VERB 2
reclamation NOUN 2
salamander NOUN 1
exponent NOUN 2
sporadic ADJ 1
outfit NOUN 0
recording NOUN 1
lot NOUN 2

C. INITIAL CLASSIFIER RESULTS


Classifier Parametersa Accuracy Tokens
KNeighborsClassifier n_neighbors = 5 0.397 ± 0.151 raw
KNeighborsClassifier n_neighbors = 10 0.330 ± 0.137 raw
KNeighborsClassifier n_neighbors = 20 0.375 ± 0.121 raw
KNeighborsClassifier n_neighbors = 40 0.397 ± 0.151 raw
RandomForestClassifier - 0.466 ± 0.138 raw
GaussianNB - 0.352 ± 0.129 raw
BernoulliNB - 0.443 ± 0.163 raw
LogisticRegression - 0.386 ± 0.067 raw
SGDClassifier loss = "log" 0.386 ± 0.093 raw
SGDClassifier loss = "hinge" 0.409 ± 0.109 raw
SGDClassifier loss = "modified_huber" 0.386 ± 0.066 raw
SGDClassifier loss = "squared_hinge" 0.409 ± 0.147 raw
SGDClassifier loss = "huber" 0.398 ± 0.121 raw
SGDClassifier loss = "perceptron" 0.432 ± 0.068 raw
SVC kernel = "linear" 0.386 ± 0.053 raw
KNeighborsClassifier n_neighbors = 5 0.318 ± 0.087 lemma
KNeighborsClassifier n_neighbors = 10 0.318 ± 0.115 lemma
KNeighborsClassifier n_neighbors = 20 0.364 ± 0.145 lemma
KNeighborsClassifier n_neighbors = 40 0.386 ± 0.124 lemma
RandomForestClassifier - 0.398 ± 0.157 lemma
GaussianNB - 0.420 ± 0.138 lemma
BernoulliNB - 0.363 ± 0.123 lemma
LogisticRegression - 0.432 ± 0.125 lemma
SGDClassifier loss = "log" 0.432 ± 0.125 lemma
SGDClassifier loss = "hinge" 0.455 ± 0.088 lemma
SGDClassifier loss = "modified_huber" 0.398 ± 0.068 lemma
SGDClassifier loss = "squared_hinge" 0.363 ± 0.122 lemma
SGDClassifier loss = "huber" 0.307 ± 0.112 lemma
SGDClassifier loss = "perceptron" 0.420 ± 0.076 lemma
SVC kernel = "linear" 0.432 ± 0.097 lemma
a
Parameters not given are the default in scikit-learn.

You might also like