Koldenhof BA EEMCS
Koldenhof BA EEMCS
Koldenhof BA EEMCS
Semantic Shift
Dylan Koldenhof
University of Twente
P.O. Box 217, 7500AE Enschede
The Netherlands
d.koldenhof@student.utwente.nl
1
initial testing of the classifiers, two corpora will be used. • Litotes: Opposite of hyperbole, e.g. the word “kill”
The first consists of Project Gutenberg2 ebooks spanning which in its original Germanic root had meanings in
approximately the first half of the 19th century, while the the sense of tormenting or vexing.
second consists of recent Wikipedia articles. Two versions
of the corpora are used, one using the words as they ap- • Degeneration: A meaning shifting into a lower sta-
pear in the texts, except lowercased and with punctuation tus, for instance the word “silly”, which has gone
removed, while the other versions contain lemmatized and through many meanings, starting in the sense of happy
part of speech-tagged tokens. These two approaches will or blessed to weak and the meaning it has now.
henceforth be referred to as “raw” and “lemma”. • Elevation: Opposite of degeneration, the word “knight”
After the models are trained on the corpora, they are went from a low servant to someone exalted.
aligned using the Orthogonal Procrustes (OP) method,
after which the vectors for the same word across the dif- Although there were some earlier methods, the advent of
ferent time periods can be compared. From the words that neural word embedding algorithms led to a large increase
have most significantly shifted on the lemmatized models, in research into diachronic semantic shift [12]. The first
an annotated list is made which will be used to categorize popular implementation of this method is the Word2Vec
shifts using different supervised machine learning classi- framework proposed by Mikolov et al. [10]. This is a neu-
fiers, with the differences between the vectors as features. ral network model that consists of two different training
Cross-validation will be performed on the annotated list methods: Continuous Bag of Words (CBOW) and Skip-
with the accuracy metrics used for an initial evaluation, Gram. CBOW is generally more appropriate for small cor-
both on the raw and lemma embeddings. pora while Skip-Gram is more suited for large ones [12],
To determine the generalizability of these classifiers, two and thus Skip-Gram will be used for this research.
different corpora will be used for validation, composed of The Word2Vec Skip-Gram model works by sliding a win-
time periods 1750-1799 and 1900-1949. Both of these are dow of a given size across the input text. For every word,
composed of Project Gutenberg texts and will be aligned pairs of the word and another word within the window
the same way as the training corpora. The classifiers will are formed. These pairs then form samples for training
then be used on another annotated list composed in the a neural network with a single hidden layer. The task
same manner, with the classifier predicting the categories. of the network is, given an input word, to evaluate the
The performance across these corpora and the training probablity of all words within the vocabulary to be in the
corpora can then be compared and the metrics of the clas- window of the input word. This task itself is not the goal
sifiers that show the best results on both corpus pairs will of the method at all, but the weights of the neurons in the
be used to answer the main research question. hidden layer after training are what make up the vector of
the input word [8].
2. RELATED WORK Along with this Negative Sampling is used in the training
As mentioned before, Bloomfield [3] proposes a catego- process. Normally, when given a training sample pair, the
rization of semantic shifts, which is popular and used as a weights for all input words are updated with the goal of
basis for more recent studies [6]. the probablity of all words outside the pair having proba-
blity 0, while for the input word in the pair the probablity
Bloomfield initially proposed nine shifts, which are as fol- of the other word in the pair will be 1. With a large
lows: vocabulary this means a lot of unnecessary computation,
since most words do not share much context, and thus the
• Narrowing: A meaning going from a wider to a nar- weight updates will be rather insignificant. With Nega-
rower scope, an example is the word “hound”, which tive Sampling, only a limited amount of “negative” words
originally was the word for dog in general and later (random words from the vocabulary) are chosen for every
evolved to mean hunting dog specifically. training sample for which the weights will be updated [9].
Skip-Gram with Negative Sampling (SGNS) was generally
• Widening: Opposite of narrowing, and the word “dog” deemed most effective by extensive evaluation studies [18,
itself is actually an example, which used to refer to 19].
a specific sort of dog.
In order to identify diachronic semantic shift however,
• Metaphor: A word used as a metaphor becoming there needs to be a way of measuring these word embed-
its primary meaning, e.g. “broadcast” as described dings across time. There have been two approaches to
before. this: static and dynamic embeddings.
The static approach relies on separating a corpus in dif-
• Metonymy: A meaning relating to a place or time
ferent time slices, and training embeddings for these time
shifting to another nearby place or time, e.g. the
slices independently. One approach is to train the model
word “cheek”, which had the meaning of jaw in older
on each of these time slices separately. The problem with
English.
this is that the embeddings cannot be compared directly.
• Synecdoche: Shift from a part to whole, or vice Due to the stochastic nature of the model, embeddings
versa. An example is the word “commercial” in the trained on different slices will not be directly compara-
sense of advertisement. Commercial is just one trait ble, only their relative distances (provided meanings stay
of what it is, but it shifted into the word for it en- similar) will be [2, 12] (see Figure 1).
tirely. Hence, one way to compensate for this is to align the dif-
ferent vector spaces using some sort of method. The best
• Hyperbole: A stronger meaning shifting into a weaker [18] of these is found to be the Orthogonal Procrustes (OP)
one, the word “quite” is an example, which used to method proposed by Hamilton et al. [5].
mean “completely” or “wholly”.
The other approach is dynamic embeddings. The key dif-
2
https://www.gutenberg.org/ ference between these and static embeddings is that dif-
2
LGBT
gay was used, only including articles with more than 5000
homosexual words. This consists of around 1.63B words. Due to the
long processing time, the lemma Wikipedia corpus is from
homosexual
LGBT a slightly older archive from 2017, which was pre-trained
nice
on Word2Vec5 .
gay
happy For validating the generalizability of the classifier and an-
swering RQ3, two different corpora are used, spanning the
nice
happy
time periods 1750-1799 and 1900-1949. These are both
retrieved from Gutenberg and preprocessed in the same
manner as the 1800-1849 corpus. The 1750-1799 corpus
consists of 765 texts containing 66.4M words, while the
Figure 1. Illustrating what happens when word embeddings 1900-1949 corpus consists of 10838 texts containing 646M
are trained on two different time periods and not aligned, words.
using as an example the word gay. Relative distances are
roughly the same between unchanged words, but their ab- 3.2 Training the model
solute positions in the vector space are completely different The corpora (except the pre-trained one) are subsequently
and hence they cannot be compared directly. Based on fig- trained with Gensim’s Word2Vec model, which includes
ures by Bianchi et al. [2]. an implementation of Skip-Gram with Negative Sampling
(SGNS). An exception is the lemma Wikipedia corpus,
which was pre-trained, also using SGNS. An overview of
ferent time slices are not modeled independently of one all parameters used can be found in Appendix A.
another, thus word embeddings at a given time slice are
based on their position at previous ones. Taking this into 3.3 Comparing embeddings
account yields more directed and smoother shift in some The alignment of the different models was done using the
cases, but a disadvantage is that time slices flow into one Orthogonal Procrustes (OP) method. To illustrate fur-
another too much [12]. Some examples of dynamic em- ther, the Orthogonal Procrustres problem is, given two
beddings are those proposed by Bamler and Mandt [1] matrices A and B, to find the orthogonal matrix most
and Rudolph and Blei [16]. closely mapping A to B. This mapping can be used on the
There have been some papers inquiring into the nature of two embedding spaces to map the older period to the space
semantic change, but none with categorizations like the of the newer one. Specifically, when Wt ∈ Rd×|V| –where
one proposed by Bloomfield [3]. Mitra et al [11] use a d is the number of dimensions in an embedding and V the
different approach from word embeddings to derive sense shared vocabulary of the models–is a matrix of all word
clusters. The shifts observed in these sense clusters are embeddings at period t, the following equation needs to
then assigned four categories: split, join, birth and death. be solved:
Hamilton et al. [5] derived statistical laws of shifts, such
Rt = arg min ||Wt Q − Wt+1 ||F (1)
as that frequent words shift at slower rates. Q
3
Appendix B.1. These are a reduction of Bloomfield’s cat- Furthermore, the differences between the raw and lemma
egories as described in Section 2, which was done because models appear to be fairly small except for two classifiers:
of the small size of the list along with many of Bloomfield’s Random Forest, which performs much better on the raw
categories being quite rare in the selection process. The models, and Stochastic Gradient Descent (SGD) with log
categories are as follows: loss, which is much better on the lemma models. Upon
further analysis of parameters, it showed that these dif-
• Scope change (narrowing and widening) ferences are only due to the random elements of these
classifiers, which gives them vastly different results every
• Metaphor and metonymy run. This makes them not very generalizable and thus it is
more beneficial to look at methods with less randomness.
• Other types/cannot be determined Excluding SGD and Random Forest, in Table 2 the best
classifiers for raw and lemma models are shown.
For each word in this list, the difference between the vec-
tors of the word from the different time periods in the Classifier Accuracy Tokens
aligned embedding space is computed. The elements of BernoulliNB 0.443 ± 0.163 raw
this delta vector are then used as features for training KNeighborsClassifier 0.397 ± 0.151 raw
different machine learning classifiers, with the categories SVC 0.386 ± 0.053 raw
functioning as labels.
SVC 0.432 ± 0.097 lemma
Before training the classifiers, the features are scaled to LogisticRegression 0.432 ± 0.125 lemma
have zero mean and unit variance. This reduces the ef- GaussianNB 0.420 ± 0.138 lemma
fect of outliers, along with potentially having training data
that can be more generalized across corpora. Table 2. Best classifiers for raw and lemma models.
To validate the performance of the models, another an-
notated list is used, created in the same manner as the 4.2 SVC grid-search
other. The difference is that this list is using vectors from
With most of these classifiers there is little room for im-
the 1750-1799 and 1900-1949 corpora, and only contains
provement except for the Support Vector Classifier(SVC),
22 words. It can be found in Appendix B.2. The best
since it allows a lot of parameter tweaking. Thus a grid-
performing classifiers on the 1800-1849/wiki pair will be
search was performed on this classifier, testing many com-
tested on this list in order to determine whether they still
binations of parameters to determine the best combina-
perform similarly. The best performing classifier on both
tion. The best results for both corpora are shown in Table
of these can then be concluded to be the most reliable for
3.
categorizing semantic shift on Word2Vec embeddings.
Kernel Parameters8 Accuracy9 Tokens
4. EVALUATION
Sigmoid C = 100, 0.477 ± 0.125 raw
4.1 First results on training corpora gamma = 0.1
The best classifiers are determined based on average pre- Sigmoid C = 1000, 0.477 ± 0.136 raw
diction accuracy across a Leave-One-Out(LOO) cross- gamma = 0.1
validation (CV) on the list. LOO-CV is when a model Sigmoid C = 10000, 0.477 ± 0.119 raw
is trained on all but one of the elements in the data set, gamma = 0.1
with the remaining single element used for testing. This Sigmoid C = 10, 0.614 ± 0.146 lemma
training is then repeated with a different test element ev- gamma = 0.3
ery time, until all elements in the data have been tested. Sigmoid C = 1, 0.614 ± 0.150 lemma
Given the small size of the list, this allows the classifier gamma = 0.4
to be trained as much as possible, as opposed to other Sigmoid C = 10, 0.591 ± 0.141 lemma
forms of CV, such as 10-fold, where different selections of gamma = 0.9
10 samples are taken out and used as testing sets.
The results of this first step can be found in Appendix C. Table 3. Top 3 results from parameter tweaking on raw and
The classifiers are referred to by their class names in the lemma models.
Python scikit-learn library7 , which is used for all of them.
The sigmoid kernel is the only one that showed up in the
What can initially be noted is that these initial accuracy
top 3 for both the raw and lemma models, so it is clearly
numbers are quite low. Crucially however, some outper-
best. However, the best parameters for the lemma and
form random or majority guessing. In Table 1 the dis-
raw models are different and on the lemma models the
tribution of the labels is shown. As a benchmark, the
accuracy goes up to much higher values than on the raw
accuracy of a ZeroR classifier, which is when only the la-
models. The confusion matrix of the classifier with the
bel which forms the majority of training data is predicted,
best accuracy is shown in Figure 2.
the accuracy would be 33/(28 + 27 + 33) = 0.375.
4.3 Validation performance
Label Count However, although the sigmoid classifier seems ideal, it
Scope (0) 28 performs poorly on the validation set, as can be seen in
Metaphor/metonymy(1) 33 Table 4.
Other(2) 27 9
Here and in subsequent tables parameters not given are
the default used in Gensim.
9
Table 1. Distribution of labels Here and in subsequent tables the values after ± repre-
sent the standard deviation as evaluated from a 10-fold
7
https://scikit-learn.org/stable/index.html cross-validation.
4
Figure 3. The confusion matrices for the SVC with RBF
kernel, C = 100 and γ = 0.0001. On the left the perfor-
mance on the training/CV pair and on the right the vali-
dation pair.
5
Kernel Parameters Acc. al. Acc. no al. Acc. CV that reliable.
Sigmoid C = 5, 0.50 0.36 0.557 ± 0.109 The answer to RQ1 seems quite clearly a form of a Support
gamma = 0.8
Sigmoid C = 10, 0.41 0.23 0.614 ± 0.146
Vector Classifier, which is the only classifier that could be
gamma = 0.3 tweaked to yield an accuracy well above the zeroR bench-
Sigmoid C = 1, 0.36 0.23 0.614 ± 0.150 mark consistently on both raw and lemma tokens as well
gamma = 0.4 as on the validation pair.
Sigmoid C = 10, 0.45 0.27 0.591 ± 0.141
gamma = 0.9 RQ2 can be answered by the reduction of Bloomfield’s
Sigmoid C = 1, 0.23 0.36 0.500 ± 0.166 types that was performed, scope, metaphor/metonymy
gamma = 0.1 and other, but this was mostly done due to the imbal-
Sigmoid C = 1, 0.32 0.41 0.375 ± 0.128 ance on the annotated list. Within these types, the best
gamma = 0.05
Sigmoid C = 1, 0.36 0.50 0.352 ± 0.091
result on the training pair seemed to perform equally well
gamma = 0.01 on scope change and metaphor/metonymy (see Figure 2),
Linear C = 1 0.36 0.41 0.432 ± 0.097 with some reduced performance in the other category. On
RBF C = 100, 0.36 0.36 0.420 ± 0.086 the validation pair scope change performs equally well as
gamma = 0.001 on the training pair, so it could be said that this type can
RBF C = 100, 0.36 0.41 0.432 ± 0.074 be predicted most easily.
gamma = 0.0001
RBF C = 10000, 0.36 0.41 0.432 ± 0.097 Finally, RQ3 has to also be answered somewhat incon-
gamma = 0.00001 clusively. With the right parameters some classifiers show
decen results after alignment, but not as good as the cross-
Table 6. Comparison of kernels across both model pairs validation shows (Table 6).
with alignment.
There were quite some limitations in this research, so there
is much room for future work here. Firstly, Gutenberg
and Wikipedia have texts from quite different domains,
and many detected shifts were actually noise as a result
of this. For example the word “bye” as a shortening of
“goodbye”, which would only appear in a conversational
text, rarely appears on Wikipedia, but its sports sense
does, which is rare in the books. Thus this was detected
as a shift despite both senses having existed fora long time.
There is a balanced corpus of English text spanning the
years 1810-2009, known as COHA10 , but this is sold for a
rather high price and was thus not available to me.
Figure 4. The confusion matrices for the SVC with sigmoid Furthermore, this also ties into the small size of the anno-
kernel, C = 5 and γ = 0.8. On the left the performance on tated list, which was time-consuming to create due to the
the training/CV pair and on the right the aligned validation many words that only seemed to be detected as shifts due
pair. to the differences in corpora. It was also complicated to
classify many words, since they seemed to involve aspects
from multiple categories or simply not enough information
found in classifying semantic shift across different corpora, could be found on them. I am also not an expert on lin-
after aligning them. Though this is far from perfect, an guistics by any means, so having a group of linguists come
accuracy above 0.5 for both pairs is clearly better than the to a consensus over a bigger list could potentially greatly
zeroR benchmark of 0.38, and it goes above this threshold improve results.
in predicting every label, as can be seen in Figure 3. How- Another method that has not been attempted in this re-
ever, the aligned validation pair still shows poorer perfor- search is to use a dynamic embedding method instead of
mance than on the cross-validation, and the best perfor- alignment, with more time periods. These could give a
mance on cross-validation does not necessarily reflect the more detailed look into shifts across time, with a word
best performance on the validation pair. This is firstly that has undergone multiple shifts being able to be sep-
most likely the result of the small size of the training and arated into its different shifts due to the gradual process
test word list, and secondly the limitations of the align- being explicit in the dynamic embeddings.
ment, which is ultimately susceptible to some inaccuracy,
And finally, there is the method of Word2Vec itself. In
as it relies on the shared vocabulary between the pairs.
recent years a different type of embedding method has
Many words could have changed differently from 1750 to
emerged resulting in contextualized embeddings. These
1949 then from 1800 to the present. Furthermore on all
methods produce different embeddings for every context
confusion matrices it is clear that classifying label 2 (other)
a word appears in, and in grouping these embeddings to-
is less accurate. This is most likely a result of it being a
gether different senses that a word has can be represented
grouping of different categories, so there can be many dif-
as different vectors. This could make the detected se-
ferences within it. Label 1 (metaphor/metonymy) is only
mantic shift more precise for polysemous (having multi-
more inaccurate on the validation pair, which might just
ple meaning) words, as the vector would be free from the
be due to the small size of the validation word list.
influence of other senses that have not undergone a seman-
tic shift. However, this method is not perfect in detecting
6. CONCLUSIONS polysemy [13] and not always better than non-contextual
Overall, RQ0 has to be answered somewhat inconclusively, methods [19].
but with promising directions for future research. 0.6 accu-
racy is possible on the Gutenberg 1800-1849 and Wikipedia
corpora, but it is somewhat less effective across different
7. ACKNOWLEDGEMENTS
10
corpora, and even 0.6 accuracy can still not be considered Available from https://www.corpusdata.org/
6
I would like to thank my supervisor, Shenghui Wang, for task 1: Unsupervised lexical semantic change
giving me important directions and advice for this re- detection. ArXiv, abs/2007.11464, 2020.
search, and the Intelligent Interaction track chair, for the
useful info and feedback sessions.
8. REFERENCES
[1] R. Bamler and S. Mandt. Dynamic word
embeddings. In ICML, 2017.
[2] F. Bianchi, V. Di Carlo, P. Nicoli, and
M. Palmonari. Compass-aligned distributional
embeddings for studying semantic differences across
corpora. ArXiv, abs/2004.06519, 2020.
[3] L. Bloomfield. Language. George Allen & Unwin,
1933.
[4] A. Darmesteter. La vie des mots. Delagrave, 1887.
[5] W. L. Hamilton, J. Leskovec, and D. Jurafsky.
Diachronic word embeddings reveal statistical laws
of semantic change. CoRR, abs/1605.09096, 2016.
[6] A. Kutuzov, L. Øvrelid, T. Szymanski, and
E. Velldal. Diachronic word embeddings and
semantic shifts: a survey. In COLING, 2018.
[7] M. Martinc, P. K. Novak, and S. Pollak. Leveraging
contextual embeddings for detecting diachronic
semantic shift. ArXiv, abs/1912.01072, 2020.
[8] C. McCormick. Word2vec tutorial - the skip-gram
model. http://mccormickml.com/2016/04/19/
word2vec-tutorial-the-skip-gram-model/, Apr.
2016. Accessed: 27-06-2021.
[9] C. McCormick. Word2vec tutorial part 2 - negative
sampling. https://mccormickml.com/2017/01/11/
word2vec-tutorial-part-2-negative-sampling/,
Jan. 2017. Accessed: 27-06-2021.
[10] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
Efficient estimation of word representations in
vector space. In ICLR, 2013.
[11] S. Mitra, R. Mitra, M. Riedl, C. Biemann,
A. Mukherjee, and P. Goyal. That’s sick dude!:
Automatic identification of word sense change across
different timescales. In ACL, 2014.
[12] S. Montariol. Models of diachronic semantic change
using word embeddings. Theses, Université
Paris-Saclay, Feb. 2021.
[13] S. Montariol, E. Zosa, M. Martinc, and
L. Pivovarova. Capturing evolution in word usage:
Just add more clusters? Companion Proceedings of
the Web Conference 2020, 2020.
[14] H. Paul. Prinzipien der Sprachgeschichte. Niemeyer,
1880.
[15] K. Reisig. Professor K. Reisig’s Vorlesungen über
lateinische Sprachwissenschaft. Lehnold, 1839.
[16] M. Rudolph and D. Blei. Dynamic embeddings for
language evolution. In Proceedings of the 2018
World Wide Web Conference, WWW ’18, pages
1003–1011, Republic and Canton of Geneva, CHE,
2018. International World Wide Web Conferences
Steering Committee.
[17] E. Sagi, S. Kaufmann, and B. Clark. Tracing
semantic change with latent semantic analysis.
Current Methods in Historical Semantics, pages
161–183, Dec. 2011.
[18] D. Schlechtweg, A. Hätty, M. D. Tredici, and S. S.
im Walde. A wind of change: Detecting and
evaluating lexical semantic change across times and
domains. In ACL, 2019.
[19] D. Schlechtweg, B. McGillivray, S. Hengchen,
H. Dubossarsky, and N. Tahmasebi. Semeval-2020
7
APPENDIX Word Label Word Label
A. PARAMETERS WORD2VEC tally NOUN 0 campaign VERB 1
intriguing ADJ 2 devious ADJ 1
Corpus Parametersa lap NOUN 2 franchise NOUN 2
virtual ADJ 0 milestone NOUN 1
1800-1849 (raw) vector_size = 300,
serving NOUN 0 calculator NOUN 1
window = 5,
fatality NOUN 2 episode NOUN 0
min_count = 50, sg = 1
screen VERB 1 garner VERB 1
1800-1849 (lemma) vector_size = 300,
stereotype NOUN 1 rack VERB 2
window = 5,
rap NOUN 2 picket VERB 1
min_count = 100, sg = 1
power VERB 2 documentary NOUN 0
Wikipedia (raw) vector_size = 300, craft VERB 1 gig NOUN 2
window = 5, gay ADJ 2 definitely ADV 2
min_count = 200, sg = 1 commercial NOUN 2 cockpit NOUN 1
Wikipedia (lemma) vector_size = 300, dramatically ADV 2 rendition NOUN 1
window = 5, sg = 1b animated ADJ 1 shrink VERB 1
1750-1799 vector_size = 300, catcher NOUN 0 impact NOUN 1
window = 5, outstanding ADJ 1 sampler NOUN 2
min_count = 30, sg = 1 guy NOUN 0 film NOUN 2
1900-1949 vector_size = 300, destroyer NOUN 1
window = 5, squad NOUN 1
min_count = 100, sg = 1 moonshine NOUN 2
a album NOUN 2
Parameters not given are the default in Gensim. retarded ADJ 0
b
Other parameters unknown (pretrained). hectic ADJ 0
quite ADJ 2
clinch VERB 1
B. WORD LIST assist NOUN 0
B.1 Training list (1800-1849 - today) sedate VERB 0
brand NOUN 1
closet VERB 0
trans ADJ 0
untitled ADJ 0
party VERB 0
ejaculate VERB 2
concourse NOUN 0
unseasonably ADV 0
stint NOUN 1
urge NOUN 1
task VERB 1
portmanteau NOUN 1
installation NOUN 1
annexed ADJ 0
peak VERB 1
acoustic ADJ 0
presently ADV 1
expectancy NOUN 0
commitment NOUN 2
scribe VERB 0
cameo NOUN 1
coach NOUN 1
ill ADV 0
trend NOUN 1
alongside ADV 1
unavailable ADJ 0
focus VERB 1
figure VERB 1
caller NOUN 2
home VERB 1
bumper NOUN 2
operative NOUN 0
chair VERB 1
fag NOUN 2
cartel NOUN 2
shaver NOUN 2
jade NOUN 2
chum NOUN 2
unused ADJ 2
famously ADV 2
merchandise VERB 1
8
B.2 Validation list (1750-1799 – 1900-1949)
Word Label
radical NOUN 1
abstractedly ADV 2
divan NOUN 2
gamut NOUN 0
deplete VERB 0
denizen NOUN 0
bored ADJ 2
civilian NOUN 1
experimentally ADV 0
monitor NOUN 1
projector NOUN 1
electorate NOUN 0
obstreperous ADJ 0
awfully ADV 2
exploit VERB 2
reclamation NOUN 2
salamander NOUN 1
exponent NOUN 2
sporadic ADJ 1
outfit NOUN 0
recording NOUN 1
lot NOUN 2