Intrinsic and Extrinsic Evaluations of Word Embeddings: Michael Zhai, Johnny Tan, Jinho D. Choi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

Intrinsic and Extrinsic Evaluations of Word Embeddings

Michael Zhai, Johnny Tan, Jinho D. Choi


Department of Mathematics and Computer Science
Emory University
Atlanta, GA 30322
{michael.zhai,johnny.tan,jinho.choi}@emory.edu

k-medoids
push
Abstract press
agitated-2 fight
agitated-2

agitated-2
agitate
campaign agitated-2 agitated-2
1000
In this paper, we first analyze the semantic composition of nursery
crusade
agitated-2
hairdressing nursery-1
word embeddings by cross-referencing their clusters with the haircare
haircare-0
haircare-0 glasshouse nursery-1
500
manual lexical database, WordNet. We then evaluate a variety hair care
haircare-0 greenhouse
nursery-1
bather
bathers-0
of word embedding approaches by comparing their contribu- 0
1
snow eater 2
tions to two NLP tasks. Our experiments show that the word chinook wind chinook-0
chinook-0
chinook 3
embedding clusters give high correlations to the synonym and -500
goose
cuckoo-0 fathead
natator bathers-0
cuckoo-0
chinook-0
4
cuckoo 5
hyponym sets in WordNet, and give 0.88% and 0.17% abso- cuckoo-0
swimmer
bathers-0 6
jackassbozo cuckoo-0
lute improvements in accuracy to named entity recognition -1000
cuckoo-0

and part-of-speech tagging, respectively. zany goof


cuckoo-0 cuckoo-0
goofball
cuckoo-0
-1500
-1000 -500 0 500 1000 1500 2000

Introduction
Figure 1: The t-SNE projection of word embeddings with
Distributional semantics, the field of finding semantic simi-
respect to the synonym sets in WordNet.
larities between entities using large data, has recently gained
lots of interest. Word clusters induced from distributional se-
0.45
mantics have shown to be helpful for handling unseen words
in several NLP tasks (Turian, Ratinov, and Bengio 2010). 0.4
fuzzy c-means

Furthermore, recent advances in embedding approaches have 0.35


k-means

produced superior word representations for word similar-


ity and analogy tasks (Mikolov et al. 2013). In this paper, 0.3
Purity Scroe

we analyze the semantic composition of word embeddings 0.25


by comparing their clusters to the manual lexical database,
WordNet (Fellbaum 1998), and give extrinsic evaluations 0.2

of different word embedding approaches through two NLP 0.15


tasks, named entity recognition and part-of-speech tagging.
0.1

Intrinsic Evaluation 0.05


0 5 10 15 20 25 30 35 40 45 50
Number of Clusters
Word embeddings are continuous-valued vectors representing
word semantics. In our experiments, they are generated by
using bag-of-words (CBOW), skip-gram with negative sam- Figure 2: The purity scores achieved by k-means and c-means
pling (SGNS), and GloVe (Pennington, Socher, and Manning clustering with respect to the number of clusters.
2014), and clustered by the k-means, g-means, hierarchical g-
means, and agglomerative clustering algorithms using cosine
similarity. Brown clusters are induced directly from the text.
Figure 1 shows that our k-means clustering (colored shapes)
For intrinsic evaluation, WordNet is used as the reference
display high degree of agreement with the WordNet synonym
for our semantic analysis of word embedding. From Word-
sets (subscripts).2
Net, sets of synonyms and hyponyms of the 100 most fre-
quent nouns and verbs in the New York Times corpus1 are Figure 2 shows that hard-bound clustering such as k-means
extracted and compared to the clusters generated from the achieves much higher purity scores than fuzzy-bound cluster-
word embeddings. ing such as c-means.
Copyright  c 2015, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
1 2
https://catalog.ldc.upenn.edu/LDC2008T19 The other clusters show similar results as Figure 1.

4282
Embedding Cluster F1-score
Baseline - 85.31
- Brown 86.15
SGNS agglomerative 86.19
SGNS k-means 85.72
SGNS g-means 85.83
SGNS g-means hier 85.68
SGNS (w+c) agglomerative 86.14
SGNS (w+c) k-means 85.65
SGNS (w+c) g-means 85.70
SGNS (w+c) g-means hier 85.71
CBOW agglomerative 85.98
CBOW k-means 85.81
CBOW g-means 85.67
CBOW g-means hier 85.70
GloVe agglomerative 86.08
GloVe k-means 85.72
GloVe g-means 85.71 Figure 3: The F1-scores for named entity recognition with
GloVe g-means hier 86.10 respect to different sizes of the training data using SGNS
grouped by all clustering algorithms.
Table 1: Named entity recognition results on the test set.

Embedding Cluster Accuracy Conclusion


Baseline - 97.34 Word embeddings have shown to be useful for several NLP
- Brown 97.51 tasks. In this paper, we first analyze the nature of the vector
SGNS agglomerative 97.43 spaces created by different word embedding approaches and
SGNS (w+c) agglomerative 97.39 compare their clusters to the ontologies in WordNet. From
CBOW agglomerative 97.42 our experiments, we found that the embedding clusters show
GloVe g-means 97.40 high correlations with the synonyms and hyponyms in Word-
Net although the correlation level decreases as the cluster
Table 2: Part-of-speech tagging results on the test set; only size increases.
the best result is displayed for each approach. We also show the impact of different word embedding
approaches couple with several clustering algorithms on two
NLP tasks, named entity recognition and part-of-speech tag-
Extrinsic Evaluation ging. Our experiments show that hierarchical clustering al-
gorithms such as Brown or agglomerative are more suitable
For extrinsic evaluation, we use the word embedding clusters for finding clustering features than partition-based clustering
as features for two NLP tasks, named entity recognition and algorithms such as k-means and g-means for these tasks.3
part-of-speech tagging. The English portion of OntoNotes 5
is used for experiments following the standard split suggested References
by Pradhan et al. (2013). AdaGrad is used for training sta- Fellbaum, C., ed. 1998. WordNet: An Electronic Lexical
tistical models. As recommended by Levy, Goldberg, and Database. MIT Press.
Dagan (2015), additional experiments are conducted by con- Levy, O.; Goldberg, Y.; and Dagan, I. 2015. Improving Distribu-
catenating the word and contextual vectors (w+c). tional Similarity with Lessons Learned from Word Embeddings.
For the NER experiments, the highest F1-score of 86.19 is TACL 3:211–225.
achieved by the skip-gram with negative sampling embed- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean,
dings (SGNS) using the agglomerative clustering. On the J. 2013. Distributed Representations of Words and Phrases and
other hand, the highest accuracy of 97.51 is achieved by their Compositionality. In NIPS, 3111–3119.
Brown clustering (using raw text instead of embeddings). Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
These results outperform the previous work (Pradhan et al. Global vectors for word representation. In EMNLP, 1532–1543.
2013), showing the absolute improvements of 3.77% and Pradhan, S.; Moschitti, A.; Xue, N.; Ng, H. T.; Björkelund, A.;
0.42% for the NER and POS tasks, respectively. Uryupina, O.; Zhang, Y.; and Zhong, Z. 2013. Towards robust
All of the above experiments are using the maximum clus- linguistic analysis using ontonotes. In CoNLL, 143–152.
ter size of 1,500. We also tested on the max cluster size of Turian, J.; Ratinov, L.; and Bengio, Y. 2010. Word Representa-
15,000, which showed very similar results. This implies that tions: A Simple and General Method for Semi-supervised Learn-
the increase of cluster size does not improve the quality of the ing. In ACL, 384–394.
clusters, at least for these two tasks. For the NER task, SGNS
and Brown give constant additive increase in performance
3
regardless of the size of the training data. All resources are available at http://github.com/emorynlp.

4283

You might also like