Linguistic Regularities in Continuous Space Word Representations

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Linguistic Regularities in Continuous Space Word Representations

Tomas Mikolov , Wen-tau Yih, Geoffrey Zweig


Microsoft Research
Redmond, WA 98052

Abstract inputs to a neural network. As pointed out by the


original proposers, one of the main advantages of
Continuous space language models have re- these models is that the distributed representation
cently demonstrated outstanding results across achieves a level of generalization that is not possi-
a variety of tasks. In this paper, we ex-
ble with classical n-gram language models; whereas
amine the vector-space word representations
that are implicitly learned by the input-layer
a n-gram model works in terms of discrete units that
weights. We find that these representations have no inherent relationship to one another, a con-
are surprisingly good at capturing syntactic tinuous space model works in terms of word vectors
and semantic regularities in language, and where similar words are likely to have similar vec-
that each relationship is characterized by a tors. Thus, when the model parameters are adjusted
relation-specific vector offset. This allows in response to a particular word or word-sequence,
vector-oriented reasoning based on the offsets the improvements will carry over to occurrences of
between words. For example, the male/female
similar words and sequences.
relationship is automatically learned, and with
the induced vector representations, King - By training a neural network language model, one
Man + Woman results in a vector very close obtains not just the model itself, but also the learned
to Queen. We demonstrate that the word word representations, which may be used for other,
vectors capture syntactic regularities by means potentially unrelated, tasks. This has been used to
of syntactic analogy questions (provided with good effect, for example in (Collobert and Weston,
this paper), and are able to correctly answer
2008; Turian et al., 2010) where induced word rep-
almost 40% of the questions. We demonstrate
that the word vectors capture semantic regu- resentations are used with sophisticated classifiers to
larities by using the vector offset method to improve performance in many NLP tasks.
answer SemEval-2012 Task 2 questions. Re- In this work, we find that the learned word repre-
markably, this method outperforms the best sentations in fact capture meaningful syntactic and
previous systems.
semantic regularities in a very simple way. Specif-
ically, the regularities are observed as constant vec-
1 Introduction tor offsets between pairs of words sharing a par-
ticular relationship. For example, if we denote the
A defining feature of neural network language mod- vector for word i as xi , and focus on the singu-
els is their representation of words as high dimen- lar/plural relation, we observe that xapple xapples
sional real valued vectors. In these models (Ben- xcar xcars , xf amily xf amilies xcar xcars , and
gio et al., 2003; Schwenk, 2007; Mikolov et al., so on. Perhaps more surprisingly, we find that this
2010), words are converted via a learned lookup- is also the case for a variety of semantic relations, as
table into real valued vectors which are used as the measured by the SemEval 2012 task of measuring

Currently at Google, Inc. relation similarity.
The remainder of this paper is organized as fol-
lows. In Section 2, we discuss related work; Section
3 describes the recurrent neural network language
model we used to obtain word vectors; Section 4 dis-
cusses the test sets; Section 5 describes our proposed
vector offset method; Section 6 summarizes our ex-
periments, and we conclude in Section 7.

2 Related Work
Distributed word representations have a long his-
tory, with early proposals including (Hinton, 1986;
Pollack, 1990; Elman, 1991; Deerwester et al.,
1990). More recently, neural network language
models have been proposed for the classical lan-
guage modeling task of predicting a probability dis-
Figure 1: Recurrent Neural Network Language Model.
tribution over the next word, given some preced-
ing words. These models were first studied in the
context of feed-forward networks (Bengio et al., output layers are computed as follows:
2003; Bengio et al., 2006), and later in the con-
text of recurrent neural network models (Mikolov et s(t) = f (Uw(t) + Ws(t1)) (1)
al., 2010; Mikolov et al., 2011b). This early work
y(t) = g (Vs(t)) , (2)
demonstrated outstanding performance in terms of
word-prediction, but also the need for more compu- where
tationally efficient models. This has been addressed 1 ezm
by subsequent work using hierarchical prediction f (z) = , g(zm ) = P z . (3)
1 + ez ke
k
(Morin and Bengio, 2005; Mnih and Hinton, 2009;
Le et al., 2011; Mikolov et al., 2011b; Mikolov et In this framework, the word representations are
al., 2011a). Also of note, the use of distributed found in the columns of U, with each column rep-
topic representations has been studied in (Hinton resenting a word. The RNN is trained with back-
and Salakhutdinov, 2006; Hinton and Salakhutdi- propagation to maximize the data log-likelihood un-
nov, 2010), and (Bordes et al., 2012) presents a se- der the model. The model itself has no knowledge
mantically driven method for obtaining word repre- of syntax or morphology or semantics. Remark-
sentations. ably, training such a purely lexical model to max-
imize likelihood will induce word representations
3 Recurrent Neural Network Model with striking syntactic and semantic properties.
The word representations we study are learned by a 4 Measuring Linguistic Regularity
recurrent neural network language model (Mikolov
et al., 2010), as illustrated in Figure 1. This architec- 4.1 A Syntactic Test Set
ture consists of an input layer, a hidden layer with re- To understand better the syntactic regularities which
current connections, plus the corresponding weight are inherent in the learned representation, we created
matrices. The input vector w(t) represents input a test set of analogy questions of the form a is to b
word at time t encoded using 1-of-N coding, and the as c is to testing base/comparative/superlative
output layer y(t) produces a probability distribution forms of adjectives; singular/plural forms of com-
over words. The hidden layer s(t) maintains a rep- mon nouns; possessive/non-possessive forms of
resentation of the sentence history. The input vector common nouns; and base, past and 3rd person
w(t) and the output vector y(t) have dimensional- present tense forms of verbs. More precisely, we
ity of the vocabulary. The values in the hidden and tagged 267M words of newspaper text with Penn
Category Relation Patterns Tested # Questions Example
Adjectives Base/Comparative JJ/JJR, JJR/JJ 1000 good:better rough:
Adjectives Base/Superlative JJ/JJS, JJS/JJ 1000 good:best rough:
Adjectives Comparative/ JJS/JJR, JJR/JJS 1000 better:best rougher:
Superlative
Nouns Singular/Plural NN/NNS, 1000 year:years law:
NNS/NN
Nouns Non-possessive/ NN/NN POS, 1000 city:citys bank:
Possessive NN POS/NN
Verbs Base/Past VB/VBD, 1000 see:saw return:
VBD/VB
Verbs Base/3rd Person VB/VBZ, VBZ/VB 1000 see:sees return:
Singular Present
Verbs Past/3rd Person VBD/VBZ, 1000 saw:sees returned:
Singular Present VBZ/VBD

Table 1: Test set patterns. For a given pattern and word-pair, both orderings occur in the test set. For example, if
see:saw return: occurs, so will saw:see returned: .

Treebank POS tags (Marcus et al., 1993). We then totypical word pair clothing:shirt. To measure the
selected 100 of the most frequent comparative adjec- degree that a target word pair dish:bowl has the same
tives (words labeled JJR); 100 of the most frequent relation, we form the analogy clothing is to shirt as
plural nouns (NNS); 100 of the most frequent pos- dish is to bowl, and ask how valid it is.
sessive nouns (NN POS); and 100 of the most fre-
quent base form verbs (VB). We then systematically 5 The Vector Offset Method
generated analogy questions by randomly matching As we have seen, both the syntactic and semantic
each of the 100 words with 5 other words from the tasks have been formulated as analogy questions.
same category, and creating variants as indicated in We have found that a simple vector offset method
Table 1. The total test set size is 8000. The test set based on cosine distance is remarkably effective in
is available online. 1 solving these questions. In this method, we assume
relationships are present as vector offsets, so that in
4.2 A Semantic Test Set the embedding space, all pairs of words sharing a
In addition to syntactic analogy questions, we used particular relation are related by the same constant
the SemEval-2012 Task 2, Measuring Relation Sim- offset. This is illustrated in Figure 2.
ilarity (Jurgens et al., 2012), to estimate the extent In this model, to answer the analogy question a:b
to which RNNLM word vectors contain semantic c:d where d is unknown, we find the embedding
information. The dataset contains 79 fine-grained vectors xa , xb , xc (all normalized to unit norm), and
word relations, where 10 are used for training and compute y = xb xa + xc . y is the continuous
69 testing. Each relation is exemplified by 3 or space representation of the word we expect to be the
4 gold word pairs. Given a group of word pairs best answer. Of course, no word might exist at that
that supposedly have the same relation, the task is exact position, so we then search for the word whose
to order the target pairs according to the degree to embedding vector has the greatest cosine similarity
which this relation holds. This can be viewed as an- to y and output it:
other analogy problem. For example, take the Class- xw y
w = argmaxw
Inclusion:Singular Collective relation with the pro- kxw kkyk
1
http://research.microsoft.com/en- When d is given, as in our semantic test set, we
us/projects/rnn/default.aspx simply use cos(xb xa + xc , xd ) for the words
Method Adjectives Nouns Verbs All
LSA-80 9.2 11.1 17.4 12.8
LSA-320 11.3 18.1 20.7 16.5
LSA-640 9.6 10.1 13.8 11.3
RNN-80 9.3 5.2 30.4 16.2
RNN-320 18.2 19.0 45.0 28.5
RNN-640 21.0 25.2 54.8 34.7
Figure 2: Left panel shows vector offsets for three word
RNN-1600 23.9 29.2 62.2 39.6
pairs illustrating the gender relation. Right panel shows
a different projection, and the singular/plural relation for Table 2: Results for identifying syntactic regularities for
two words. In high-dimensional space, multiple relations different word representations. Percent correct.
can be embedded for a single word.

Method Adjectives Nouns Verbs All


provided. We have explored several related meth- RNN-80 10.1 8.1 30.4 19.0
ods and found that the proposed method performs CW-50 1.1 2.4 8.1 4.5
well for both syntactic and semantic relations. We CW-100 1.3 4.1 8.6 5.0
note that this measure is qualitatively similar to rela- HLBL-50 4.4 5.4 23.1 13.0
tional similarity model of (Turney, 2012), which pre-
HLBL-100 7.6 13.2 30.2 18.7
dicts similarity between members of the word pairs
(xb , xd ), (xc , xd ) and dis-similarity for (xa , xd ). Table 3: Comparison of RNN vectors with Turians Col-
lobert and Weston based vectors and the Hierarchical
6 Experimental Results Log-Bilinear model of Mnih and Hinton. Percent correct.

To evaluate the vector offset method, we used


vectors generated by the RNN toolkit of Mikolov questions. Turians Collobert and Weston based vec-
(2012). Vectors of dimensionality 80, 320, and 640 tors do poorly on this task, whereas the Hierarchical
were generated, along with a composite of several Log-Bilinear Model vectors of (Mnih and Hinton,
systems, with total dimensionality 1600. The sys- 2009) do essentially as well as the RNN vectors.
tems were trained with 320M words of Broadcast These representations were trained on 37M words
News data as described in (Mikolov et al., 2011a), of data and this may indicate a greater robustness of
and had an 82k vocabulary. Table 2 shows results the HLBL method.
for both RNNLM and LSA vectors on the syntactic We conducted similar experiments with the se-
task. LSA was trained on the same data as the RNN. mantic test set. For each target word pair in a rela-
We see that the RNN vectors capture significantly tion category, the model measures its relational sim-
more syntactic regularity than the LSA vectors, and ilarity to each of the prototypical word pairs, and
do remarkably well in an absolute sense, answering then uses the average as the final score. The results
more than one in three questions correctly. 2 are evaluated using the two standard metrics defined
In Table 3 we compare the RNN vectors with in the task, Spearmans rank correlation coefficient
those based on the methods of Collobert and We- and MaxDiff accuracy. In both cases, larger val-
ston (2008) and Mnih and Hinton (2009), as imple- ues are better. To compare to previous systems, we
mented by (Turian et al., 2010) and available online report the average over all 69 relations in the test set.
3 Since different words are present in these datasets,
From Table 4, we see that as with the syntac-
we computed the intersection of the vocabularies of
tic regularity study, the RNN-based representations
the RNN vectors and the new vectors, and restricted
perform best. In this case, however, Turians CW
the test set and word vectors to those. This resulted
vectors are comparable in performance to the HLBL
in a 36k word vocabulary, and a test set with 6632
vectors. With the RNN vectors, the performance im-
2
Guessing gets a small fraction of a percent. proves as the number of dimensions increases. Sur-
3
http://metaoptimize.com/projects/wordreprs/ prisingly, we found that even though the RNN vec-
Method Spearmans MaxDiff Acc. R. Collobert and J. Weston. 2008. A unified architecture
LSA-640 0.149 0.364 for natural language processing: Deep neural networks
RNN-80 0.211 0.389 with multitask learning. In Proceedings of the 25th
international conference on Machine learning, pages
RNN-320 0.259 0.408 160167. ACM.
RNN-640 0.270 0.416 S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer,
RNN-1600 0.275 0.418 and R. Harshman. 1990. Indexing by latent semantic
CW-50 0.159 0.363 analysis. Journal of the American Society for Informa-
CW-100 0.154 0.363 tion Science, 41(96).
HLBL-50 0.149 0.363 J.L. Elman. 1991. Distributed representations, simple re-
current networks, and grammatical structure. Machine
HLBL-100 0.146 0.362 learning, 7(2):195225.
UTD-NB 0.230 0.395 G.E. Hinton and R.R. Salakhutdinov. 2006. Reducing
the dimensionality of data with neural networks. Sci-
Table 4: Results in measuring relation similarity ence, 313(5786):504507.
G. Hinton and R. Salakhutdinov. 2010. Discovering bi-
nary codes for documents by learning deep generative
tors are not trained or tuned specifically for this task, models. Topics in Cognitive Science, 3(1):7491.
the model achieves better results (RNN-320, RNN- G.E. Hinton. 1986. Learning distributed representations
640 & RNN-1600) than the previously best perform- of concepts. In Proceedings of the eighth annual con-
ing system, UTD-NB (Rink and Harabagiu, 2012). ference of the cognitive science society, pages 112.
Amherst, MA.
7 Conclusion David Jurgens, Saif Mohammad, Peter Turney, and Keith
Holyoak. 2012. Semeval-2012 task 2: Measuring de-
We have presented a generally applicable vector off- grees of relational similarity. In *SEM 2012: The First
set method for identifying linguistic regularities in Joint Conference on Lexical and Computational Se-
continuous space word representations. We have mantics (SemEval 2012), pages 356364. Association
for Computational Linguistics.
shown that the word representations learned by a
Hai-Son Le, I. Oparin, A. Allauzen, J.-L. Gauvain, and
RNNLM do an especially good job in capturing
F. Yvon. 2011. Structured output layer neural network
these regularities. We present a new dataset for mea- language model. In Proceedings of ICASSP 2011.
suring syntactic performance, and achieve almost Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beat-
40% correct. We also evaluate semantic general- rice Santorini. 1993. Building a large annotated cor-
ization on the SemEval 2012 task, and outperform pus of english: the penn treebank. Computational Lin-
the previous state-of-the-art. Surprisingly, both re- guistics, 19(2):313330.
sults are the byproducts of an unsupervised maxi- Tomas Mikolov, Martin Karafiat, Jan Cernocky, and San-
mum likelihood training criterion that simply oper- jeev Khudanpur. 2010. Recurrent neural network
ates on a large amount of text data. based language model. In Proceedings of Interspeech
2010.
Tomas Mikolov, Anoop Deoras, Daniel Povey, Lukas
References Burget, and Jan Cernocky. 2011a. Strategies for
Training Large Scale Neural Network Language Mod-
Y. Bengio, R. Ducharme, Vincent, P., and C. Jauvin. els. In Proceedings of ASRU 2011.
2003. A neural probabilistic language model. Jour- Tomas Mikolov, Stefan Kombrink, Lukas Burget, Jan
nal of Machine Learning Reseach, 3(6). Cernocky, and Sanjeev Khudanpur. 2011b. Ex-
Y. Bengio, H. Schwenk, J.S. Senecal, F. Morin, and J.L. tensions of recurrent neural network based language
Gauvain. 2006. Neural probabilistic language models. model. In Proceedings of ICASSP 2011.
Innovations in Machine Learning, pages 137186. Tomas Mikolov. 2012. RNN toolkit.
A. Bordes, X. Glorot, J. Weston, and Y. Bengio. 2012. A. Mnih and G.E. Hinton. 2009. A scalable hierarchical
Joint learning of words and meaning representations distributed language model. Advances in neural infor-
for open-text semantic parsing. In Proceedings of 15th mation processing systems, 21:10811088.
International Conference on Artificial Intelligence and F. Morin and Y. Bengio. 2005. Hierarchical probabilistic
Statistics. neural network language model. In Proceedings of the
international workshop on artificial intelligence and
statistics, pages 246252.
J.B. Pollack. 1990. Recursive distributed representa-
tions. Artificial Intelligence, 46(1):77105.
Bryan Rink and Sanda Harabagiu. 2012. UTD: Deter-
mining relational similarity using lexical patterns. In
*SEM 2012: The First Joint Conference on Lexical
and Computational Semantics (SemEval 2012), pages
413418. Association for Computational Linguistics.
Holger Schwenk. 2007. Continuous space language
models. Computer Speech and Language, 21(3):492
518.
J. Turian, L. Ratinov, and Y. Bengio. 2010. Word rep-
resentations: a simple and general method for semi-
supervised learning. In Proceedings of Association for
Computational Linguistics (ACL 2010).
P.D. Turney. 2012. Domain and function: A dual-space
model of semantic relations and compositions. Jour-
nal of Artificial Intelligence Research, 44:533585.

You might also like