Linguistic Regularities in Continuous Space Word Representations
Linguistic Regularities in Continuous Space Word Representations
Linguistic Regularities in Continuous Space Word Representations
2 Related Work
Distributed word representations have a long his-
tory, with early proposals including (Hinton, 1986;
Pollack, 1990; Elman, 1991; Deerwester et al.,
1990). More recently, neural network language
models have been proposed for the classical lan-
guage modeling task of predicting a probability dis-
Figure 1: Recurrent Neural Network Language Model.
tribution over the next word, given some preced-
ing words. These models were first studied in the
context of feed-forward networks (Bengio et al., output layers are computed as follows:
2003; Bengio et al., 2006), and later in the con-
text of recurrent neural network models (Mikolov et s(t) = f (Uw(t) + Ws(t1)) (1)
al., 2010; Mikolov et al., 2011b). This early work
y(t) = g (Vs(t)) , (2)
demonstrated outstanding performance in terms of
word-prediction, but also the need for more compu- where
tationally efficient models. This has been addressed 1 ezm
by subsequent work using hierarchical prediction f (z) = , g(zm ) = P z . (3)
1 + ez ke
k
(Morin and Bengio, 2005; Mnih and Hinton, 2009;
Le et al., 2011; Mikolov et al., 2011b; Mikolov et In this framework, the word representations are
al., 2011a). Also of note, the use of distributed found in the columns of U, with each column rep-
topic representations has been studied in (Hinton resenting a word. The RNN is trained with back-
and Salakhutdinov, 2006; Hinton and Salakhutdi- propagation to maximize the data log-likelihood un-
nov, 2010), and (Bordes et al., 2012) presents a se- der the model. The model itself has no knowledge
mantically driven method for obtaining word repre- of syntax or morphology or semantics. Remark-
sentations. ably, training such a purely lexical model to max-
imize likelihood will induce word representations
3 Recurrent Neural Network Model with striking syntactic and semantic properties.
The word representations we study are learned by a 4 Measuring Linguistic Regularity
recurrent neural network language model (Mikolov
et al., 2010), as illustrated in Figure 1. This architec- 4.1 A Syntactic Test Set
ture consists of an input layer, a hidden layer with re- To understand better the syntactic regularities which
current connections, plus the corresponding weight are inherent in the learned representation, we created
matrices. The input vector w(t) represents input a test set of analogy questions of the form a is to b
word at time t encoded using 1-of-N coding, and the as c is to testing base/comparative/superlative
output layer y(t) produces a probability distribution forms of adjectives; singular/plural forms of com-
over words. The hidden layer s(t) maintains a rep- mon nouns; possessive/non-possessive forms of
resentation of the sentence history. The input vector common nouns; and base, past and 3rd person
w(t) and the output vector y(t) have dimensional- present tense forms of verbs. More precisely, we
ity of the vocabulary. The values in the hidden and tagged 267M words of newspaper text with Penn
Category Relation Patterns Tested # Questions Example
Adjectives Base/Comparative JJ/JJR, JJR/JJ 1000 good:better rough:
Adjectives Base/Superlative JJ/JJS, JJS/JJ 1000 good:best rough:
Adjectives Comparative/ JJS/JJR, JJR/JJS 1000 better:best rougher:
Superlative
Nouns Singular/Plural NN/NNS, 1000 year:years law:
NNS/NN
Nouns Non-possessive/ NN/NN POS, 1000 city:citys bank:
Possessive NN POS/NN
Verbs Base/Past VB/VBD, 1000 see:saw return:
VBD/VB
Verbs Base/3rd Person VB/VBZ, VBZ/VB 1000 see:sees return:
Singular Present
Verbs Past/3rd Person VBD/VBZ, 1000 saw:sees returned:
Singular Present VBZ/VBD
Table 1: Test set patterns. For a given pattern and word-pair, both orderings occur in the test set. For example, if
see:saw return: occurs, so will saw:see returned: .
Treebank POS tags (Marcus et al., 1993). We then totypical word pair clothing:shirt. To measure the
selected 100 of the most frequent comparative adjec- degree that a target word pair dish:bowl has the same
tives (words labeled JJR); 100 of the most frequent relation, we form the analogy clothing is to shirt as
plural nouns (NNS); 100 of the most frequent pos- dish is to bowl, and ask how valid it is.
sessive nouns (NN POS); and 100 of the most fre-
quent base form verbs (VB). We then systematically 5 The Vector Offset Method
generated analogy questions by randomly matching As we have seen, both the syntactic and semantic
each of the 100 words with 5 other words from the tasks have been formulated as analogy questions.
same category, and creating variants as indicated in We have found that a simple vector offset method
Table 1. The total test set size is 8000. The test set based on cosine distance is remarkably effective in
is available online. 1 solving these questions. In this method, we assume
relationships are present as vector offsets, so that in
4.2 A Semantic Test Set the embedding space, all pairs of words sharing a
In addition to syntactic analogy questions, we used particular relation are related by the same constant
the SemEval-2012 Task 2, Measuring Relation Sim- offset. This is illustrated in Figure 2.
ilarity (Jurgens et al., 2012), to estimate the extent In this model, to answer the analogy question a:b
to which RNNLM word vectors contain semantic c:d where d is unknown, we find the embedding
information. The dataset contains 79 fine-grained vectors xa , xb , xc (all normalized to unit norm), and
word relations, where 10 are used for training and compute y = xb xa + xc . y is the continuous
69 testing. Each relation is exemplified by 3 or space representation of the word we expect to be the
4 gold word pairs. Given a group of word pairs best answer. Of course, no word might exist at that
that supposedly have the same relation, the task is exact position, so we then search for the word whose
to order the target pairs according to the degree to embedding vector has the greatest cosine similarity
which this relation holds. This can be viewed as an- to y and output it:
other analogy problem. For example, take the Class- xw y
w = argmaxw
Inclusion:Singular Collective relation with the pro- kxw kkyk
1
http://research.microsoft.com/en- When d is given, as in our semantic test set, we
us/projects/rnn/default.aspx simply use cos(xb xa + xc , xd ) for the words
Method Adjectives Nouns Verbs All
LSA-80 9.2 11.1 17.4 12.8
LSA-320 11.3 18.1 20.7 16.5
LSA-640 9.6 10.1 13.8 11.3
RNN-80 9.3 5.2 30.4 16.2
RNN-320 18.2 19.0 45.0 28.5
RNN-640 21.0 25.2 54.8 34.7
Figure 2: Left panel shows vector offsets for three word
RNN-1600 23.9 29.2 62.2 39.6
pairs illustrating the gender relation. Right panel shows
a different projection, and the singular/plural relation for Table 2: Results for identifying syntactic regularities for
two words. In high-dimensional space, multiple relations different word representations. Percent correct.
can be embedded for a single word.