b732 PDF
b732 PDF
b732 PDF
Automatic measures of semantic distance can be classified into two kinds: (1) those, such as
WordNet, that rely on the structure of manually created lexical resources and (2) those that
rely only on co-occurrence statistics from large corpora. Each kind has inherent strengths and
limitations. Here we present a hybrid approach that combines corpus statistics with the structure
of a Roget-like thesaurus to gain the strengths of each while avoiding many of their limitations.
We create distributional profiles (co-occurrence vectors) of coarse thesaurus concepts, rather
than words. This allows us to estimate the distributional similarity between concepts, rather
than words. We show that this approach can be ported to a cross-lingual framework, so as to
estimate semantic distance in a resource-poor language by combining its text with a thesaurus
in a resource-rich language. Extensive experiments, both monolingually and cross-lingually,
on ranking word pairs in order of semantic distance, correcting real-word spelling errors,
and solving word-choice problems show that these distributional measures of concept distance
markedly outperform traditional distributional word-distance measures and are competitive with
the best WordNet-based measures.
1. Introduction
Semantic distance is a measure of how close or distant two units of language are,
in terms of their meaning. The units of language may be words, phrases, sentences,
paragraphs, or documents. For example, the nouns dance and choreography are closer in
meaning than the nouns clown and bridge. These units of language, especially words,
may have more than one possible meaning. However, their context may be used to
determine the intended senses. For example, star can mean both CELESTIAL BODY and
CELEBRITY ; however, star in the sentence below refers only to CELESTIAL BODY and is
much closer to sun than to famous:
Thus, semantic distance between words in context is in fact the distance between word
senses or concepts. (We use the terms word senses and concepts interchangeably here,
although later on we will make a distinction. Throughout this paper, example words
will be written in italics, as in the example sentence above, whereas example senses or
concepts will be written in small capitals.)
∗ E-mail: saif@umiacs.umd.edu
∗∗ Email: gh@cs.toronto.edu
2. Background
Semantic distance is of two kinds: semantic similarity and semantic relatedness. The
former is a subset of the latter, but the two may be used interchangeably in certain con-
texts, making it even more important to be aware of their distinction. Two concepts are
considered to be semantically similar if there is a hyponymy (hypernymy), antonymy,
or troponymy relation between them. Two concepts are considered to be semantically
related if there is any lexical semantic relation between them—classical or non-classical.
Semantically similar concepts tend to share a number of common properties. For
example, consider APPLES and BANANAS. They are both hyponyms of FRUIT. They are
both edible, they grow on trees, they have seeds, etc. Another example of a semantically
2
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
Table 1
Word-pair datasets that have been manually annotated with distance values. Pearson’s
correlation was used to determine inter-annotator correlation (last column). Those used for
experiments reported in this paper are marked in bold. “n.r.” stands for “not reported”.
Dataset Year Language # pairs PoS # subjects Correlation
Rubenstein and Goodenough 1965 English 65 N 51 n.r.
Miller and Charles 1991 English 30 N n.r. .90
Resnik and Diab 2000 English 27 V n.r. .76 and .79
Finkelstein 2002 English 153 N 13 n.r.
Finkelstein 2002 English 200 N 16 n.r.
Gurevych 2005 German 65 N 24 .81
Zesch and Gurevych 2006 German 350 N, V, A 8 .69
Many will agree that humans are adept at estimating semantic distance, but consider
the following questions. How strongly will two people agree or disagree on distance
estimates? Will the agreement vary over different sets of concepts? In our minds, is
there a clear distinction between related and unrelated concepts or are concept-pairs
spread across the whole range from synonymous to unrelated?
Some of the earliest work that begins to answer these questions is by Rubenstein
and Goodenough (1965a). They conducted quantitative experiments with human sub-
jects (51 in all) who were asked to rate 65 English word pairs on a scale from 0.0 to 4.0
as per their semantic distance. The word pairs chosen ranged from almost synonymous
to unrelated. However, they were all noun pairs and those that were semantically close
were semantically similar; the dataset did not contain word pairs that were semantically
related but not semantically similar. The subjects repeated the annotation after two
weeks and the new distance values had a Pearson’s correlation r of 0.85 with the
old ones. Miller and Charles (1991) also conducted a similar study on 30 word pairs
taken from the Rubenstein-Goodenough pairs. These annotations had a high correlation
(r = 0.97) with the mean annotations of Rubenstein and Goodenough (1965a). Resnik
(1999) repeated these experiments and found the inter-annotator correlation (r) to be
0.90. Finkelstein (2002) asked human judges to rank two sets of noun pairs (153 pairs
and 200 pairs) in order of semantic distance. However, this dataset has certain polit-
ically biased word pairs, such as Arafat–peace, Arafat–terror, Jerusalem–Israel, Jerusalem–
Palestinian, and so there might be less human agreement on ranking this data.
3
Computational Linguistics Volume 1, Number 1
Resnik and Diab (2000) conducted annotations of 48 verb pairs and found inter-
annotator correlation (r) to be 0.76 when the verbs were presented without context and
0.79 when presented in context. Gurevych (2005) and Zesch et al. (2007) asked native
German speakers to mark two different sets of German word pairs with distance values.
Set 1 was a German translation of the Rubenstein and Goodenough (1965a) dataset. It
had 65 noun–noun word pairs. Set 2 was a larger dataset containing 350 word pairs
made up of nouns, verbs, and adjectives. The semantically close word pairs in the 65-
word set were mostly synonyms or hypernyms (hyponyms) of each other, whereas
those in the 350-word set had both classical and non-classical relations with each other.
Details of these semantic distance benchmarks are summarized in Table 1. Inter-subject
correlations (last column in Table 1) are indicative of the degree of ease in annotating
the datasets.
The high correlation values suggest that humans are quite good and consistent at
estimating semantic distance of noun-pairs; however, annotating verbs and adjectives
and combinations of parts of speech is harder. This also means that estimating semantic
relatedness is harder than estimating semantic similarity. It should be noted here that
even though the annotators were presented with word-pairs and not concept-pairs, it is
reasonable to assume that they were annotated as per their closest senses. For example,
given the noun pair bank and interest, most if not all will identify it as semantically
related even though both words have more than one sense and many of the sense–sense
combinations are unrelated (for example, the RIVER BANK sense of bank and the SPECIAL
ATTENTION sense of interest).
Apart from proving that humans can indeed estimate semantic distance, these
datasets act as “gold standards” to evaluate automatic distance measures. However,
lack of large amounts of data from human subject experimentation limits the reliability
of this mode of evaluation. Therefore automatic distance measures are also evaluated
by their usefulness in natural language tasks such as correcting real-word spelling
errors (Budanitsky and Hirst 2006) and solving word-choice problems (Turney 2001).
We evaluate our distributional concept-distance measures both intrinsically through
ranking human-judged word pairs in order of semantic distance as well as extrinsically
through natural language tasks such as correcting spelling errors and solving word-
choice problems.
Even though there are numerous distributional measures, many of which may seem
dramatically different from each other, all of them do the following: (1) choose a unit
of co-occurrence (e.g., word, word–syntactic-relation combination); (2) choose a mea-
sure of strength of association (SoA) of the co-occurrence unit with the target word
(e.g., conditional probability, pointwise mutual information); (3) represent the target
words by vectors or points in the co-occurrence space (and possibly apply dimension
reduction);1 and (4) calculate the distance between the target vectors using a suitable
distributional measure (e.g., cosine, Euclidean distance). While any of the measures of
vector distance may be used with any of the measures of strength of association, in
practice only certain combinations are used (see Table 2) and certain other combinations
1 The co-occurrence space is a hyper-dimensional space where each dimension is a unique co-occurrence
unit. If words are used as co-occurrence units, then this space has |V| dimensions, where V is the
vocabulary.
4
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
Table 2
Measures of vector distance, measures of strength of association, and standard combinations.
Those used for experiments reported in this paper are marked in bold.
Measures of DP distance Measures of strength of association (SoA)
α-skew divergence (ASD) φ coefficient (Phi)
cosine (Cos) conditional probability (CP)
Dice coefficient (Dice) cosine (Cos)
Euclidean distance (L2 norm) Dice coefficient (Dice)
Hindle’s measure (Hin) log likelihood ration (LLR)
Kullback-Leibler divergence (KLD) odds ratio (Odds)
Manhattan distance (L1 norm) pointwise mutual information (PMI)
Jensen–Shannon divergence (JSD) Yule’s coefficient (Yule)
Lin’s measure (Lin)
Standard combinations
α-skew divergence—φ coefficient (ASD–CP)
cosine—conditional probability (Cos–CP)
Dice coefficient—conditional probability (Dice–CP)
Euclidean distance—conditional probability (L2 norm–CP)
Hindle’s measure—pointwise mutual information (Hin–PMI)
Kullback-Leibler divergence—conditional probability (KLD–CP)
Manhattan distance—conditional probability (L1 norm–CP)
Jensen–Shannon divergence—conditional probability (JSD–CP)
Lin’s measure—pointwise mutual information (Lin–PMI)
FUSION : heat 0.16, hydrogen 0.16, energy 0.13, bomb 0.09, light 0.09, space 0.04, . . .
It shows that fusion has a strong tendency to co-occur with words such as heat, hydrogen,
and energy. The values are the pointwise mutual information between the target and
co-occurring words.
All experiments in this paper use simple word co-occurrences, and standard com-
binations of vector distance and measure of association. To avoid clutter, instead of
referring to a distributional measure by its measure of vector distance and measure
of association (for example, α-skew divergence—conditional probability), we will refer
to it simply by the measure of vector distance (in this case, α-skew divergence). The
measures used in our experiments are α-skew divergence (ASD) (Lee 2001) , cosine
(Cos) (Schütze and Pedersen 1997), Jensen-Shannon divergence (JSD) (Manning and
Schütze 2008), and that proposed by Lin (1998a) (Lin). Jensen–Shannon divergence
and α-skew divergence calculate the difference in distributions of words that co-occur
with the targets. Lin’s distributional measure follows from his information-theoretic
definition of similarity (Lin 1998b).
5
Computational Linguistics Volume 1, Number 1
Resource-based measures are only as good as the lexical resource on which they rely.
3.1.2 Poor estimation of semantic relatedness. The most widely used WordNet-based
measures rely only on its extensive #is-a hierarchy. This is because networks of other
lexical-relations such as meronymy are much less developed. Further, the networks
for different parts of speech are not well connected. Thus, even though resource-based
measures are successful at estimating semantic similarity between nouns, they are poor
at estimating semantic relatedness—especially in pairs other than noun–noun. Also,
as Morris and Hirst (2004) pointed out, a large number of terms have a non-classical
relation between them and are semantically related (not semantically similar). On the
other hand, distributional measures can be used to determine both semantic relatedness
and semantic similarity (Mohammad and Hirst 2007).
3.1.3 Inability to cater to specific domains. Given a concept pair, measures that rely only
on a network and no text, such as Rada et al. (1989), give just one distance value. How-
ever, two concepts may be very close in a certain domain but not so much in another.
For example, SPACE and TIME are close in the domain of quantum mechanics but not
so much in most others. Resources created for specific domains do exist; however, they
are rare. Some of the more successful WordNet-based measures, such as that of Jiang
and Conrath (1997), rely on text as well, and do indeed capture domain-specificity to
some extent, but the distance values are still largely affected by the underlying network,
which is not domain-specific. On the other hand, distributional measures rely primarily
(if not completely) on text and large amounts of corpora specific to particular domains
can easily be collected.
6
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
3.2.2 Data sparseness. Since Zipf’s law seems to hold even for the largest of corpora,
there will always be words that occur too few times for distributional measures to
accurately estimate their distance with other words. On the other hand, a large number
of relatively obscure words may be listed in high-coverage resources such as WordNet
(WordNet has more than 155,000 unique tokens). Of course, manually created resources
are also lacking in a number of word-types. However, they tend to have groupings of
words into coarse concepts. This allows even corpus-based approaches to determine
properties of these coarse concepts through occurrences of the more frequent members
of a concept. In Section 4, we will propose a hybrid method of semantic distance that
does exactly that using the categories in a published thesaurus.
2 Even though WordNet-based and distributional measures give non-zero similarity and relatedness
values to a large number of term pairs (concept pairs and word pairs), values below a suitable threshold
can be reset to 0.
7
Computational Linguistics Volume 1, Number 1
bootstrapping and
word–category co-occurrence matrix
sense disambiguation
Figure 1
An overview of the distributional concept-distance approach.
We now propose a hybrid approach that combines corpus statistics with a published
thesaurus (Mohammad and Hirst 2006b; Mohammad et al. 2007). It overcomes, with
varying degrees of success, many of the limitations described in Section 3 earlier. Our
goal is to gain the performance of resource-based methods and the breadth of distribu-
tional methods. The central ideas are these:
!
the category structure of a Roget-style thesaurus.
!
In order to avoid data sparseness, the concepts are very coarse-grained.
The distributional component of the method is based on concepts, not
surface strings. We create distributional profiles (co-occurrence vectors) of
concepts.
The sub-sections below describe our approach in detail. Figure 1 depicts the key steps.
A Roget-style thesaurus classifies all word types into approximately 1000 categories.
Words within a category tend to be semantically related to each other. Words with more
than one sense are listed in more than one category. Each category has a head word that
best represents the meaning of all the words in the category. Some example categories
are CLOTHING , HONESTY, and DESIRE. Each category is divided into paragraphs that
classify lexical units more finely; however, we do not make use of this information.
We take these thesaurus categories as the coarse-grained concepts of our method.
That is, for our semantic distance measure, there are only around 1000 concepts (word-
senses) in the world; each lexical unit is a pairing of the surface string with the thesaurus
8
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
If we apply the distributional hypothesis (Firth 1957; Harris 1968) to word senses (in-
stead of words), then the hypothesis states that words when used in different senses tend
to keep different “company" (co-occurring words). Therefore, we propose the creation of
distributional profiles (DPs) of word senses or concepts, rather than those of words. The
closer the distributional profiles of two concepts, the smaller is their semantic distance.
Below are example distributional profiles of two senses of STAR:
CELESTIAL BODY : space 0.36, light 0.27, constellation 0.11, hydrogen 0.07, . . .
CELEBRITY : famous 0.24, movie 0.14, rich 0.14, fan 0.10, . . .
It should be noted that creating such distributional profiles of concepts is much more
challenging than creating distributional profiles of words, which involve simple word–
word co-occurrence counts. (In the next sub-section, we show how these profiles may be
estimated without the use of any sense-annotated data). However, once created, any of
the many measures of vector distance can be used to estimate the distance between the
DPs of two target concepts (just as in the case of traditional word-distance measures,
measures of vector distance are used to estimate the distance between the DPs of
two target words). For example, here is how cosine is traditionally used to estimate
distributional distance between two words.
!
w∈C(w1 )∪C(w2 ) (P (w|w1 ) × P (w|w2 ))
Coscp (w1 , w2 ) = "! "! (1)
w∈C(w1 ) P (w|w1 )2×
w∈C(w2 ) P (w|w2 )2
C(t) is the set of words that co-occur (within a certain window) with the word t in
a corpus. The conditional probabilities in the formula are taken from the distributional
profiles of words. We adapt the formula to estimate distributional distance between two
concepts as shown below:
!
w∈C(c1 )∪C(c2 )(P (w|c1 ) × P (w|c2 ))
Coscp (c1 , c2 ) = "! "! (2)
w∈C(c1 ) P (w|c 1 )2×
w∈C(c2 ) P (w|c2 )
2
3 WordNet has more than 117,000 synsets. To counter this fine-grainedness, methods to group synsets into
coarser senses have been proposed (Agirre and Lopez de Lacalle Lekuona 2003; Navigli 2006).
9
Computational Linguistics Volume 1, Number 1
C(x) is now the set of words that co-occur with concept x within a pre-determined
window. The conditional probabilities in the formula are taken from the distributional
profiles of concepts.
If the distance between two words is required, and their intended senses are not
known, then the distance between all relevant sense pairs is determined and the min-
imum is chosen. (This is the heuristic described earlier in Section 3.2.1, and is exactly
how WordNet-based measures of concept-distance are used too (Budanitsky and Hirst
2006).) For example, if star has the two senses mentioned above and fusion has one (let’s
call it FUSION), then the distance between them is determined by first applying cosine
(or any vector distance measure) to the DPs of CELESTIAL BODY and FUSION:
CELESTIAL BODY : space 0.36, light 0.27, constellation 0.11, hydrogen 0.07, . . .
FUSION : heat 0.16, hydrogen 0.16, energy 0.13, bomb 0.09, light 0.09, space 0.04, ...
Finally the scores which imply the greatest closeness (least distance) is chosen:
c1 c2 . . . cj ...
w1 m11 m12 . . . m1j ...
w2 m21 m22 . . . m2j ...
.. .. .. . . .. ..
. . . . . .
wi mi1 mi2 . . . mij . . .
.. .. .. . .
. . . . . . .. . .
The matrix is populated with co-occurrence counts from a large corpus. A particular
cell mij , corresponding to word wi and category or concept cj , is populated with the
10
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
number of times wi co-occurs (in a window of ±5 words) with any word that has cj
as one of its senses (i.e., wi co-occurs with any word listed under concept cj in the
thesaurus). For example, assume that the concept of CELESTIAL BODY is represented by
four words in the thesaurus: constellation, planet, star, and sun. If the word space co-occurs
with constellation (15 times), planet (50 times), star (40 times), and sun (65 times) in the
given text corpus, then the cell for space and CELESTIAL BODY in the WCCM is populated
with 170 (15 + 50 + 40 + 65). This matrix, created after a first pass of the corpus, is called
the base word–category co-occurrence matrix (base WCCM).
The choice of ±5 words as window size is somewhat arbitrary and hinges on the
intuition that words close to a target word are more indicative of its semantic properties
than those more distant. Church and Hanks (1990), in their seminal work on word–word
co-occurrence association, also use a window size of ±5 words and argue that this size
is large enough to capture many verb–argument dependencies and yet small enough
that adjacency information is not diluted too much.
A contingency table for any particular word w and category c can be easily gener-
ated from the WCCM by collapsing cells for all other words and categories into one and
summing up their frequencies.
c ¬c
w nwc nw¬
¬w n¬c n¬¬
11
Computational Linguistics Volume 1, Number 1
CELESTIAL BODY
one sense of w
space w
a fragment of text
........
other sense(s) of w
w ∈ {constellation, planet, star, sun}
Figure 2
The word space will co-occur with a number of words X that each have one sense of CELESTIAL
BODY in common.
CELESTIAL BODY
SoA
sense of star
space star
a fragment of text
SoA
CELEBRITY
sense of star
Figure 3
The base WCCM captures strong word–category co-occurrence strength of association (SoA).
occurring words is chosen as the intended sense of the target word. In this second pass,
a new bootstrapped WCCM is created such that each cell m ij , corresponding to word
wi and concept cj , is populated with the number of times wi co-occurs with any word
used in sense cj . For example, consider again the 40 times star co-occurs with space. If
the contexts of 25 of these instances have higher cumulative strength of association
with CELESTIAL BODY than CELEBRITY, suggesting that in only these 25 of those 40
occurrences star was used in CELESTIAL BODY sense, then the cell for space–CELESTIAL
BODY is incremented by 25 rather than 40 (as was the case in the base WCCM). This
bootstrapped WCCM, created after simple and fast word sense disambiguation, will
better capture word–concept co-occurrence values, and hence strengths of association
values, than the base WCCM.4
The bootstrapping step can be repeated; however, further iterations do not improve
results significantly. This is not surprising because the base WCCM was created without
any word sense disambiguation and so the first bootstrapping iteration with word sense
disambiguation will markedly improve the matrix. The same is not true for subsequent
iterations.
4 Speed of disambiguation is important here as all words in the corpus are to be disambiguated. After
determining co-occurrence counts from the BNC (a 100 million word corpus), creating the bootstrapped
WCCM from the base WCCM took only about 4 hours on a 1.3GHz machine with 16GB memory.
12
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
5 Recall that the Macquarie Thesaurus has 98,000 word types and 812 categories.
13
Computational Linguistics Volume 1, Number 1
Table 3
Correlations with human ranking of Rubenstein and Goodenough word pairs of automatic
rankings using traditional word–word co-occurrence–based distributional word-distance
measures and the word–concept co-occurrence–based distributional concept-distance measures.
Best results for each measure-type are shown in boldface.
Measure-type
Word-distance Concept-distance
Distributional measure closest average
α-skew divergence 0.45 0.60 –
cosine 0.54 0.69 0.42
Jensen–Shannon divergence 0.48 0.61 –
Lin’s distributional measure 0.52 0.71 0.59
Figure 4
Correlations with human ranking of Rubenstein and Goodenough word pairs of automatic
rankings using traditional word–word co-occurrence–based distributional word-distance
measures and the word–concept co-occurrence–based distributional concept-distance measures.
The set of Rubenstein and Goodenough word pairs is much too small to safely assume
that measures that work well on them do so for the entire English vocabulary. Conse-
quently, semantic measures have traditionally been evaluated through more extensive
applications such as the work by Hirst and Budanitsky (2005) on correcting real-word
14
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
spelling errors (or malapropisms). If a word in a text is not semantically close to any
other word in its context, then it is considered a suspect. If the suspect has a spelling-
variant that is semantically close to a word in its context, then the suspect is declared a
probable real-word spelling error and an alarm is raised; the semantically close spelling-
variant is considered its correction. Hirst and Budanitsky tested the method on 500
articles from the 1987–89 Wall Street Journal corpus for their experiments, replacing one
noun in every 200th word by a spelling-variant and looking at whether the method
could restore the original word. This resulted in text with 1408 real-word spelling
errors out of a total of 107,233 noun tokens. we adopt this method and this test data,
but whereas Hirst and Budanitsky used WordNet-based semantic measures, we use
distributional concept- and word-distance measures.
In order to determine whether two words are “semantically close” or not as per
any measure of distance, a threshold must be set. If the distance between two words
is less than the threshold, then they will be considered semantically close. Hirst and
Budanitsky (2005) pointed out that there is a notably wide band in the human ratings
of the Rubenstein and Goodenough word pairs such that no word-pair was assigned
a distance value between 1.83 and 2.36 (on a scale of 0–4). They argue that somewhere
within this band is a suitable threshold between semantically close and semantically
distant, and therefore set thresholds for the WordNet-based measures such that there
was maximum overlap in what the automatic measures and human judgments consid-
ered semantically close and distant. Following this idea, we use an automatic method
to determine thresholds for the various distributional concept- and word-distance mea-
sures. Given a list of Rubenstein and Goodenough word pairs ordered according to
a distance measure, we repeatedly consider the mean of all adjacent distance values as
candidate thresholds. Then we determine the number of word-pairs correctly classified
as semantically close or semantically distant for each candidate threshold, considering
which side of the band they lie as per human judgments. The candidate threshold with
highest accuracy is chosen as the threshold.
We follow the Hirst and St. Onge (1998) metrics to evaluate real-word spelling
correction. Suspect ratio and alarm ratio evaluate the processes of identifying suspects
and raising alarms, respectively.
Detection ratio is the product of the two, and measures overall performance in detecting
the errors.
15
Computational Linguistics Volume 1, Number 1
Notice that the correction ratio is the product of the detection ratio and correction
accuracy. The overall (single-point) precision (P), recall(R), and F-score (F) of detection
are also computed.
number of true-alarms
P = (9)
number of alarms
number of true-alarms
R= (10)
number of malapropisms
2×P ×R
F = (11)
P +R
The product of detection F-score and correction accuracy, which we will call correction
performance, can also be used as a bottom-line performance metric.
Table 4 details the performance of distributional word- and concept-distance mea-
sures. For comparison, the table also lists results obtained by Hirst and Budanitsky
(2005) using WordNet-based concept-distance measures: those of Hirst and St. Onge
(1998), Jiang and Conrath (1997), Leacock and Chodorow (1998), Lin (1997), and Resnik
(1995). The last two are information content measures that rely on finding the lowest
common subsumer (lcs) of the target synsets in WordNet’s hypernym hierarchy and use
corpus counts to determine how specific or general this concept is. The more specific the
lcs is and the smaller the difference of its specificity with that of the target concepts, the
closer the target concepts are considered. (See Budanitsky and Hirst (2001) for more
details.)
Observe that the correction ratio results for the distributional word-distance mea-
sures are poor compared to distributional concept-distance measures; the concept-
distance measures are clearly superior, in particular α-skew divergence and cosine.
(Figure 5 depicts the results in a graph.) Moreover, if we consider correction ratio to be
the bottom-line statistic, then three of the four distributional concept-distance measures
outperform all the WordNet-based measures except the Jiang–Conrath measure. If we
consider correction performance to be the bottom-line statistic, then again we see that
the distributional concept-distance measures outperform the word-distance measures,
except in the case of Lin’s distributional measure, which gives slightly poorer results
with concept-distance.
16
Table 4
Real-word spelling error correction. The best results as per the two bottom-line statistics, correction ratio and correction performance, are shown in
boldface.
α-skew divergence 3.36 1.78 5.98 0.84 5.03 7.37 45.53 12.69 10.66
cosine 2.91 1.64 4.77 0.85 4.06 5.97 37.15 10.28 8.74
Jensen–Shannon divergence 3.29 1.77 5.82 0.83 4.88 7.19 44.32 12.37 10.27
Lin’s distributional measure 3.63 2.15 7.78 0.84 6.52 9.38 58.38 16.16 13.57
Distributionalconcept
α-skew divergence 4.11 2.54 10.43 0.91 9.49 12.19 25.28 16.44 14.96
cosine 4.00 2.51 10.03 0.90 9.05 11.77 26.99 16.38 14.74
Jensen–Shannon divergence 3.58 2.46 8.79 0.90 7.87 10.47 34.66 16.08 14.47
Lin’s distributional measure 3.02 2.60 7.84 0.88 6.87 9.45 36.86 15.04 13.24
WNetconcept
Hirst–St-Onge 4.24 1.95 8.27 0.93 7.70 9.67 26.33 14.15 13.16
Jiang–Conrath 4.73 2.97 14.02 0.92 12.91 14.33 46.22 21.88 20.13
Leacock–Chodrow 3.23 2.72 8.80 0.83 7.30 11.56 60.33 19.40 16.10
Lin’s WordNet-based measure 3.57 2.71 9.70 0.87 8.48 9.56 51.56 16.13 14.03
Resnik 2.58 2.75 7.10 0.78 5.55 9.00 55.00 15.47 12.07
Measuring Semantic Distance using Distributional Profiles of Concepts
17
Computational Linguistics Volume 1, Number 1
Figure 5
Correction ratio obtained on the real-word spelling correction task using traditional word–word
co-occurrence–based distributional word-distance measures and the word–concept
co-occurrence–based distributional concept-distance measures.
18
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
CELESTIAL BODY (celestial body, sun, . . . ): space 0.36, light 0.27, constellation 0.11, hydrogen
0.07, . . .
CELEBRITY (celebrity, hero, . . . ): famous 0.24, movie 0.14, rich 0.14, fan 0.10, . . .
19
Computational Linguistics Volume 1, Number 1
Table 5
Vocabulary of German words needed to understand this discussion.
German word Meaning(s) German word Meaning(s)
Bank 1. financial institution Licht light
2. bench (furniture) Morgensonne morning sun
berühmt famous Raum space
Bombe bomb reich rich
Erwärmung heat Sonne sun
Film movie (motion picture) Star star (celebrity)
Himmelskörper heavenly body Stern star (celestial body)
Konstellation constellation Verschmelzung fusion
CELESTIAL BODY (celestial body, sun, . . . ): Raum 0.36, Licht 0.27, Konstellation 0.11, ...
CELEBRITY (celebrity, hero, . . . ): berühmt 0.24, Film 0.14, reich 0.14, . . .
The values are the strength of association (usually pointwise mutual information or
conditional probability) of the target concept with co-occurring words. In order to cal-
culate the strength of association, we must first determine individual word and concept
counts, as well as their co-occurrence counts. The next section describes how these can
be estimated without the use of any word-aligned parallel corpora and without any
sense-annotated data. The closer the cross-lingual DPs of two concepts, the smaller
is their semantic distance. Just as in the case of monolingual distributional concept-
distance measures (described in Section 4.2 earlier), distributional measures can be used
to estimate the distance between the cross-lingual DPs of two target concepts. For ex-
ample, recall how cosine is used in a monolingual framework to estimate distributional
distance between two concepts:
!
(P (w|c1 ) × P (w|c2 ))
w∈C(c1 )∪C(c2 )
Coscp (c1 , c2 ) = "! "! (12)
w∈C(c1 ) P (w|c1 ) × w∈C(c2 ) P (w|c2 )
2 2
C(x) is the set of English words that co-occur with English concept x within a pre-
determined window. The conditional probabilities in the formula are taken from the
monolingual distributional profiles of concepts. We can adapt the formula to estimate
cross-lingual distributional distance between two concepts as shown below:
! # $
w de ∈C(cen P (wde |cen
en 1 ) × P (w |c2 )
de en
1 )∪C(c2 )
Coscp (cen en
1 , c2 ) = "! "! (13)
en P (w |c1 ) ×
w de ∈C(cen ) P (w |c2 )
de en 2 de en 2
de
w ∈C(c ) 1 2
20
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
C(x) is now the set of German words that co-occur with English concept x within a
pre-determined window. The conditional probabilities in the formula are taken from
the cross-lingual DPCs.
If the distance between two German words is required, then the distance between
all relevant English cross-lingual candidate sense pairs is determined and the minimum
is chosen. For example, if Stern has the two cross-lingual candidate senses mentioned
above and Verschmelzung has one (FUSION), then the distance between them is deter-
mined by first applying cosine (or any distributional measure) to the cross-lingual DPs
of CELESTIAL BODY and FUSION:
CELESTIAL BODY (celestial body, sun, . . . ): Raum 0.36, Licht 0.27, Konstellation 0.11, . . .
FUSION (thermonuclear reaction, atomic reaction, . . . ): Erwärmung 0.16, Bombe 0.09, Licht
0.09, Raum 0.04, . . .
And finally choosing the one with minimum semantic distance, that is, maximum
similarity/relatedness:
cen
1 cen
2 . . . cen
j ...
w1de m11 m12 . . . m1j ...
w2de m21 m22 . . . m2j ...
.. .. .. . . .. ..
. . . . . .
wide mi1 mi2 . . . mij . . .
.. .. .. . .
. . . . . . .. . .
21
Computational Linguistics Volume 1, Number 1
The matrix is populated with co-occurrence counts from a large German corpus.
A particular cell mij , corresponding to word wide and concept cen j , is populated with
the number of times the German word wide co-occurs (in a window of ±5 words) with
any German word having cen j as one of its cross-lingual candidate senses. For example,
the Raum–CELESTIAL BODY cell will have the sum of the number of times Raum co-
occurs with Himmelskörper, Sonne, Morgensonne, Star, Stern, and so on (see Figure 7). This
matrix, created after a first pass of the corpus, is called the cross-lingual base WCCM. A
contingency table for any particular German word wde and English category cen can be
easily generated from the WCCM by collapsing cells for all other words and categories
into one and summing up their frequencies.
cen ¬cen
wde nwde cen nwde ¬
¬wde n¬cen n¬¬
The application of a suitable statistic, such as PMI or conditional probability, will then
yield the strength of association between the German word and the English category.
As the cross-lingual base WCCM is created from unannotated text, it will be noisy
(for the same word-sense-ambiguity reasons as to why the monolingual base WCCM
is noisy—explained in Section 4.3.1 earlier). Yet, again, the cross-lingual base WCCM
does capture strong associations between a category (concept) and co-occurring words
(just like the monolingual base WCCM). For example, even though we increment counts
for both Raum–CELESTIAL BODY and Raum–CELEBRITY for a particular instance where
Raum co-occurs with Star, Raum will co-occur with a number of words such as Himmel-
skörper, Sonne, and Morgensonne that each have the sense of CELESTIAL BODY in common
(see Figures 7 and 8), whereas all their other senses are likely different and distributed
across the set of concepts. Therefore, the co-occurrence count of Raum and CELESTIAL
BODY , and thereby their strength of association, will be relatively higher than those of
Raum and CELEBRITY (Figure 9). Therefore, we bootstrap the matrix just as described
before in the monolingual case (Section 4.3.2).
22
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
CELESTIAL BODY
one sense of x
Raum x
a fragment of text
........
other sense(s) of x
x ∈ {Stern, Sonne, Himmelskörper, Morgensonne, Konstellation}
Figure 8
The word Raum will also co-occur with a number of other words x that each have one sense of
CELESTIAL BODY in common.
CELESTIAL BODY
SoA
sense of Stern
Raum Stern
a fragment of text
SoA
CELEBRITY
sense of Stern
Figure 9
The base WCCM captures strong word–category co-occurrence strength of association (SoA).
The German-English distributional profiles were created using the following re-
sources: the German newspaper corpus taz6 (Sep 1986 to May 1999; 240 million words),
the English Macquarie Thesaurus (Bernard 1986) (about 98,000 word types), and the
German–English bilingual lexicon BEOLINGUS7 (about 265,000 entries). Multi-word
expressions in the thesaurus and the bilingual lexicon were ignored. We used a context
of ±5 words on either side of the target word for creating the base and bootstrapped
WCCMs. No syntactic pre-processing was done, nor were the words stemmed, lemma-
tized, or part-of-speech tagged.
In order to compare results with state-of-the-art monolingual approaches we con-
ducted experiments using GermaNet measures as well. The specific distributional mea-
sures and GermaNet-based measures used are listed in Table 6. The GermaNet measures
used are of two kinds: (1) information content measures, and (2) Lesk-like measures
that rely on n-gram overlaps in the glosses of the target senses, proposed by Gurevych
(2005). As GermaNet does not have glosses for synsets, Gurevych (2005) proposed a
way of creating a bag-of-words-type pseudo-gloss for a synset by including the words
in the synset and in synsets close to it in the network. The information content measures
rely on finding the lowest common subsumer (lcs) of the target synsets in a hypernym
hierarchy and using corpus counts to determine how specific or general this concept is.
The more specific the lcs is and the smaller the difference of its specificity with that of
the target concepts, the closer the target concepts are.
6 http://www.taz.de
7 http://dict.tu-chemnitz.de
23
Computational Linguistics Volume 1, Number 1
Table 6
Distance measures used in the experiments.
(Cross-lingual) Distributional Measures (Monolingual) GermaNet Measures
Information Content–based Lesk-like
α-skew divergence (Lee 2001) Jiang and Conrath (1997) hypernym pseudo-gloss
cosine (Schütze and Pedersen 1997) Lin (1998b) (Gurevych 2005)
Jensen-Shannon divergence Resnik (1995) radial pseudo-gloss
(Dagan, Lee, and Pereira 1994) (Gurevych 2005)
Lin (1998a)
7.1.2 Results and Discussion. Word-pair distances determined using different distance
measures are compared in two ways with the two human-created benchmarks. The rank
ordering of the pairs from closest to most distant is evaluated with Spearman’s rank
order correlation ρ; the distance judgments themselves are evaluated with Pearson’s
correlation coefficient r. The higher the correlation, the more accurate the measure is.
Spearman’s correlation ignores actual distance values after a list is ranked—only the
ranks of the two sets of word pairs are compared to determine correlation. On the other
hand, Pearson’s coefficient takes into account actual distance values. So even if two lists
are ranked the same, but one has distances between consecutively-ranked word-pairs
more in line with human-annotations of distance than the other, then Pearson’s coef-
ficient will capture this difference. However, this makes Pearson’s coefficient sensitive
to outlier data points, and so one must interpret it with caution. Therefore, Spearman’s
rank correlation is more common in the semantic distance literature. However, many of
the experiments on German data report Pearson’s correlation. We report both correla-
tions in Table 7.
Observe that on both datasets and by both measures of correlation, the cross-
lingual measures of concept-distance perform not just as well as the best monolingual
measures, but in fact slightly better. (Figure 10 depicts the results in a graph.) In general,
the correlations are lower for Gur350 as it contains cross-PoS word pairs and non-
classical relations, making it harder to judge even by humans (as shown by the inter-
annotator correlations for the datasets in Table 1). As per Spearman’s rank correlation,
α-skew divergence and Jensen-Shannon divergence perform best on both datasets. The
correlations of cosine and Lin’s distributional measure are not far behind. Amongst the
monolingual GermaNet measures, radial pseudo-gloss performs best. As per Pearson’s
24
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
Table 7
Ranking German word pairs: Correlations of distance measures with human judgments. The
best results obtained using monolingual and cross-lingual measures are marked in bold.
Gur65 Gur350
Spearman’s Pearson’s Spearman’s Pearson’s
Measure rank correlation correlation rank correlation correlation
Monolingual
hypernym pseudo-gloss 0.672 0.702 0.346 0.331
radial pseudo-gloss 0.764 0.565 0.492 0.420
Jiang and Conrath measure 0.665 0.748 0.417 0.410
Lin’s GermaNet measure 0.607 0.739 0.475 0.495
Resnik’s measure 0.623 0.722 0.454 0.466
Cross-lingual
α-skew divergence 0.794 0.597 0.520 0.413
cosine 0.778 0.569 0.500 0.212
Jensen-Shannon divergence 0.793 0.633 0.522 0.422
Lin’s distributional measure 0.775 0.816 0.498 0.514
Figure 10
Ranking German word pairs: Spearman’s rank correlation obtained when using the best
cross-lingual distributional concept-distance measure and that obtained when using the best
monolingual GermaNet-based measure.
correlation, Lin’s distributional measure performs best overall and radial pseudo-gloss
does best amongst the monolingual measures.
25
Computational Linguistics Volume 1, Number 1
the objective is to pick the alternative that is most closely related to the target. For
example:9
Duplikat (duplicate)
a. Einzelstück (single copy) b. Doppelkinn (double chin)
c. Nachbildung (replica) d. Zweitschrift (copy)
Torsten Zesch compiled the Reader’s Digest Word Power (RDWP) benchmark for Ger-
man, which consists of 1072 of these word-choice problems collected from the January
2001 to December 2005 issues of the German-language edition (Wallace and Wallace
2005). Forty-four problems that had more than one correct answer and twenty problems
that used a phrase instead of a single term as the target were discarded. The remaining
1008 problems form our evaluation dataset, which is significantly larger than any of the
previous datasets employed in a similar evaluation.
We evaluate the various cross-lingual and monolingual distance measures by their
ability to choose the correct answer. The distance between the target and each of the
alternatives is computed by a measure, and the alternative that is closest is chosen. If
two or more alternatives are equally close to the target, then the alternatives are said to
be tied. If one of the tied alternatives is the correct answer, then the problem is counted
as correctly solved, but the corresponding score is reduced. The system assigns a score
of 0.5, 0.33, and 0.25 for 2, 3, and 4 tied alternatives, respectively (in effect approximating
the score obtained by randomly guessing one of the tied alternatives). If more than
one alternative has a sense in common with the target, then the thesaurus-based cross-
lingual measures will mark them each as the closest sense. However, if one or more of
these tied alternatives is in the same semicolon group of the thesaurus as the target, then
only these are chosen as the closest senses.10
Even though we discard questions from the German RDWP dataset that contained
a phrasal target, we did not discard questions that had phrasal alternatives simply
because of the large number of such questions. Many of these phrases cannot be found
in the knowledge sources (GermaNet or Macquarie Thesaurus via translation list). In
these cases, we remove stopwords (prepositions, articles, etc.) and split the phrase
into component words. As German words in a phrase can be highly inflected, all
components are lemmatized. For example, the target imaginär (imaginary) has nur in
der Vorstellung vorhanden (exists only in the imagination) as one of its alternatives. The
phrase is split into its component words nur, Vorstellung, and vorhanden. The system
computes semantic distance between the target and each phrasal component and selects
the minimum value as the distance between target and potential answer.
7.2.2 Results and Discussion. Table 8 presents the results obtained on the German
RDWP benchmark for both monolingual and cross-lingual measures. Only those ques-
tions for which the measures have some distance information are attempted; the column
‘# attempted’ shows the number of questions attempted by each measure, which is the
maximum score that the measure can hope to get. Observe that the thesaurus-based
cross-lingual measures have a much larger coverage than the GermaNet-based mono-
lingual measures. The cross-lingual measures have a much larger number of correct
26
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
Table 8
Solving word choice questions: Performance of monolingual and cross-lingual distance
measures. The best results for each class of measures are marked in bold.
Reader’s Digest Word Power benchmark
Measure # attempted # correct # ties Score P R F
Monolingual
hypernym pseudo-gloss 222 174 11 171.5 .77 .17 .28
radial pseudo-gloss 266 188 15 184.7 .69 .18 .29
Jiang and Conrath 357 157 1 156.0 .44 .16 .23
Lin’s GermaNet measure 298 153 1 152.5 .51 .15 .23
Resnik’s measure 299 154 33 148.3 .50 .15 .23
Cross-lingual
α-skew divergence 438 185 81 151.6 .35 .15 .21
cosine 438 276 90 223.1 .51 .22 .31
Jensen-Shannon divergence 438 276 90 229.6 .52 .23 .32
Lin’s distributional measure 438 274 90 228.7 .52 .23 .32
answers too (column ‘# correct’), but this number is bloated due to the large number
of ties. We see more ties when using the cross-lingual measures because they rely on
the Macquarie Thesaurus, a very coarse-grained sense inventory (around 800 categories),
whereas the monolingual measures operate on the fine-grained GermaNet. ‘Score’ is
the score each measure gets after it is penalized for the ties. The cross-lingual measures
cosine, Jensen-Shannon divergence, and Lin’s distributional measure obtain the highest
scores. But ‘Score’ by itself does not present the complete picture either as, given the
scoring scheme, a measure that attempts more questions may get a higher score just
from random guessing. We therefore present precision (P), recall (R), and F measure
(F):
P = Score (15)
# attempted
R = Score
1008 (16)
F = 2×P ×R
P +R (17)
Figure 11 depicts the results in a graph. Observe that the cross-lingual measures have
a higher coverage (recall) than the monolingual measures but lower precision. The F
measures show that the best cross-lingual measures do slightly better than the best
monolingual ones, despite the large number of ties. The measures of cosine, Jensen-
Shannon divergence, and Lin’s distributional measure remain the best cross-lingual
measures, whereas hypernym pseudo-gloss and radial pseudo-gloss are the best mono-
lingual ones.
8. Related work
27
Computational Linguistics Volume 1, Number 1
Figure 11
Solving word choice questions: Performance of the best monolingual and cross-lingual distance
measures.
WordNet-based measures. See Curran (2004), Weeds et al. (2004), and Mohammad and
Hirst (2007) for comprehensive surveys of distributional measures of word-distance.
Yarowsky (1992) proposed a model for unsupervised word sense disambiguation
using Roget’s Thesaurus. A mutual information–like measure was used to identify words
that best represent each category in the thesaurus, which he calls the salient words. The
presence of a salient word in the context of a target word is evidence that the word
is used in a sense corresponding to the salient word. The evidence is incorporated in
a Bayesian model. The word-category co-occurrence matrix (WCCM) we created can
be seen as a means of determining the degree of salience of any word co-occurring
with a concept. We further improved the accuracy of the WCCM using a bootstrapping
technique.
Jarmasz and Szpakowicz (2003) use the taxonomic structure of the Roget’s Thesaurus
to determine semantic similarity. Two words are considered maximally similar if they
occur in the same semicolon group in the thesaurus. Then on, decreasing in similarity
are word pairs in the same paragraph, words pairs in different paragraphs belonging to
the same part of speech and within the same category, word pairs in the category, and so
on until word pairs which have nothing in common except that they are in the thesaurus
(maximally distant). However, a large number of words that are in different thesaurus
categories may be semantically related. Thus, this approach is better suited for estimat-
ing semantic similarity than semantic relatedness. Our approach is specifically intended
to determine the semantic relatedness between word pairs across thesaurus categories.
Pantel and Lin (2002) proposed a method to discover word senses from text using
word co-occurrence information. The approach produces clusters of words that are
semantically similar and there is a numeric score representing the distance of each
word in a cluster with the centroid of that cluster. Note that these clusters do not
have information of which words co-occur with the clusters (concepts) and so these are
not distributional profiles of concepts (DPCs). Rather, the output of the Pantel and Lin
system is more like a Roget’s or Macquaries Thesaurus, except that it is automatically
generated. One can create DPCs using our method and the Pantel and Lin thesaurus
(instead of Macquarie) and it will be interesting to determine its usefulness. However,
we suspect that there will be more complementarity between information encoded in a
human created lexical resource and the co-occurrence information in text.
28
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
Pantel (2005) also provides a way to create co-occurrence vectors for WordNet
senses. The lexical co-occurrence vectors of words in a leaf node are propagated up
the WordNet hierarchy. A parent node inherits those co-occurrences that are shared
by its children. Lastly, co-occurrences not pertaining to the leaf nodes are removed
from its vector. Even though the methodology attempts to associate a WordNet node or
sense with only those co-occurrences that pertain to it, no attempt is made at correcting
the frequency counts. After all, word1–word2 co-occurrence frequency (or association) is
likely not the same as SENSE 1–word2 co-occurrence frequency (or association), simply
because word1 may have senses other than SENSE 1, as well. Further, in Pantel’s system,
the co-occurrence frequency associated with a parent node is the weighted sum of co-
occurrence frequencies of its children. The frequencies of the child nodes are used as
weights. Sense ambiguity issues apart, this is still problematic because a parent concept
(say, BIRD) may co-occur much more frequently (or infrequently) with a word than its
children do. In contrast, the bootstrapped WCCM not only identifies which words co-
occur with which concepts, but also has more accurate estimates of the co-occurrence
frequencies.
Patwardhan and Pedersen (2006) create aggregate co-occurrence vectors for a
WordNet sense by adding the co-occurrence vectors of the words in its WordNet gloss.
The distance between two senses is then determined by the cosine of the angle between
their aggregate vectors. However, such aggregate co-occurrence vectors are expected to
be noisy because they are created from data that is not sense-annotated. The bootstrap-
ping procedure introduced in Section 4.3.2 minimizes such errors and, as we showed in
Mohammad and Hirst (2006a), markedly improves accuracies of natural language tasks
that use these co-occurrence vectors.
Véronis (2004) presents a graph theory–based approach to identify the various
senses of a word in a text corpus without the use of a dictionary. For each target word,
a graph of inter-connected nodes is created. Every word that co-occurs with the target
word is a node. Two nodes are connected with an edge if they are found to co-occur
with each other. Highly interconnected components of the graph represent the different
senses of the target word. The node (word) with the most connections in a component
is representative of that sense and its associations with words that occur in a test
instance are used to quantify evidence that the target word is used in the corresponding
sense. However, these strengths of association are at best only rough estimates of the
associations between the sense and co-occurring words, since a sense in his system is
represented by a single (possibly ambiguous) word.
Erk and Padó (2008) proposed a way of determining the distributional profile of
a word in context. They use dependency relations and selectional preferences of the
target words and combine multiple co-occurrence vectors in a manners so as to give
more weight to co-occurring words pertaining to the intended senses of the target
words. This approach effectively assumes that each occurrence of a word in a different
context has a unique meaning. In contrast, our approach explores the use of only about a
thousand very coarse concepts to represent the meaning of all words in the vocabulary.
By choosing to work with much coarser concepts, the approach foregoes the ability
to make fine-grained distinctions in meaning, but is able to better estimate semantic
distance between the coarser concepts as there is much more information pertaining to
them.
29
Computational Linguistics Volume 1, Number 1
9. Conclusion
30
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
Acknowledgments
This paper incorporates research that was first reported in Mohammad and Hirst
(2006a), Mohammad and Hirst (2006b), and Mohammad et al. (2007). The work de-
scribed in section 7 was carried out in collaboration with Iryna Gurevych and Torsten
Zesch, Technische Universität Darmstadt. The research was supported by the Natural
Sciences and Engineering Research Council of Canada, the University of Toronto, the
U.S. National Science Foundation, and the Human Language Technology Center of
Excellence. Any opinions, findings, and conclusions or recommendations expressed in
this material are those of the authors and do not necessarily reflect the views of the spon-
sor. We thank Afra Alishahi, Alex Budanitsky, Michael Demko, Afsaneh Fazly, Diana
McCarthy, Rada Mihalcea, Siddharth Patwardhan, Gerald Penn, Philip Resnik, Frank
Rudicz, Suzanne Stevenson, Vivian Tsang, and Xinglong Wang for helpful discussions.
References
Agirre, Eneko and Oier Lopez de Lacalle Lekuona. 2003. Clustering WordNet word senses. In
Proceedings of the 1st International Conference on Recent Advances in Natural Language Processing
(RANLP-2003), Borovets, Bulgaria.
Bernard, John R. L., editor. 1986. The Macquarie Thesaurus. Macquarie Library, Sydney, Australia.
Budanitsky, Alexander and Graeme Hirst. 2001. Semantic distance in WordNet: An
experimental, application-oriented evaluation of five measures. In Workshop on WordNet and
Other Lexical Resources, in the North American Chapter of the Association for Computational
Linguistics (NAACL-2000), Pittsburgh, Pennsylvania.
Budanitsky, Alexander and Graeme Hirst. 2006. Evaluating WordNet-based measures of
semantic distance. Computational Linguistics, 32(1):13–47.
Church, Kenneth W. and Patrick Hanks. 1990. Word association norms, mutual information and
lexicography. Computational Linguistics, 16(1):22–29.
Cruse, D. Allen. 1986. Lexical semantics. Cambridge University Press, Cambridge, UK.
Cucerzan, Silviu and David Yarowsky. 2002. Bootstrapping a multilingual part-of-speech tagger
in one person-day. In Proceedings of the 6th Conference on Computational Natural Language
Learning, pages 132–138, Taipei, Taiwan.
Curran, James R. 2004. From Distributional to Semantic Similarity. Ph.D. thesis, School of
Informatics, University of Edinburgh, Edinburgh, UK.
Dagan, Ido, Lillian Lee, and Fernando Pereira. 1994. Similarity-based estimation of word
cooccurrence probabilities. In Proceedings of the 32nd Annual Meeting of the Association of
Computational Linguistics (ACL-1994), pages 272–278, Las Cruces, New Mexico.
Erk, Katrin and Sebastian Padó. 2008. A structured vector space model for word meaning in
context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP-2086), pages 897–906, Honolulu, HI.
Finkelstein, Lev, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman,
and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on
Information Systems, 20(1):116–131.
31
Computational Linguistics Volume 1, Number 1
Firth, John R. 1957. A synopsis of linguistic theory 1930–55. In Studies in Linguistic Analysis
(special volume of the Philological Society), pages 1–32, Oxford, England. The Philological Society.
Gurevych, Iryna. 2005. Using the structure of a conceptual network in computing semantic
relatedness. In Proceedings of the 2nd International Joint Conference on Natural Language Processing
(IJCNLP-2005), pages 767–778, Jeju Island, Republic of Korea.
Harris, Zellig. 1968. Mathematical Structures of Language. Interscience Publishers, New York, NY.
Hirst, Graeme and Alexander Budanitsky. 2005. Correcting real-word spelling errors by
restoring lexical cohesion. Natural Language Engineering, 11(1):87–111.
Hirst, Graeme and David St-Onge. 1998. Lexical chains as representations of context for the
detection and correction of malapropisms. In Christiane Fellbaum, editor, WordNet: An
Electronic Lexical Database. The MIT Press, Cambridge, MA, chapter 13, pages 305–332.
Jarmasz, Mario and Stan Szpakowicz. 2003. Roget’s Thesaurus and semantic similarity. In
Proceedings of the International Conference on Recent Advances in Natural Language Processing
(RANLP-2003), pages 212–219, Borovets, Bulgaria.
Jiang, Jay J. and David W. Conrath. 1997. Semantic similarity based on corpus statistics and
lexical taxonomy. In Proceedings of International Conference on Research on Computational
Linguistics (ROCLING X), Taipei, Taiwan.
Landauer, Thomas K. and Susan T. Dumais. 1997. A solution to Plato’s problem: The latent
semantic analysis theory of acquisition, induction, and representation of knowledge.
Psychological Review, 104:211–240.
Landauer, Thomas K., Peter W. Foltz, and Darrell Laham. 1998. Introduction to latent semantic
analysis. Discourse Processes, 25(2–3):259–284.
Leacock, Claudia and Martin Chodorow. 1998. Combining local context and WordNet similarity
for word sense identification. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical
Database. The MIT Press, Cambridge, MA, chapter 11, pages 265–283.
Lee, Lillian. 2001. On the effectiveness of the skew divergence for statistical language analysis. In
Proceedings of the Eigth International Workshop on Artificial Intelligence and Statistics
(AISTATS-2001), pages 65–72, Key West, Florida.
Lin, Dekang. 1997. Using syntactic dependency as local context to resolve word sense ambiguity.
In Proceedings of the 8th Conference of the European Chapter of the Association for Computational
Linguistics (ACL,EACL-1997), pages 64–71, Madrid, Spain.
Lin, Dekang. 1998a. Automatic retreival and clustering of similar words. In Proceedings of the 17th
International Conference on Computational Linguistics (COLING-1998), pages 768–773, Montreal,
Canada.
Lin, Dekang. 1998b. An information-theoretic definition of similarity. In Proceedings of the 15th
International Conference on Machine Learning, pages 296–304, San Francisco, CA. Morgan
Kaufmann.
Manning, Christopher D. and Hinrich Schütze. 2008. Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, Massachusetts.
Miller, George A. and Walter G. Charles. 1991. Contextual correlates of semantic similarity.
Language and Cognitive Processes, 6(1):1–28.
Mohammad, Saif, Iryna Gurevych, Graeme Hirst, and Torsten Zesch. 2007. Cross-lingual
distributional profiles of concepts for measuring semantic distance. In Proceedings of the Joint
Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning (EMNLP/CoNLL-2007), pages 571–580, Prague, Czech Republic.
Mohammad, Saif and Graeme Hirst. 2006a. Determining word sense dominance using a
thesaurus. In Proceedings of the 11th Conference of the European Chapter of the Association for
Computational Linguistics (EACL), pages 121–128, Trento, Italy.
Mohammad, Saif and Graeme Hirst. 2006b. Distributional measures of concept-distance: A
task-oriented evaluation. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP-2006), pages 35–43, Sydney, Australia.
Mohammad, Saif and Graeme Hirst. 2007. Distributional measures of semantic distance: A
survey. http://www.cs.toronto.edu/compling/Publications.
Mohammad, Saif, Graeme Hirst, and Philip Resnik. 2007. Tor, tormd: Distributional profiles of
concepts for unsupervised word sense disambigution. In Proceedings of the Fourth International
Workshop on Semantic Evaluations (SemEval-07), pages 326–333, Prague, Czech Republic.
Morris, Jane and Graeme Hirst. 2004. Non-classical lexical semantic relations. In Proceedings of the
Workshop on Computational Lexical Semantics, Human Language Technology Conference of the North
American Chapter of the Association for Computational Linguistics, pages 46–51, Boston,
32
Mohammad and Hirst Measuring Semantic Distance using Distributional Profiles of Concepts
Massachusetts.
Navigli, Roberto. 2006. Meaningful clustering of senses helps boost word sense disambiguation
performance. In Proceedings of the 21st International Conference on Computational Linguistics and
the 44th annual meeting of the Association for Computational Linguistics (COLING-ACL 2006),
pages 105–112, Sydney, Australia.
Pantel, Patrick. 2005. Inducing ontological co-occurrence vectors. In Proceedings of the 43rd
Annual Meeting of the Association for Computational Linguistics (ACL-05), pages 125–132, Ann
Arbor, Michigan.
Pantel, Patrick and Dekang Lin. 2002. Discovering word senses from text. In Proceedings of the 8th
Association of Computing Machinery SIGKDD International Conference On Knowledge Discovery
and Data Mining, pages 613–619, Edmonton, Canada.
Patwardhan, Siddharth, Satanjeev Banerjee, and Ted Pedersen. 2003. Using measures of semantic
relatedness for word sense disambiguation. In Proceedings of the Fourth International Conference
on Intelligent Text Processing and Computational Linguistics (CICLING-03), pages 17–21, Mexico
City, Mexico.
Patwardhan, Siddharth and Ted Pedersen. 2006. Using WordNet based context vectors to
estimate the semantic relatedness of concepts. In Proceedings of the European Chapter of the
Association for Computational Linguistics Workshop Making Sense of Sense—Bringing
Computational Linguistics and Psycholinguistics Together, pages 1–8, Trento, Italy.
Rada, Roy, Hafedh Mili, Ellen Bicknell, and Maria Blettner. 1989. Development and application
of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1):17–30.
Resnik, Philip. 1995. Using information content to evaluate semantic similarity. In Proceedings of
the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), pages 448–453,
Montreal, Canada.
Resnik, Philip. 1998. Wordnet and class-based probabilities. In Christiane Fellbaum, editor,
WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, Massachusetts, pages
239–263.
Resnik, Philip. 1999. Semantic similarity in a taxonomy: An information-based measure and its
application to problems of ambiguity in natural language. Communications of the Association of
Computing Machinery, 11:95–130.
Resnik, Philip and Mona Diab. 2000. Measuring verb similarity. In Proceedings of the 22nd Annual
Meeting of the Cognitive Science Society (CogSci 2000), pages 399–404, Philadelphia,
Pennsylvania.
Rubenstein, Herbert and John B. Goodenough. 1965a. Contextual correlates of synonymy.
Communications of the Association of Computing Machinery, 8(10):627–633.
Rubenstein, Herbert and John B. Goodenough. 1965b. Contextual Correlates of Synonymy.
Communications of the ACM, 8(10):627–633.
Schütze, Hinrich and Jan O. Pedersen. 1997. A cooccurrence-based thesaurus and two
applications to information retreival. Information Processing and Management, 33(3):307–318.
Turney, Peter. 2001. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings
of the Twelfth European Conference on Machine Learning (ECML-2001), pages 491–502, Freiburg,
Germany.
Turney, Peter. 2006. Expressing implicit semantic relations without supervision. In Proceedings of
the 21st International Conference on Computational Linguistics and the 44th annual meeting of the
Association for Computational Linguistics, pages 313–320, Sydney, Australia.
Véronis, Jean. 2004. Hyperlex: Lexical cartography for information retrieval. Computer Speech and
Language. Special Issue on Word Sense Disambiguation, 18(3):223–252.
Wallace, DeWitt and Lila Acheson Wallace. 2005. Reader’s Digest, das Beste für Deutschland. Jan
2001–Dec 2005. Verlag Das Beste, Stuttgart.
Weeds, Julie, David Weir, and Diana McCarthy. 2004. Characterising measures of lexical
distributional similarity. In Proceedings of the 20th International Conference on Computational
Linguistics (COLING-04), pages 1015–1021, Geneva, Switzerland.
Yarowsky, David. 1992. Word-sense disambiguation using statistical models of Roget’s
categories trained on large corpora. In Proceedings of the 14th International Conference on
Computational Linguistics (COLING-1992), pages 454–460, Nantes, France.
Zesch, Torsten, Iryna Gurevych, and Max Mühlhäuser. 2007. Comparing Wikipedia and German
WordNet by evaluating semantic relatedness on multiple datasets. In Proceedings of Human
Language Technologies: The Annual Conference of the North American Chapter of the Association for
Computational Linguistics (NAACL,HLT-2007), pages 205–208, Rochester, New York.
33