Choosing The Word Most Typical in Context Using A Lexical Co-Occurrence Network

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, 1997, pp. 507–509.

Choosing the Word Most Typical in Context


Using a Lexical Co-occurrence Network

Philip Edmonds
Department of Computer Science, University of Toronto
Toronto, Canada, M5S 3G4
pedmonds@cs.toronto.edu

Abstract For example, our implemented lexical choice program


selects mistake as most typical for the ‘gap’ in sen-
This paper presents a partial solution to a com-
tence (1), and error in (2).
ponent of the problem of lexical choice: choos-
ing the synonym most typical, or expected, in (1) However, such a move also would run the risk
context. We apply a new statistical approach of cutting deeply into U.S. economic growth,
to representing the context of a word through which is why some economists think it would be
lexical co-occurrence networks. The imple- a big {error j mistake j oversight}.
mentation was trained and evaluated on a large (2) The {error j mistake j oversight} was magnified
corpus, and results show that the inclusion of when the Army failed to charge the standard
second-order co-occurrence relations improves percentage rate for packing and handling.
the performance of our implemented lexical
choice program. 2 Generalizing Lexical Co-occurrence
2.1 Evidence-based Models of Context
1 Introduction
Evidence-based models represent context as a set of fea-
Recent work views lexical choice as the process of map- tures, say words, that are observed to co-occur with, and
ping from a set of concepts (in some representation of thereby predict, a word (Yarowsky, 1992; Golding and
knowledge) to a word or phrase (Elhadad, 1992; Stede, Schabes, 1996; Karow and Edelman, 1996; Ng and Lee,
1996). When the same concept admits more than one 1996). But, if we use just the context surrounding a word,
lexicalization, it is often difficult to choose which of we might not be able to build up a representation satisfac-
these ‘synonyms’ is the most appropriate for achieving tory to uncover the subtle differences between synonyms,
the desired pragmatic goals; but this is necessary for high- because of the massive volume of text that would be re-
quality machine translation and natural language genera- quired.
tion. Now, observe that even though a word might not co-
Knowledge-based approaches to representing the po- occur significantly with another given word, it might nev-
tentially subtle differences between synonyms have suf- ertheless predict the use of that word if the two words are
fered from a serious lexical acquisition bottleneck (Di- mutually related to a third word. That is, we can treat
Marco, Hirst, and Stede, 1993; Hirst, 1995). Statistical lexical co-occurrence as though it were moderately tran-
approaches, which have sought to explicitly represent sitive. For example, in (3), learn provides evidence for
differences between pairs of synonyms with respect to task because it co-occurs (in other contexts) with difficult,
their occurrence with other specific words (Church et al., which in turn co-occurs with task (in other contexts), even
1994), are inefficient in time and space. though learn is not seen to co-occur significantly with
This paper presents a new statistical approach to mod- task.
eling context that provides a preliminary solution to an
important sub-problem, that of determining the near- (3) The team’s most urgent task was to learn whether
synonym that is most typical, or expected, if any, in a Chernobyl would suggest any safety flaws at
given context. Although weaker than full lexical choice, KWU-designed plants.
because it doesn’t choose the ‘best’ word, we believe that So, by augmenting the contextual representation of a
it is a necessary first step, because it would allow one word with such second-order (and higher) co-occurrence
to determine the effects of choosing a non-typical word relations, we stand to have greater predictive power, as-
in place of the typical word. The approach relies on a suming that we assign less weight to them in accordance
generalization of lexical co-occurrence that allows for an with their lower information content. And as our results
implicit representation of the differences between two (or will show, this generalization of co-occurrence is neces-
more) words with respect to any actual context. sary.
task/NN
Set POS Synonyms (with training corpus frequency)
1.41 2.60 1.37 1.30 1.96 1 JJ difficult (352), hard (348), tough (230)
urgent/JJ 0.41 difficult/JJ easy/JJ called/VBD forces/NNS 2 NN error (64), mistake (61), oversight (37)
1.36 1.33 3.03 1.89 1.25 1.24
3 NN job (418), task (123), duty (48)
4 NN responsibility (142), commitment (122),
learn/VB costly/JJ find/VB 2.54 practice/NN plants/NNS obligation (96), burden (81)
1.35 1.40 2.19 1.30 5 NN material (177), stuff (79), substance (45)
safety/NN flaws/NNS To/TO team/NN 6 VB give (624), provide (501), offer (302)
7 VB settle (126), resolve (79)
1.36

suggest/VB
Table 1: The sets of synonyms for our experiment.
Figure 1: A fragment of the lexical co-occurrence net-
work for task. The dashed line is a second-order relation
implied by the network. in the sentence. So, given a gap in a sentence S, we find

M(c, S) =
X
the candidate c for the gap that maximizes

sig(c, w)
We can represent these relations in a lexical co-
occurrence network, as in figure 1, that connects lexi-
2
w S

cal items by just their first-order co-occurrence relations. For example, given S as sentence (3), above, and the
Second-order and higher relations are then implied by network of figure 1, M(task, S) = 4.40. However, job
transitivity. (using its own network) matches best with a score of
5.52; duty places third with a score of 2.21.
2.2 Building Co-occurrence Networks
We build a lexical co-occurrence network as follows: 3 Results and Evaluation
Given a root word, connect it to all the words that sig-
nificantly co-occur with it in the training corpus;1 then, To evaluate the lexical choice program, we selected sev-
recursively connect these words to their significant co- eral sets of near-synonyms, shown in table 1, that have
occurring words up to some specified depth. low polysemy in the corpus, and that occur with similar
We use the intersection of two well-known measures frequencies. This is to reduce the confounding effects of
of significance, mutual information scores and t-scores lexical ambiguity.
(Church et al., 1994), to determine if a (first-order) co- For each set, we collected all sentences from the yet-
occurrence relation should be included in the network; unseen 1987 Wall Street Journal (part-of-speech-tagged)
however, we use just the t-scores in computing signifi- that contained any of the members of the set, ignoring
cance scores for all the relations. Given two words, w0 word sense. We replaced each occurrence by a ‘gap’ that
and wd , in a co-occurrence relation of order d, and a the program then had to fill. We compared the ‘correct-
shortest path P(w0 , wd ) = (w0 , . . . , wd ) between them, the ness’ of the choices made by our program to the baseline

X
significance score is of always choosing the most frequent synonym according
to the training corpus.
1 t(wi,1, wi ) But what are the ‘correct’ responses? Ideally, they
sig(w0 , wd ) =
d3 i
2
wi P(w1 ,wd ) should be chosen by a credible human informant. But
regrettably, we are not in a position to undertake a study
This formula ensures that significance is inversely pro- of how humans judge typical usage, so we will turn in-
portional to the order of the relation. For example, in the stead to a less ideal source: the authors of the Wall Street
network of figure 1, sig(task, learn) = [t(task, difficult) + Journal. The problem is, of course, that authors aren’t
1
2 t(difficult, learn)]/8 = 0.41. always typical. A particular word might occur in a ‘pat-
A single network can be quite large. For instance, tern’ in which another synonym was seen more often,
the complete network for task (see figure 1) up to the making it the typical choice. Thus, we cannot expect
third-order has 8998 nodes and 37,548 edges. perfect accuracy in this evaluation.
Table 2 shows the results for all seven sets of synonyms
2.3 Choosing the Most Typical Word
under different versions of the program. We varied two
The amount of evidence that a given sentence provides for parameters: (1) the window size used during the construc-
choosing a candidate word is the sum of the significance tion of the network: either narrow (4 words), medium
scores of each co-occurrence of the candidate with a word ( 10 words), or wide ( 50 words); (2) the maximum
1
Our training corpus was the part-of-speech-tagged 1989 order of co-occurrence relation allowed: 1, 2, or 3.
Wall Street Journal, which consists of N = 2, 709, 659 tokens. The results show that at least second-order co-
No lemmatization or sense disambiguation was done. Stop occurrences are necessary to achieve better than baseline
words were numbers, symbols, proper nouns, and any token accuracy in this task; regular co-occurrence relations are
with a raw frequency greater than F = 800. insufficient. This justifies our assumption that we need
Set 1 2 3 4 5 6 7
Size 6665 1030 5402 3138 1828 10204 1568
Baseline 40.1% 33.5% 74.2% 36.6% 62.8% 45.7% 62.2%
1 31.3% 18.7% 34.5% 27.7% 28.8% 33.2% 41.3%
Narrow 2 47.2% 44.5% 66.2% 43.9% 61.9%a 48.1% 62.8%a
3 47.9% 48.9% 68.9% 44.3% 64.6%a 48.6% 65.9%
1 24.0% 25.0% 26.4% 29.3% 28.8% 20.6% 44.2%
Medium 2 42.5% 47.1% 55.3% 45.3% 61.5%a 44.3% 63.6%a
3 42.5% 47.0% 53.6% — — — —
Wide 1 9.2% 20.6% 17.5% 20.7% 21.2% 4.1% 26.5%
2 39.9%a 46.2% 47.1% 43.2% 52.7% 37.7% 58.6%
a
Difference from baseline not significant.

Table 2: Accuracy of several different versions of the lexical choice program. The best score for each set is in boldface.
Size refers to the size of the sample collection. All differences from baseline are significant at the 5% level according
to Pearson’s 2 test, unless indicated.

more than the surrounding context to build adequate con- References


textual representations.
Church, Kenneth Ward, William Gale, Patrick Hanks, Donald
Also, the narrow window gives consistently higher ac- Hindle, and Rosamund Moon. 1994. Lexical substitutability.
curacy than the other sizes. This can be explained, per- In B.T.S. Atkins and A. Zampolli, editors, Computational
haps, by the fact that differences between near-synonyms Approaches to the Lexicon. Oxford University Press, pages
often involve differences in short-distance collocations 153–177.
with neighboring words, e.g., face the task. DiMarco, Chrysanne, Graeme Hirst, and Manfred Stede. 1993.
There are two reasons why the approach doesn’t do The semantic and stylistic differentiation of synonyms and
as well as an automatic approach ought to. First, as near-synonyms. In AAAI Spring Symposium on Building
mentioned above, our method of evaluation is not ideal; Lexicons for Machine Translation, pages 114–121, Stanford,
it may make our results just seem poor. Perhaps our CA, March.
results actually show the level of ‘typical usage’ in the Elhadad, Michael. 1992. Using Argumentation to Control
newspaper. Lexical Choice: A Functional Unification Implementation.
Second, lexical ambiguity is a major problem, affecting Ph.D. thesis, Columbia University.
both evaluation and the construction of the co-occurrence Golding, Andrew R. and Yves Schabes. 1996. Combin-
network. For example, in sentence (3), above, it turns out ing trigram-based and feature-based methods for context-
that the program uses safety as evidence for choosing job sensitive spelling correction. In Proceedings of the 34th
(because job safety is a frequent collocation), but this is Annual Meeting of the Association for Computational Lin-
the wrong sense of job. Syntactic and collocational red guistics.
herrings can add noise too. Hirst, Graeme. 1995. Near-synonymy and the structure of
lexical knowledge. In AAAI Symposium on Representation
4 Conclusion and Acquisition of Lexical Knowledge: Polysemy, Ambiguity,
and Generativity, pages 51–56, Stanford, CA, March.
We introduced the problem of choosing the most typical Karow, Yael and Shimon Edelman. 1996. Learning similarity-
synonym in context, and gave a solution that relies on a based word sense disambiguation from sparse data. In Pro-
generalization of lexical co-occurrence. The results show ceedings of the Fourth Workshop on Very Large Corpora,
that a narrow window of training context (4 words) Copenhagen, August.
works best for this task, and that at least second-order Ng, Hwee Tou and Hian Beng Lee. 1996. Integrating multiple
co-occurrence relations are necessary. We are planning sources to disambiguate word sense: An exemplar-based
to extend the model to account for more structure in the approach. In Proceedings of the 34th Annual Meeting of the
narrow window of context. Association for Computational Linguistics.
Stede, Manfred. 1996. Lexical Semantics and Knowledge Rep-
Acknowledgements resentation in Multilingual Sentence Generation. Ph.D. the-
sis, University of Toronto.
For comments and advice, I thank Graeme Hirst, Eduard Yarowsky, David. 1992. Word-sense disambiguation using
Hovy, and Stephen Green. This work is financially sup- statistical models of Roget’s categories trained on large cor-
ported by the Natural Sciences and Engineering Council pora. In Proceedingsof the 14th International Conference on
of Canada. Computational Linguistics (COLING-92), pages 454–460.

You might also like