APPLICATIO N S
Using WordNet and Lexical
Operators to Improve
Internet Searches
D AN I. M O LDO VAN AN D RADA M IH ALCEA
Southern Methodist University
disambiguation method and
vast amount of information is available on the Internet, and
many tools have been developed to gather it. These include search
engines such as AltaVista, Infoseek, Lycos, and many others. A
main problem with current search engines is that broad, general queries
produce a large volume of documents, many of which are totally irrelevant. At the same time, many relevant documents can be missing because
the query does not contain the keywords that index them; for the same
reason, specific queries often fail to produce any documents at all. Boolean
operators can sometimes help, but they can also further restrict a query
such that it fails to find relevant documents.
The lack of a natural language interface is another limitation of current
search engines. Many users, particularly those who are not computer professionals, would prefer to ask, “Who were the U.S. presidents of the past
century?” rather than form a Boolean query such as (US NEAR presidents)
AND (past NEAR century). These users would undoubtedly benefit from
an interface that could transform sentences into Boolean queries. But there
is another, perhaps even greater advantage in using natural language questions. With a modest amount of linguistic processing, the words in an English question can be “disambiguated” and the query subsequently expanded to include similar words from online dictionaries.
In this article, we describe such a system for broadening Web searches.
The large number of documents that result from the search are then subjected to a new search using an operator that further capitalizes on natural language constructs by extracting only the paragraphs that render information relevant to the query. We conclude with test results that show
significant improvements in two metrics:
postprocesses retrieved
■
A
A natural language interface
system for an Internet search
engine shows substantial
increases in the precision of
query results and the
percentage of queries answered
correctly. The system expands
queries based on a word-sense-
documents to extract only the
■
parts relevant to a query.
34
JANUARY • FEBRUARY 2 0 0 0
http:/ / computer.org/ i nter net/
performance is a standard information-retrieval system measure of the
number of relevant documents retrieved over the total number of documents retrieved;
system productivityis the percentage of questions answered satisfactorily,
a new measure that we introduce to address the Internet environment.
1089-7801/ 00/ $10.00 © 2000 IEEE
IEEEI NTERNET COM PUTI NG
U
S
I
N
G
W
O
R
D
N
E
T
A
N
IN TERFACE SYSTEM
ARCHITECTURE
Figure 1 shows the system architecture. The input
query or sentence expressed in English is first presented to the lexical processing module. This module was adopted from an information extraction system that we developed for the Message
Understanding Conference (MUC) competition.1
The word and sentence boundaries are located via
a process called tokanization. The words are tagged
for their part of speech using a version of Brill’s tagger.2 A phrase parser segments each sentence into
constituent noun and verb phrases and recognizes
the head words. After eliminating stopwords (conjunctions, prepositions, pronouns, and modal
verbs), we are left with some keywords xi that represent the important concepts of the input sentence.
In the next three sections, we describe the wordsense-disambiguation (WSD), query-expansion,
and postprocessing modules in our system. The
current implementation uses WordNet for WSD
and query expansion, and the AltaVista search
engine for Internet search. For more information
on these tools, see the sidebar “Development
Resources for Improving Internet Searches.”
W O RD-SEN SE
DISAM BIGUATIO N
Word-sense disambiguation is a novelty of our system. Each keyword in the query is mapped into its
corresponding semantic form as defined in WordNet. This step enables subsequent query expansion
based on semantic concepts rather than keywords.
Our approach takes advantage of the sentence
context. The words are paired, and each word is
disambiguated by searching the Internet with
queries formed using different senses of one word
while keeping the other word fixed. The senses are
ranked simply by the number of hits. In this way
all the words are processed and senses are ranked.
The next step refines the ordering of senses by
using a semantic density method that measures the
number of common words within a semantic distance of two or more words. The method uses
WordNet glosses. The algorithms and performance
results are presented in the remainder of this section (for an example application of the algorithms,
see the sidebar “Applying the WSD Algorithms”).
Algorithm 1 :
Contex tua l Ra nk ing of W ord Senses
From a semantically untagged word pair (W1 – W2),
we first select one of the words, say W2, and form
IEEEI NTERNET COM PUTI NG
D
L
E
X
I
C
A
L
O
P
E
R
A
T
O
R
S
Query
Lexical
processing module
Parts-of-speech
Word-sense
disambiguation
Phrase parser
Similarity list
WordNet
Query formulation
Internet
Internet search
Postprocessing
No
Desired
answer?
Stop
Resources
System architecture
Figure 1. System architecture. English queries enter the system
through the lexical processing module, where parts of speech are
tagged and sentences parsed for subsequent processing.
a similarity list for each of its senses, using WordNet’s synset for that word.
Consider, for example, that W2 has m senses.
This means that W2 appears in m similarity lists:
1(2)
1(k1)
(W 12, W 1(1)
2 , W 2 , …, W 2 )
2
2(1)
2(2)
(W 2, W 2 ,W 2 , …, W 22(k2))
...
m(2)
m(km )
(W m2 ,W m(1)
)
2 ,W 2 , …, W 2
where W 21, W 22 , ..., W m2 are the senses of W2, and W2i(s)
represents the synonym number sof the sense W i2 as
defined in WordNet. We can then form W1 – W i2(s)
pairs, specifically:
1(2)
1(k1)
(W1 –W 12,W1–W 1(1)
2 ,W1 –W 2 , …, W1 –W2 )
2
2(1)
2(2)
(W1 –W 2,W1 –W 2 ,W1 –W 2 , …, W1–W22(k2) )
...
m(2)
m(km)
(W1–Wm2,W1–Wm(1)
)
2 ,W1–W 2 , …, W1–W 2
http:/ / computer.or g/ i nter net/
JANUARY • FEBRUARY 2 0 0 0
35
F
E
A
T
U
R
E
Development Resources for Improving Internet Searches
In the system implementation and tests reported in this article, we used WordNet to translate and expand a query
from a natural language question, and AltaVista to fetch
documents from the Internet.
W ordN et
WordNet is a machine-readable dictionary (MRD) developed at Princeton University by a group led by George
Miller.1,2 Our system used WordNet 1.6 to disambiguate
word sense and generate similarity lists.
WordNet covers the vast majority of nouns, verbs, adjectives, and adverbs from the English language. Its 129,509
words are organized in 99,643 synonym sets, called
synsets. Each synset represents a concept. For example,
consider the noun “ computer.” It has two senses defined in
WordNet, hence two synsets: {computer, data processor,
electronic computer, information processing system} and
{calculator, reckoner, figurer, estimator, computer}.
WordNet features a rich set of 299,711 relation links
among words, between words and synsets, and between
synsets.
■
The complex search feature allowed us to create specific
relationships among the query keywords by using brackets, AND, OR, NOT, and NEAR operators.
Our main concern with AltaVista was its reliability, so
we tested it on a set of 1,100 words (nouns, verbs, adjectives, and adverbs) built from one of the texts in the Brown
corpus. A test run consisted of searching the Internet for
each of these words, and recording the number of hits
obtained. For searches performed at different time intervals, the number of hits obtained for a query should vary
only within a small range. We performed 20 tests using
the same words over a 10-day period—a test run every
12 hours.
The overall results showed that given an average of the
number of hits, AV, for a particular word:
■
■
Alta V ista
AltaVista is a search engine developed in 1995 by Digital
Equipment Corporation in its Palo Alto research labs. Its
URL is http:/ / www.altavista.com. Several characteristics
make it one of the most powerful search engines available
today. We based our decision to use AltaVista in our system on two of these features:
the complex Boolean searches available through its
advanced search function.
the hits are in the range [0.99AV–1.01AV] 90 percent
of the time, and
the hits are in the range [0.85AV–1.15AV] 100 percent
of the time.
Considering the amount of information on the Internet and
its highly unstructured nature, these small variations qualify this search engine as a reliable one.
References
1. G.A. Miller, “ WordNet: A Lexical Database,” Comm. ACM, Vol.
■
the amount of information available; its growing index
has more than 160,000,000 unique World Wide Web
pages; and
38, No. 11, 1993, pp. 39-41.
2. C. Fellbaum, An Electronic Lexical Database, MIT Press, Cambridge, Mass., 1998.
Finally, we perform an Internet search for each set
of W1 – W 2i (s) pairs. The query uses the operators
provided by AltaVista to find occurrences of W1
together with that sense of W2 for each set. For
example, one such query is
i(2)
(“W1*W 2i *” OR “W1* W i(1)
2 *” OR “W1* W 2 *”
i(ki)
OR ... OR “W1* W 2 *”) for all 1 ≤ i ≤ m.
The asterisk (*) is used as a wild card to increase the
number of hits with morphologically related words.
Using such a query, we get the number of hits for
each sense i of W2, and this provides a ranking of
the m senses of W2 as they relate with W1.
A similar algorithm is used to rank the senses of
36
JANUARY • FEBRUARY 2 0 0 0
http:/ / computer.or g/ i nter net/
W1 while keeping W2 constant (un-disambiguated).
Since these two procedures are performed over a
large corpora (the Internet) and with the help of
similarity lists, there is little correlation between the
results they produce.
Evaluation of Algorithm 1. We tested this method
on 384 pairs: 200 verb-noun, 127 adjective-noun,
and 57 adverb-verb extracted from the first text of
the SemCor 1.6 from the Brown corpus. Using the
AltaVista query form, we obtained the results shown
in Table 1 (on page 38).
The table indicates the percentages of correct
senses (as given by SemCor) ranked by us as the
first, second, third, and fourth choices of our list.
IEEEI NTERNET COM PUTI NG
U
S
I
N
G
W
O
R
D
N
E
T
A
N
D
L
E
X
I
C
A
L
O
P
E
R
A
T
O
R
S
Applying the WSD Algorithms
Consider the verb-noun collocation revise law. The verb
revise has two possible senses in WordNet 1.6, and the
noun law has seven senses.
We applied Algorithm 1 and searched the Internet using
Alta Vista for all possible pairs V – N that can be created
using revise and the words from the similarity lists of law.
We obtained the following ranking of senses: law#2(2,829),
law#3 (648), law#4 (640), law#6 (397), law#1 (224),
law#5 (37), law#7(0), where the number in the parentheses
indicates the number of hits.
By setting the threshold t = 2, we kept only senses #2
and #3. (The notation #i / n means sense i out of n possible
senses given by WordNet.)
Next, we applied Algorithm 2 to rank the four possible
combinations (two for the verb times two for the noun).
Table A summarizes the results, according to Equation 1
from the main text.
We concluded that keeping the top four choices for
verbs and nouns and the top two choices for adjectives and adverbs would cover all relevant senses in
the mid and upper 90 percent range.
From one point of view, a possible use of the
procedure so far is to exclude senses that do not
apply. This can save considerable computation time
as many words are highly polysemous.
Algorithm 2 :
Conceptua l Density Ra nk ing
Ameasure of the relatedness between words can be a
knowledge source for several decisions in natural language processing (NLP) applications. Our approach
is to construct a linguistic context for each sense of
the verb and noun, and to measure the number of
nouns shared by the verb and the noun contexts.
In WordNet each concept has a gloss that acts
as a microcontext for that concept. This rich source
of linguistic information proved useful in determining the conceptual density between words,
though it applies only to verb-noun pairs and not
to adjectives or adverbs.
We developed an algorithm that takes a semantically untagged verb-noun pair and a ranking of
noun senses (as determined by Algorithm 1) as its
input and gives a sense-tagged verb-noun pair as
output. Given a verb-noun pair V – N, we use
WordNet to determine the possible senses of the
verb and the noun, <v1, v2, …, vh> and <n1, n2, …,
nl >, respectively. Then we use Algorithm 1 to rank
IEEEI NTERNET COM PUTI NG
The largest conceptual density, C12 = 0.30, corresponds
to v1 – n2: revise#1/ 2 – law#2/ 5. This combination of verbnoun senses also appears in SemCor, file br-a01.
Table A. Values used in computing the conceptual
density Cij.
v1
v2
| cdij|
n2
n3
5
4
0
0
descj
n2
975
975
n3
1,265
1,265
Cij
n2
0.30
0
n3
0.28
0
| cdij| = Number of common concepts between verb and noun
hierarchies.
descj = Number of nouns within the hierarchy of each sense nj.
Cij = Conceptual density for each pair vi – nj.
the senses of the noun. Only the first t possible senses of this ranking will be considered; the rest are
dropped to reduce the computational complexity.
For each possible pair vi – nj, the conceptual
density Cij is computed as follows:
1. Extract all glosses from the subhierarchy
including vi (the rationale for selecting the
subhierarchy is explained below).
2. Determine the nouns from these glosses. These
constitute the noun-context of the verb. Each
such noun is stored together with a weight w
that indicates the level in the subhierarchy of
the verb concept in whose gloss the noun was
found.
3. Determine the nouns from the noun subhierarchy including nj.
4. Determine the conceptual density Cij of
common concepts between the nouns obtained
at Step 2 and the nouns obtained at Step 3 by
using the metric:
cdij
C ij =
∑wk
k
log(descendents j )
(1)
where |cdij|is the number of common concepts
between the hierarchies of vi and nj; wk represents
the levels of the nouns in the hierarchy of verb vi;
http:/ / computer.or g/ i nter net/
JANUARY • FEBRUARY 2 0 0 0
37
F
E
A
T
U
R
E
and descendentsj is the total number of words within the hierarchy of noun nj.
Given the conceptual density Cij , the last step of
Algorithm 2 ranks each pair vi – nj , for all i and j.
paring Table 2 results with those for Table 1 will
show the percentage increase in accuracy contributed by Algorithm 2 beyond Algorithm 1.
To our knowledge, there is only one other
method, recently reported, that disambiguates unrestricted nouns, verbs, adverbs, and adjectives in
texts.3 The method uses WordNet and attempts to
exploit sentential and discourse contexts; it is based
on the idea of semantic distance between words and
on lexical relations. There are several accurate statistical methods, such as the one presented in
Yarowsky,4 but they disambiguate only one part of
speech (nouns in this case) and focus on only a few
words because they lack training corpora.
Table 3 presents a comparison between our
results and the results reported in those papers. The
baseline for the comparison is the occurrences of
the first senses from WordNet.
For applications such as query expansion in
information retrieval, our method has the additional
advantage of potentially considering the first two
senses for each word, in which case the average accuracy (as determined in Table 2) is 91 percent.
Rationale for Algorithm 2. This algorithm capital-
Q UERY EX PAN SIO N
izes on WordNet’s gloss, which explains a concept
and provides one or more examples with typical
usage of that concept. To determine the most
appropriate noun and verb hierarchies, we performed some experiments using SemCor and concluded that the noun subhierarchy should include
all the nouns in the class of nj. The subhierarchy of
verb vi is taken as the hierarchy of the highest
hypernym hi of the verb vi. It is necessary to consider a larger hierarchy than just the one provided
by synonyms and direct hyponyms. As we replaced
the role of a corpora with glosses, we achieved better results with more glosses. Still, we don’t want to
enlarge the context too much.
The nouns with a big hierarchy tend to have a
larger value for |cdij |, so the weighted sum of common concepts is normalized in respect to the
dimension of the noun hierarchy. Since a hierarchy’s size grows exponentially with its depth, we
used the logarithm of the total number of descendants in the hierarchy, log(descendentsj). We experimented with a fewother metrics, but after running
the program on several examples, the formula from
Algorithm 2 provided the best results.
The technology of query expansion is almost 30
years old.5 It can be used either to broaden the set
of documents retrieved or to increase the retrieval
precision. In the former case, the query is expanded with terms similar to those from the original
query, while in the second case, the expansion procedure adds completely new terms. We take the
first approach, using WordNet to find words
semantically related to concepts from the original
query. (An example of the second technique is the
Smart system, developed at Cornell University,
which uses words derived from documents relevant
to the original query.6)
The query expansion module in our system has
two main functions:
Table 1. Accuracy statistics for 384 word pairs using Algorithm 1.
Top 1
Top 2
Top 3
Top 4
Noun
76%
83%
86%
98%
Verb
60%
68%
86%
87%
Adjective
79.8%
93%
Adverb
87%
97%
Table 2. Final results obtained for 384 word pairs using both
Algorithms 1 and 2.
Top 1
Top 2
Top 3
Top 4
Noun
86.5%
96%
97%
98%
Verb
67%
79%
86%
87%
Adjective
79.8%
93%
Adverb
87%
97%
Eva lua tion of W SD M ethod
Table 2 shows the overall results using Algorithm 1
followed by Algorithm 2 on 384 word pairs. Com-
38
JANUARY • FEBRUARY 2 0 0 0
http:/ / computer.or g/ i nter net/
■
■
the construction of similarity lists using WordNet, and
the formation of the actual query.
Once we have a sense ranking for each word of the
input sentence, it is relatively easy to use WordNet’s
rich semantic information to identify many words
that are semantically similar to a given input word.
Doing this increases the chance of finding more
answers to input queries. WordNet can provide
semantic similarity between words that belong to
the same synonym set.
IEEEI NTERNET COM PUTI NG
U
S
I
N
G
W
O
R
D
N
E
Consider, for example, the word activity. WordNet gives seven senses for this word. The synset for
the first sense includes two other synonyms, action
and activeness. The similarity list for this sense of
the word is therefore
W = {action, activity, activeness}
The efficacy of expanding a query for search in
large text collections was investigated by Voorhees.7
She used WordNet to experiment with four
expanding strategies:
■
■
■
■
by synonyms only,
by synonyms plus all descendants in a isa hierarchy,
by synonyms plus parents and all descendants
in a isa hierarchy, and
by synonyms plus any synset directly related to
the given synset.
Her results showed no significant differences in the
precision obtained using any one of these four
expanding strategies.
Let’s denote with xi the words of a question or
sentence, and with Wi = {xi, xik } the similarity lists
provided by WordNet for each word xi. The elements of a list are xki where k enumerates the elements in each list (that is, the words on the same
level of similarity with the word xi).We can nowuse
these lists to formulate the actual query, using the
Boolean operators accepted by current search
engines. The OR operator is used to link words
within a similarity list Wi, while the AND and
NEAR operators link the similarity lists.
While different combinations of similarity lists
linked by AND or NEAR operators are possible,
two basic forms
W1 AND W2 AND ... AND Wn
W1 NEAR W2 NEAR ... NEAR Wn
give the maximum and minimum, respectively, of
the number of documents retrieved. In most cases,
the maximum format gathered thousands of documents, while the minimum format almost always
had null results.
We can assume that any documents containing
the answers must be among the large number of
documents provided by the AND operators, but
the search engine failed to rank them in the top of
the list. Thus, we sought a newoperator that would
filter out many irrelevant texts.
IEEEI NTERNET COM PUTI NG
T
A
N
D
L
E
X
I
C
A
L
O
P
E
R
Table 3. A comparison with other WSD methods.
Baseline
Stetina
Yarowsky
Noun
80.3%
85.7%
93.9%
Verb
62.5%
63.9%
Adjective
81.8%
83.6%
Adverb
84.3%
86.5%
Average
77%
80%
A
T
O
R
S
Our method
86.5%
67%
79.8%
87%
80.1%
PO STPRO CESSIN G W ITH A
N EW O PERATO R
Our approach to filtering documents is to first search
the Internet using weak operators (AND, OR) and
then to further search this large number of documents
using a more restrictive operator. For this second
phase, we propose the following additional operator:
PARAGRAPH n (... similarity lists ... )
The PARAGRAPH operator searches as an AND
operator for the words in the similarity lists, but
with the constraint that the words belong only to
some n consecutive paragraphs, where n is a positive integer. The parameter n selects the number of
paragraphs, thus controlling the size of the text
retrieved from a document considered relevant.
The rationale is that most likely the information
requested is found in a few paragraphs rather than
being dispersed over an entire document. (A similar idea can be found in Callan.8)
To apply this newoperator, the documents gathered from the Internet must be segmented into sentences and paragraphs. Separating a text into sentences is an easy task that can be solved just by
using the punctuation. However, the unstructured
texts on the Web make paragraph segmentation
much more difficult. Both Callan8 and Hearst9
have developed work in this direction, but their
methods work only for structured texts containing
lexical separators known a priori (for example, a tag
or an empty line). Thus, we had to use a method
that covers almost all possible paragraph separators
that can occur in Web texts. The paragraph separators that we’ve considered so far are HTML tags,
empty lines, and paragraph indentations.
We give a complete example of our system in the
sidebar, “Finding a Relevant Answer: A Query
Example.”
TEST RESULTS
To test our system, we used 50 questions from real
Internet searches and 50 questions derived from
http:/ / computer.or g/ i nter net/
JANUARY • FEBRUARY 2 0 0 0
39
F
E
A
T
U
R
E
Finding a Relevant Answer: A Query Example
Suppose you want to answer the question: “ How much tax
does an average salary worker pay in the United States?”
The linguistic processing module (shown in Figure 1,
main text) identified keywords, including part-of-speech
tags, which were then ranked for word sense as follows:
x1 =(tax), pos = noun, sense #1/ 1
x2 =(average), pos = adjective, sense #4/ 5
x3 =(salary), pos = noun, sense #1/ 1
x4 =(the United States), pos = noun, sense #1/ 2
x5 =(worker), pos = noun, sense #1/ 4
x6 =(pays), pos = verb, sense #1/ 7
The sense number indicates the actual WordNet sense that
resulted from the disambiguation of all possible senses in
WordNet. For instance, adjective average has five senses
and the system picked sense #4.
These keywords are the input for the next step of our system. Using the similarity relation encoded in the WordNet
synsets, it yields the following six similarity lists:
from search engines today, the ranking provided by
AltaVista is of no use for us here. None of the 10 leading
documents in any category provided the desired information. Nor did the single document fetched by Query 4:
....The proposed tax cut, and the bigger one promised for next
year, if enacted, will be paid for by the Social Security wage
taxes of middle and low-income workers of America. Employees
have been willing to pay these taxes because of the promise of
guaranteed Social Security retirement benefits. This Republican
tax bill is a betrayal of the low and middle-income workers. The
unfairness of these proposals is breathtaking.
Analysis of the table results indicates a gap in the volume
of documents retrieved with the AltaVista operators. For
instance, using only the AND operator (Query 1) obtained
49,182 documents, but the NEAR operator (Queries 4 and
6) produced only one (irrelevant) and zero outputs, respectively. This operator seems to be too restrictive, while it fails
to identify the right answer. Various combinations of AND
and NEAR operators achieved no great results.
Using the PARAGRAPH operator, however, the system
found a relevant answer:
W 1 = {tax, taxation, revenue enhancement}
In 1910, American workers paid no income tax. In 1995, a
W 2 = {average, intermediate, medium, middle}
worker earning an average wage of $26,000 pays about
W 3 = {salary, wage, pay, earnings, remuneration}
24% (about $6,000) in income taxes. The average American
worker’s pay has risen greatly since 1910. Then, the average
W 4 = {United States, United States of America, America,
worker earned about $600 per year. Today, the figure is
US, U.S., USA, U.S.A.}
$26,000.
W 5 = {worker}
W 6 = {pay}
Table A. Query results with various combinations of operators.
These lists are used to formulate queries for the search
Query
No. of documents
engine. Table A shows some
1
W 1 AND W 2 AND W 3 AND W 4 AND W 5 AND W 6
49,182
queries and the number of
2
W 1 AND (W 2 NEAR W 3) AND W 4 AND W 5 AND W 6
9,766
documents retrieved by
3
W 1 NEAR (W 2 NEAR W 3) AND W 4 AND W 5 AND W 6
976
AltaVista. Though AltaVista
4
W 1 NEAR W 2 NEAR W 3 NEAR W 4 NEAR W 5 NEAR W 6
1(no)
has one of the most powerful
5
W 1 AND {average W 3} AND W 4 AND W 5 AND W 6
9,045
sets of operators available
6
W 1 NEAR {average W 3} NEAR W 4 NEAR W 5 NEAR W 6
0
each of 50 topics defined for ad hoc queries at the
Sixth Text Retrieval Conference (TREC-6),10
cosponsored by the U.S. National Institute of Standards and Technology (NIST) and the Defense
Advanced Research Projects Agency (DARPA).
Figure 2 presents an example topic from the
TREC-6 ad hoc collection. Each topic is a framelike data structure with the following fields:
■
■
40
JANUARY • FEBRUARY 2 0 0 0
<num> identifies the topic number;
<title> classifies the topic within a domain;
http:/ / computer.or g/ i nter net/
■
■
<desc> describes the topic briefly (for TREC6, this section was intended to be an initial
search query);
<narr> provides a further explanation of what
a relevant material may look like.
We edited the <desc> field to derive natural language questions similar to those normally asked by
real users searching the Internet. For example, from
the corpus entry presented above, the question
derived was “Which are some of the organizations
IEEEI NTERNET COM PUTI NG
U
S
I
N
G
W
O
R
D
N
E
participating in international criminal activity?”
Let’s denote the two sets of questions as REAL
and TREC. In our experiment, the REAL queries
posed by users could usually be classified as concrete
queries—that is, based on specialized knowledge
of a domain, while the TREC topics led to more
abstract queries.11
Table 4 presents five randomly selected questions from the TREC set and five questions from
the REAL set, together with the results obtained.
Each table cell contains two numbers: on the
top, the number of documents or—for the
PARAGRAPH operator—paragraphs retrieved
for the question; on the bottom, the number of
relevant documents or paragraphs found in the
top 10 ranking.
The AND xi and NEAR xi columns contain the
results for the search when AND and NEAR operators were applied to the input words xi. By replacing the words xi with their similarity lists derived
from WordNet, the number of documents
retrieved increased, as expected. The results
obtained in these cases are presented in the AND
wi and NEAR wi columns.
T
A
N
D
L
E
X
I
C
A
L
O
P
E
R
A
T
O
R
S
<num> Number: 301
<title> International Organized Crime
<desc> Description:
Identify organizations that participate in international criminal activity,
and, if possible, collaborating organizations and the countries involved.
<narr> Narrative:
A relevant document must as a minimum identify the organization and the
type of illegal activity (e.g., Colombian cartel exporting cocaine). Vague
references to international drug trade without identification of the
organization(s) involved would not be relevant.
Figure 2. Example ad hoc topic and its data structure.
The next column contains the number of documents extracted when the operator PARAGRAPH 2
(meaning two consecutive paragraphs) was applied
to words from the similarity lists. The results were
encouraging; the number of documents retrieved
was small, and correct answers were found in
almost all cases.
Table 5 (next page) presents a summary of
results for the 100 questions used to test our system. First, it shows the number of documents
retrieved for an average TREC and REAL ques-
Table 4. A sample of the results obtained for randomly selected questions from the TREC and the REAL sets.
AND
NEAR
AND
NEAR Paragraph
xi
xi
wi
wi
wi
TREC questions
Which are some of the organizations participating
27,716
3
48,133
5
6
in international criminal activity?
0
1
0
1
1
Is the disease of Poliomyelitis (polio) under control
9,432
13
10,271
15
40
in the world?
1
3
2
3
11
Which are some of the positive accomplishments of the
178
4
504
4
2
Hubble telescope since it was launched?
1
0
1
0
1
Which are some of the endangered mammals?
32,133
6,214
32,133
6,214
150
0
1
0
1
80
Which are the most crashworthy, and least crashworthy,
246
5
260
5
15
passenger vehicles?
0
1
1
1
6
REAL questions
Where can I find cheap airline fares?
1,360
3
2,608
35
61
2
3
2
5
34
Find out about Fifths disease.
2
0
30
0
10
0
0
1
0
1
What is the price of ICI?
4,503
202
10,221
575
117
0
0
0
1
10
Where can I shop online for Canada?
36,049
858
36,049
858
15
0
1
0
1
8
What are the average wages for event planners?
6
0
70
0
6
1
0
0
0
6
IEEEI NTERNET COM PUTI NG
http:/ / computer.or g/ i nter net/
JANUARY • FEBRUARY 2 0 0 0
41
F
E
A
T
U
R
E
Table 5. Summary of results for 50 questions from the TREC collection and 50 questions from the
frequently asked queries on the Internet.
AND
NEAR
AND
NEAR
Paragraph
xi
xi
wi
wi
wi
Number of documents retrieved
Average question from the TREC set
7,746
258
25,803
332
26.04
Average question from the REAL set
13,510
1,843
28,715
3,003
48.95
Precision
Average question from the TREC set
1.6%
4.8%
4.4%
8.8%
43%
Average question from the REAL set
6.3%
12.43%
6.09%
13.65%
27.7%
Productivity
Average question from the TREC set
36%
44%
20%
36%
90%
Average question from the REAL set
30%
42%
28%
48%
66%
tion. Naturally, the query extension determined
an increase in the number of documents by a factor varying from 1 (meaning an equal number of
documents retrieved for both the unextended and
extended queries) to 32. Instead of hundreds,
thousands, and even tens of thousands of documents, the PARAGRAPH operator returns just
26 and 48 documents for the TREC and REAL
questions, respectively.
Moreover, instead of returning full documents, the new operator identifies only the portion of the document where the answer is; this
constitutes another reduction factor not captured
in the table.
Next, Table 5 shows the precision, or ratio
between the number of relevant documents
retrieved and the total number of documents
retrieved. Because it is impractical to search for the
relevant documents among all those retrieved by an
AltaVista query, we considered only the relevant
documents in the first ten ranked documents. In
the case of PARAGRAPH, however, the number of
paragraphs retrieved is small, so the precision was
considered over the entire set.
With the PARAGRAPH operator, the actual
precision reaches 43 percent for the TREC questions and 27.7 percent for the REAL questions.
(The difference can be explained by the short
questions that Internet users tend to ask, which
tend to retrieve a very large number of documents
and make it much harder to find relevant information.
The biggest gain in Table 5, however, is in
system productivity. From the TREC set, 90 percent of the questions were answered correctly;
from the REAL set, and 66 percent. This is a significant improvement over current technology.
42
JANUARY • FEBRUARY 2 0 0 0
http:/ / computer.or g/ i nter net/
CO N CLUSIO N
In general, because the range of questions is so
broad, it is difficult to compare the performance of
question-answering systems. Other systems implemented for the REAL type of questions operate in
narrow domains. For example, Burke, Hammond,
and Kozlovsky12 describe a system that uses the files
of “Frequently Asked Questions” associated with
many Usenet groups.
The results obtained during the TREC tests can
be compared with work described in Voorhees,13
though the latter retrieves information on very large
text collections of texts, rather than the Internet.
Voorhees reported an average precision of 36 percent for full-topic statements. Our result of 43 percent precision in retrieving information for narrow
questions over heterogeneous Internet domains is
thus encouraging.
Our system can still fail to return relevant
answers for some questions, for example, questions
with very specialized terms. The test results nevertheless demonstrate a substantial increase in both
the precision and the percentage of queries
answered correctly, while reducing the amount of
text presented to the user in comparison with current Internet search engine technology.
The system can be easily extended to restrict the
output to several sentences instead of paragraphs.
Also, a more flexible NEAR search could be implemented with a new operator SEQUENCE
(W1dW2d …,Wn), where d is a numeric variable
that indicates the distance between the words in the
W lists for which the search is done.
Indexing words by their WordNet senses, socalled semantic or conceptual indexing, could
also improve Internet searches. This method
implies some online parsing and word-sense dis-
IEEEI NTERNET COM PUTI NG
U
S
I
N
G
W
O
R
D
N
E
ambiguation that may be possible in the not-toodistant future. Semantic indexing has the potential for improving the ranking of search results,
as well as allowing information extraction of
objects and their relationships (for example, see
Pustejovsky et al.14).
Finally, Web searches could use compound
nouns or collocations. WordNet includes thousands of word groups—for example, blue-collar
worker, stock market, and mortgage interest rate—
that point to their respective concept. Indexing
each compound noun as one term reduces the storage space for the search engine and might further
increase the precision.
■
REFEREN CES
1. D. Moldovan et al., “USC: Description of the SNAP
System Used for MUC-5,” Proc. 5th Message Understanding Conf., Morgan Kaufmann, San Francisco, 1993,
pp. 305-320.
2. E. Brill, “Some Advances in Rule-Based Part-of-Speech Tagger,” Proc. 12th Nat’l Conf. on Artificial Intelligence (AAAI94), AAAI Press, Menlo Park, Calif., 1994, pp. 256-261.
3. J. Stetina, S. Kurohashi, and M. Nagao, “General Word
Sense Disambiguation Method Based on a Full Sentential
Context,” Proc. Workshop on Usage of WordNet in Natural
Language Processing, Morgan Kaufmann, San Francisco,
1998, pp. 1-8.
4. D. Yarowsky, “Unsupervised Word Sense Disambiguation
Rivaling Supervised Methods,” Proc. 33rd Ann. Meeting of
the Association of Computational Linguistics, Morgan Kaufmann, San Francisco, 1995, pp. 189-196.
5. G. Salton and M.E. Lesk, “Computer Evaluation of
Indexing and Text Processing,” in The SMART Retrieval
System: Experiments in Automatic Document Processing, G.
Salton, ed., Prentice Hall, Englewood Cliffs, N.J., 1971,
pp. 143-180.
6. C. Buckley et al., “Automatic Query Expansion Using
SMART,” Proc. Third Text Retrieval Conf. (TREC-3),
NIST Special Publications, Washington, DC, 1994, pp.
69-81; available for download at http://trec.nist.gov/pubs/
trec3/t3_proceedings.html.
7. E.M. Voorhees, “Using WordNet for Text Retrieval,” in
WordNet —An Electronic and Lexical Database, C.Fellbaum,
ed., MIT Press, Cambridge, Mass., 1998, pp 285-303.
8. J.P. Callan, “Passage-Level Evidence in Document
Retrieval,”Proc. 17th Ann. Int’l ACM SIGIR, Conf.on
Research and Development in Information Retrieval,
Dublin, Ireland, 1994, pp. 302-310.
9. M.A. Hearst, “Multi-Paragraph Segmentation of Expository Text,” Proc. 32nd Ann. Meeting of the Association for
Computational Linguistics, Morgan Kaufmann, San Francisco, 1994, pp. 143-180.
IEEEI NTERNET COM PUTI NG
T
A
N
D
L
E
X
I
C
A
L
O
P
E
R
A
T
O
R
S
10. E.M. Voorhees and D. Harman, eds., Sixth Text Retrieval
Conf. (TREC 6), NIST Special Publication 500-240,
Washington, DC, 1997, available online at http://
trec.nist.gov/pubs/trec6/t6_proceedings.html.
11. M.K. Leong, “Concrete Queries in Specialized Domains:
Known Item as Feedback for Query Formulation, Sixth
Text Retrieval Conf. (TREC-6), NIST Special Publications, Washington, DC., 1997, pp. 541-550; available
for download at http://trec.nist.gov/pubs/trec6/
t6_proceedings.html.
12. R. Burke, K. Hammond, and J. Kozlovsky,” “KnowledgeBased Information Retrieval from Semi-Structured Text,”
Proc. American Assn. for Artificial Intelligence Conf., Fall
Symp. on AI Applications in Knowledge Navigation and
Retrieval, AAAI Press, Menlo Park, Calif., 1995, pp. 15-19.
13. E.M. Voorhees, Query Expansion using Lexical-Semantic
Relations, Proc. 17th Ann. Int’l ACM SIGIR, Conf. on
Research and Development in Information Retrieval,
Dublin, Ireland, 1994, pp 61-69.
14. J. Pustejovsky et al., “Semantic Indexing and Typed Hyperlinking,” Proc. American Assn. for Artificial Intelligence
Conf., Spring Symp., AAAI Press, Menlo Park, Calif., 1997,
pp. 120-128.
Dan I. Moldovan is a professor in the Department of Computer Science and Engineering at Southern Methodist University. His current research interests are in the field of natural
language processing, particularly in linguistic knowledge
bases, text inference, and question answering systems.
Moldovan received a PhD in electrical engineering and computer science from Columbia University in 1978.
Rada Mihalcea is a PhD student in the Department of Computer Science at Southern Methodist University. Her
research interests are in the field of natural language processing, particularly in word sense disambiguation, information extraction, and question answering systems. Mihalcea is a member of the AAAI and the ACL.
Readers may contact the authors at {moldovan, rada}@seas.
smu.edu.
Next issue in IEEE Internet Computing—
Agent Technology
and t he Int ernet
Guest Editors:
M ike Wooldridge and Keith Decker
http:/ / computer.or g/ i nter net/
JANUARY • FEBRUARY 2 0 0 0
43