Academia.eduAcademia.edu

Using wordnet and lexical operators to improve internet searches

2000, Internet Computing, IEEE

APPLICATIO N S Using WordNet and Lexical Operators to Improve Internet Searches D AN I. M O LDO VAN AN D RADA M IH ALCEA Southern Methodist University disambiguation method and vast amount of information is available on the Internet, and many tools have been developed to gather it. These include search engines such as AltaVista, Infoseek, Lycos, and many others. A main problem with current search engines is that broad, general queries produce a large volume of documents, many of which are totally irrelevant. At the same time, many relevant documents can be missing because the query does not contain the keywords that index them; for the same reason, specific queries often fail to produce any documents at all. Boolean operators can sometimes help, but they can also further restrict a query such that it fails to find relevant documents. The lack of a natural language interface is another limitation of current search engines. Many users, particularly those who are not computer professionals, would prefer to ask, “Who were the U.S. presidents of the past century?” rather than form a Boolean query such as (US NEAR presidents) AND (past NEAR century). These users would undoubtedly benefit from an interface that could transform sentences into Boolean queries. But there is another, perhaps even greater advantage in using natural language questions. With a modest amount of linguistic processing, the words in an English question can be “disambiguated” and the query subsequently expanded to include similar words from online dictionaries. In this article, we describe such a system for broadening Web searches. The large number of documents that result from the search are then subjected to a new search using an operator that further capitalizes on natural language constructs by extracting only the paragraphs that render information relevant to the query. We conclude with test results that show significant improvements in two metrics: postprocesses retrieved ■ A A natural language interface system for an Internet search engine shows substantial increases in the precision of query results and the percentage of queries answered correctly. The system expands queries based on a word-sense- documents to extract only the ■ parts relevant to a query. 34 JANUARY • FEBRUARY 2 0 0 0 http:/ / computer.org/ i nter net/ performance is a standard information-retrieval system measure of the number of relevant documents retrieved over the total number of documents retrieved; system productivityis the percentage of questions answered satisfactorily, a new measure that we introduce to address the Internet environment. 1089-7801/ 00/ $10.00 © 2000 IEEE IEEEI NTERNET COM PUTI NG U S I N G W O R D N E T A N IN TERFACE SYSTEM ARCHITECTURE Figure 1 shows the system architecture. The input query or sentence expressed in English is first presented to the lexical processing module. This module was adopted from an information extraction system that we developed for the Message Understanding Conference (MUC) competition.1 The word and sentence boundaries are located via a process called tokanization. The words are tagged for their part of speech using a version of Brill’s tagger.2 A phrase parser segments each sentence into constituent noun and verb phrases and recognizes the head words. After eliminating stopwords (conjunctions, prepositions, pronouns, and modal verbs), we are left with some keywords xi that represent the important concepts of the input sentence. In the next three sections, we describe the wordsense-disambiguation (WSD), query-expansion, and postprocessing modules in our system. The current implementation uses WordNet for WSD and query expansion, and the AltaVista search engine for Internet search. For more information on these tools, see the sidebar “Development Resources for Improving Internet Searches.” W O RD-SEN SE DISAM BIGUATIO N Word-sense disambiguation is a novelty of our system. Each keyword in the query is mapped into its corresponding semantic form as defined in WordNet. This step enables subsequent query expansion based on semantic concepts rather than keywords. Our approach takes advantage of the sentence context. The words are paired, and each word is disambiguated by searching the Internet with queries formed using different senses of one word while keeping the other word fixed. The senses are ranked simply by the number of hits. In this way all the words are processed and senses are ranked. The next step refines the ordering of senses by using a semantic density method that measures the number of common words within a semantic distance of two or more words. The method uses WordNet glosses. The algorithms and performance results are presented in the remainder of this section (for an example application of the algorithms, see the sidebar “Applying the WSD Algorithms”). Algorithm 1 : Contex tua l Ra nk ing of W ord Senses From a semantically untagged word pair (W1 – W2), we first select one of the words, say W2, and form IEEEI NTERNET COM PUTI NG D L E X I C A L O P E R A T O R S Query Lexical processing module Parts-of-speech Word-sense disambiguation Phrase parser Similarity list WordNet Query formulation Internet Internet search Postprocessing No Desired answer? Stop Resources System architecture Figure 1. System architecture. English queries enter the system through the lexical processing module, where parts of speech are tagged and sentences parsed for subsequent processing. a similarity list for each of its senses, using WordNet’s synset for that word. Consider, for example, that W2 has m senses. This means that W2 appears in m similarity lists: 1(2) 1(k1) (W 12, W 1(1) 2 , W 2 , …, W 2 ) 2 2(1) 2(2) (W 2, W 2 ,W 2 , …, W 22(k2)) ... m(2) m(km ) (W m2 ,W m(1) ) 2 ,W 2 , …, W 2 where W 21, W 22 , ..., W m2 are the senses of W2, and W2i(s) represents the synonym number sof the sense W i2 as defined in WordNet. We can then form W1 – W i2(s) pairs, specifically: 1(2) 1(k1) (W1 –W 12,W1–W 1(1) 2 ,W1 –W 2 , …, W1 –W2 ) 2 2(1) 2(2) (W1 –W 2,W1 –W 2 ,W1 –W 2 , …, W1–W22(k2) ) ... m(2) m(km) (W1–Wm2,W1–Wm(1) ) 2 ,W1–W 2 , …, W1–W 2 http:/ / computer.or g/ i nter net/ JANUARY • FEBRUARY 2 0 0 0 35 F E A T U R E Development Resources for Improving Internet Searches In the system implementation and tests reported in this article, we used WordNet to translate and expand a query from a natural language question, and AltaVista to fetch documents from the Internet. W ordN et WordNet is a machine-readable dictionary (MRD) developed at Princeton University by a group led by George Miller.1,2 Our system used WordNet 1.6 to disambiguate word sense and generate similarity lists. WordNet covers the vast majority of nouns, verbs, adjectives, and adverbs from the English language. Its 129,509 words are organized in 99,643 synonym sets, called synsets. Each synset represents a concept. For example, consider the noun “ computer.” It has two senses defined in WordNet, hence two synsets: {computer, data processor, electronic computer, information processing system} and {calculator, reckoner, figurer, estimator, computer}. WordNet features a rich set of 299,711 relation links among words, between words and synsets, and between synsets. ■ The complex search feature allowed us to create specific relationships among the query keywords by using brackets, AND, OR, NOT, and NEAR operators. Our main concern with AltaVista was its reliability, so we tested it on a set of 1,100 words (nouns, verbs, adjectives, and adverbs) built from one of the texts in the Brown corpus. A test run consisted of searching the Internet for each of these words, and recording the number of hits obtained. For searches performed at different time intervals, the number of hits obtained for a query should vary only within a small range. We performed 20 tests using the same words over a 10-day period—a test run every 12 hours. The overall results showed that given an average of the number of hits, AV, for a particular word: ■ ■ Alta V ista AltaVista is a search engine developed in 1995 by Digital Equipment Corporation in its Palo Alto research labs. Its URL is http:/ / www.altavista.com. Several characteristics make it one of the most powerful search engines available today. We based our decision to use AltaVista in our system on two of these features: the complex Boolean searches available through its advanced search function. the hits are in the range [0.99AV–1.01AV] 90 percent of the time, and the hits are in the range [0.85AV–1.15AV] 100 percent of the time. Considering the amount of information on the Internet and its highly unstructured nature, these small variations qualify this search engine as a reliable one. References 1. G.A. Miller, “ WordNet: A Lexical Database,” Comm. ACM, Vol. ■ the amount of information available; its growing index has more than 160,000,000 unique World Wide Web pages; and 38, No. 11, 1993, pp. 39-41. 2. C. Fellbaum, An Electronic Lexical Database, MIT Press, Cambridge, Mass., 1998. Finally, we perform an Internet search for each set of W1 – W 2i (s) pairs. The query uses the operators provided by AltaVista to find occurrences of W1 together with that sense of W2 for each set. For example, one such query is i(2) (“W1*W 2i *” OR “W1* W i(1) 2 *” OR “W1* W 2 *” i(ki) OR ... OR “W1* W 2 *”) for all 1 ≤ i ≤ m. The asterisk (*) is used as a wild card to increase the number of hits with morphologically related words. Using such a query, we get the number of hits for each sense i of W2, and this provides a ranking of the m senses of W2 as they relate with W1. A similar algorithm is used to rank the senses of 36 JANUARY • FEBRUARY 2 0 0 0 http:/ / computer.or g/ i nter net/ W1 while keeping W2 constant (un-disambiguated). Since these two procedures are performed over a large corpora (the Internet) and with the help of similarity lists, there is little correlation between the results they produce. Evaluation of Algorithm 1. We tested this method on 384 pairs: 200 verb-noun, 127 adjective-noun, and 57 adverb-verb extracted from the first text of the SemCor 1.6 from the Brown corpus. Using the AltaVista query form, we obtained the results shown in Table 1 (on page 38). The table indicates the percentages of correct senses (as given by SemCor) ranked by us as the first, second, third, and fourth choices of our list. IEEEI NTERNET COM PUTI NG U S I N G W O R D N E T A N D L E X I C A L O P E R A T O R S Applying the WSD Algorithms Consider the verb-noun collocation revise law. The verb revise has two possible senses in WordNet 1.6, and the noun law has seven senses. We applied Algorithm 1 and searched the Internet using Alta Vista for all possible pairs V – N that can be created using revise and the words from the similarity lists of law. We obtained the following ranking of senses: law#2(2,829), law#3 (648), law#4 (640), law#6 (397), law#1 (224), law#5 (37), law#7(0), where the number in the parentheses indicates the number of hits. By setting the threshold t = 2, we kept only senses #2 and #3. (The notation #i / n means sense i out of n possible senses given by WordNet.) Next, we applied Algorithm 2 to rank the four possible combinations (two for the verb times two for the noun). Table A summarizes the results, according to Equation 1 from the main text. We concluded that keeping the top four choices for verbs and nouns and the top two choices for adjectives and adverbs would cover all relevant senses in the mid and upper 90 percent range. From one point of view, a possible use of the procedure so far is to exclude senses that do not apply. This can save considerable computation time as many words are highly polysemous. Algorithm 2 : Conceptua l Density Ra nk ing Ameasure of the relatedness between words can be a knowledge source for several decisions in natural language processing (NLP) applications. Our approach is to construct a linguistic context for each sense of the verb and noun, and to measure the number of nouns shared by the verb and the noun contexts. In WordNet each concept has a gloss that acts as a microcontext for that concept. This rich source of linguistic information proved useful in determining the conceptual density between words, though it applies only to verb-noun pairs and not to adjectives or adverbs. We developed an algorithm that takes a semantically untagged verb-noun pair and a ranking of noun senses (as determined by Algorithm 1) as its input and gives a sense-tagged verb-noun pair as output. Given a verb-noun pair V – N, we use WordNet to determine the possible senses of the verb and the noun, <v1, v2, …, vh> and <n1, n2, …, nl >, respectively. Then we use Algorithm 1 to rank IEEEI NTERNET COM PUTI NG The largest conceptual density, C12 = 0.30, corresponds to v1 – n2: revise#1/ 2 – law#2/ 5. This combination of verbnoun senses also appears in SemCor, file br-a01. Table A. Values used in computing the conceptual density Cij. v1 v2 | cdij| n2 n3 5 4 0 0 descj n2 975 975 n3 1,265 1,265 Cij n2 0.30 0 n3 0.28 0 | cdij| = Number of common concepts between verb and noun hierarchies. descj = Number of nouns within the hierarchy of each sense nj. Cij = Conceptual density for each pair vi – nj. the senses of the noun. Only the first t possible senses of this ranking will be considered; the rest are dropped to reduce the computational complexity. For each possible pair vi – nj, the conceptual density Cij is computed as follows: 1. Extract all glosses from the subhierarchy including vi (the rationale for selecting the subhierarchy is explained below). 2. Determine the nouns from these glosses. These constitute the noun-context of the verb. Each such noun is stored together with a weight w that indicates the level in the subhierarchy of the verb concept in whose gloss the noun was found. 3. Determine the nouns from the noun subhierarchy including nj. 4. Determine the conceptual density Cij of common concepts between the nouns obtained at Step 2 and the nouns obtained at Step 3 by using the metric: cdij C ij = ∑wk k log(descendents j ) (1) where |cdij|is the number of common concepts between the hierarchies of vi and nj; wk represents the levels of the nouns in the hierarchy of verb vi; http:/ / computer.or g/ i nter net/ JANUARY • FEBRUARY 2 0 0 0 37 F E A T U R E and descendentsj is the total number of words within the hierarchy of noun nj. Given the conceptual density Cij , the last step of Algorithm 2 ranks each pair vi – nj , for all i and j. paring Table 2 results with those for Table 1 will show the percentage increase in accuracy contributed by Algorithm 2 beyond Algorithm 1. To our knowledge, there is only one other method, recently reported, that disambiguates unrestricted nouns, verbs, adverbs, and adjectives in texts.3 The method uses WordNet and attempts to exploit sentential and discourse contexts; it is based on the idea of semantic distance between words and on lexical relations. There are several accurate statistical methods, such as the one presented in Yarowsky,4 but they disambiguate only one part of speech (nouns in this case) and focus on only a few words because they lack training corpora. Table 3 presents a comparison between our results and the results reported in those papers. The baseline for the comparison is the occurrences of the first senses from WordNet. For applications such as query expansion in information retrieval, our method has the additional advantage of potentially considering the first two senses for each word, in which case the average accuracy (as determined in Table 2) is 91 percent. Rationale for Algorithm 2. This algorithm capital- Q UERY EX PAN SIO N izes on WordNet’s gloss, which explains a concept and provides one or more examples with typical usage of that concept. To determine the most appropriate noun and verb hierarchies, we performed some experiments using SemCor and concluded that the noun subhierarchy should include all the nouns in the class of nj. The subhierarchy of verb vi is taken as the hierarchy of the highest hypernym hi of the verb vi. It is necessary to consider a larger hierarchy than just the one provided by synonyms and direct hyponyms. As we replaced the role of a corpora with glosses, we achieved better results with more glosses. Still, we don’t want to enlarge the context too much. The nouns with a big hierarchy tend to have a larger value for |cdij |, so the weighted sum of common concepts is normalized in respect to the dimension of the noun hierarchy. Since a hierarchy’s size grows exponentially with its depth, we used the logarithm of the total number of descendants in the hierarchy, log(descendentsj). We experimented with a fewother metrics, but after running the program on several examples, the formula from Algorithm 2 provided the best results. The technology of query expansion is almost 30 years old.5 It can be used either to broaden the set of documents retrieved or to increase the retrieval precision. In the former case, the query is expanded with terms similar to those from the original query, while in the second case, the expansion procedure adds completely new terms. We take the first approach, using WordNet to find words semantically related to concepts from the original query. (An example of the second technique is the Smart system, developed at Cornell University, which uses words derived from documents relevant to the original query.6) The query expansion module in our system has two main functions: Table 1. Accuracy statistics for 384 word pairs using Algorithm 1. Top 1 Top 2 Top 3 Top 4 Noun 76% 83% 86% 98% Verb 60% 68% 86% 87% Adjective 79.8% 93% Adverb 87% 97% Table 2. Final results obtained for 384 word pairs using both Algorithms 1 and 2. Top 1 Top 2 Top 3 Top 4 Noun 86.5% 96% 97% 98% Verb 67% 79% 86% 87% Adjective 79.8% 93% Adverb 87% 97% Eva lua tion of W SD M ethod Table 2 shows the overall results using Algorithm 1 followed by Algorithm 2 on 384 word pairs. Com- 38 JANUARY • FEBRUARY 2 0 0 0 http:/ / computer.or g/ i nter net/ ■ ■ the construction of similarity lists using WordNet, and the formation of the actual query. Once we have a sense ranking for each word of the input sentence, it is relatively easy to use WordNet’s rich semantic information to identify many words that are semantically similar to a given input word. Doing this increases the chance of finding more answers to input queries. WordNet can provide semantic similarity between words that belong to the same synonym set. IEEEI NTERNET COM PUTI NG U S I N G W O R D N E Consider, for example, the word activity. WordNet gives seven senses for this word. The synset for the first sense includes two other synonyms, action and activeness. The similarity list for this sense of the word is therefore W = {action, activity, activeness} The efficacy of expanding a query for search in large text collections was investigated by Voorhees.7 She used WordNet to experiment with four expanding strategies: ■ ■ ■ ■ by synonyms only, by synonyms plus all descendants in a isa hierarchy, by synonyms plus parents and all descendants in a isa hierarchy, and by synonyms plus any synset directly related to the given synset. Her results showed no significant differences in the precision obtained using any one of these four expanding strategies. Let’s denote with xi the words of a question or sentence, and with Wi = {xi, xik } the similarity lists provided by WordNet for each word xi. The elements of a list are xki where k enumerates the elements in each list (that is, the words on the same level of similarity with the word xi).We can nowuse these lists to formulate the actual query, using the Boolean operators accepted by current search engines. The OR operator is used to link words within a similarity list Wi, while the AND and NEAR operators link the similarity lists. While different combinations of similarity lists linked by AND or NEAR operators are possible, two basic forms W1 AND W2 AND ... AND Wn W1 NEAR W2 NEAR ... NEAR Wn give the maximum and minimum, respectively, of the number of documents retrieved. In most cases, the maximum format gathered thousands of documents, while the minimum format almost always had null results. We can assume that any documents containing the answers must be among the large number of documents provided by the AND operators, but the search engine failed to rank them in the top of the list. Thus, we sought a newoperator that would filter out many irrelevant texts. IEEEI NTERNET COM PUTI NG T A N D L E X I C A L O P E R Table 3. A comparison with other WSD methods. Baseline Stetina Yarowsky Noun 80.3% 85.7% 93.9% Verb 62.5% 63.9% Adjective 81.8% 83.6% Adverb 84.3% 86.5% Average 77% 80% A T O R S Our method 86.5% 67% 79.8% 87% 80.1% PO STPRO CESSIN G W ITH A N EW O PERATO R Our approach to filtering documents is to first search the Internet using weak operators (AND, OR) and then to further search this large number of documents using a more restrictive operator. For this second phase, we propose the following additional operator: PARAGRAPH n (... similarity lists ... ) The PARAGRAPH operator searches as an AND operator for the words in the similarity lists, but with the constraint that the words belong only to some n consecutive paragraphs, where n is a positive integer. The parameter n selects the number of paragraphs, thus controlling the size of the text retrieved from a document considered relevant. The rationale is that most likely the information requested is found in a few paragraphs rather than being dispersed over an entire document. (A similar idea can be found in Callan.8) To apply this newoperator, the documents gathered from the Internet must be segmented into sentences and paragraphs. Separating a text into sentences is an easy task that can be solved just by using the punctuation. However, the unstructured texts on the Web make paragraph segmentation much more difficult. Both Callan8 and Hearst9 have developed work in this direction, but their methods work only for structured texts containing lexical separators known a priori (for example, a tag or an empty line). Thus, we had to use a method that covers almost all possible paragraph separators that can occur in Web texts. The paragraph separators that we’ve considered so far are HTML tags, empty lines, and paragraph indentations. We give a complete example of our system in the sidebar, “Finding a Relevant Answer: A Query Example.” TEST RESULTS To test our system, we used 50 questions from real Internet searches and 50 questions derived from http:/ / computer.or g/ i nter net/ JANUARY • FEBRUARY 2 0 0 0 39 F E A T U R E Finding a Relevant Answer: A Query Example Suppose you want to answer the question: “ How much tax does an average salary worker pay in the United States?” The linguistic processing module (shown in Figure 1, main text) identified keywords, including part-of-speech tags, which were then ranked for word sense as follows: x1 =(tax), pos = noun, sense #1/ 1 x2 =(average), pos = adjective, sense #4/ 5 x3 =(salary), pos = noun, sense #1/ 1 x4 =(the United States), pos = noun, sense #1/ 2 x5 =(worker), pos = noun, sense #1/ 4 x6 =(pays), pos = verb, sense #1/ 7 The sense number indicates the actual WordNet sense that resulted from the disambiguation of all possible senses in WordNet. For instance, adjective average has five senses and the system picked sense #4. These keywords are the input for the next step of our system. Using the similarity relation encoded in the WordNet synsets, it yields the following six similarity lists: from search engines today, the ranking provided by AltaVista is of no use for us here. None of the 10 leading documents in any category provided the desired information. Nor did the single document fetched by Query 4: ....The proposed tax cut, and the bigger one promised for next year, if enacted, will be paid for by the Social Security wage taxes of middle and low-income workers of America. Employees have been willing to pay these taxes because of the promise of guaranteed Social Security retirement benefits. This Republican tax bill is a betrayal of the low and middle-income workers. The unfairness of these proposals is breathtaking. Analysis of the table results indicates a gap in the volume of documents retrieved with the AltaVista operators. For instance, using only the AND operator (Query 1) obtained 49,182 documents, but the NEAR operator (Queries 4 and 6) produced only one (irrelevant) and zero outputs, respectively. This operator seems to be too restrictive, while it fails to identify the right answer. Various combinations of AND and NEAR operators achieved no great results. Using the PARAGRAPH operator, however, the system found a relevant answer: W 1 = {tax, taxation, revenue enhancement} In 1910, American workers paid no income tax. In 1995, a W 2 = {average, intermediate, medium, middle} worker earning an average wage of $26,000 pays about W 3 = {salary, wage, pay, earnings, remuneration} 24% (about $6,000) in income taxes. The average American worker’s pay has risen greatly since 1910. Then, the average W 4 = {United States, United States of America, America, worker earned about $600 per year. Today, the figure is US, U.S., USA, U.S.A.} $26,000. W 5 = {worker} W 6 = {pay} Table A. Query results with various combinations of operators. These lists are used to formulate queries for the search Query No. of documents engine. Table A shows some 1 W 1 AND W 2 AND W 3 AND W 4 AND W 5 AND W 6 49,182 queries and the number of 2 W 1 AND (W 2 NEAR W 3) AND W 4 AND W 5 AND W 6 9,766 documents retrieved by 3 W 1 NEAR (W 2 NEAR W 3) AND W 4 AND W 5 AND W 6 976 AltaVista. Though AltaVista 4 W 1 NEAR W 2 NEAR W 3 NEAR W 4 NEAR W 5 NEAR W 6 1(no) has one of the most powerful 5 W 1 AND {average W 3} AND W 4 AND W 5 AND W 6 9,045 sets of operators available 6 W 1 NEAR {average W 3} NEAR W 4 NEAR W 5 NEAR W 6 0 each of 50 topics defined for ad hoc queries at the Sixth Text Retrieval Conference (TREC-6),10 cosponsored by the U.S. National Institute of Standards and Technology (NIST) and the Defense Advanced Research Projects Agency (DARPA). Figure 2 presents an example topic from the TREC-6 ad hoc collection. Each topic is a framelike data structure with the following fields: ■ ■ 40 JANUARY • FEBRUARY 2 0 0 0 <num> identifies the topic number; <title> classifies the topic within a domain; http:/ / computer.or g/ i nter net/ ■ ■ <desc> describes the topic briefly (for TREC6, this section was intended to be an initial search query); <narr> provides a further explanation of what a relevant material may look like. We edited the <desc> field to derive natural language questions similar to those normally asked by real users searching the Internet. For example, from the corpus entry presented above, the question derived was “Which are some of the organizations IEEEI NTERNET COM PUTI NG U S I N G W O R D N E participating in international criminal activity?” Let’s denote the two sets of questions as REAL and TREC. In our experiment, the REAL queries posed by users could usually be classified as concrete queries—that is, based on specialized knowledge of a domain, while the TREC topics led to more abstract queries.11 Table 4 presents five randomly selected questions from the TREC set and five questions from the REAL set, together with the results obtained. Each table cell contains two numbers: on the top, the number of documents or—for the PARAGRAPH operator—paragraphs retrieved for the question; on the bottom, the number of relevant documents or paragraphs found in the top 10 ranking. The AND xi and NEAR xi columns contain the results for the search when AND and NEAR operators were applied to the input words xi. By replacing the words xi with their similarity lists derived from WordNet, the number of documents retrieved increased, as expected. The results obtained in these cases are presented in the AND wi and NEAR wi columns. T A N D L E X I C A L O P E R A T O R S <num> Number: 301 <title> International Organized Crime <desc> Description: Identify organizations that participate in international criminal activity, and, if possible, collaborating organizations and the countries involved. <narr> Narrative: A relevant document must as a minimum identify the organization and the type of illegal activity (e.g., Colombian cartel exporting cocaine). Vague references to international drug trade without identification of the organization(s) involved would not be relevant. Figure 2. Example ad hoc topic and its data structure. The next column contains the number of documents extracted when the operator PARAGRAPH 2 (meaning two consecutive paragraphs) was applied to words from the similarity lists. The results were encouraging; the number of documents retrieved was small, and correct answers were found in almost all cases. Table 5 (next page) presents a summary of results for the 100 questions used to test our system. First, it shows the number of documents retrieved for an average TREC and REAL ques- Table 4. A sample of the results obtained for randomly selected questions from the TREC and the REAL sets. AND NEAR AND NEAR Paragraph xi xi wi wi wi TREC questions Which are some of the organizations participating 27,716 3 48,133 5 6 in international criminal activity? 0 1 0 1 1 Is the disease of Poliomyelitis (polio) under control 9,432 13 10,271 15 40 in the world? 1 3 2 3 11 Which are some of the positive accomplishments of the 178 4 504 4 2 Hubble telescope since it was launched? 1 0 1 0 1 Which are some of the endangered mammals? 32,133 6,214 32,133 6,214 150 0 1 0 1 80 Which are the most crashworthy, and least crashworthy, 246 5 260 5 15 passenger vehicles? 0 1 1 1 6 REAL questions Where can I find cheap airline fares? 1,360 3 2,608 35 61 2 3 2 5 34 Find out about Fifths disease. 2 0 30 0 10 0 0 1 0 1 What is the price of ICI? 4,503 202 10,221 575 117 0 0 0 1 10 Where can I shop online for Canada? 36,049 858 36,049 858 15 0 1 0 1 8 What are the average wages for event planners? 6 0 70 0 6 1 0 0 0 6 IEEEI NTERNET COM PUTI NG http:/ / computer.or g/ i nter net/ JANUARY • FEBRUARY 2 0 0 0 41 F E A T U R E Table 5. Summary of results for 50 questions from the TREC collection and 50 questions from the frequently asked queries on the Internet. AND NEAR AND NEAR Paragraph xi xi wi wi wi Number of documents retrieved Average question from the TREC set 7,746 258 25,803 332 26.04 Average question from the REAL set 13,510 1,843 28,715 3,003 48.95 Precision Average question from the TREC set 1.6% 4.8% 4.4% 8.8% 43% Average question from the REAL set 6.3% 12.43% 6.09% 13.65% 27.7% Productivity Average question from the TREC set 36% 44% 20% 36% 90% Average question from the REAL set 30% 42% 28% 48% 66% tion. Naturally, the query extension determined an increase in the number of documents by a factor varying from 1 (meaning an equal number of documents retrieved for both the unextended and extended queries) to 32. Instead of hundreds, thousands, and even tens of thousands of documents, the PARAGRAPH operator returns just 26 and 48 documents for the TREC and REAL questions, respectively. Moreover, instead of returning full documents, the new operator identifies only the portion of the document where the answer is; this constitutes another reduction factor not captured in the table. Next, Table 5 shows the precision, or ratio between the number of relevant documents retrieved and the total number of documents retrieved. Because it is impractical to search for the relevant documents among all those retrieved by an AltaVista query, we considered only the relevant documents in the first ten ranked documents. In the case of PARAGRAPH, however, the number of paragraphs retrieved is small, so the precision was considered over the entire set. With the PARAGRAPH operator, the actual precision reaches 43 percent for the TREC questions and 27.7 percent for the REAL questions. (The difference can be explained by the short questions that Internet users tend to ask, which tend to retrieve a very large number of documents and make it much harder to find relevant information. The biggest gain in Table 5, however, is in system productivity. From the TREC set, 90 percent of the questions were answered correctly; from the REAL set, and 66 percent. This is a significant improvement over current technology. 42 JANUARY • FEBRUARY 2 0 0 0 http:/ / computer.or g/ i nter net/ CO N CLUSIO N In general, because the range of questions is so broad, it is difficult to compare the performance of question-answering systems. Other systems implemented for the REAL type of questions operate in narrow domains. For example, Burke, Hammond, and Kozlovsky12 describe a system that uses the files of “Frequently Asked Questions” associated with many Usenet groups. The results obtained during the TREC tests can be compared with work described in Voorhees,13 though the latter retrieves information on very large text collections of texts, rather than the Internet. Voorhees reported an average precision of 36 percent for full-topic statements. Our result of 43 percent precision in retrieving information for narrow questions over heterogeneous Internet domains is thus encouraging. Our system can still fail to return relevant answers for some questions, for example, questions with very specialized terms. The test results nevertheless demonstrate a substantial increase in both the precision and the percentage of queries answered correctly, while reducing the amount of text presented to the user in comparison with current Internet search engine technology. The system can be easily extended to restrict the output to several sentences instead of paragraphs. Also, a more flexible NEAR search could be implemented with a new operator SEQUENCE (W1dW2d …,Wn), where d is a numeric variable that indicates the distance between the words in the W lists for which the search is done. Indexing words by their WordNet senses, socalled semantic or conceptual indexing, could also improve Internet searches. This method implies some online parsing and word-sense dis- IEEEI NTERNET COM PUTI NG U S I N G W O R D N E ambiguation that may be possible in the not-toodistant future. Semantic indexing has the potential for improving the ranking of search results, as well as allowing information extraction of objects and their relationships (for example, see Pustejovsky et al.14). Finally, Web searches could use compound nouns or collocations. WordNet includes thousands of word groups—for example, blue-collar worker, stock market, and mortgage interest rate— that point to their respective concept. Indexing each compound noun as one term reduces the storage space for the search engine and might further increase the precision. ■ REFEREN CES 1. D. Moldovan et al., “USC: Description of the SNAP System Used for MUC-5,” Proc. 5th Message Understanding Conf., Morgan Kaufmann, San Francisco, 1993, pp. 305-320. 2. E. Brill, “Some Advances in Rule-Based Part-of-Speech Tagger,” Proc. 12th Nat’l Conf. on Artificial Intelligence (AAAI94), AAAI Press, Menlo Park, Calif., 1994, pp. 256-261. 3. J. Stetina, S. Kurohashi, and M. Nagao, “General Word Sense Disambiguation Method Based on a Full Sentential Context,” Proc. Workshop on Usage of WordNet in Natural Language Processing, Morgan Kaufmann, San Francisco, 1998, pp. 1-8. 4. D. Yarowsky, “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods,” Proc. 33rd Ann. Meeting of the Association of Computational Linguistics, Morgan Kaufmann, San Francisco, 1995, pp. 189-196. 5. G. Salton and M.E. Lesk, “Computer Evaluation of Indexing and Text Processing,” in The SMART Retrieval System: Experiments in Automatic Document Processing, G. Salton, ed., Prentice Hall, Englewood Cliffs, N.J., 1971, pp. 143-180. 6. C. Buckley et al., “Automatic Query Expansion Using SMART,” Proc. Third Text Retrieval Conf. (TREC-3), NIST Special Publications, Washington, DC, 1994, pp. 69-81; available for download at http://trec.nist.gov/pubs/ trec3/t3_proceedings.html. 7. E.M. Voorhees, “Using WordNet for Text Retrieval,” in WordNet —An Electronic and Lexical Database, C.Fellbaum, ed., MIT Press, Cambridge, Mass., 1998, pp 285-303. 8. J.P. Callan, “Passage-Level Evidence in Document Retrieval,”Proc. 17th Ann. Int’l ACM SIGIR, Conf.on Research and Development in Information Retrieval, Dublin, Ireland, 1994, pp. 302-310. 9. M.A. Hearst, “Multi-Paragraph Segmentation of Expository Text,” Proc. 32nd Ann. Meeting of the Association for Computational Linguistics, Morgan Kaufmann, San Francisco, 1994, pp. 143-180. IEEEI NTERNET COM PUTI NG T A N D L E X I C A L O P E R A T O R S 10. E.M. Voorhees and D. Harman, eds., Sixth Text Retrieval Conf. (TREC 6), NIST Special Publication 500-240, Washington, DC, 1997, available online at http:// trec.nist.gov/pubs/trec6/t6_proceedings.html. 11. M.K. Leong, “Concrete Queries in Specialized Domains: Known Item as Feedback for Query Formulation, Sixth Text Retrieval Conf. (TREC-6), NIST Special Publications, Washington, DC., 1997, pp. 541-550; available for download at http://trec.nist.gov/pubs/trec6/ t6_proceedings.html. 12. R. Burke, K. Hammond, and J. Kozlovsky,” “KnowledgeBased Information Retrieval from Semi-Structured Text,” Proc. American Assn. for Artificial Intelligence Conf., Fall Symp. on AI Applications in Knowledge Navigation and Retrieval, AAAI Press, Menlo Park, Calif., 1995, pp. 15-19. 13. E.M. Voorhees, Query Expansion using Lexical-Semantic Relations, Proc. 17th Ann. Int’l ACM SIGIR, Conf. on Research and Development in Information Retrieval, Dublin, Ireland, 1994, pp 61-69. 14. J. Pustejovsky et al., “Semantic Indexing and Typed Hyperlinking,” Proc. American Assn. for Artificial Intelligence Conf., Spring Symp., AAAI Press, Menlo Park, Calif., 1997, pp. 120-128. Dan I. Moldovan is a professor in the Department of Computer Science and Engineering at Southern Methodist University. His current research interests are in the field of natural language processing, particularly in linguistic knowledge bases, text inference, and question answering systems. Moldovan received a PhD in electrical engineering and computer science from Columbia University in 1978. Rada Mihalcea is a PhD student in the Department of Computer Science at Southern Methodist University. Her research interests are in the field of natural language processing, particularly in word sense disambiguation, information extraction, and question answering systems. Mihalcea is a member of the AAAI and the ACL. Readers may contact the authors at {moldovan, rada}@seas. smu.edu. Next issue in IEEE Internet Computing— Agent Technology and t he Int ernet Guest Editors: M ike Wooldridge and Keith Decker http:/ / computer.or g/ i nter net/ JANUARY • FEBRUARY 2 0 0 0 43