The Design of A System For The Automatic Extraction of A Lexical Database Analogous To Wordnet From Raw Text

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

The Design of a System for the Automatic Extraction of a Lexical Database Analogous to WordNet from Raw Text

Reinhard Rapp reinhardrapp@gmx.de ABSTRACT


Constructing a lexical database such as WordNet manually takes several years of work and is very costly. On the other hand, methods for the automatic identication of semantically related words and for computing relational similarities based on large corpora of raw text have reached a considerable degree of maturity, with the results coming close to native speakers performance. We describe ongoing work which aims at further rening and extending these approaches, thereby making it possible to fully automatically generate a resource similar to WordNet. The developed system will be largely language independent and is to be applied to four European languages, namely English, French, German, and Spanish. This is an outline of the approach: Starting from a raw corpus we rst compute related words by applying an algorithm based on distributional similarity. Next, to identify synsets, an algorithm for unsupervised word sense induction is applied, and each word in the vocabulary is assigned to one or (if ambiguous) several of the synsets. Finally, to determine the relations between words, a method for computing relational similarities is applied.

Michael Zock Michael.Zock@lif.univ-mrs.fr


as model for similar lexical databases created for other languages. However, creating a WordNet for a new language in the established way takes several years of work. It also involves countless subjective decisions which may be controversial. On the other hand, methods for the automatic identication of semantically related words based on large text corpora, which can be used to identify the synsets, have reached a considerable degree of maturity, and the results could be shown to come close to human judgement [32]. In a recent paper, Turney & Pantel [45] highlighted the success of Vector Space Models (VSM): The success of the VSM for information retrieval has inspired researchers to extend the VSM to other semantic tasks in natural language processing, with impressive results. For instance, Rapp (2003) [29] used a vector-based representation of word meaning to achieve a score of 92.5% on multiple-choice synonym questions from the Test of English as a Foreign Language (TOEFL), whereas the average human score was 64.5%. Turney (2006) [42] used a vector-based representation of semantic relations to attain a score of 56% on multiple-choice analogy questions from the SAT college entrance test, compared to an average human score of 57%. Together with similar work conducted by other researchers, which conrmed these ndings, the two papers mentioned in this citation are the basis of the methodological considerations to be described here. Our aim is to further rene and extend these approaches, and to apply them to a new task, namely the fully automatic generation of a resource similar to WordNet. The developed system will be largely language independent and is to be applied to four major European languages, namely English, French, German, and Spanish. This paper is mainly conceptual in nature, aiming at giving an overview on the various parts of a larger project. Although various foundational steps have been completed (see Section 3), this is still work in progress. For this reason and due to space constraints, further results will be presented in subsequent publications. This work, although anchored in computational linguistics, is also related and relevant to the eld of ontology learning which has attracted considerable attention in computer science since the semantic web has been announced to be among the foci of the World Wide Web Consortium. In this eld, in recent years a number of resources sharing some properties with WordNet have been designed and de-

LIF-CNRS Aix-Marseille Universit

LIF-CNRS Aix-Marseille Universit

Categories and Subject Descriptors


[Computing methodologies]: Lexical semantics, Language resources

1.

INTRODUCTION

WordNet [8] is a large lexical database of English where nouns, verbs, adjectives and adverbs are grouped into sets of synonyms (synsets), each expressing a distinct concept (see Figure 1 for a sample entry relating to the word motion). Synsets are interlinked by means of conceptual-semantic and lexical relations. WordNet has become an invaluable resource for processing semantic aspects of language, and it has served This research was supported by a Marie Curie Intra European Fellowship within the 7th European Community Framework Programme.

Figure 1: WordNet entry for the word motion. veloped, among them LexInfo [3], LingInfo [4], LexOnto [7], Linguistic Information Repository (LIR [20]), and Linguistic Watermark Suite (LWS [25]). However, these resources are too recent to be able to claim a status similar to WordNet. WordNet has been developed and optimized over decades, it has been scrutinized and used by a large community of scientists and, despite some criticism, the validity of its underlying principles is widely acknowledged. This is the main reason why the work descibed here is based on WordNet rather than on one of the above mentioned ontologies. Another reason is that WordNet appears to be a more direct representation of native speakers intuitions about language than most ontologies. Therefore, when working with it, it should be somewhat more straightforward to draw conclusions concerning human cognition. As far as applicable, we nevertheless take experiences from ontology building into account (for an overview see [2]). The focus of this work is to generate Lexical Databases as similar as possible to the existing manually created WordNets in a way that is easily adaptable to other languages for which no WordNets exist yet. The evaluation is conducted in a direct way by taking the existing WordNet and experimental data on word synonymy as a gold standard, and by comparing the generated data to this gold standard. An alternative would be to conduct an indirect evaluation by comparing the performance of the automatically generated WordNet to the performance of existing WordNets in some applicative areas, such as word sense disambiguation and information retrieval. However, applications are too numerous to consider one (or a few) of them as authoritative. For example, Rosenzweig et al. [34] list 868 papers of which many describe applications of WordNet. For such reasons we suggest here the automatic construction of a general purpose WordNet which is not optimized with respect to a specic application. However, looking at specic applications would be a logical next step following the work described here. Let us mention that although there has been quite some work on specic aspects of what is described here, to our knowledge no previous attempt to automatically create a WordNet-like system using state-of-the-art modules for lexical acquisition has been fully completed, with the resulting lexical database being nalized and published.1 However,
1

variations of a completely dierent method for automatically creating WordNets have been described and put into practice many times. They are based on the idea that a raw version of a WordNet for a new language can be created by simply translating an existing WordNet of a closely related language. Recent examples are [40] for Thai and WOLF for French.2 However, this methodology is rather unrelated to what we investigate here. It does not have as much cognitive plausibility, cannot produce WordNets specic to the genre of a particular corpus, and can only be applied if a WordNet of a related language is available.

2. RESEARCH METHODOLOGY AND PREVIOUS WORK


Our approach comprises three major steps which will be described in more detail in the following subsections: 1. Computing word similarities Starting from a large part-of-speech tagged corpus of the respective language, various methods for computing related words, e.g. using syntax parsing or latent semantic analysis, are considered. The results are evaluated by comparing them to a recently published data set comprising the 200,000 human similarity judgments from the Princeton Evocation Project,3 in addition to the commonly used but less adequate 80 item TOEFL dataset which has been the standard so far. 2. Word sense induction To identify each words senses, an algorithm for unsupervised word sense induction is applied which in essence clusters the entire semantic space into synsets. Subsequently each word in the vocabulary is assigned to one or (if ambiguous) several of the synsets. The WordNet glosses (short descriptions of the synsets) are replaced by concordances of the respective senses. 3. Conceptual relations between words To reveal the relations between words (e.g. hyponymy, holonymy, and meronymy) we use mainly the methodology for computing relational similarities introduced by Turney [42] and rened by Pennacchiotti & Pantel [26]. But we will also take the results of related work in the construction of ontologies into account [2].
2 3

Ongoing work for Polish is described in [27]

http://alpage.inria.fr/sagot/wolf-en.html http://wordnet.cs.princeton.edu/downloads.html

Figure 2: Words co-occurring with red and blue.

Table 1: Performances for TOEFL synonym test Description Score Reference Random guessing. Four alter- 25.00% Rapp [30] nativesof which one is correct Average non-English US col64.50% Landauer & lege applicant taking TOEFL Dumais [14] Non-native speakers of 86.75% Rapp [30] English living in Australia Native speakers of English 97.75% Rapp [30] living in Australia

2.1 Computing word similarities


In his seminal paper Distributional Structure Zellig S. Harris [10] hypothesized that words occurring in similar contexts tend to have similar meanings. This nding is often referred to as the distributional hypothesis. It was put into practice by Ruge [35] who showed that the semantic relatedness of two words can be computed by looking at the agreement of their lexical neighborhoods. For example, as illustrated in Figure 2, a certain degree of semantic relatedness between the words red and blue can be derived from the fact that they both frequently co-occur with words like color, dress, ower, etc. although there are other context words occuring only with one of them. If on the basis of a large text corpus a matrix of word co-occurrences is compiled, then the semantic similarities between words can be determined by comparing the vectors in the matrix. This can be done using any of the standard vector similarity measures such as the cosine coecient. Since Ruges pioneering work, many researchers, e.g. Sch utze [38], Rapp [28], and Turney [42] used this type of distributional analysis as a basis to determine semantically related words. An important characteristic of some algorithms (e.g. as used by Ruge [35], Grefenstette [9], and Lin [17]) is that they parse the corpus and only consider co-occurrences of word pairs showing a specic relationship, e.g. a headmodier, verb-object, or subject-object relation. Others do not parse, but perform a singular value decomposition (SVD) on the co-occurrence matrix, which also improves results ([14] [15]). As an alternative to the SVD, Sahlgren [37] uses random indexing which is computationally less demanding. Our aim here is to systematically compare some of the best algorithms, among them the ones described in [23] and [32], and to come up with an improved version which combines the advantages of all. In doing so we hope to be able to provide at least partial answers to questions such as the following: Does the analysis of syntax help in determining semantic similarities between words? Is it true that a dimensionality reduction of the semantic space using singular value decomposition uncovers latent semantic structures between words? Does the optimal number of dimensions as found empirically reect cognitive processes, or can a similar behavior be achieved by simply applying various strengths of smoothing? To be able to come up with good answers to these questions, an accurate evaluation method is necessary which allows to analyze ne grained distinctions e.g. with regard to word frequency, ambiguity, saliency or part of speech. For this purpose, many possibilities can be thought of and have been applied in the past. For example, Grefenstette [9] used available dictionaries as a gold standard, Lin [17] compared his results to WordNet, and Landauer & Dumais [14] used experimental data taken from the synonym portion of the Test of English as a Foreign Language (TOEFL). As it more directly reects human judgements, in the literature for the purpose of evaluation the TOEFL data has often been preferred over dictionaries or lexical databases. As pointed out by Turney [42], another advantage of the TOEFL data is that it has gained considerable acceptance among researchers. The TOEFL is an obligatory test for non-native speakers of English who intend to study at universities with English as the teaching language. The data used by Landauer & Dumais had been acquired from the Educational Testing Service and comprises 80 test items. Each item consists of a problem word embedded in a sentence and four alternative words, from which the test taker is asked to choose the one with the most similar meaning to the problem word. For example, given the test sentence Both boats and trains are used for transporting the materials and the four alternative words planes, ships, canoes, and railroads, the subject would be expected to choose the word ships, which is supposed to be the one most similar to boats. Table 1 [32] shows a number of relevant baselines for the TOEFL test as obtained by either random guessing or by three groups of test persons with dierent levels of language prociency. As can be seen, with 97.75% correct answers, the performance of the native speakers is more than 30% better than that of language learners intending to apply for admission at a US college. Systems capable of computing semantically related words can also answer the TOEFL synonym questions. Usually the context is simply discarded and the system is only oered the test word. Then the similarity scores between the test word and the four alternative words are determined, and the word with the best score is considered to be the systems answer. Let us now look at the state of the art in automatically solving the standard TOEFL synonym test comprising 80 items. In the literature, there are essentially three basic approaches. One is lexicon-based, another is corpus-based, and the third, which is usually referred to as hybrid, is a mixture of the rst two. With the lexicon-based approaches (used e.g. by Leacock & Chodrow [16], Hirst & St-Onge [11], and Jarmasz & Szpa-

kowicz [12]), a given word is looked up in a large lexicon (or lexical database) of synonyms and it is determined whether there is a match between any of the retrieved synonyms and the four alternative words presented in the TOEFL question. If there is a match, the respective word is considered to be the solution to the question. In the other case, the procedure can be extended to indirect matches, e.g. involving synonyms of synonyms. This procedure works rather well if the lexicon has a good coverage of the respective vocabulary. In the literature typically WordNet [8] has been used, and performances of up to 78.75% on the TOEFL task have been reported [12]. On the other hand, both the TOEFL questions and the lexicons are handcrafted and therefore reect human intuitions. So it is not surprising that a high correspondence between these two closely related types of human intuitions can be observed. In our setting, as the purpose of our work is to generate a lexical database similar to WordNet, it would be contradictory to presuppose WordNet for the similarity computations. Therefore we concentrate here on the second method. This is a corpus-based machine learning approach which appears to be more interesting from a cognitive perspective as it potentially better captures the relevant aspects of human vocabulary acquisition. Table 2 (derived from [32] and the ACL Wiki) gives an overview on the current state of the art with regard to performance gures on the TOEFL synonym test. With 90.9% and 92.5% correct answers, the best performances were achieved by Pantel & Lin [23] and Rapp [32]. This is why here we will concentrate on combining these two rather dierent approaches, with the rst being syntax-based and the second using singular value decomposition for dimensionality reduction of the semantic space. The intention is to introduce some amount of syntax to the second approach by operating it on a part-of-speech-tagged rather than a raw text corpus. This should not only lead to better results, but is also necessary to obtain WordNet like entries which distinguish between parts of speech. The third approach is hybrid ([33] [13] [18] [12] [44]) and is basically a fall-back strategy for the rst approach: That is, by default the lexicon-based approach is used as its results tend to be more reliable. However, if the relevant words cannot be found in the lexicon, then it is of course better to use a corpus-based approach rather than to guess randomly. With a performance of up to 97.5% on the TOEFL synonym test [44], the results of the hybrid approach are the best. However, it is nevertheless inappropriate for our research because, like the lexicon-based approach, it also presupposes readily available lexical knowledge. Although the scores from the 80 item TOEFL synonym test, which has been the standard so far, give some idea concerning the overall performance of an algorithm, it can be argued that this test set is rather small and therefore prone to statistical variation. Also, this test was not designed to measure the strengths and weaknesses of various algorithms concerning particular properties of the input words, e.g. their frequency, saliency, part of speech, or ambiguity. We will therefore base our future evaluation on a much larger data set, namely the 200,000 sense specic human similarity judgments that were collected in the Prince-

Table 2: Comparison of corpus-based approaches Characterization of algorithm Score Ref. Latent semantic analysis 64.38% [14] Raw co-occurrences and city-block 69.00% [28] Dependency space 73.00% [22] Pointwise mutual information (MI) 73.75% [41] PairClass 76.25% [43] Pointwise mutual information 81.25% [39] Context window overlapping 82.55% [36] Positive pointwise MI with cosine 85.00% [5] Generalized latent semantic analysis 86.25% [19] Similarities between parsed relations 90.90% [30][23] Modied latent semantic analysis 92.50% [32]

ton Evocation project. Such a large scale data set will allow a much more detailed analysis of the behavior of the algorithms, and we would like to see this as the future gold standard for such comparisons.

2.2 Word sense induction


The previous step of introducing a similarity measure that generates judgements of word relatedness akin to human intuition lays the foundation for the next steps which are to nd out about possible senses (called synsets in the WordNet terminology), and to assign to each word at least one of them. This is what we call word sense induction. Pantel & Lin [23] have conducted such work with considerable success. Using a specially designed clustering algorithm (clustering by committee), they divided the semantic space into thousands of basic concepts which can be seen in analogy to WordNets synsets. As the clustering is based on similarities between global co-occurrence vectors, i.e. vectors that are based on the co-occurrence counts from an entire corpus, we call this global clustering. Since (by looking at dierential vectors) their algorithm allows a word to belong to more than one cluster, each cluster a word is assigned to can be considered as one of its senses. However there is a potential problem with this approach as it allows only as many senses as there are clusters, thereby limiting the granularity of the meaning space. This problem is avoided by Neill [21] who uses local instead of global clustering. In this case, to nd the senses of a given word, only its close associations are clustered, i.e. for each word new clusters will be found. Concerning the type of co-occurrence vectors used, most approaches to word sense induction (including [23] and [1]) that have been published so far rely on global co-occurrence vectors based on string identity. Since most words are semantically ambiguous, this means that these vectors reect the sum of the contextual behavior of a words underlying senses, i.e. they are mixtures of all senses occurring in the corpus. However, since reconstructing the sense vectors from such mixtures is dicult, the question arises whether we really need to base our work on the mixtures or if there is not a direct way to observe the contextual behavior of the senses, thereby avoiding the mixtures right from the beginning. Here we suggest to compare Pantel & Lins approach [23] to a

Table 3: Term/context matrix for c1 c2 c3 c4 arm beach coconut nger hand shoulder tree

the word palm c5 c6

errors and to close gaps in the data. Although SVD is computationally demanding, previous experience shows that it is feasible to deal with matrices of several hundred thousand dimensions [32]. In summary, we will compare two fundamental types of algorithms for word sense induction, one being based on global and the other on local clustering of words. If the results are similar, we will give preference to the global clustering as it matches better the WordNet approach. On the other hand, the local clustering makes it easier to provide contexts for each sense, which will be used as replacements for the WordNet glosses. Empirical verication of these issues may give us important arguments to question some underlying principles of WordNet.

method outlined in [31] which looks at local rather than global co-occurrence vectors. As can be seen from human performance, in almost all cases the local context of an ambiguous word is sucient to disambiguate its sense. This means that if we consider words within their local context they are hardly ever ambiguous. The basic idea is now that we do not cluster the global cooccurrence vectors of the words (based on an entire corpus) but local ones which are derived from the various contexts of a single word. That is, the computations are based on the concordance of a word. Also, we do not consider a term/term but a term/context matrix. This means that for each word to be analyzed we get an entire matrix. Let us illustrate this using the ambiguous word palm which can refer to a tree or to a part of the hand. If we assume that our corpus contains six occurrences of palm, i.e. that there are six local contexts, then we can derive six local co-occurrence vectors for palm. Considering only strong associations to palm, these vectors could, for example, look as shown in Table 3. The dots in the matrix indicate whether the respective word occurs in a particular context or not. We use binary vectors since we assume short contexts where words usually occur only once. The matrix reveals that the contexts c1, c3, and c6 seem to relate to the hand sense of palm, whereas the contexts c2, c4, and c5 relate to its tree sense. These intuitions can be resembled by using a method for computing vector similarities such as the cosine coecient. If we then apply an appropriate clustering algorithm to the context vectors, we should obtain the two expected clusters. Each of the two clusters corresponds to one of the senses of palm, and the words closest to the geometric centers of the clusters should be good descriptors of each sense. However, as matrices of the above type can be extremely sparse, clustering is a dicult task, and common algorithms often produce sub-optimal results. Fortunately, the sparsity problem can be minimized by reducing the dimensionality of the matrix. An appropriate algebraic method which has the capability to reduce the dimensionality of a rectangular or square matrix in an optimal way is singular value decomposition. As shown by Sch utze [38] by reducing the dimensionality a generalization eect can be achieved which often yields improved results. The approach that we suggest here involves reducing the number of columns (contexts) and then applying a clustering algorithm to the row vectors (words) of the resulting matrix. This should work well as it is one of the strengths of SVD to reduce the eects of sampling

2.3 Conceptual relations between words


WordNet distinguishes a number of relations between words, e.g. hyponymy, holonymy, and meronymy. It has for long been unclear how such relations could be automatically extracted from a corpus. Caraballo & Charniak [6] did some pioneering work, and so did Turney [42] who further elaborated on the concept of the so called relational similarities. The basic idea is to consider pairs of co-occurring content words, and to assume that two pairs have the same relation if the content words constituting a pair are separated by the same sequence of function words. For example, the sequence in the may indicate a part-of relationship (holonymy), whereas and is more likely to express the idea of coordination between two similar terms, hence synonymy. The current state of the art in this respect is the Espresso system [26] and followup work [24]. What we suggest below is roughly along these lines. But we will also take related work from computer science into account, dealing with association rule mining, social annotations, formal concept analysis, and ontology learning [2]. An unsupervised approach would involve clustering all possible pairs of content words co-occurring within a distance of about four words, according to their separating word sequences, and to manually assign some meaning to the most salient clusters. Alternatively, as the aim is to resemble the relations as dened in WordNet, a weakly supervised bootstrapping approach is suggested. For each of the WordNet relations a few typical pairs will be manually chosen and taken as seeds. Next, their behavior with regard to typical separating word sequences is quantitatively analyzed. The typical separators would then serve as samples to nd more word pairs with the same relations. That is, using this bootstrapping mechanism the set of seeds is extended by assigning the appropriate type of relations to each word pair. Work along these lines has already been successfully conducted for Polish and is described in detail by Piasecki et al. [27]. We intend to do so not only for several languages, but to also go beyond this: To improve results, in a further step an analogous procedure will be applied to a word sense disambiguated corpus. This means that instead of considering the relations between pairs of words, we will consider the relations between pairs of word senses. The reasoning is that dierent senses of an ambiguous word may well have dierent types of relations with another word (or another

words sense). Words represent mixes of senses and looking at these mixes leads to blurred results. To avoid this, we must rst perform a word sense disambiguation on the entire corpus, and then apply the procedure for relation detection. From the previous step (of word sense induction) we already have the possible word senses readily available, so that using available software (including some of our own) it is relatively straightforward to conduct a word sense disambiguation. It can be expected that in the disambiguated corpus the relations between word senses are more salient than they would be for words. Nevertheless it will be necessary to optimize the algorithm in a procedure of stepwise renement by comparing its results to a representative subset of the relations found in WordNet.

3.

RESULTS

The focus of this paper is to give an overview on the AutoWordNet project. As the project is ongoing (and also due to space constraints) detailed results concerning the various aspects of the project cannot be presented here, but will be published separately. However, in order to give the reader an idea of the outcome, let us summarize here the results concerning one of the fundamental aspects of the project, namely the computation of thesauri of related words. Although this work has been completed for several languages, as more information can be found in [32], we will conne our description here to the English version. For the other languages, the procedure is essentially the same. As our underlying textual basis we used the British National Corpus (BNC). While being considerably smaller than more recent corpora (e.g. the WaCky or the LDC Gigaword corpora, which were used for some of the other languages), our experience is that it leads to somewhat better results for this task as it is well balanced, whereas the other corpora have a stronger tendency to produce idiosyncrasies. In a pre-processing step, we lemmatized this corpus and removed the function words (for details concerning this step see [32]. Based on a window size of 2 words, we then computed a co-occurrence matrix comprising all of the approximately 375,000 lemmas occurring in the BNC. The raw co-occurrence counts were converted to association strengths using the entropy-based association measure as described in [32]. Inspired by Latent Semantic Analysis [14], in a further step we applied a Singular Value Decomposition to the association matrix, thereby reducing the dimensionality of the semantic space to 300 dimensions. This dimensionality reduction has a generalization and smoothing eect which could be shown to improve the results of the subsequent similarity computations [30]. Given the resulting dimensionality reduced matrix, word similarities were computed by comparing word association vectors using the standard cosine similarity measure. This led to results like the ones shown in Table 4 (the lists are ranked according to decreasing cosine values). For a quantitative evaluation we used the system for solving the TOEFL synonym test (see section 2.1) and compared the results to the correct answers as provided by the Educational Testing Service. Remember that in this test the subjects had to choose the word most similar to a given stimulus word from a list of four alternatives. In the simulation,

Table 4: Sample lists of related words as computed greatly (0.52), immensely (0.51), tremendously enorm- (0.48), considerably (0.48), substantially (0.44), ously vastly (0.38), hugely (0.38), dramatically (0.35), materially (0.34), appreciably (0.33) Shortcomings (0.43), defect (0.42), deciencies aw (0.41), weakness (0.41), fault (0.36), drawback (0.36), anomaly (0.34), inconsistency (0.34), discrepancy (0.33), fallacy (0.31) question (0.51), matter (0.47), debate (0.38), issue concern (0.38), problem (0.37), topic (0.34), consideration (0.31), raise (0.30), dilemma (0.29), discussion (0.28) building (0.55), construct (0.48), erect (0.39), build design (0.37), create (0.37), develop (0.36), construction (0.34), rebuild (0.34), exist (0.29), brick (0.27) disparity (0.44), anomaly (0.43), inconsistency discre- (0.43), inaccuracy (0.40), dierence (0.36), pancy shortcomings (0.35), variance (0.34), imbalance (0.34), aw (0.33), variation (0.33) primarily (0.50), largely (0.49), purely (0.48), essenbasically (0.48), mainly (0.46), mostly (0.39), tially fundamentally (0.39), principally (0.39), solely (0.36), entirely (0.35)

we assumed that the system made the right decision if the correct answer was ranked best among the four alternatives. This was the case for 74 of the 80 test items which gives us an accuracy of 92.5%. In comparison, the performance of human subjects had been 97.75% for native speakers and 86.75% for highly procient non-native speakers (see Table 1). This means that our programs performance is in between these two levels with about equal margins towards both sides. An interesting observation is that in Table 4 most words listed are of the same part of speech as the stimulus word. This is insofar surprising as the simulation system never obtained any information concerning part of speech, but in the process of computing term relatedness implicitly determines it. This observation is consistent with other work (e.g. [14]). As mentioned above, the method has also been applied to other languages, namely French, German, Spanish and Russian [32]. Apart from corpus pre-processing (e.g. segmentation and lemmatization) the algorithm had remained unchanged, but nevertheless delivered similarly good results. As an outcome, large thesauri of related words (analogous to the samples shown in Table 1) each comprising in the order of 50,000 entries are available for these languages.

4. SUMMARY, CONCLUSIONS, OUTLOOK


Our methodology builds on previous work concerning the three steps which have been described above, namely the computation of word similarities, word sense induction, and the identication of conceptual relations between words. Our aim is to advance the state of the art for each of these tasks, and to combine them into an overall system. For the computation of word similarities, as described in the previous

section, human intuitions have been successfully replicated via an automatic system building on previous studies such as [14], [23], and [32]. In word sense induction, current methods can make rough sense distinctions, but are far from reaching the sophistication of human judgements. Here our current work focuses on comparing methods based on local versus global cooccurrence vectors, as well as local versus global clustering. There are deep theoretical questions behind these choices which also correlate with some design principles of WordNet. We intend to compare three existing systems which can be seen as prototypical for dierent choices, namely the ones described by Pantel & Lin [23], Rapp [31], and Bordag [1]. By providing empirical evidence this should enable us to at least partially answer these questions. By combining the best choices we hope to be able to come up with an improved algorithm. Concerning the identication of conceptual relations holding between words, the eld is still at an early stage and it is unclear whether the aim of automatically replicating WordNets relations through unsupervised learning from raw text is realistic. However, attempting to do so is certainly of interest. On one hand, it is still rather unclear what the empirical basis for these relations is, and how they can be extracted from a corpus. On the other hand, WordNet provides such relations and can therefore be used as a gold standard for the iterative renement of an algorithm. As a possible outcome, it may well turn out that the empirical support for WordNets conceptual relations is not equally strong for all types. This would raise the question whether the choices underlying WordNet were sound, and what the most salient alternative relations would be. Also, there may be interesting ndings within each category, as most categories are only applicable to certain subsets of words (e.g. holonymy cannot easily be applied to abstract terms). Although the envisaged advances concerning the three steps are of a more evolutionary nature, the sum of it is supposed to lead to a time saving and largely language independent algorithm for the automatic extraction of a WordNet-like resource from a large text corpus. The work is also of interest from a cognitive perspective, as WordNet is a collection of dierent types of human intuitions, namely intuitions on word similarity, on word senses, and on word relations. The question is whether all of these intuitions nd their counterpart in corpus evidence. Should this be the case, this would support the view that human language acquisition can be explained by unsupervised learning (i.e. low level statistical mechanisms) on the basis of perceived spoken and/or written language. If not, other sources of information available for language learning would have to be identied, which may e.g. include knowledge derived from visual perception, world knowledge as postulated in Articial Intelligence, or some inherited high level mechanisms such as Pinkers language instinct or Chomskys language acquisition device. Although the suggested methodology is unlikely to completely replace current manual techniques of compiling lexical databases in the near future, it should at least be useful

to eciently aggregate relevant information for subsequent human inspection, thereby making the manual work more ecient. This is of particular importance as the suggested methods should in principle be applicable to all languages so that the potential savings multiply. Another aspect is that automatic methods will in principle allow generating WordNets for particular genres, domains or dialects by simply running the algorithm on a large text corpus of the respective type. This is an aspect that would not be easy to obtain manually, as human intuitions tend to be based on the sum of lifetime experience, so that it is dicult to concentrate on specic aspects. Let us conclude by citing from Piasecki et al. [27]: A language without a wordnet is at a severe disadvantage. ... Language technology is a signature area of ... the Internet, ... including increasingly clever search engines and more and more adequate machine translation. A wordnet a rich repository of knowledge about words is a key element of ... language processing.

5. REFERENCES
[1] S. Bordag. Word sense induction: triplet-based clustering and automatic evaluation. Proc. of EACL 2006. [2] P. Buitelaar, P. Cimiano (eds.). Ontology Learning and Population: Bridging the Gap between Text and Knowledge Selected Contributions to Ontology Learning and Population from Text. IOS Press 2008. [3] P. Buitelaar, P. Cimiano, P. Haase, M. Sintek. Towards linguistically grounded ontologies. Proceedings of the 6th ESCW, Heraklion, Greece, 111125, 2009. [4] P. Buitelaar, T. Declerck, A. Frank, S. Racioppa, M. Kiesel, M. Sintek, R. Engel, M. Romanelli, D. Sonntag, B. Loos, V. Micelli, R. Porzel, P. Cimiano. LingInfo: Design and applications of a model for the integration of linguistic information in ontologies. Proc. of the OntoLex Workshop, Genoa, Italy, 2834, 2006 [5] J.A. Bullinaria, J.P. Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39, 510526, 2007. [6] S.A. Carabello, E. Charniak. Determining the specicity of nouns from text. Proc. of EMNLP-VLC, 6370, 1999. [7] P. Cimiano, P. Haase, M. Herold, M. Mantel, P. Buitelaar. LexOnto. A model for ontology lexicons for ontology based NLP. Proceedings of the OntoLex07 Workshop at ISWC07, South Corea, 2007. [8] C. Fellbaum (ed.). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press, 1998. [9] G. Grefenstette. Explorations in Automatic Thesaurus Discovery. Dordrecht: Kluwer, 1994. [10] Z.S. Harris. Distributional structure. Word, 10(23), 146162, 1954. [11] G. Hirst, D. St-Onge. Lexical chains as representation of context for the detection and correction of malapropisms. In: C. Fellbaum (ed.): WordNet: An Electronic Lexical Database, Cambridge: MIT Press, 305332, 1998. [12] M. Jarmasz, S. Szpakowicz. Rogets thesaurus and

semantic similarity. Proc. of RANLP, Borovets, Bulgaria, September, 212219, 2003. [13] J.J. Jiang, D.W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of the International Conference on Research in Computational Linguistics, Taiwan, 1997. [14] T.K. Landauer, S.T. Dumais. A solution to Platos problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211240, 1997. [15] T.K. Landauer, D.S. McNamara, S. Dennis, W. Kintsch (eds.). Handbook of Latent Semantic Analysis. Lawrence Erlbaum, 2007. [16] C. Leacock, M. Chodorow. Combining local context and WordNet similarity for word sense identication. In: C. Fellbaum (ed.). WordNet: An Electronic Lexical Database. Cambridge: MIT Press, 265283, 1998. [17] D. Lin. Automatic retrieval and clustering of similar words. Proc. of COLING-ACL, Montreal, Vol. 2, 768773, 1998. [18] D. Lin. An information-theoretic denition of similarity. Proc. of the 15th International Conference on Machine Learning (ICML-98), Madison, WI, 296304, 1998. [19] I. Matveeva, G. Levow, A. Farahat, C. Royer. Generalized latent semantic analysis for term representation. Proc. of RANLP, Borovets, Bulgaria, 2005. [20] E. Montiel-Ponsoda, W. Peters, G. Auguado de Cea, M. Espinoza, A. G omez P erez, M. Sini. Multilingual and Localization Support for Ontologies. Technical report, D2.4.2 Neon Project Deliverable, 2008. [21] D.B. Neill. Fully Automatic Word Sense Induction by Semantic Clustering. Cambridge University, Masters Thesis, M.Phil. in Computer Speech, 2002. [22] S. Pado, M. Lapata. Dependency-based construction of semantic space models. Computational Linguistics, 33(2), 161199, 2007. [23] P. Pantel; D. Lin. Discovering word senses from text. Proc. of ACM SIGKDD, Edmonton, 613619, 2002. [24] P. Pantel, M. Pennacchiotti. Automatically harvesting and ontologizing semantic relations. In: P. Buitelaar, P. Cimiano (eds.) Ontology Learning and Population: Bridging the Gap between Text and Knowledge Selected Contributions to Ontology Learning and Population from Text, IOS Press, 2008. [25] M.T. Pazienza, A. Stellato. Exploiting Linguistic Resources for building linguistically motivated ontologies in the Semantic Web. Proc. of the 2nd OntoLex Workshop, 2006. [26] Pennacchiotti, M.; Pantel, P.. A bootstrapping algorithm for automatically harvesting semantic relations. Proceedings of Inference in Computational Semantics (ICoS), Boxton, England, 87-96, 2006. [27] M. Piasecki, S. Szpakowicz, B. Broda. A WordNet from the Ground Up. Ocyna Wydawnicza Politechniki Wroclawskiej, 2009. [28] R. Rapp. The computation of word associations: comparing syntagmatic and paradigmatic approaches. Proc. of the 19th COLING, Taipei, ROC, Vol. 2, 821827, 2003.

[29] R. Rapp. Word sense discovery based on sense descriptor dissimilarity. Proceedings of the Ninth MT Summit, 315322, 2003. [30] R. Rapp A freely available automatically generated thesaurus of related words. Proceedings of the 4th LREC, Lisbon, Vol. II, 395398, 2004. [31] R. Rapp. A practical solution to the problem of automatic word sense induction. Proc. of the 42nd Meeting of the ACL, Comp. Vol., 195198, 2004. [32] R. Rapp. The automatic generation of thesauri of related words for English, French, German, and Russian. International Journal of Speech Technology, 11 (3), 147156, 2009. [33] P. Resnik. Using information content to evaluate semantic similarity. Proc. of the 14th International Joint Conference on Articial Intelligence (IJCAI), Montreal, 448453, 1995. [34] J. Rosenzweig, R. Mihalcea, A. Csomai. WordNet bibliography. Web page: a bibliography referring to research involving the WordNet lexical database. URL http://lit.csci.unt.edu/wordnet/, 2007. [35] G. Ruge. Experiments on linguistically based term associations. Information Processing and Management, 28(3), 317332, 1992. [36] M. Ruiz-Casado, E. Alfonseca, P. Castells. Using context-window overlapping in synonym discovery and ontology extension. Proc. of RANLP, Borovets, Bulgaria, 2005. [37] M. Sahlgren. Vector-based semantic analysis: representing word meanings based on random labels. In: A. Lenci, S. Montemagni, V. Pirrelli (eds.): Proceedings of the ESSLLI Workshop on the Acquisition and Representation of Word Meaning, Helsinki, 2001. [38] H. Sch utze. Ambiguity Resolution in Language Learning: Computational and Cognitive Models. Stanford: CSLI Publications, 1997. [39] E. Terra, C.L.A. Clarke. Frequency estimates for statistical word similarity measures. Proceedings of HLT/NAACL, Edmonton, Alberta, 244251, 2003. [40] S. Thoongsup, K. Robkop, C. Mokarat, T. Sinthurahat, T. Charoenporn, V. Sornlertlamvanich, H. Isahara. Thai WordNet construction Proc. of the 7th Workshop on Asian Language Resources at ACL-IJCNLP, Suntec, Singapore, 139144, 2009. [41] P.D. Turney. Mining the Web for synonyms. PMI-IR versus LSA on TOEFL. Proc. of the Twelfth European Conference on Machine Learning, Freiburg, Germany, 491502, 2001. [42] P.D. Turney. Similarity of Semantic Relations Computational Linguistics, 32(3), 379416, 2006. [43] P.D. Turney. A uniform approach to analogies, synonyms, antonyms, and associations. Proceedings of the 22nd Coling, Manchester, UK, 905912, 2008. [44] P.D. Turney, M.L. Littman, J. Bigham, V. Shnayder. Combining independent modules to solve multiple-choice synonym and analogy problems. Proc. of RANLP, Borovets, Bulgaria, pp. 482489, 2003. [45] P.D. Turney, P. Pantel. From frequency to meaning: vector space models of semantics. Journal of Articial Intelligence Research, Volume 37, 141188, 2010.

You might also like