Ontology via terminology

Lee Gillam

Ontology via terminology

Lee Gillam

2004

visibility

…

description

6 pages

link

1 file

AI-generated Abstract

The paper explores the interplay between ontology and terminology, arguing that recent definitions of ontology have shifted towards its role in information representation, linking it to language understanding. It discusses methods for automatically generating ontologies from text using techniques like term-frequency/inverse-document-frequency (TF/IDF) and the need for human evaluation in these processes. The application of these methods is illustrated through a case study on carbon nanostructures, demonstrating their effectiveness in ontology construction.

Ontology Via Terminology? Lee Gillam and Mariam Tariq Department of Computing, University of Surrey, Guildford, GU2 7XH, United Kingdom {l.gillam, m.tariq}@surrey.ac.uk Introduction The Encyclopaedia Britannica’s (EB) definition of ontology as “the theory or study of being as such; i.e., of the basic characteristics of all reality” has been overtaken in recent years by the use of “ontology” to describe the representation of information such that it can be reasoned over, which some consider to be “knowledge”. Sowa’s view of ontology as “the study of the categories of things that exist or may exist in some domain” [18] seems to provide a link between the EB definition and Gruber’s commonly cited “explicit specification of a conceptualization” [5]. The effective reduction of “ontology” to representation leads to its consideration as a tool for developing solutions to, for example, problems of translation [14], information retrieval [15], [6], knowledge management [12] and other issues related to knowledge-based activities [2]. For these and other authors, an ontology is produced by hand-crafting a representation of a specific domain, or by renaming an existing language resource: here, Wordnet, and its EuroWordnet variants, are such examples [15], [18]. The reduction of ontology to language resources occurs, perhaps, as an evolution of the work of philosophers such as Wittgenstein who reduced the study of existence to the study of language; this alignment of language and ontology is apparent elsewhere [8], [9]. By accepting such a reduction, we can argue that the production of (the representation of) an ontology, can be reduced to the problem of language understanding and analysis. Work on ontology construction, especially that of Alexander Maedche, suggests the potential for automatic population of ontologies from text, referred to by some as “ontology learning” [11], [3], [13]. These approaches use measures such as the information retrieval favourite term-frequency/inverse-document-frequency (TF/IDF), entropy measures, part-of-speech tagging and other devices for suggesting terms to the user. The burden of constructing the ontology from the results of these operations rests squarely with the user. By constraining this reduction further to specialist languages, the reduction of the study of ontology to the study of language suggests that the problem of ontology engineering, or learning, can be reduced to one that is previously unsolved: the automatic acquisition of terminological knowledge from domain texts. Most advocates of ontology (as representation), with a few exceptions [3], pay little heed to terminology science, and yet arguably they are only creating (limited) terminologies. Sowa notes that “subsets of the terminology can be used as starting points for formalization”, and that this is a valid endeavour since “most fields of science, engineering, business, and law have evolved systems of terminology or nomenclature for naming, classifying, and standardizing their concepts” [18]. Encouraged by the works of Sowa and Maedche, we seek to develop a method for the automatic derivation of ontologies, informed by work in terminology science, using mechanisms for extracting and organising terms from text corpora. Maedche suggests the use of an ontology structure to map between Wordnet and an ontology representation. If we describe an approach to terminology acquisition, informed by recent developments in international standards for terminology that can be used to seed such terminology collections, and map between terminology and ontology via an ontology structure, we can potentially reduce the burden of both terminology and ontology acquisition. If such a mapping can be made, large-scale validated terminology collections may be of value to ontology developers as seeds for a domain. Here, we consider collections of terminology developed in accordance with an international standard that enables the development of terminology standards (ISO 704). Adopting this approach to terminology production provides a ready-to-use peer-agreed resource for such activities, although the approach to creation of such a collection is human-resource intensive. Existing collections of ontologies (see for example http://www.daml.org/ontologies) are generally small scale, with a few notable exceptions, and these exceptions are closer in form to terminology collections. We can demonstrate the gap between texts currently available (e.g. Web texts, journal papers, and so on) and existing ontologies using the terms diode and tunneling diode. Although both successfully retrieve a number of texts from common search engines, in the indexed collections of ontologies mentioned previously, the query tunneling diode and its text variant tunnelling diode produce no evidence of ontology classes; only by (a human) knowing a further variant, tunnel-diode, C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted 080104.doc 02/03/05 20:58 do we achieve success. This success is limited since it appears to have no subclasses, although from domain texts we can produce evidence of light emitting tunnel diodes and a range of other subtypes. Such ontologies are therefore not sufficiently useful for purposes such as assisting information retrieval, despite claims elsewhere. To produce ontologies, we consider the essential organisation of science as evidenced through single words and related multi-word expressions (which some refer to as compounds). We initially use statistical techniques for the extraction from text collections of candidate terms, including weirdness [1], collocation statistics [17] and term clusters [10]. This produces a candidate term hierarchy. Linguistic techniques are then used to augment this hierarchy. The extracted information is organised according to two international (ISO) standards for terminology, specifically ISO 12620 for terminology data categories, and ISO 16642 for the terminological markup framework. These are used for providing a basis for a terminology collection and, via an ontology structure, for a so-called “lightweight” ontology. The synthesis of statistical and linguistic techniques, in conjunction with these standards, enables us to produce an ontology (representation) suitable for refinement within an ontology editor such as Protégé. Method The method requires two text corpora, a general language corpus (GL), for English the British National Corpus (BNC, 100 million tokens), and the specialist text corpus (SL). We adopt descriptions of relative frequencies and weirdness calculations [1]. About 75% of the tokens of the BNC are represented by the first 2000 most frequent words in the BNC, and we remove these from the analysis of SL. We wish to consider high-frequency, high-weirdness words in SL. The adopted definition of weirdness is problematic when a word in GL is not in SL, since the denominator is zero, regardless of the frequency of occurrence in the specialist corpus. To overcome this, we redefine the weirdness calculation by inventing a minimum value for the GL frequency of any word not in GL, half that of the minimum frequency in GL. For every word in SL, we have a value for both relative frequency and weirdness. We can therefore consider the strength above the standard deviation (z-scores) for the distributions given by these values, both of which tend to have a large kurtosis (shown experimentally). By taking both z-scores > 1, using SL, we automatically generate inputs to Smadja’s collocation method (step 1.2), with a set of automatically selected words rather than “a given word w” (one word only, manually selected for Smadja). To provide a larger number of words, we can vary the strength above the standard deviation. For collocations, we again remove the top 2000 words of GL, and use Smadja’s U-scores, z-scores and the neighbourhood of 5 words. For creation of multiword expressions in English, we use collocations in the immediate neighbourhood (we still consider the full neighbourhood for determining the strength of the collocation). U-score and z-score are used to produce bigrams representing candidate compounds, and the process is repeated using significant bigrams to form n-word collocations. From this analysis, we form candidate trees that evidence “term inclusion” through left- and right-extension of the collocating phrase at every iteration. For the indication of “concepts”, we consider the computation of a word-word similarity matrix, using the Dice coefficient [16] to measure the number of matching n-grams. In our case, we consider trigrams as patterns of three letters, where each pattern generally overlaps the previous by 2 letters and all patterns are of length 3. We produce the set of all substrings of length n characters in each word that is a component of the phrase. Trigram patterns made up from each candidate term are compared using a selection strategy where the match is 80% or more [19]. This value may be increased or decreased depending on the strength of match required. Since we are considering similar compounds rather than single words alone, it is perhaps worth considering a value of 90% or greater. Having identified and related domain terms statistically through inclusion, we also consider that terms in a domain are often related to each other through semantic relations like hyponymy and meronomy, often exemplified in the domain texts through particular recurrent grammatical patterns. For example, scientists often use the device of enumeration to explain certain concepts. The sentence “Various copper compounds such as copper oxides, nitrides, and sulphides have been studied extensively due to their excellent optical and electronic properties,” signals the semantic relationship of hyponymy between the terms through the use of the cue such as since copper oxide, nitride, and sulphide are hyponyms (subtypes) of copper compound. The use of such phrases to encode complex relationships appears to have its own rules of description: a kind of local grammar governs the behaviour of clauses. Cruse has discussed the notion of semantic frames: a triplet of phrases - X REL Y where X and Y are noun phrases (NPs) and REL is a phrase generally expressed as IS A, IS A TYPE OF/KIND OF and PART OF for illustrating hyponymic and meronymic relationships respectively [4]. Apart from these cues, most C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted 080104.doc 02/03/05 20:58 commonly used in biological classifications, it has been suggested that certain lexico-syntactic patterns occurring in texts can be similarly used, such as the frame (X1………,Xn) OR OTHER Y where each X and Y are NPs and each Xi in the list (X1………,Xn) is a hyponym of Y [7]. Within our method we have employed cues suggested by Hearst as well as other patterns to automatically extract relevant sentences, which are processed to elicit the hypernym-hyponym pairs [1]. These partial graphs can then be used to augment the candidate ontology. The result of combining the statistical and linguistic methods produces a “tree” of terms and relations, mostly organised hierarchically. The use of recently developed and developing terminology standards ISO 12620 and ISO 16642 enables us to produce a terminology markup language (TML) that represents these results. We use terminology data categories from ISO 12620, which include relational data categories such as superordinate concept and subordinate concept, along with mechanisms for developing a TML from ISO 16642, including the use of the notions of style and vocabulary, to produce an XML-conformant encoding. This encoding can be converted to, for example, MARTIF (ISO 12200), or to the TermBase eXchange (TBX) format developed by the Localisation Industry Standards Association (LISA). The combination of these standards with the extraction method can provide the basis for a terminology collection. By following Maedche’s approach to defining an “ontology wrapper” for WordNet [11], we can consider the provision of a wrapper for a (concept oriented) terminology collection created in this way. To do so, we map from TermEntry (ISO 16642) to concept; from term (ISO 12620) to lexicon; from broader concept generic (‘is a’), superordinate concept, superordinate concept generic, subordinate concept, subordinate concept generic (ISO 12620) to hierarchical relation; and from relationships including broader concept partitive (‘has a’), sequentially related concept, temporally related concept and spatially related concept (ISO 12620) to relations. We could consider the extension of this model to retain information with respect to relationships between terms, which may be of importance in establishing relationships between parts of concepts, but which can be extended to lexicographical collections and term-oriented terminology collections. Such relationships in ISO 12620 include short form of term, initialism, acronym, clipped term, homonym and homograph. In the supertype/subtype relationship-based Resource Description Framework Schema (RDFS), each term would form the content of an rdfs:label, and suitable concept identifiers (rdf:ID) would be used to present the classing and subclassing (rdfs:Class, rdfs:subClass). In this conversion, there is a degree of information loss, since RDFS does not cater for much of the information needed for a terminology format and it is not expressive enough to cater for natural languages, however the mapping to an ontology language shows the ability to directly populate such an ontology system. Ontology editing applications that understand RDFS, including Protégé and OilEd, can use such output to seed their ontologies for further development. Case Study: Carbon Nanotubes A corpus of 1,012,096 tokens was collected comprising 404 learned articles from the Applied Physics Letters section on Nanoscale Science and Design. Analysis produces a list of 26861 words. Removal of the top 2000 words of the BNC reduces the length of this list to 25339 (1522 less words, a reduction of 5.7% of the vocabulary). Of these 25339, 14142 words produce an “INFINITE” weirdness – nearly 56% of words do not occur in the BNC. With z-score (> 1) for frequency and strength, a subset of 46 words (0.18%) is selected to consider for collocations (relaxing both k-scores to values above 0.5 would result in 90 words being selected for the next phase of this analysis). The 46 words contain high frequency-high weirdness combinations, for example in the selection of nanotubes in 6th position (1378 and “INFINITE” respectively) when ordered by frequency. For ordering by weirdness, the first 10 results include: nanotubes, nanotube, nanoparticles, nanowires, tunneling and cnts. Results obtained from this corpus for frequency and weirdness, such as 2142 and 225 respectively for electron and 126 and “INFINITE” respectively for fiber, determine the exclusion of these terms from this particular set. These automatically selected 46 words are then used as the seeds in our collocation process. For nanotubes, with a distance of –5 to +5, 1811 collocating words are found. nanotubes collocates with carbon a total of 690 times, 647 of which are at position –1 (giving the compound carbon nanotubes). Applying U > 10 and z > 1 reduces the number of results to consider from 1811 to 22 (= 98.8% of collocates ignored). Our further constraint with regard to positions +1 and –1 reduces this list further to 4, with consideration of carbon nanotubes, z nanotubes, nanotubes cnts and nanotubes grown. Relaxing U > 5 and z > 0.5 increases the initial number of considerations to 47, however this only increases the list for positions –1 and +1 by 7 compounds. For carbon nanotubes, with U > 10 and z > 5, we achieve a further list of 25 collocations, reduced by considering position information to 11. Taking the four examples at position –1, still applying the constraints, we derive (frequencies in C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted 080104.doc 02/03/05 20:58 brackets): aligned carbon nanotubes (48), vertically aligned carbon nanotubes (15), aligned carbon nanotubes kai (4), multiwalled carbon nanotubes (46), multiwalled carbon nanotubes mwnts (13), single-wall carbon nanotubes (24), single-wall carbon nanotubes swnts (4). Interestingly, singlewalled carbon nanotubes is also extended by swnts (f = 19), however when we consider multiwall carbon nanotubes, the mwnts extension does not satisfy the conditions. There is, perhaps, some tension between wall and walled within this collection. Analysis by hand would suggest that vertically aligned carbon nanotubes is valid, while single-wall carbon nanotube and multiwalled carbon nanotubes appear to only be extended by abbreviations. Extending this analysis to lower frequencies (relaxing U and z constraints), we find longer terms such as: conventional horizontal-type metalorganic chemical vapor deposition reactor; ridge-type ingaas quantum-wire field-effect transistors; and trench-type narrow ingaas quantum-wire field effect transistor. From the resulting list of term candidates, we consider the matching mechanism for determining potential term clusters (possible synonyms), and if we consider multiwalled carbon nanotube, multiwall carbon nanotube and multiwalled carbon nanotubes, these can be shown to match with a value greater than about 0.92, so we present them as a term cluster. From this analysis, we can produce hyponymies such as (the arrow indicates subtype supertype): [nanowire array], [boron nanowire], [nanowire transistor] [nanowire] [fe nanowire array], [thicker nanowire array], [thin nanowire array] [nanowire array] For the linguistic analysis, 722 sentences were extracted using a set of 8 cues, out of which 55% embodied a domain-related hyponymic relationship. Out of all the cues, such as was the most productive, being used in 66% of the valid sentences. Below we list some example sentences illustrating the use of the cues: such as, and other, including and like. 1. 2. 3. 4. This method has been successfully applied in recent years in the synthesis of various metal nanostructures such as nanowires, nanorods, and nanoparticles. Occasional multiwall carbon nanotubes and other carbon nanostructures were also found following annealing at higher (> °C) temperatures. The present method will be extended to find and fix nanoparticles including polymers, colloids, micelles, and hopefully biological molecules/tissues in solution. This technique is promising because many different types of nanowires, like nanotubes or semiconductor nanowires, are now synthetically available. From these sentences, various terms can be linked together based on the hyponymic relationship, for example (the arrow indicates subtype supertype): [nanotube], [semiconductor nanowire] [micelle], [polymer], [colloid] [nanowire] [nanoparticle] [metal nanostructure] (sentences 1, 4) [metal nanostructure] (sentences 1, 3) Sentences such as 2 and 4 above may confirm a synonymy relationship between multiwall carbon nanotubes and multiwalled carbon nanotubes. Furthermore, the partial graphs above can be merged with results from collocation analysis. Collocates of nanowire can be linked to the nanowire node of the sub-graph [nanotube] [nanowire] [metal nanostructure]. For example: [ [fe nanowire array], [thicker nanowire array], [thin nanowire array] [nanowire array] ], [ [nanotube], [semiconductor nanowire] ] [nanowire] [metal nanostructure] This graph could similarly be expanded for other extracted relations and collocations, for example those of nanoparticle and other subtypes of metal nanostructure. Mapping these results to RDFS enables us to produce a candidate ontology that can be edited (pruned, adapted and so on) and visualised within Protégé, as shown in the following figure. The issues of multiple inheritance and the use of synonyms and abbreviations can then be handled within the ontology editor. Results of this method still require (human) evaluation, and determination of the appropriate parameters for term extraction, but show early promise. C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted 080104.doc 02/03/05 20:58 Figure 1: Screen shot of the Protégé Ontology Editor displaying a section of the automatically constructed Carbon Nanotube candidate ontology in RDFS format. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Ahmad, K., Tariq, M., Vrusias, B. and Handy, C. (2003). “Corpus-Based Thesaurus Construction for Image Retrieval in Specialist Domains”. In: Sebastiani, F. (ed.): Proceedings of ECIR’03. LNCS-2633. Springer Verlag, Heidelberg, pp.502-510. Alani, H., Kim, S., Millard, D., Weal, M., Hall, W., Lewis, P. and Shadbolt, N. (2003). “Automatic Ontology-Based Knowledge Extraction from Web Documents.” IEEE Intelligent Systems, Vol.18, No.1, pp.14-21. Aussenac-Gilles, N., Biebow, B. and Szulman, S. (2000). “Revisiting Ontology Design: A Method Based on Corpus Analysis.” Proceedings of EKAW 2000, LNAI-1937, Springer-Verlag, Berlin Heidelberg. pp.172-188 Cruse, D. A. (1986). Lexical Semantics. Cambridge University Press, Avon, Great Britain. Gruber, T. (1993). “A translation approach to portable ontologies,” Knowledge Acquisition, Vol. 5, No.2, pp.199-220. Guarino, N., Masolo, C., and Vetere, G. (1999). “ONTOSEEK: Content-Based Access to the Web.” IEEE Intelligent Systems, Vol.14, No.3, pp.70-80. Hearst, M. A. (1992). “Automatic Acquisition of Hyponyms from Large Text Corpora.” Proceedings of the Fourteenth International Conference on Computational Linguistics, Nantes, France. Kaminsky, J. (1969). “Language and Ontology.” Southern Illinois University Press. Küng, G. (1967). “Ontology and the Logistic Analysis of Language.” D.Reidel Publishing Company, Dordrecht, Holland. Lewis, D. and Croft, W. (1990). “Term clustering of syntactic phrases.” ACM SIGIR-90. pp.385-404. Maedche, A. (2002). “Ontology Learning for the Semantic Web.” The Kluwer International Series in Engineering and Computer Science, Vol.665, ISBN: 0792376560. Maedche, A., Motik, B., Stojanovic, L., Studer, R. and Volz, R. (2003). “Ontologies for enterprise knowledge management”. IEEE Intelligent Systems, March-April 2003, Vol.18, Issue 2, pp26-33. Mikheev, A. and Finch, S. (1995). “A Workbench for Acquisition of Ontological Knowledge from Natural Text”. In: Proceedings of the 7th conference of the European Chapter for Computational Linguistics (EACL'95). Dublin, Ireland. pp.194-201. Navigli, R., Velardi, P. and Gangemi, A. (2003) “Ontology Learning and Its Application to Automated Terminology Translation”. IEEE Intelligent Systems Vol.18, No.1, pp.22-31. Oard, D.W. (1997). “Alternative approaches for cross-language text retrieval.” In: AAAI Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence, March 1997. Salton G. and McGill M. J. (1983). “Introduction to Modern Information Retrieval.” McGraw-Hill, New York. pp201 et seq. Smadja, F. (1993). “Retrieving collocations from text: Xtract.” Computational Linguistics, Vol.19, No.1. Oxford University Press. pp.143-178. Sowa, J.F. (2000). “Knowledge Representation: Logical, Philosophical, and Computational Foundations.” Brooks Cole Publishing Co., Pacific Grove, CA. pp. 492, 497 et seq. C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted 080104.doc 02/03/05 20:58 [19] Srinivasan, P. and Ruiz, M. E. (1998). “Crosslingual Information Retrieval with the UMLS: An Analysis of Errors”. In: Proceedings of the 61st Annual Meeting of the American Society for Information Science, Pittsburgh, PA. pp.153-165. C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted 080104.doc 02/03/05 20:58

Log In

Ontology via terminology

Sign up to get access to over 50M papers

Related papers

Related papers

Related topics