Ontology Via Terminology?
Lee Gillam and Mariam Tariq
Department of Computing,
University of Surrey,
Guildford, GU2 7XH, United Kingdom
{l.gillam, m.tariq}@surrey.ac.uk
Introduction
The Encyclopaedia Britannica’s (EB) definition of ontology as “the theory or study of being as such;
i.e., of the basic characteristics of all reality” has been overtaken in recent years by the use of
“ontology” to describe the representation of information such that it can be reasoned over, which some
consider to be “knowledge”. Sowa’s view of ontology as “the study of the categories of things that
exist or may exist in some domain” [18] seems to provide a link between the EB definition and
Gruber’s commonly cited “explicit specification of a conceptualization” [5]. The effective reduction of
“ontology” to representation leads to its consideration as a tool for developing solutions to, for
example, problems of translation [14], information retrieval [15], [6], knowledge management [12] and
other issues related to knowledge-based activities [2]. For these and other authors, an ontology is
produced by hand-crafting a representation of a specific domain, or by renaming an existing language
resource: here, Wordnet, and its EuroWordnet variants, are such examples [15], [18].
The reduction of ontology to language resources occurs, perhaps, as an evolution of the work of
philosophers such as Wittgenstein who reduced the study of existence to the study of language; this
alignment of language and ontology is apparent elsewhere [8], [9]. By accepting such a reduction, we
can argue that the production of (the representation of) an ontology, can be reduced to the problem of
language understanding and analysis. Work on ontology construction, especially that of Alexander
Maedche, suggests the potential for automatic population of ontologies from text, referred to by some
as “ontology learning” [11], [3], [13]. These approaches use measures such as the information retrieval
favourite term-frequency/inverse-document-frequency (TF/IDF), entropy measures, part-of-speech
tagging and other devices for suggesting terms to the user. The burden of constructing the ontology
from the results of these operations rests squarely with the user. By constraining this reduction further
to specialist languages, the reduction of the study of ontology to the study of language suggests that the
problem of ontology engineering, or learning, can be reduced to one that is previously unsolved: the
automatic acquisition of terminological knowledge from domain texts. Most advocates of ontology (as
representation), with a few exceptions [3], pay little heed to terminology science, and yet arguably they
are only creating (limited) terminologies. Sowa notes that “subsets of the terminology can be used as
starting points for formalization”, and that this is a valid endeavour since “most fields of science,
engineering, business, and law have evolved systems of terminology or nomenclature for naming,
classifying, and standardizing their concepts” [18].
Encouraged by the works of Sowa and Maedche, we seek to develop a method for the automatic
derivation of ontologies, informed by work in terminology science, using mechanisms for extracting
and organising terms from text corpora. Maedche suggests the use of an ontology structure to map
between Wordnet and an ontology representation. If we describe an approach to terminology
acquisition, informed by recent developments in international standards for terminology that can be
used to seed such terminology collections, and map between terminology and ontology via an ontology
structure, we can potentially reduce the burden of both terminology and ontology acquisition. If such a
mapping can be made, large-scale validated terminology collections may be of value to ontology
developers as seeds for a domain. Here, we consider collections of terminology developed in
accordance with an international standard that enables the development of terminology standards (ISO
704). Adopting this approach to terminology production provides a ready-to-use peer-agreed resource
for such activities, although the approach to creation of such a collection is human-resource intensive.
Existing collections of ontologies (see for example http://www.daml.org/ontologies) are generally
small scale, with a few notable exceptions, and these exceptions are closer in form to terminology
collections. We can demonstrate the gap between texts currently available (e.g. Web texts, journal
papers, and so on) and existing ontologies using the terms diode and tunneling diode. Although both
successfully retrieve a number of texts from common search engines, in the indexed collections of
ontologies mentioned previously, the query tunneling diode and its text variant tunnelling diode
produce no evidence of ontology classes; only by (a human) knowing a further variant, tunnel-diode,
C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted
080104.doc
02/03/05 20:58
do we achieve success. This success is limited since it appears to have no subclasses, although from
domain texts we can produce evidence of light emitting tunnel diodes and a range of other subtypes.
Such ontologies are therefore not sufficiently useful for purposes such as assisting information
retrieval, despite claims elsewhere.
To produce ontologies, we consider the essential organisation of science as evidenced through single
words and related multi-word expressions (which some refer to as compounds). We initially use
statistical techniques for the extraction from text collections of candidate terms, including weirdness
[1], collocation statistics [17] and term clusters [10]. This produces a candidate term hierarchy.
Linguistic techniques are then used to augment this hierarchy. The extracted information is organised
according to two international (ISO) standards for terminology, specifically ISO 12620 for terminology
data categories, and ISO 16642 for the terminological markup framework. These are used for
providing a basis for a terminology collection and, via an ontology structure, for a so-called
“lightweight” ontology. The synthesis of statistical and linguistic techniques, in conjunction with these
standards, enables us to produce an ontology (representation) suitable for refinement within an
ontology editor such as Protégé.
Method
The method requires two text corpora, a general language corpus (GL), for English the British National
Corpus (BNC, 100 million tokens), and the specialist text corpus (SL). We adopt descriptions of
relative frequencies and weirdness calculations [1]. About 75% of the tokens of the BNC are
represented by the first 2000 most frequent words in the BNC, and we remove these from the analysis
of SL. We wish to consider high-frequency, high-weirdness words in SL. The adopted definition of
weirdness is problematic when a word in GL is not in SL, since the denominator is zero, regardless of
the frequency of occurrence in the specialist corpus. To overcome this, we redefine the weirdness
calculation by inventing a minimum value for the GL frequency of any word not in GL, half that of the
minimum frequency in GL. For every word in SL, we have a value for both relative frequency and
weirdness. We can therefore consider the strength above the standard deviation (z-scores) for the
distributions given by these values, both of which tend to have a large kurtosis (shown experimentally).
By taking both z-scores > 1, using SL, we automatically generate inputs to Smadja’s collocation
method (step 1.2), with a set of automatically selected words rather than “a given word w” (one word
only, manually selected for Smadja). To provide a larger number of words, we can vary the strength
above the standard deviation. For collocations, we again remove the top 2000 words of GL, and use
Smadja’s U-scores, z-scores and the neighbourhood of 5 words. For creation of multiword expressions
in English, we use collocations in the immediate neighbourhood (we still consider the full
neighbourhood for determining the strength of the collocation). U-score and z-score are used to
produce bigrams representing candidate compounds, and the process is repeated using significant
bigrams to form n-word collocations. From this analysis, we form candidate trees that evidence “term
inclusion” through left- and right-extension of the collocating phrase at every iteration. For the
indication of “concepts”, we consider the computation of a word-word similarity matrix, using the Dice
coefficient [16] to measure the number of matching n-grams. In our case, we consider trigrams as
patterns of three letters, where each pattern generally overlaps the previous by 2 letters and all patterns
are of length 3. We produce the set of all substrings of length n characters in each word that is a
component of the phrase. Trigram patterns made up from each candidate term are compared using a
selection strategy where the match is 80% or more [19]. This value may be increased or decreased
depending on the strength of match required. Since we are considering similar compounds rather than
single words alone, it is perhaps worth considering a value of 90% or greater.
Having identified and related domain terms statistically through inclusion, we also consider that terms
in a domain are often related to each other through semantic relations like hyponymy and meronomy,
often exemplified in the domain texts through particular recurrent grammatical patterns. For example,
scientists often use the device of enumeration to explain certain concepts. The sentence “Various
copper compounds such as copper oxides, nitrides, and sulphides have been studied extensively due to
their excellent optical and electronic properties,” signals the semantic relationship of hyponymy
between the terms through the use of the cue such as since copper oxide, nitride, and sulphide are
hyponyms (subtypes) of copper compound. The use of such phrases to encode complex relationships
appears to have its own rules of description: a kind of local grammar governs the behaviour of clauses.
Cruse has discussed the notion of semantic frames: a triplet of phrases - X REL Y where X and Y are
noun phrases (NPs) and REL is a phrase generally expressed as IS A, IS A TYPE OF/KIND OF and PART OF
for illustrating hyponymic and meronymic relationships respectively [4]. Apart from these cues, most
C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted
080104.doc
02/03/05 20:58
commonly used in biological classifications, it has been suggested that certain lexico-syntactic patterns
occurring in texts can be similarly used, such as the frame (X1………,Xn) OR OTHER Y where each X and Y
are NPs and each Xi in the list (X1………,Xn) is a hyponym of Y [7]. Within our method we have
employed cues suggested by Hearst as well as other patterns to automatically extract relevant
sentences, which are processed to elicit the hypernym-hyponym pairs [1]. These partial graphs can
then be used to augment the candidate ontology.
The result of combining the statistical and linguistic methods produces a “tree” of terms and relations,
mostly organised hierarchically. The use of recently developed and developing terminology standards
ISO 12620 and ISO 16642 enables us to produce a terminology markup language (TML) that
represents these results. We use terminology data categories from ISO 12620, which include relational
data categories such as superordinate concept and subordinate concept, along with mechanisms for
developing a TML from ISO 16642, including the use of the notions of style and vocabulary, to
produce an XML-conformant encoding. This encoding can be converted to, for example, MARTIF
(ISO 12200), or to the TermBase eXchange (TBX) format developed by the Localisation Industry
Standards Association (LISA). The combination of these standards with the extraction method can
provide the basis for a terminology collection. By following Maedche’s approach to defining an
“ontology wrapper” for WordNet [11], we can consider the provision of a wrapper for a (concept
oriented) terminology collection created in this way. To do so, we map from TermEntry (ISO 16642)
to concept; from term (ISO 12620) to lexicon; from broader concept generic (‘is a’), superordinate
concept, superordinate concept generic, subordinate concept, subordinate concept generic (ISO 12620)
to hierarchical relation; and from relationships including broader concept partitive (‘has a’),
sequentially related concept, temporally related concept and spatially related concept (ISO 12620) to
relations. We could consider the extension of this model to retain information with respect to
relationships between terms, which may be of importance in establishing relationships between parts of
concepts, but which can be extended to lexicographical collections and term-oriented terminology
collections. Such relationships in ISO 12620 include short form of term, initialism, acronym, clipped
term, homonym and homograph. In the supertype/subtype relationship-based Resource Description
Framework Schema (RDFS), each term would form the content of an rdfs:label, and suitable concept
identifiers (rdf:ID) would be used to present the classing and subclassing (rdfs:Class, rdfs:subClass).
In this conversion, there is a degree of information loss, since RDFS does not cater for much of the
information needed for a terminology format and it is not expressive enough to cater for natural
languages, however the mapping to an ontology language shows the ability to directly populate such an
ontology system. Ontology editing applications that understand RDFS, including Protégé and OilEd,
can use such output to seed their ontologies for further development.
Case Study: Carbon Nanotubes
A corpus of 1,012,096 tokens was collected comprising 404 learned articles from the Applied Physics
Letters section on Nanoscale Science and Design. Analysis produces a list of 26861 words. Removal
of the top 2000 words of the BNC reduces the length of this list to 25339 (1522 less words, a reduction
of 5.7% of the vocabulary). Of these 25339, 14142 words produce an “INFINITE” weirdness – nearly
56% of words do not occur in the BNC. With z-score (> 1) for frequency and strength, a subset of 46
words (0.18%) is selected to consider for collocations (relaxing both k-scores to values above 0.5
would result in 90 words being selected for the next phase of this analysis). The 46 words contain high
frequency-high weirdness combinations, for example in the selection of nanotubes in 6th position (1378
and “INFINITE” respectively) when ordered by frequency. For ordering by weirdness, the first 10
results include: nanotubes, nanotube, nanoparticles, nanowires, tunneling and cnts. Results obtained
from this corpus for frequency and weirdness, such as 2142 and 225 respectively for electron and 126
and “INFINITE” respectively for fiber, determine the exclusion of these terms from this particular set.
These automatically selected 46 words are then used as the seeds in our collocation process.
For nanotubes, with a distance of –5 to +5, 1811 collocating words are found. nanotubes collocates
with carbon a total of 690 times, 647 of which are at position –1 (giving the compound carbon
nanotubes). Applying U > 10 and z > 1 reduces the number of results to consider from 1811 to 22 (=
98.8% of collocates ignored). Our further constraint with regard to positions +1 and –1 reduces this list
further to 4, with consideration of carbon nanotubes, z nanotubes, nanotubes cnts and nanotubes
grown. Relaxing U > 5 and z > 0.5 increases the initial number of considerations to 47, however this
only increases the list for positions –1 and +1 by 7 compounds. For carbon nanotubes, with U > 10
and z > 5, we achieve a further list of 25 collocations, reduced by considering position information to
11. Taking the four examples at position –1, still applying the constraints, we derive (frequencies in
C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted
080104.doc
02/03/05 20:58
brackets): aligned carbon nanotubes (48), vertically aligned carbon nanotubes (15), aligned carbon
nanotubes kai (4), multiwalled carbon nanotubes (46), multiwalled carbon nanotubes mwnts (13),
single-wall carbon nanotubes (24), single-wall carbon nanotubes swnts (4). Interestingly, singlewalled carbon nanotubes is also extended by swnts (f = 19), however when we consider multiwall
carbon nanotubes, the mwnts extension does not satisfy the conditions. There is, perhaps, some tension
between wall and walled within this collection. Analysis by hand would suggest that vertically aligned
carbon nanotubes is valid, while single-wall carbon nanotube and multiwalled carbon nanotubes
appear to only be extended by abbreviations. Extending this analysis to lower frequencies (relaxing U
and z constraints), we find longer terms such as: conventional horizontal-type metalorganic chemical
vapor deposition reactor; ridge-type ingaas quantum-wire field-effect transistors; and trench-type
narrow ingaas quantum-wire field effect transistor. From the resulting list of term candidates, we
consider the matching mechanism for determining potential term clusters (possible synonyms), and if
we consider multiwalled carbon nanotube, multiwall carbon nanotube and multiwalled carbon
nanotubes, these can be shown to match with a value greater than about 0.92, so we present them as a
term cluster. From this analysis, we can produce hyponymies such as (the arrow indicates subtype
supertype):
[nanowire array], [boron nanowire], [nanowire transistor]
[nanowire]
[fe nanowire array], [thicker nanowire array], [thin nanowire array]
[nanowire array]
For the linguistic analysis, 722 sentences were extracted using a set of 8 cues, out of which 55%
embodied a domain-related hyponymic relationship. Out of all the cues, such as was the most
productive, being used in 66% of the valid sentences. Below we list some example sentences
illustrating the use of the cues: such as, and other, including and like.
1.
2.
3.
4.
This method has been successfully applied in recent years in the synthesis of various metal
nanostructures such as nanowires, nanorods, and nanoparticles.
Occasional multiwall carbon nanotubes and other carbon nanostructures were also found
following annealing at higher (> °C) temperatures.
The present method will be extended to find and fix nanoparticles including polymers, colloids,
micelles, and hopefully biological molecules/tissues in solution.
This technique is promising because many different types of nanowires, like nanotubes or
semiconductor nanowires, are now synthetically available.
From these sentences, various terms can be linked together based on the hyponymic relationship, for
example (the arrow indicates subtype supertype):
[nanotube], [semiconductor nanowire]
[micelle], [polymer], [colloid]
[nanowire]
[nanoparticle]
[metal nanostructure] (sentences 1, 4)
[metal nanostructure] (sentences 1, 3)
Sentences such as 2 and 4 above may confirm a synonymy relationship between multiwall carbon
nanotubes and multiwalled carbon nanotubes. Furthermore, the partial graphs above can be merged
with results from collocation analysis. Collocates of nanowire can be linked to the nanowire node of
the sub-graph [nanotube] [nanowire] [metal nanostructure]. For example:
[ [fe nanowire array], [thicker nanowire array], [thin nanowire array]
[nanowire array] ],
[ [nanotube], [semiconductor nanowire] ]
[nanowire]
[metal nanostructure]
This graph could similarly be expanded for other extracted relations and collocations, for example
those of nanoparticle and other subtypes of metal nanostructure.
Mapping these results to RDFS enables us to produce a candidate ontology that can be edited (pruned,
adapted and so on) and visualised within Protégé, as shown in the following figure. The issues of
multiple inheritance and the use of synonyms and abbreviations can then be handled within the
ontology editor. Results of this method still require (human) evaluation, and determination of the
appropriate parameters for term extraction, but show early promise.
C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted
080104.doc
02/03/05 20:58
Figure 1: Screen shot of the Protégé Ontology Editor displaying a section of the automatically
constructed Carbon Nanotube candidate ontology in RDFS format.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
Ahmad, K., Tariq, M., Vrusias, B. and Handy, C. (2003). “Corpus-Based Thesaurus Construction for
Image Retrieval in Specialist Domains”. In: Sebastiani, F. (ed.): Proceedings of ECIR’03. LNCS-2633.
Springer Verlag, Heidelberg, pp.502-510.
Alani, H., Kim, S., Millard, D., Weal, M., Hall, W., Lewis, P. and Shadbolt, N. (2003). “Automatic
Ontology-Based Knowledge Extraction from Web Documents.” IEEE Intelligent Systems, Vol.18, No.1,
pp.14-21.
Aussenac-Gilles, N., Biebow, B. and Szulman, S. (2000). “Revisiting Ontology Design: A Method
Based on Corpus Analysis.” Proceedings of EKAW 2000, LNAI-1937, Springer-Verlag, Berlin
Heidelberg. pp.172-188
Cruse, D. A. (1986). Lexical Semantics. Cambridge University Press, Avon, Great Britain.
Gruber, T. (1993). “A translation approach to portable ontologies,” Knowledge Acquisition, Vol. 5,
No.2, pp.199-220.
Guarino, N., Masolo, C., and Vetere, G. (1999). “ONTOSEEK: Content-Based Access to the Web.”
IEEE Intelligent Systems, Vol.14, No.3, pp.70-80.
Hearst, M. A. (1992). “Automatic Acquisition of Hyponyms from Large Text Corpora.” Proceedings of
the Fourteenth International Conference on Computational Linguistics, Nantes, France.
Kaminsky, J. (1969). “Language and Ontology.” Southern Illinois University Press.
Küng, G. (1967). “Ontology and the Logistic Analysis of Language.” D.Reidel Publishing Company,
Dordrecht, Holland.
Lewis, D. and Croft, W. (1990). “Term clustering of syntactic phrases.” ACM SIGIR-90. pp.385-404.
Maedche, A. (2002). “Ontology Learning for the Semantic Web.” The Kluwer International Series in
Engineering and Computer Science, Vol.665, ISBN: 0792376560.
Maedche, A., Motik, B., Stojanovic, L., Studer, R. and Volz, R. (2003). “Ontologies for enterprise
knowledge management”. IEEE Intelligent Systems, March-April 2003, Vol.18, Issue 2, pp26-33.
Mikheev, A. and Finch, S. (1995). “A Workbench for Acquisition of Ontological Knowledge from
Natural Text”. In: Proceedings of the 7th conference of the European Chapter for Computational
Linguistics (EACL'95). Dublin, Ireland. pp.194-201.
Navigli, R., Velardi, P. and Gangemi, A. (2003) “Ontology Learning and Its Application to Automated
Terminology Translation”. IEEE Intelligent Systems Vol.18, No.1, pp.22-31.
Oard, D.W. (1997). “Alternative approaches for cross-language text retrieval.” In: AAAI Symposium
on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence, March
1997.
Salton G. and McGill M. J. (1983). “Introduction to Modern Information Retrieval.” McGraw-Hill, New
York. pp201 et seq.
Smadja, F. (1993). “Retrieving collocations from text: Xtract.” Computational Linguistics, Vol.19,
No.1. Oxford University Press. pp.143-178.
Sowa, J.F. (2000). “Knowledge Representation: Logical, Philosophical, and Computational
Foundations.” Brooks Cole Publishing Co., Pacific Grove, CA. pp. 492, 497 et seq.
C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted
080104.doc
02/03/05 20:58
[19]
Srinivasan, P. and Ruiz, M. E. (1998). “Crosslingual Information Retrieval with the UMLS: An
Analysis of Errors”. In: Proceedings of the 61st Annual Meeting of the American Society for
Information Science, Pittsburgh, PA. pp.153-165.
C:\Documents and Settings\csp1sa\Desktop\saif_web\saif\ONtology Via Terminology LG MT 2003 sub 101203 accepted
080104.doc
02/03/05 20:58