Using Wordnet For Text Categorization
Using Wordnet For Text Categorization
Using Wordnet For Text Categorization
1, January 2008
Abstract: This paper explores a method that use WordNet concept to categorize text documents. The bag of words
representation used for text representation is unsatisfactory as it ignores possible relations between terms. The proposed
method extracts generic concepts from WordNet for all the terms in the text then combines them with the terms in different
ways to form a new representative vector. The effects of this method are examined in several experiments using the
multivariate chi-square to reduce the dimensionality, the cosine distance and two benchmark corpus the reuters-21578
newswire articles and the 20 newsgroups data for evaluation. The proposed method is especially effective in raising the
macro-averaged F1 value, which increased to 0.714 for the Reuters from 0.649 and to 0.719 for the 20 newsgroups from
0.667.
Keywords: 20Newsgroups, ontology, reuters-21578, text categorization, wordNet, and cosine distance.
board of directors) and to disambiguate these • See also: relation between concepts having a certain
homonymic significances “a board” will also belong affinity (cold /frozen).
to the synset {board, committee}. The definition of the • Similar to: certain adjectival concepts which
synsets varies from the very specific one to the very meaning is close are gathered. A synset is then
general. The most specific synsets gather a restricted designated as being central to the regrouping. The
number of lexical significances whereas the most relation 'Similar to' binds a peripheral synset with
general synsets cover a very broad number of the central synset (moist /wet).
significances. • Derived from: indicate a morphological derivation
The organization of WordNet through lexical between the target concept (adjective) and the
significances instead of using lexemes makes it concept origin (coldly /cold).
different from the traditional dictionaries and thesaurus
[11]. The other difference which has WordNet 2.1. Synonymy in WordNet
compared to the traditional dictionaries is the
separation of the data into four data bases associated A synonym is a word which we can substitute to
with the categories of verbs, nouns, adjectives and another without important change of meaning. Cruse
adverbs. This choice of organization is justified by [2] distinguishes three types of synonymy:
psycholinguistics research on the association of words • Absolute synonymes.
to the syntactic categories by humans. Each database is • Cognitive synonymes.
differently organized than the others. The names are
• Plesionymes.
organized in hierarchy, the verbs by relations, the
adjectives and the adverbs by N-dimension According to the definition of Cruse [3] of the
hyperspaces [11]. cognitive synonyms, X and Y are cognitive synonyms
The following list enumerates the semantic relations if they have the same syntactic function and that all
available in WordNet. These relations relate to grammatical declaratory sentences containing X have
concepts, but the examples which we give are based on the same conditions of truth as another identical
words. sentence where X is replaced by Y.
• Synonymy: relation binding two equivalent or close Example: Convey /automobile
concepts (frail /fragile). It is a symmetrical relation.
• Antonymy: relation binding two opposite concepts The relation of synonymy is at the base of the
(small /large). This relation is symmetrical. structure of WordNet. The lexemes are gathered in sets
• Hyperonymy: relation binding a concept-1 to a more of synonyms ("synsets"). There are thus in a synset all
general concept-2 (tulip /flower). the terms used to indicate the concept.
• Hyponymy: relation binding a concept-1 to a more The definition of synonymy used in WordNet [11] is
specific concept-2. It is the reciprocal of as follows: "Two expressions are synonymous in a
hyperonymy. This relation may be useful in linguistic context C if the substitution of for the other
information retrieval. Indeed, if all the texts treating out of C does not modify the value of truth of the
of vehicles are sought, it can be interesting to find sentence in which substitution is made".
those which speak about cars or motor bikes. Example of synset: [Person, individual, someone,
• Meronymy: relation binding a concept-1 to a somebody, mortal, human, drunk person].
concept-2 which is one of its parts (flower/petal),
one of its members (forest /tree) or a substance 2.2. Hyponyms /Hyperonyms in Word Net
made of (pane/glass). X is a hyponym of Y (and Y is a hyperonym of X) if:
• Metonymy: relation binding a concept-1 to a
concept-2 of which it is one of the parts. It is the • F(X) is the minimal indefinite expression
opposite of the meronymy relation. compatible with sentence A is F(X) and
• Implication: relation binding a concept-1 to a • A is F(X) implies A is F(Y).
concept-2 which results from it (to walk /take a In other words, the hyponymy is the relation between a
step). narrower term and a generic term expressed by the
• Causality: relation binding a concept-1 to its expression "is-a".
purpose (to kill /to die).
Example:
• Value: relation binding a concept-1 (adjective) which
It is a dog → It is an animal [2].
is a possible state for a concept-2 (poor /financial
A dog is a hyponym of animal and animal is a
condition).
hyperonym of dog.
• Has the value: relation binding a concept-1 to its
possible values (adjectives) (size /large). It is the
opposite of relation value.
18 The International Arab Journal of Information Technology, Vol. 5, No. 1, January 2008
The document to be
classified Generation of
Categories the bag of words
WordNet
Calculate cosine distances
between profiles
.....
In WordNet, the hyponymy is a lexical relation The second stage relates to the classification phase.
between meanings of words and more precisely It consists on:
between synsets (Synonym Sets). This relation is
• Weighting the features in the categories profiles.
defined by: X is a hyponym of Y if “X is a kind of Y”
• Calculating the distance between the categories
is true. It is a transitive and asymmetrical relation,
profiles and the profile of the document to be
which generates a downward hierarchy of heritage for
classified.
the organization of the nouns and the verbs. The
hyponymy is represented in WordNet by the symbol
'@', which is interpreted by "is-a" or "is a kind of". 3.1. The Learning Phase
The first issue that needs to be addressed in text
Example: categorization is how to represent texts so as to
It is a tree → It is a plant. facilitate machine manipulation but also to retain as
much information as needed. The commonly used text
3. WordNet-Based Texts Categorization representation is the Bag-Of-Words, which simply uses
a set of words and the number of occurrences of the
The approach suggested is composed of two stages, as
words to represent documents and categories [12].
indicated in Figure 1. The first stage relates to the
Many efforts have been made to improve this simple
learning phase. It consists of:
and limited text representation. For example, [6] uses
• Generating a new text representation based on phrases or word sequences to replace single words. In
merging terms with their associated concept. our approach, we use a method that merges terms with
• Selecting the characteristic features for creating the their associated concepts to represent texts. To
categories profiles. generate a text representation using this method, four
steps are required:
• Mapping terms into concepts and choosing a
merging strategy.
Using WordNet for Text Categorization 19
• Applying a strategy for word senses thus contain only the terms, which do not appear in
disambiguation. WordNet.
• Applying a strategy for considering hypernyms.
• Applying a strategy for features selection. C. Concept Vector Only
This strategy differs from the second strategy by the
3.1.1. Mapping Terms into Concepts fact that it excludes all the terms from the new
representation including the terms, which do not
The process of mapping terms into concepts is r
illustrated with an example shown in Figure 2. For appear in WordNet; cd is used to represent the
simplicity, suppose there is a text consisting in only 10 category.
words: government (2), politics (1), economy (1),
natural philosophy (2), life science (1), math (1), 3.1.2. Strategies for Disambiguation
political economy (1), and science (1), where the The assignment of terms to concepts is ambiguous.
number indicated is the number of occurrences. Therefore, one word may have several meanings and
Key Words Concept: physics (2) thus one word may be mapped into several concepts.
In this case, we need to determine which meaning is
government (2) Concept: government (3)
being used, which is the problem of sense
politics (1)
disambiguation [8]. Since a sophisticated solution for
economy (1) Concept: economics (2)
sense disambiguation is often impractical [1], we have
naturalphilosophy (2) considered the two simple disambiguation strategies
Concept: bioscience (1)
life science (1) used in [7].
math (1) Concept: mathematics (1)
political economy (1) A. All Concepts
science (1) Concept: science (1) This strategy considers all proposed concepts as the
most appropriate one for augmenting the text
Figure 2. Example of mapping terms into concepts. representation. This strategy is based on the
assumption that texts contain central themes that in our
The words are then mapped into their corresponding cases will be indicated by certain concepts having
concepts in the ontology. In the example, the two height weights. In this case, the concept frequencies
words government (2) and politics (1) are mapped in are calculated as follows:
the concept government and the term frequencies of cf ( d , c ) = tf {d ,{t ∈ T c ∈ ref c ( t )}} (1)
these two words are added to the concept frequency.
From this point, three strategies for adding or replacing B. First Concept
terms by concepts can be distinguished as proposed by This strategy considers only the most often used sense
[1]: of the word as the most appropriate concept. This
strategy is based on the assumption that the used
A. Add Concept
r ontology returns an ordered list of concepts in which
This strategy extends each term vector t d by new more common meanings are listed before less common
entries for WordNet concepts C appearing in the texts ones [10].
r
set. Thus, the vector t d will be replaced by the
r r cf ( d , c ) = tf {d ,{t ∈ T first ( ref c ( t )) = c }} (2)
concatenation of td and cd where
r
c d = ( cf ( d , c 1 ),......, cf ( d , c l )) . The concept vector 3.1.3. Adding Hypernyms
with l = C and cf ( d , c ) denotes the frequency that a
If concepts are used to represent texts, the relations
concept c ∈ C appears in a text d. between concepts play a key role in capturing the ideas
The terms, which appear in WordNet as a concept, in these texts. Recent research shows that simply
will be accounted for at least twice in the new changing the terms to concepts without considering the
r vector relations does not have a significant improvement and
representation; once in the old term vector t d and at
r some time even perform worse than terms [1]. For this
least once in the concept vector c d . purpose, we have considered the hypernym relation
B. Replace Terms by Concepts between concepts by adding to the concept frequency
This strategy is similar to the first strategy; the only of each concept in a text the frequencies that their
difference lies in the fact that it avoids the duplication hyponyms appears. Then the frequencies of the
of the terms in the new representation; i.e., the terms concept vector part are updated in the following way:
which appear in WordNet will be taken into account
only in the concept vector. The vector of the terms will cf ' (d , c ) = ∑ cf (d , b ) (3)
b∈ H ( c )
20 The International Arab Journal of Information Technology, Vol. 5, No. 1, January 2008
• tf (t k , ci ) denotes the number of times feature tk In the above formula, precision and recall are two
occurs in category ci. standard measures widely used in text categorization
literature to evaluate the algorithm’s effectiveness on a
• df (t k ) denotes the number of categories in which
given category where
feature tk occurs.
true positive (8)
• C denotes the number of categories. precision = × 100
(true positive ) + ( false positive )
true positive (9)
3.2.2. Distance Calculation recall = × 100
(true positive ) + ( false negative )
The similarity measure is used to determine the degree We also use the macroaveraged F1 to evaluate the
of resemblance between two vectors. To achieve overall performance of our approach on given datasets.
reasonable classification results, a similarity measure The macroaveraged F1 compute the F1 values for each
should generally respond with larger values to category and then takes the average over the per-
documents that belong to the same class and with category F1 scores. Given a training dataset with m
smaller values otherwise. categories, assuming the F1 value for the i-th category
The dominant similarity measure in information is F1(i), the macroaveraged F1 is defined as :
retrieval and text classification is the cosine similarity
∑ F1 (i )
m
Table 2. Detailes of 20Newsgroups categories. Macro-averaged values then reached 71.7%, thus
Category # Train # Test Total # yielding a relative improvement of 6.8% compared to
Docs Docs Docs the Bag-Of-Word representation.
alt.atheism 480 319 799
comp.graphics 584 389 973
The same remarks can be done on the
comp.os.ms-windows.misc 572 394 966
20Newsgroups categories (see Table 4). The best
comp.sys.ibm.pc.hardware 590 392 982 performance is obtained with the profile size k=500.
comp.sys.mac.hardware 578 385 963 The relative improvement is about 5.2% compared to
comp.windows.x 593 392 985 the Bag-Of-Word representation.
misc.forsale 585 390 975
rec.autos 594 395 989
rec.motorcycles 598 398 996 5. Related Work
rec.sport.baseball 597 397 994
The importance of WordNet as a source of conceptual
rec.sport.hockey 600 399 999
sci.crypt 595 396 991
information for all kinds of linguistic processing has
sci.electronics 591 393 984 been recognized with many different experiences and
sci.med 594 396 990 specialized workshops.
sci.space 593 394 987 There are a number of interesting uses of WordNet
soc.religion.christian 598 398 996 in information retrieval and supervised learning. Green
talk.politics.guns 545 364 909
[4, 5] uses WordNet to construct chains of related
talk.politics.mideast 564 376 940
synsets (that he calls ‘lexical chains’) from the
talk.politics.misc 465 310 775
talk.religion.misc 377 251 628
occurrence of terms in a document. It produces a
Total 11293 7528 18821 WordNet based document representation using a word
sense disambiguation strategy and term weighting.
Dave [13] has explored WordNet using synsets as
4.2. Results features for document representation and subsequent
Tables 3 and 4 summarize the results of our approach clustering.
compared with the Bag-Of-Word representation over He did not perform word sense disambiguation and
Reuters-21578 (10 largest categories) and only found that WordNet synsets decreased clustering
20Newsgroups categories. The results obtained in the performance in all his experiments. Voorhees [15] as
experiments suggest that the integration of conceptual well as Moldovan and Mihalcea have explored the
features improved text classification results. On the possibility to use WordNet for retrieving documents by
Reuters categories (see Table 3); the best overall value keyword search.
is achieved by the following combination of strategies: It has already become clear by their work that
"add concept" strategy using "First concept" strategy particular care must be taken in order to improve
for disambiguation with the profile size k=200. precision and recall.
6. Conclusion and Future Work [8] Ide N. and Véronis J., “Introduction to the
Special Issue on Word Sense Disambiguation:
In this paper, we have proposed a new approach for The State of the Art,” Computational Linguistics,
text categorization based on incorporating background vol. 24, no. 1, pp. 1-40, 1998.
knowledge (WordNet) into text representation with [9] Kehagias A., Petridis V., Kaburlasos V., and
using the χ2 multivariate, which consists on extracting Fragkou P., “A Comparison of Word and Sense-
the K better features characterizing best the category Based Text Categorization Using Several
compared to the others. The experimental results with Classification Algorithms”, Journal of Intelligent
both Reuters21578 and 20Newsgroups datasets show Information Systems, vol. 21, no. 3, pp. 227-247,
that incorporating background knowledge in order to 2001.
capture relationships between words is especially [10] McCarthy D., Koeling R., Weeds J., and Carroll
effective in raising the macro-averaged F1 value. J., “Finding Pre-Dominant Senses in Untagged
The main difficulty is that a word usually has Text”, in Proceedings of the 42nd Annual
multiple synonyms with somewhat different meanings Meeting of the Association for Computational
and it is not easy to automatically find the correct Linguistics, pp. 280-287. Barcelona, Spain, 2004.
synonyms to use. Our word sense disambiguation [11] Miller G., “Nouns in WordNet: A Lexical
technique is not capable of determining the correct Inheritance System”, International Journal of
senses. Our future works include a better Lexicography, vol. 3, no. 4, 1990.
disambiguation strategy for a more precise [12] Peng X. and Choi B., “Document Classifications
identification of the proper synonym and hyponym Based on Word Semantic Hierarchies”, in
synsets. Proceedings of the International Conference on
Some work has been done on creating WordNets for Artificial Intelligence and Applications
specialized domains and integrating them into (IASTED), pp. 362-367, 2005.
MultiWordNet. We plan to make use of it to achieve [13] Pennock D., Dave K., and Lawrence S., “Mining
further improvement. the Peanut Gallery: Opinion Extraction and
Semantic Classification of Product Reviews”, in
References Proceedings of the Twelfth International World
[1] Bloehdorn S. and Hotho A., “Text Classification Wide Web Conference (WWW’2003), ACM,
by Boosting Weak Learners Based on Terms and 2003.
Concepts”, in Proceedings of the Fourth IEEE [14] Sebastiani F., “Machine Learning in Automated
International Conference on Data Mining, IEEE Text Categorization,” ACM Computing Surveys,
Computer Society Press, 2004. vol. 34, no. 1, pp. 1-47, 2002.
[2] Cruse D., Lexical Semantics, Cambridge, [15] Voorhees E. , “Query Expansion Using Lexical-
London, New York, Cambridge University Press, Semantic Relations”, in Proceedings of ACM-
1986. SIGIR, Dublin, Ireland, pp. 61–69,
[3] Dash M. and Liu H., “Feature Selection for ACM/Springer, 1994.
Classification”, Journal Intelligent Data
Analysis, Elsevier, vol. 1, no. 3, 1997.
[4] Green S., “Building Hypertext Links in
Newspaper Articles Using Semantic Similarity”, Zakaria Elberrichi is lecturer in
in Proceedings of Third Workshop on computer science and a researcher at
Applications of Natural Language to Information Evolutionary Engineering and
Systems (NLDB’97), pp. 178-190, 1997. Distributed Information Systems
[5] Green S., “Building Hypertext Links by Laboratory, EEDIS at the University
Computing Semantic Similarity”, IEEE Djillali Liabes, Sidi-belabbes,
Transactions on Knowledge and Data Algeria. He holds a master degree in
Engineering (TKDE), vol. 11, no. 5, pp. 713-730, computer science from the California State University
1999. in addition to PGCert in higher education. He has more
[6] Hofmann T., “Probmap: A Probabilistic than 17 years of experience in teaching both BSc and
Approach for Mapping Large Document MSc levels in computer science and planning and
Collections”, Journal for Intelligent Data leading data mining related projects. The last one
Analysis, vol. 4, pp. 149-164, 2000. called “New Methodologies for Knowledge
[7] Hotho A., Staab S., and Stumme G., “Ontologies Acquisition”. He supervises five master students in e-
Improve Text Document Clustering”, in larning, text mining, web services, and workflow.
Proceedings of the 2003 IEEE International
Conference on Data Mining (ICDM'03), pp. 541-
544, 2003.
24 The International Arab Journal of Information Technology, Vol. 5, No. 1, January 2008