Academia.eduAcademia.edu

Sequential patterns for text categorization

2006, Intelligent Data Analysis

Text categorization is a well-known task based essentially on statistical approaches using neural networks, Support Vector Machines and other machine learning algorithms. Texts are generally considered as bags of words without any order. Although these approaches have proven to be efficient, they do not provide users with comprehensive and reusable rules about their data. Such rules are, however, very important

Sequential Patterns for Text Categorization Simon Jaillet, Anne Laurent, Maguelonne Teisseire To cite this version: Simon Jaillet, Anne Laurent, Maguelonne Teisseire. Sequential Patterns for Text Categorization. Intelligent Data Analysis, IOS Press, 2006, 10 (3), pp.16. <lirmm-00135010> HAL Id: lirmm-00135010 http://hal-lirmm.ccsd.cnrs.fr/lirmm-00135010 Submitted on 6 Mar 2007 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Galley Proof 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 1 Intelligent Data Analysis 9 (2006) 1–16 IOS Press 1 Sequential patterns for text categorization S. Jaillet, A. Laurent and M. Teisseire LIRMM-CNRS – Université Montpellier 2, 161 rue Ada, 34392 Montpellier Cedex 5 France E-mail: {jaillet,laurent,teisseire}@lirmm.fr Received 22 June 2005 Revised 25 August 2005 Accepted 12 November 2005 Abstract. Text categorization is a well-known task based essentially on statistical approaches using neural networks, Support Vector Machines and other machine learning algorithms. Texts are generally considered as bags of words without any order. Although these approaches have proven to be efficient, they do not provide users with comprehensive and reusable rules about their data. Such rules are, however, very important for users to describe trends in the data they have to analyze. In this framework, an association-rule based approach has been proposed by Bing Liu (CBA). We propose, in this paper, to extend this approach by using sequential patterns in the SPaC method (Sequential Patterns for Classification) for text categorization. Taking order into account allows us to represent the succession of words through a document without complex and time-consuming representations and treatments such as those performed in natural language and grammatical methods. The original method we propose here consists of mining sequential patterns in order to build a classifier. We experimentally show that our proposal is relevant, and that it is very interesting compared to other methods. In particular, our method outperforms CBA and provides better results than SVM on some corpus. Keywords: Text mining, categorization, sequential patterns, SPaC 1. Introduction Automatic text classification goes back at least to the 1960s [24]. But with the growing volume of available digital documents, automatic classification has been extensively addressed through research in the past few years to define efficient and scalable methods [33,39]. In this domain, two distinct types of approaches have been proposed: supervised and unsupervised classification. In supervised classification (also known as categorization), categories are defined by an expert, while they are automatically learned in the other case (also called clustering) [14,31]. In this setting, the goal is to define a function which associates texts with categories. The learning step involves automatically defining this function using a training set. This training set consists of texts for which the category is known. Then this function can be used to associate a category (e.g. politics or sport) to a new text that has never been processed. The more accurate the automatic decision, the better the classifier. Currently, the best classifiers are mostly based on the statistical text representation T F -IDF (Term Frequency, Inverse Document Frequency) [31] and machine learning algorithms such as neural networks or Support Vector Machines (SVM). However, most of these methods do not provide understandable descriptions of the extracted knowledge. In order to cope with this problem, an association-rule based approach was first proposed by Bing Liu (CBA) [21], which has subsequently been enhanced in [5,8,12, 16,19,37], etc. 1088-467X/06/$17.00  2006 – IOS Press and the authors. All rights reserved Galley Proof 2 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 2 S. Jaillet et al. / Sequential patterns for text categorization All these methods consider each text as a so-called bag of words where no order between words is taken into account for categorization. This textual representation has proven to be useful and almost as efficient as complex representations which require time-consuming methods like syntactic analysis. It is thus interesting to investigate methods that take order into account while remaining scalable. For this purpose, we study sequential patterns. Sequential Patterns aim at discovering temporal relationships between facts embedded in a database. In a market basket analysis problem, sequential patterns refer to rules like a customer who bought a TV together with a DVD player later bought a recorder. In this framework, the databases being considered consist of customer transactions, recording the items bought at some dates by customers (clients). In this paper, we thus propose to extend the CBA method by taking the order into account. In our SPaC (Sequential Patterns for Categorization) approach, the order is considered by using sequential patterns instead of association rules. Sequential patterns have, in this framework, three main advantages: first they provide understandable rules (contrary to SVM, Rocchio, naive Bayes, . . .). Secondly they allow trend analysis, as shown in [18]. Thirdly, they extract patterns that are more precise and informative than association rules. In the original text classification method using sequential patterns we propose here, sentences are distinguished and ordered in each text. This means that the text is considered as being an ordered list of sentences. Each sentence is considered as being an unordered set of words. If we compare the market basket analysis problem to our approach, then a text plays the role of a client; the sentences from a text play the role of all the transactions for this text; the position of the sentence within the text plays the role of the date; and the set of words from a sentence plays the role of a list of items bought. Experiments show that sequential pattern-based classification with SPaC is very efficient, particularly when Support Vector Machines do not perform well. Our approach is not only evaluated using accuracy, but also using the precision and recall measures merged into the F β -measure [30]. These measures have indeed proven to be more relevant for comparing text classification methods [33]. The paper is organized as follows. Section 2 presents the background of the problem addressed by introducing sequential patterns and textual representations. Section 3 details existing methods that deal with text mining with “frequent patterns” and “sequential patterns”. Section 4 details our method based on Sequential Patterns (SPaC). Section 5 shows that our method performs well on datasets in French and English. Finally, Section 6 summarizes the paper and presents future work. 2. Problem statement First, we introduce the categorization problem. Secondly, we formulate the concept of sequence mining by summarizing the formal description of the problem introduced in [3] and extended in [35]. 2.1. Textual representation and categorization Text categorization is the task of assigning a boolean value to each pair (document, category). For instance, text categorization is used to automatically determine whether a text belongs to the politics or sport category. In order to build such automatic classifiers, a textual database is considered. In this database, the class to which each text belongs is known. The textual database is partitioned into two databases. The first sub-database is a training set and the second one is a test set used in order to evaluate the classifier quality. Galley Proof 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 3 S. Jaillet et al. / Sequential patterns for text categorization 3 In usual methods, texts are represented as bags of words [33], meaning that the order is not considered. Each document is represented by a vector where each component is a word weighted by a numerical value. The most used weight is T F -IDF (Term Frequency – Inverse Document Frequency) [31]. For a word w, we have: N tf idf (w) = tf (w). log df (w) where tf (w) is the number of occurrences of w in the document, df (w) if the number of documents containing w and N is the total number of documents. The weight tf idf (w) thus represents the relative importance of the word in the document. These vectors describing documents are used to extract knowledge using common algorithms such as k-nearest neighbors, SVM, naive Bayes on the training set. 2.2. Mining sequential patterns Let DB be a set of customer transactions where each transaction T consists of customer-id, transaction time and a set of items involved in the transaction. Let I = {i1 , i2 , . . . , im } be a set of literals called items. An itemset is a non-empty set of items. A sequence s is a set of itemsets ordered according to their timestamp. It is denoted by < s 1 s2 . . . sp > where sj , j ∈ 1..n, is an itemset. An n-sequence is a sequence of n items (or of length n). For example, let us consider a given customer who purchased items 1, 2, 3, 4, 5, according to the following sequence: s =< (1) (2, 3) (4) (5)>. This means that apart from 2 and 3 that were purchased together, i.e. during the same transaction, items in the sequence were bought separately. s is a 5-sequence. A sequence < s 1 s2 . . . sp > is a sub-sequence of another sequence < s ′1 s′2 . . . s′m > if there are integers i1 < i2 < . . . ij . . . < in such that s1 ⊆ s′i1 , s2 ⊆ s′i2 , . . . , sp ⊆ s′in . For example, the sequence s′ = < (2) (5) > is a sub-sequence of s because (2) ⊆ (2, 3) and (5) ⊆ (5). However < (2) (3) > is not a sub-sequence of s since the items were not bought during the same transaction. All transactions from the same customer are grouped together and sorted in increasing order. They are called a data sequence. A support value (supp(s)) for a sequence gives its number of actual occurrences in DB . Nevertheless, a sequence in a data sequence is taken into account only once to compute the support, even if several occurrences are discovered. In other words, the support of a sequence is defined as the fraction of total distinct data sequences that contain s. In order to decide whether a sequence is frequent or not, a minimum support value (minSupp) is specified by the user. A sequence s is said to be frequent if the condition supp(s)  minSupp holds. Given a database of customer transactions the problem of sequential pattern mining is to find all sequences whose support is greater than a specified threshold (minimum support) [28]. Sequential patterns are usually extracted from a database built on the following scheme: date, client, items. For instance, we consider the database of client purchases in a supermarket, as shown in Table 1. Each line (transaction, tuple) from this table corresponds to the set of items bought by the client at the corresponding date. In this example, Peter has bought the items 1, 2, 3, 4, 5 in the sequence <(1)(2,3)(4)(5)>, meaning that he first bought 1, then he bought 2 together with 3, then he bought 4 and finally he bought 5. 3. Related work Text mining has been widely investigated [1,4,18,33]. In this section, we focus on text classification and frequent patterns since our method is based on rules. Galley Proof 28/04/2006; 16:20 4 File: ida245.tex; BOKCTP/ljl p. 4 S. Jaillet et al. / Sequential patterns for text categorization Table 1 Database of purchases Client Peter Martin Peter Peter Peter Date 04/01/12 04/02/28 04/03/02 04/03/12 04/04/26 Items TV (1) Chocolate(5) DVD Player (2) , Camera (3) Printer (4) Chocolate (5) 3.1. Classification based on associations: the CBA method In [21] the authors propose CBA: a text categorization method based on association rules. Contrary to C4.5 [29], CN2 [9], or RIPPER [10], which use heuristic search to learn a subset of the regularities in data to build a classifier, CBA is based on exhaustive search and aims at finding all rules respecting a minsup value. CBA consists of two parts, a rule generator (CBA-RG), which is based on the well-known Apriori algorithm [2], and a classifier builder (CBA-CB), which is based on generated rules. 3.1.1. CBA-RG In this first step, each assignment <text, category> is represented by a ruleitem defined by: ρ =< condset, Ci > where condset is a set of items and C i is a class label1 Each ruleitem ρ is equivalent to a rule of the type condset → Ci where support and confidence are defined by: sup(ρ) = #texts from Ci matching condset #texts in D conf (ρ) = #texts in Ci matching condset #texts in D matching condset Ruleitems that satisfy the minimum support are called frequent ruleitems. If two ruleitems have the same condset, only the one having the highest confidence is chosen as a possible rule (PR). If some ruleitems have the same condset and the same confidence, the PR is then randomly chosen on this set. PRs is then a subset of the frequent ruleitems set determined by the two previous constraints. The set of class association rules (CARs) thus consists of all PRs whose confidence is greater than a minimum confidence. CARs (classification rules) thus consists of all ruleitems that satisfy the minimum support and minimum confidence levels. In these approaches, frequent patterns are extracted using a single minimum support threshold. However, categories are not always equi-distributed. It is thus not relevant to consider such a single value. Choosing a relevant minimum support threshold is crucial so that frequent patterns will be relevant for the categorization task. A high support will indeed prevent the system from finding frequent patterns for a small category, while a low support will lead to generation of a huge number of rules, which is not interesting because it will result in overfitting. Works have been proposed to define a multiple minimum support application (msCBA) [16,22]. In these approaches, ruleitems are extracted using a multiple minimum support strategy. The minimum 1 In data mining, condset is also called itemset. Galley Proof 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 5 S. Jaillet et al. / Sequential patterns for text categorization 5 support level of each category is defined according to the distribution frequency of each category and the user-defined minimum support threshold: minSupCi = minSupuser ∗ f reqDistr(Ci ) where, f reqDistr(Ci) = #texts from Ci . #texts 3.1.2. CBA-CB Once all CARs are generated, they are ordered according to the total order described below. Definition 1. Let ri and rj be two classification rules (CARs), r i ≺ rj if: – conf (ri ) > conf (rj ); – or conf (ri ) = conf (rj ) and supp(ri ) > supp(rj ); – or conf (ri ) = conf (rj ) and supp(ri ) = supp(rj ) and ri has been generated before r j ; Let R be the set of CARs and D be the training data. The basic idea of the algorithm is to choose a set of high precedence rules in R to cover D. The categorizer is thus represented by a list of rules r i ∈ R ordered according to the total defined above (Definition 1). We thus have: < (r1 , r2 , ..., rk ), Ci > (where Ci is the target category and r j one of the associated rules). Each rule is then tested over D. If a rule does not improve the accuracy of the classifier, then this rule and the following ones are discarded from the list. Once the categorizer has been built, each ordered rule r i ∈ R is tested on each new text to classify. As soon as the condset part of a rule is supported by the text, the text is then assigned to the target class of the rule. If no rule is appropriate, then the text is assigned to the default class C i . 3.2. Enhancements and other approaches Many other research studies have been performed using association rules to classify. In [16], the authors replace the confidence by the intensity of implication, in CBA, when sorting the rules to build the classifier. According to the author, this new “measure” is powerful when classes are not equally distributed and for low minsup value. In [23], the authors integrate the CBA method with other methods such as decision trees, naive Bayes, RIPPER, etc. to increase the classification score. In [8], the authors investigate L3 , i.e. a method for association rule classification. Contrary to CBA-CB which takes only one rule into account, they propose to determine the category of a text by considering several rules that are mixed using majority voting. In order to cope with the huge number of rules, the authors propose a pruning method during extraction of the classification rules using χ 2 , as done in [19]. But contrary to most pruning strategies, L 3 performs a “lazy” pruning in order to eliminate only “harmful” rules and not “useful knowledge”. In L 3 , maxrules stands for the maximum number of rules used to classify new cases. Moreover, rules are separated in two levels in order to increase the classification accuracy. L 3 has also been enhanced by considering several minimum support thresholds for each category in [7]. In [4], association rules are used for partial classification. Partial classification means that the classifier does not cover all cases. In particular, this work is interesting when dealing with missing values. Galley Proof 6 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 6 S. Jaillet et al. / Sequential patterns for text categorization CAEP [12], LB [27], ADT [37], CMAR [19] are some other existing classification systems using association rules. LB and CAEP are based on rule aggregation rather than rule selection. The particularity of ADT is to prune rules with low support which are considered as meaningless like in [16]. To avoid overfitting rules, ADT also uses a learning strategy based on a decision tree The advantage of CMAR is the categorization policy and the data structure used which allow the user to store a large number of extracted rules. But all of these methods are hampered by the same problem. Except for msCBA and L 3 , all of these methods use a single minsup value. This limitation leads to overlooking minority classes or overfit majority classes (depending on the chosen minsup value). Moreover, most of them are based on an Apriori-like method to extract association rules and the number of generated rules increase dramatically when a low support has to be used. Apart from relatively small numerical datasets from the UCI archives [13], most of these methods are unusable for classification tasks that require low minsup definition (like in text categorization). ARC-CB [5] proposes a solution for multi-classification (i.e. a text is associated to one or more classes). But no comparison to other association rules based classifiers is performed in order to highlight the impact of the method. In the text mining framework [18,38], propose to use sequential patterns. In [38], the proposal is based on two methods. The first method is based on the visualization of word occurrences in order to detect sequential patterns. The second method is based on classical methods to extract sequential patterns. However, the authors do not propose a method to classify texts using sequential patterns. Moreover, the texts considered are associated with a date. The corpus consists of 1,170 articles collected over 6 years. This point makes it very different from our proposal and more difficult to apply since texts are rarely associated with a date. In [18], the authors demonstrate how sequential patterns are useful for text mining. Sequential patterns are used in order to extract trends from textual databases. In [34,36], the authors propose to use sequential patterns for categorization. However, this approach does not make use of all the power of sequential patterns since the patterns considered consist of lists of items, whereas we aim at considering lists of itemsets. Each element from the patterns is indeed only composed of a single morpheme (or n-gram), whereas it would be interesting to consider patterns composed by elements which may be more complex (set of words) and automatically composed. We thus propose an original method based on sequential patterns for classification. We argue that this method is able to deal with order in texts without being time-consuming. The next section details our approach. Section 5 shows that our method obtains good results compared to others. 4. Sequential patterns for classification: the SPaC method In this paper, we propose an original method (SPaC) for text classification based on sequential patterns [15]. This method consists of two steps. In the first step, we build sequential patterns from texts. In the second step, sequential patterns are used to classify texts. Hereafter, we use the notations introduced in Table 3. 4.1. From Texts to sequential patterns Each text is a set of words. Our method is based on sequential pattern mining. Texts are represented as ordered sets of words using the T F -IDF representation. Each text is thus considered as being the Galley Proof 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 7 S. Jaillet et al. / Sequential patterns for text categorization 7 equivalent of a client. The text consists of a set of sentences. Each sentence is associated with a date (its position in the text). Finally the set of words contained in a sentence corresponds to the set of items purchased by the client in the market basket analysis framework. Table 2 summarizes the two terminologies. This representation is coupled with a stemming step and a stop-list. The stemming step involves replacing each word by its root word. The stop-list prevents the system from learning from noisy words such as “the, a”. Some words are discarded by considering the entropy of each stem over the corpus. This method eliminates words that could skew the classifier since they are not discriminant enough. Moreover, this method allows us to apply low supports in the sequential pattern discovery without deteriorating the results. For this purpose, a user-defined threshold is considered. For each word w, we consider its entropy H(w) over all classes C i defined as: H(w) = −ΣCi [p(w).p(Ci |w).log(p(Ci |w)) +((1 − p(w)).p(Ci |w̄).log (p(Ci |w̄)))] In SPaC, sequential patterns are extracted using a multiple minimum support strategy as done in msCBA. This means that a different support is applied for each category C i . In our approach, the training set is divided into n training sets, one for each category. Texts are thus grouped depending on their category. Sequential pattern mining algorithms are applied separately on these n databases using the corresponding minimum supports. For each category, frequent sequential patterns are computed and their supports stored. The support of a frequent pattern is the number of texts containing the sequence of words. Definition 2. Let < s1 . . . sp > be a sequence. The support of < s 1 . . . sp > is defined as: supp(< s1 . . . sp >) = #texts matching < s1 . . . sp > #texts Contrary to msCBA, minimum supports are defined automatically in the following way: (1) the minimum support is set at the lowest value, i.e. one text (for example, if the training set contains 200 texts, the minsup is set to 0.5%). (2) if the mining step provides more than X rules, the process is started again with a higher minsup value, i.e. on more texts (for example, if the training set contains 200 texts, the minsup is increased by 0.5%). The use of SPAM [6] to find sequential patterns makes the training step extremely fast. Indeed this step, which extracts all sequential patterns from the training set, takes only a few minutes on a current deskstop.2 The limited number of rules (i.e. X ) will also be detailed in the experiments (Section 5). Algorithm 1 describes SPaC sequential pattern generation. The SPAM algorithm is used through the SP M ining() function in order to find all frequent sequences in the transactional databases (DB ) [6]. For instance, the following frequent patterns have been extracted from the “Purchasing-Logistics” category of our French database: 2 Pentium IV 2.4 GHz, 520 Mo. Galley Proof 8 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 8 S. Jaillet et al. / Sequential patterns for text categorization < (cacao) (ivoir) (abidjan)> < (blé soja) (maı̈)> < (soj)(blé lespin victor)(maı̈ soj )(maı̈ )(grain soj)(soj)> The first sequential pattern means that some texts contain words cacao then ivoire then (ivory) Abidjan in three different sentences. The second sequential pattern means that some texts contain the words bl é and soja in the same sentence and then maı̈s3 (mai). The third sequential pattern means the word maı̈ occurs in two successive sentences before the word grain. Experiments have led us to consider a threshold that eliminates about 5 to 10% (according to the Zipf law [32]) of the words. Note that sequential patterns consider multiple occurences of the same word in 3 In French, blé means wheat and maı̈s stands for corn. Galley Proof 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 9 S. Jaillet et al. / Sequential patterns for text categorization 9 Table 2 Application of the sequential pattern terminology to textual data Usual Databases client item items/transaction date ↔ ↔ ↔ ↔ fTextual Databases text word sentence (set of words) position of the sentence Table 3 Notations Notation C = {C1 , . . . , Cn } Ci ∈ C minSupCi T T Ci ⊆ T TT rain = {(Ci , T Ci )} SEQ SP RuleSP Meaning set of n categories. a given category. user-defined minimum support for category Ci . set of texts. set of texts belonging to category Ci . Training set constituted by a set of texts associated with their category. set of sequences found for category Ci , customer c at time t. table of sequential patterns. table of tuples (spj , Ci , confi,j ). corresponding to the sequence spj , the category Ci and the confidence confi,j of the rule spj → Ci . the text, contrary to association rules. Moreover, some frequent co-occurrences can be identified with sequential pattern mining. 4.2. From sequential patterns to categories Once sequential patterns have been extracted for each category, the goal is to derive a categorizer from the obtained patterns. This is done by computing, for each category, the confidence of each associated sequential pattern. To solve this problem, a rule γ is generated in the following way: γ :< s1 . . . sp >→ Ci where < s1 . . . sp > is a sequential pattern for category C i . This rule means that if a text contains s 1 then s2 . . . then sp then it will belong to category Ci . Each rule is associated with its confidence level, indicating the extent to which the sequential pattern is characteristic of this category: conf (γ) = #texts from Ci matching < s1 . . . sp > #texts matching < s1 . . . sp > Rules are sorted depending on their confidence level and the size of the associated sequence. When considering a new text to be classified, a simple categorization policy is applied: the K rules having the best confidence level and being supported are applied. The text is then assigned to the class mainly obtained within the K rules. This method is the same as majority voting in [8]. If two categories obtain the same score, a random choice is made. This prevents the system from always choosing the same category. The SPaC classifier step (SPaC-C) is described in Algorithm 2. Galley Proof 10 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 10 S. Jaillet et al. / Sequential patterns for text categorization 5. Experiments Experiments are conducted on three databases. The first two ones are the well-known English databases 20 Newsgroups (with 30% of training) and Reuters [13]. The third one is a real database. It describes French texts (news) with 8,239 texts divided into 28 categories. For this corpus, a training set of 33% was used. Experiments compare our approach to results obtained with CBA and SVM. Sequential Patterns are mined using SPAM [6]. Table 4 details these results. Comparisons are based on the F β measure [33]. This measure allows us to combine recall and precision for a global evaluation. The F β measure is thus more relevant than accuracy since accuracy does not take into account the case when a text is not classified. This leads us to consider that classifying no text (thus never making errors) would have an accuracy of near 100%! Accuracy was taken as the reference measure in [8,20]. However, for the reasons presented above, we argue that this is not as relevant as relying on recall and precision, as mentioned in [33] Galley Proof 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 11 S. Jaillet et al. / Sequential patterns for text categorization 11 Table 4 Comparison of SPaC, msCBA and SVM tested on three different corpuses F 1M F 1µ Acc. #Rules SPaC 0.461 0.497 0.963 31060 French news msCBA 0.367 0.401 0.956 315 SVM 0.485 0.486 0.969 − SPaC 0.322 0.694 0.992 80985 Reuters msCBA 0.082 0.679 0.992 640 20 Newsgroups SPaC msCBA SVM 0.452 0.423 0.423 0.494 0.436 0.455 0.946 0.941 0.941 19938 642 − SVM 0.500 0.840 0.996 − Table 5 Results of SPaC on the French news corpus Minsup allowed 1% 3% 4% 5% 6% 12% 21% 49% Acc. F 1M F 1µ #Rules Time 0.963 0.963 0.963 0.962 0.962 0.957 0.958 0.963 0.461 0.455 0.441 0.435 0.433 0.352 0.262 0.018 0.497 0.495 0.488 0.479 0.477 0.390 0.320 0.033 31060 29931 12227 5328 3019 394 63 1 441 s 414 s 271 s 221 s 208 s 181 s 181 s 180 s Definition 3. The Fβ measure is defined as follows: Fβ = (β 2 + 1)πi ρi β 2 πi + ρi where ρ stands for the recall, and π stands for the precision. This measure is computed for each class C i . Definition 4. Precision and recall are defined as follows: πi = T Pi T Pi , ρi = T Pi + F Pi T Pi + F Ni where T Pi , F Pi , F Ni stand, respectively, for the number of texts in the class C i which are correctly classified (True Positive), the number of texts mistakenly put in class C i (False Positive), the number of texts mistakenly put in a different class from C i (False Negative). Accuracy is defined as follows: Definition 5. Accuracy: Accuracyi = T Pi + T Ni T Pi + T Ni + F Pi + F Ni In order to evaluate the classifier for all classes, we consider Micro-averaging (µ) and Macro-averaging (M ) [33,39], defined as follows: Definition 6. Macro-averaging and Micro-averaging |C| |C| ρˆi M M i=1 π̂i , ρ̂ = i=1 π̂ = |C| |C| Galley Proof 28/04/2006; 16:20 12 File: ida245.tex; BOKCTP/ljl p. 12 S. Jaillet et al. / Sequential patterns for text categorization Table 6 Results of SPaC on the 20 Newsgroups corpus Acc. F 1M F 1µ #Rules Time 0.946 0.946 0.945 0.942 0.942 0.943 0.942 0.452 0.446 0.428 0.386 0.275 0.148 0.050 0.494 0.491 0.478 0.441 0.348 0.231 0.079 19938 13246 9441 4950 1209 32 2 594 s 480 s 446 s 400 s 250 s 178 s 177 s 40000 0.5 F1 micro #Rules F1 micro 0.4 30000 0.3 20000 0.2 #Rules Minsup allowed 1% 3% 4% 6% 12% 21% 49% 10000 0.1 0 0 5 10 15 20 25 30 35 40 45 Minsup Allowed Fig. 1. Results of SPaC on the French news corpus. 30000 0.5 F1 micro #Rules 20000 0.3 #Rules F1 micro 0.4 0.2 10000 0.1 0 0 5 10 15 20 25 30 35 40 45 Minsup Allowed Fig. 2. Results of SPaC on the 20 Newsgroups corpus. µ |C| π̂ = |C| i=1 V i=1 (V Pi Pi + F Pi ) µ |C| , ρ̂ = |C| i=1 V i=1 (V Pi Pi + F Ni ) Micro-averaging gives the same importance to each document, contrary to macro-averaging which computes the average class by class (thus placing more importance on small categories). In this paper, we consider that recall and precision have the same importance. We thus have: β = 1 for the Fβ measure. Galley Proof 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 13 S. Jaillet et al. / Sequential patterns for text categorization 13 Table 7 Results of SPaC on the Reuters corpus Minsup allowed 1% 2% 13% 30% 50% 92% 99% Acc. F 1M F 1µ #Rules Time 0.992 0.992 0.991 0.989 0.987 0.986 0.986 0.322 0.307 0.283 0.208 0.216 0.154 0.020 0.694 0.691 0.619 0.545 0.432 0.148 0.030 80985 79556 27931 6929 791 45 23 196 s 172 s 147 s 45 s 22 s 19 s 16 s SVM results are obtained using a linear kernel (no better result is provided by a more complex kernel) with SVMLight [17]. The supports chosen here are the values providing the best results while remaining computable (too low supports lead to a very time-consuming application). SPaC is applied with K = 10 rules taken into account when classifying new documents. The automatic minsup definition is done by limiting the mining process to X = 3000 rules per category. If SPAM exceeds this limit then the mining process is started again with a higher minsup value. The experiments showed that a larger limit did not provide significant enhancement during the evaluation step. In this work, we argue that obtaining understandable knowledge is as (or even more) important than getting the most accurate classifier. For this reason, comparisons are essentially studied between CBA and SPaC. CBA is tested with the msCBA version of the algorithm, using the best results we obtained when testing different supports. The results show that SPaC is always better than CBA. This is due to the fact that SPaC is able to use more specific rules to classify. SPaC is similar to SVM for French texts, but obtains better performance than SVM when dealing with the 20 Newsgroups database. Nevertheless, SVM is still the best classifier on the Reuters corpus. But this corpus had particularities. Indeed, 2 categories (over 90) contain 2/3 of the train test. Moreover, Table 7 shows that few rules with a very high support (>90%) alone provide a great part of the score. Up to here, the automatic definition of minsup in SPaC provides a lot of rules compared to CBA. In the following experiments, we reduce the number of categorization rules for SPaC by increasing the start minsup value (in the automatic minsup definition process). Tables 5, 6 and 7 together with Figs 1, 2 and 3 show the number of rules used for categorization and the associated performance in the F 1 µ measure. These experiments show the stability of F 1 µ , even when a lot of rules are pruned. On the other hand, SPaC probably reaches almost its best performance with X = 3, 000 since it requires an exponential number of rules to increase its performaces. Currently, CBA is not able to increase the number of generated rules to increase its performance. Indeed, CBA is based on a hard rule pruning strategy. Moreover, with a slight drop in performance, the number of generated sequential patterns is small enough to be usable by an expert or for a trend analysis or for manually tuning the classifier. In Fig. 4, we compare the results obtained for the F 1 measure regarding the number of rules considered for the choice of class. Experiments have shown that K = 10 provides good results (similar to [8] where best results correspond to maxrules = 9). Experiments are done (1) with order: the sequential pattern (sequence) is supported by the text (2) without order: the sub-sequences are supported by the text in an unspecified order. The corresponding sentences are in the document but are not ordered as in the sequential pattern. We note that the results are always better when order is taken into account. Galley Proof 28/04/2006; 16:20 14 File: ida245.tex; BOKCTP/ljl p. 14 S. Jaillet et al. / Sequential patterns for text categorization 0.7 100000 F1 micro #Rules 0.6 80000 0.4 60000 0.3 40000 #Rules F1 micro 0.5 0.2 20000 0.1 0 20 40 60 80 Minsup Allowed 0 100 Fig. 3. Results of SPaC on the Reuters corpus. WITH ORDER WITHOUT ORDER 0.5 F1 micro-average 0.48 0.46 0.44 0.42 0.4 0.38 2 4 6 8 K 10 12 14 Fig. 4. SPaC: F 1 as a function of the number of rules considered. 6. Conclusion and further work In this paper, we address the problem of text categorization using sequential patterns. In our framework, texts are represented by T F -IDF vectors, and each category is associated with a set of sequential patterns. When classifying new data, a text is matched to a category depending on the number of sequential patterns involved. The corresponding category is determined using majority voting. Even if SVM have proven to be efficient in such a task, we argue that it is very important to provide users with understandable knowledge about their data. In this framework, sequential patterns are well-adapted. They provide rules that are used for classification. We show that this approach is efficient and relevant, in particular when SVM do not perform well. Moreover, the method we propose is simple and adaptable to drifting concepts since it is possible to Galley Proof 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 15 S. Jaillet et al. / Sequential patterns for text categorization 15 update sequential patterns without performing the whole process using incremental sequential pattern mining [25]. This possibility is of great importance for text categorization, especially for the automatic analysis of news which is a very fast and variable area. This is thus a first step towards an On-Line Classification Process (OLCP) using sequential patterns. Future works include the integration of our approach for different foreign languages in order to determine how important order is for each language. We are working on the automatic definition of the best number of rules to take into account (the K parameter of our method). Our approach may also be enhanced by mining generalized sequential patterns [3]. This framework allows us to integrate time constraints, as shown in [26]. Finally, we aim at integrating muti-level sequential patterns, as proposed in [7]. Very specific rules can thus be kept without damaging the classifier performances. In this framework, we argue that it is interesting to build a very compact set of rules (rules where the left-hand part is as short as possible). For this purpose, we aim at extending studies on δ-free sets [11] to sequential patterns. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] R. Agrawal, R.J. Bayardo, Jr. and R. Srikant, Athena: Mining-Based Interactive Management of Text Databases, In Proc. of the 7rd Int. Conf. on Extending Database Technology, (EDBT’00), Springer, March 2000, 365–379. R. Agrawal and R. Srikant, Fast Algorithms for Mining Generalized Association Rules, in: Proc. 20th Int. Conf. Very Large Data Bases, (VLDB’94), J.B. Bocca, M. Jarke and C. Zaniolo, eds, Morgan Kaufmann, 1994, pp. 487–499. R. Agrawal and R. Srikant, Mining Sequential Patterns, In Proc. of the 11th Int. Conf. on Data Engineering (ICDE’95), Tapei, Taiwan, March 1995. IEEE Computer Society Press, 3–14. K. Ali, S. Manganaris and R. Srikant, Partial Classification Using Association Rules, In Proc. of the 3rd Int. Conf. on Knowledge Discovery and Data Mining, (KDD’97), CA, USA, August 1997. AAAI Press, 115–118. M.-L. Antonie and O. Zaiane, Text Document Categorization by Term Association, In Proc. of the 2002 IEEE Int. Conf. on Data Mining (ICDM’02), December 2002, 19–26. J. Ayres, J. Gehrke, T. Yiu and J. Flannick, Sequential Pattern Mining Using Bitmaps, In Proc. of the 8th Int. Conf. on Knowledge Discovery and Data Mining, (KDD’02), July 2002. E. Baralis, S. Chiusano and P. Garza, On support thresholds in associative classification, In Proc. of the 2004 ACM Symposium on Applied Computing (SAC’04), ACM Press, March 2004, 553–558. E. Baralis and P. Garza, Majority Classification by Means of Association Rules, In Proc. of the 7th European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD’03), Springer, September 2003, 35–46. P. Clark and T. Niblett, The CN2 Induction Algorithm, Machine Learning 3 (March 1989), 261–283. W. Cohen, Fast Effective Rule Induction, In Proc. of the 12th Int. Conf. on Machine Learning, (ICML’95), CA, USA, July 1995. Morgan Kaufmann, 115–123. B. Cremilleux and J.F. Boulicaut, Simplest rules characterizing classes generated by delta-free sets, In Proc. of the 22nd BCS SGAI Int. Conf. on Knowledge Based Systems and Applied Artificial Intelligence, (ES’02), Springer, December 2002, 33–46. G. Dong, X. Zhang, L. Wong and J. Li, CAEP: Classification by aggregating emerging patterns, In Proc. of the 2nd Int. Conf. on Discovery Science, (DS’99), December 1999, 30–42. S. Hettich and S.D. Bay, The UCI KDD Archive, [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Information and Computer Science., 1999. M. Iwayama and T. Tokunaga, Cluster-based text categorization: a comparison of category search strategies, In Proc. of SIGIR-95, 18th ACM Int. Conf. on Research and Development in Information Retrieval, ACM Press, 1995, 273–281. S. Jaillet, A. Laurent, M. Teisseire and J. Chauché, Order and Mess in text categorization: Why using sequential patterns to classify, In Proc. of 3rd Workshop on Mining Temporal and Sequential Data, in conjunction with The 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (TDM’04/KDD’04), WA, USA, August 2004. D. Janssens, G. Wets, T. Brijs, K. Vanhoof and G. Chen, Adapting the CBA-algorithm by means of intensity of implication, In Proc. of the 1st Int. Conf. on Fuzzy Information Processing Theories and Applications, China, March 2003, 397–403. T. Joachims, Text categorization with support vector machines: learning with many relevant features, In Proc. of ECML98, 10th European Conf. on Machine Learning, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE., 137–142. B. Lent, R. Agrawal and R. Srikant, it Discovering Trends in Text Databases, In Proc. of the 3rd Int. Conf. on Knowledge Discovery and Data Mining, (KDD’97), CA, USA, August 1997. AAAI Press, 227–230. Galley Proof 16 [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] 28/04/2006; 16:20 File: ida245.tex; BOKCTP/ljl p. 16 S. Jaillet et al. / Sequential patterns for text categorization W. Li, J. Han and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, In Proc. of the 2001 IEEE Int. Conf. on Data Mining (ICDM’01), CA, USA, November 2001, 369–376. Y. Li and A. Jain, Classification of text documents, The Computer Journal 41(8) (1998), 537–546. B. Liu, W. Hsu and Y. Ma, Integrating Classification and Association Rule Mining, In Proc . of the 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD’98), NY, USA, August 1998. AAAI Press, 80–86. B. Liu, Y. Ma and C.-K. Wong, Improving an Association Rule Based Classifier, In Proc. of the 4th European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD’00), France, September 2000. Springer, 504–509. B. Liu, Y. Ma and C.-K. Wong, Classification Using Association Rules: Weaknesses and Enhancements, in: Data Mining for Scientific Application and Engineering Applications, V. Kumar, ed., Kluwer Academic, 2001. M. Maron, Automatic indexing: An experimental inquiry, Journal of the ACM (JACM) 8 (1961), 404–417. F. Masseglia, P. Poncelet and M. Teisseire, Incremental mining of sequential patterns in large databases, Data and Knowledge Engineering 46(1) (2003). F. Masseglia, P. Poncelet and M. Teisseire, Pre-processing Time Constraints for Efficiently Mining Generalized Sequential Patterns, In Proc. of the 11th Int. Symposium on Temporal Representation and Reasoning, (TIME’04), IEEE Computer Society Press, July 2004, 87–95. D. Meretakis and B. Wuthrich, Extending Naive Bayes Classifiers Using Long Itemsets, In Proc. of the 5th Int. conf. on Knowledge Discovery and Data Mining (KDD’99), CA, USA, August 1999, 165–174. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M.-C. Hsu, PrefixSpan mining sequential patterns efficiently by prefix projected pattern growth, In International Conference on Data Engineering (ICDE’01), Heidelberg, Germany, 2001, 215–226. J. Quinlan, C4.5 – Programs for Machine Learning. Morgan Kaufman, CA, USA, 1993. C.J. Van Rijsbergen, Information Retrieval, Butterworths, sec. edition, 1979. G. Salton and M.J. McGill, Introduction to modern information retrieval, McGraw-Hill, New York, 1983. G. Salton, C. Yang and C. Yu, A theory of term importance in automatic text analysis, Journal of the American Society for Information Science 36 (1975), 33–44. F. Sebastiani, Machine learning in automated text categorisation, (Vol. 34), In Proc. of ACM Computing Surveys, 2002, 1–47. M. Shimbo, T. Yamasaki and Y. Matsumoto, Automatic classification of sentences in the medline abstracts: A case study of the power of word sequence features, In Proc. of the 6th Sanken (ISIR) International Symposium, Osaka, Japan, 2003, 135–138. R. Srikant and R. Agrawal, Mining Sequential Patterns: Generalizations and Performance Improvements, In Proc. of the 5th Int.Conf. on Extending Database Technology (EDBT’96), September 1996, 3–17. M. Takechi, T. Tokunaga, Y. Matsumoto and H. Tanaka, Feature selection in categorizing procedural expressions, In Proc. Sixth International Workshop on Information Retrieval with Asian Languages (IRAL’2003), Sapporo, Japan, July 7 2003. K. Wang, S. Zhou and Y. He, Growing Decision Trees on Support-less Association Rules, In Proc. of the 6th Int. Conf. on Knowledge discovery and data mining (KDD’00), MA, USA, August 2000. ACM Press, 265–269. P.-C. Wong, W. Cowley, H. Foote, E. Jurrus and J. Thomas, Visualizing Sequential Patterns for Text Mining, In Proc of the 2000 IEEE Symposium on Information Visualization, (INFOVIS’00), UT, USA, October 2000. IEEE Computer Society Press, 105–114. Y. Yang, An Evaluation of statistical approaches to text categorization, Information Retrieval Journal 1(1/2) (1999), 69–90.