Sequential Patterns for Text Categorization
Simon Jaillet, Anne Laurent, Maguelonne Teisseire
To cite this version:
Simon Jaillet, Anne Laurent, Maguelonne Teisseire. Sequential Patterns for Text Categorization. Intelligent Data Analysis, IOS Press, 2006, 10 (3), pp.16. <lirmm-00135010>
HAL Id: lirmm-00135010
http://hal-lirmm.ccsd.cnrs.fr/lirmm-00135010
Submitted on 6 Mar 2007
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Galley Proof
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 1
Intelligent Data Analysis 9 (2006) 1–16
IOS Press
1
Sequential patterns for text categorization
S. Jaillet, A. Laurent and M. Teisseire
LIRMM-CNRS – Université Montpellier 2, 161 rue Ada, 34392 Montpellier Cedex 5 France
E-mail: {jaillet,laurent,teisseire}@lirmm.fr
Received 22 June 2005
Revised 25 August 2005
Accepted 12 November 2005
Abstract. Text categorization is a well-known task based essentially on statistical approaches using neural networks, Support
Vector Machines and other machine learning algorithms. Texts are generally considered as bags of words without any order.
Although these approaches have proven to be efficient, they do not provide users with comprehensive and reusable rules about
their data. Such rules are, however, very important for users to describe trends in the data they have to analyze. In this
framework, an association-rule based approach has been proposed by Bing Liu (CBA). We propose, in this paper, to extend this
approach by using sequential patterns in the SPaC method (Sequential Patterns for Classification) for text categorization. Taking
order into account allows us to represent the succession of words through a document without complex and time-consuming
representations and treatments such as those performed in natural language and grammatical methods. The original method we
propose here consists of mining sequential patterns in order to build a classifier. We experimentally show that our proposal is
relevant, and that it is very interesting compared to other methods. In particular, our method outperforms CBA and provides
better results than SVM on some corpus.
Keywords: Text mining, categorization, sequential patterns, SPaC
1. Introduction
Automatic text classification goes back at least to the 1960s [24]. But with the growing volume of
available digital documents, automatic classification has been extensively addressed through research in
the past few years to define efficient and scalable methods [33,39]. In this domain, two distinct types of
approaches have been proposed: supervised and unsupervised classification. In supervised classification
(also known as categorization), categories are defined by an expert, while they are automatically learned
in the other case (also called clustering) [14,31].
In this setting, the goal is to define a function which associates texts with categories. The learning
step involves automatically defining this function using a training set. This training set consists of texts
for which the category is known. Then this function can be used to associate a category (e.g. politics or
sport) to a new text that has never been processed. The more accurate the automatic decision, the better
the classifier.
Currently, the best classifiers are mostly based on the statistical text representation T F -IDF (Term
Frequency, Inverse Document Frequency) [31] and machine learning algorithms such as neural networks
or Support Vector Machines (SVM). However, most of these methods do not provide understandable
descriptions of the extracted knowledge. In order to cope with this problem, an association-rule based
approach was first proposed by Bing Liu (CBA) [21], which has subsequently been enhanced in [5,8,12,
16,19,37], etc.
1088-467X/06/$17.00 2006 – IOS Press and the authors. All rights reserved
Galley Proof
2
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 2
S. Jaillet et al. / Sequential patterns for text categorization
All these methods consider each text as a so-called bag of words where no order between words is
taken into account for categorization. This textual representation has proven to be useful and almost
as efficient as complex representations which require time-consuming methods like syntactic analysis.
It is thus interesting to investigate methods that take order into account while remaining scalable. For
this purpose, we study sequential patterns. Sequential Patterns aim at discovering temporal relationships
between facts embedded in a database. In a market basket analysis problem, sequential patterns refer
to rules like a customer who bought a TV together with a DVD player later bought a recorder. In this
framework, the databases being considered consist of customer transactions, recording the items bought
at some dates by customers (clients).
In this paper, we thus propose to extend the CBA method by taking the order into account. In our SPaC
(Sequential Patterns for Categorization) approach, the order is considered by using sequential patterns
instead of association rules. Sequential patterns have, in this framework, three main advantages: first
they provide understandable rules (contrary to SVM, Rocchio, naive Bayes, . . .). Secondly they allow
trend analysis, as shown in [18]. Thirdly, they extract patterns that are more precise and informative than
association rules.
In the original text classification method using sequential patterns we propose here, sentences are
distinguished and ordered in each text. This means that the text is considered as being an ordered list of
sentences. Each sentence is considered as being an unordered set of words. If we compare the market
basket analysis problem to our approach, then a text plays the role of a client; the sentences from a text
play the role of all the transactions for this text; the position of the sentence within the text plays the role
of the date; and the set of words from a sentence plays the role of a list of items bought.
Experiments show that sequential pattern-based classification with SPaC is very efficient, particularly
when Support Vector Machines do not perform well. Our approach is not only evaluated using accuracy,
but also using the precision and recall measures merged into the F β -measure [30]. These measures have
indeed proven to be more relevant for comparing text classification methods [33].
The paper is organized as follows. Section 2 presents the background of the problem addressed by
introducing sequential patterns and textual representations. Section 3 details existing methods that deal
with text mining with “frequent patterns” and “sequential patterns”. Section 4 details our method based
on Sequential Patterns (SPaC). Section 5 shows that our method performs well on datasets in French and
English. Finally, Section 6 summarizes the paper and presents future work.
2. Problem statement
First, we introduce the categorization problem. Secondly, we formulate the concept of sequence
mining by summarizing the formal description of the problem introduced in [3] and extended in [35].
2.1. Textual representation and categorization
Text categorization is the task of assigning a boolean value to each pair (document, category). For
instance, text categorization is used to automatically determine whether a text belongs to the politics
or sport category. In order to build such automatic classifiers, a textual database is considered. In this
database, the class to which each text belongs is known. The textual database is partitioned into two
databases. The first sub-database is a training set and the second one is a test set used in order to evaluate
the classifier quality.
Galley Proof
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 3
S. Jaillet et al. / Sequential patterns for text categorization
3
In usual methods, texts are represented as bags of words [33], meaning that the order is not considered.
Each document is represented by a vector where each component is a word weighted by a numerical
value. The most used weight is T F -IDF (Term Frequency – Inverse Document Frequency) [31]. For a
word w, we have:
N
tf idf (w) = tf (w). log
df (w)
where tf (w) is the number of occurrences of w in the document, df (w) if the number of documents
containing w and N is the total number of documents. The weight tf idf (w) thus represents the relative
importance of the word in the document.
These vectors describing documents are used to extract knowledge using common algorithms such as
k-nearest neighbors, SVM, naive Bayes on the training set.
2.2. Mining sequential patterns
Let DB be a set of customer transactions where each transaction T consists of customer-id, transaction
time and a set of items involved in the transaction.
Let I = {i1 , i2 , . . . , im } be a set of literals called items. An itemset is a non-empty set of items. A
sequence s is a set of itemsets ordered according to their timestamp. It is denoted by < s 1 s2 . . . sp >
where sj , j ∈ 1..n, is an itemset. An n-sequence is a sequence of n items (or of length n). For example,
let us consider a given customer who purchased items 1, 2, 3, 4, 5, according to the following sequence:
s =< (1) (2, 3) (4) (5)>. This means that apart from 2 and 3 that were purchased together, i.e. during
the same transaction, items in the sequence were bought separately. s is a 5-sequence.
A sequence < s 1 s2 . . . sp > is a sub-sequence of another sequence < s ′1 s′2 . . . s′m > if there are
integers i1 < i2 < . . . ij . . . < in such that s1 ⊆ s′i1 , s2 ⊆ s′i2 , . . . , sp ⊆ s′in . For example, the sequence
s′ = < (2) (5) > is a sub-sequence of s because (2) ⊆ (2, 3) and (5) ⊆ (5). However < (2) (3) > is not a
sub-sequence of s since the items were not bought during the same transaction.
All transactions from the same customer are grouped together and sorted in increasing order. They are
called a data sequence. A support value (supp(s)) for a sequence gives its number of actual occurrences
in DB . Nevertheless, a sequence in a data sequence is taken into account only once to compute the
support, even if several occurrences are discovered. In other words, the support of a sequence is defined
as the fraction of total distinct data sequences that contain s. In order to decide whether a sequence is
frequent or not, a minimum support value (minSupp) is specified by the user. A sequence s is said to be
frequent if the condition supp(s) minSupp holds.
Given a database of customer transactions the problem of sequential pattern mining is to find all
sequences whose support is greater than a specified threshold (minimum support) [28].
Sequential patterns are usually extracted from a database built on the following scheme: date, client,
items. For instance, we consider the database of client purchases in a supermarket, as shown in Table 1.
Each line (transaction, tuple) from this table corresponds to the set of items bought by the client at the
corresponding date.
In this example, Peter has bought the items 1, 2, 3, 4, 5 in the sequence <(1)(2,3)(4)(5)>, meaning
that he first bought 1, then he bought 2 together with 3, then he bought 4 and finally he bought 5.
3. Related work
Text mining has been widely investigated [1,4,18,33]. In this section, we focus on text classification
and frequent patterns since our method is based on rules.
Galley Proof
28/04/2006; 16:20
4
File: ida245.tex; BOKCTP/ljl p. 4
S. Jaillet et al. / Sequential patterns for text categorization
Table 1
Database of purchases
Client
Peter
Martin
Peter
Peter
Peter
Date
04/01/12
04/02/28
04/03/02
04/03/12
04/04/26
Items
TV (1)
Chocolate(5)
DVD Player (2) , Camera (3)
Printer (4)
Chocolate (5)
3.1. Classification based on associations: the CBA method
In [21] the authors propose CBA: a text categorization method based on association rules. Contrary
to C4.5 [29], CN2 [9], or RIPPER [10], which use heuristic search to learn a subset of the regularities
in data to build a classifier, CBA is based on exhaustive search and aims at finding all rules respecting a
minsup value.
CBA consists of two parts, a rule generator (CBA-RG), which is based on the well-known Apriori
algorithm [2], and a classifier builder (CBA-CB), which is based on generated rules.
3.1.1. CBA-RG
In this first step, each assignment <text, category> is represented by a ruleitem defined by: ρ =<
condset, Ci > where condset is a set of items and C i is a class label1 Each ruleitem ρ is equivalent to a
rule of the type condset → Ci where support and confidence are defined by:
sup(ρ) =
#texts from Ci matching condset
#texts in D
conf (ρ) =
#texts in Ci matching condset
#texts in D matching condset
Ruleitems that satisfy the minimum support are called frequent ruleitems. If two ruleitems have the
same condset, only the one having the highest confidence is chosen as a possible rule (PR). If some
ruleitems have the same condset and the same confidence, the PR is then randomly chosen on this set.
PRs is then a subset of the frequent ruleitems set determined by the two previous constraints. The set
of class association rules (CARs) thus consists of all PRs whose confidence is greater than a minimum
confidence.
CARs (classification rules) thus consists of all ruleitems that satisfy the minimum support and minimum
confidence levels.
In these approaches, frequent patterns are extracted using a single minimum support threshold. However, categories are not always equi-distributed. It is thus not relevant to consider such a single value.
Choosing a relevant minimum support threshold is crucial so that frequent patterns will be relevant for
the categorization task. A high support will indeed prevent the system from finding frequent patterns
for a small category, while a low support will lead to generation of a huge number of rules, which is not
interesting because it will result in overfitting.
Works have been proposed to define a multiple minimum support application (msCBA) [16,22]. In
these approaches, ruleitems are extracted using a multiple minimum support strategy. The minimum
1
In data mining, condset is also called itemset.
Galley Proof
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 5
S. Jaillet et al. / Sequential patterns for text categorization
5
support level of each category is defined according to the distribution frequency of each category and the
user-defined minimum support threshold:
minSupCi = minSupuser ∗ f reqDistr(Ci )
where, f reqDistr(Ci) =
#texts from Ci
.
#texts
3.1.2. CBA-CB
Once all CARs are generated, they are ordered according to the total order described below.
Definition 1. Let ri and rj be two classification rules (CARs), r i ≺ rj if:
– conf (ri ) > conf (rj );
– or conf (ri ) = conf (rj ) and supp(ri ) > supp(rj );
– or conf (ri ) = conf (rj ) and supp(ri ) = supp(rj ) and ri has been generated before r j ;
Let R be the set of CARs and D be the training data. The basic idea of the algorithm is to choose a
set of high precedence rules in R to cover D. The categorizer is thus represented by a list of rules r i ∈ R
ordered according to the total defined above (Definition 1). We thus have:
< (r1 , r2 , ..., rk ), Ci >
(where Ci is the target category and r j one of the associated rules).
Each rule is then tested over D. If a rule does not improve the accuracy of the classifier, then this rule
and the following ones are discarded from the list.
Once the categorizer has been built, each ordered rule r i ∈ R is tested on each new text to classify. As
soon as the condset part of a rule is supported by the text, the text is then assigned to the target class of
the rule. If no rule is appropriate, then the text is assigned to the default class C i .
3.2. Enhancements and other approaches
Many other research studies have been performed using association rules to classify. In [16], the
authors replace the confidence by the intensity of implication, in CBA, when sorting the rules to build
the classifier. According to the author, this new “measure” is powerful when classes are not equally
distributed and for low minsup value.
In [23], the authors integrate the CBA method with other methods such as decision trees, naive Bayes,
RIPPER, etc. to increase the classification score.
In [8], the authors investigate L3 , i.e. a method for association rule classification. Contrary to CBA-CB
which takes only one rule into account, they propose to determine the category of a text by considering
several rules that are mixed using majority voting. In order to cope with the huge number of rules,
the authors propose a pruning method during extraction of the classification rules using χ 2 , as done
in [19]. But contrary to most pruning strategies, L 3 performs a “lazy” pruning in order to eliminate
only “harmful” rules and not “useful knowledge”. In L 3 , maxrules stands for the maximum number
of rules used to classify new cases. Moreover, rules are separated in two levels in order to increase the
classification accuracy. L 3 has also been enhanced by considering several minimum support thresholds
for each category in [7].
In [4], association rules are used for partial classification. Partial classification means that the classifier
does not cover all cases. In particular, this work is interesting when dealing with missing values.
Galley Proof
6
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 6
S. Jaillet et al. / Sequential patterns for text categorization
CAEP [12], LB [27], ADT [37], CMAR [19] are some other existing classification systems using
association rules. LB and CAEP are based on rule aggregation rather than rule selection. The particularity
of ADT is to prune rules with low support which are considered as meaningless like in [16]. To avoid
overfitting rules, ADT also uses a learning strategy based on a decision tree The advantage of CMAR
is the categorization policy and the data structure used which allow the user to store a large number of
extracted rules.
But all of these methods are hampered by the same problem. Except for msCBA and L 3 , all of these
methods use a single minsup value. This limitation leads to overlooking minority classes or overfit
majority classes (depending on the chosen minsup value). Moreover, most of them are based on an
Apriori-like method to extract association rules and the number of generated rules increase dramatically
when a low support has to be used. Apart from relatively small numerical datasets from the UCI
archives [13], most of these methods are unusable for classification tasks that require low minsup
definition (like in text categorization).
ARC-CB [5] proposes a solution for multi-classification (i.e. a text is associated to one or more
classes). But no comparison to other association rules based classifiers is performed in order to highlight
the impact of the method.
In the text mining framework [18,38], propose to use sequential patterns. In [38], the proposal is based
on two methods. The first method is based on the visualization of word occurrences in order to detect
sequential patterns. The second method is based on classical methods to extract sequential patterns.
However, the authors do not propose a method to classify texts using sequential patterns. Moreover, the
texts considered are associated with a date. The corpus consists of 1,170 articles collected over 6 years.
This point makes it very different from our proposal and more difficult to apply since texts are rarely
associated with a date.
In [18], the authors demonstrate how sequential patterns are useful for text mining. Sequential patterns
are used in order to extract trends from textual databases.
In [34,36], the authors propose to use sequential patterns for categorization. However, this approach
does not make use of all the power of sequential patterns since the patterns considered consist of lists
of items, whereas we aim at considering lists of itemsets. Each element from the patterns is indeed
only composed of a single morpheme (or n-gram), whereas it would be interesting to consider patterns
composed by elements which may be more complex (set of words) and automatically composed.
We thus propose an original method based on sequential patterns for classification. We argue that this
method is able to deal with order in texts without being time-consuming. The next section details our
approach. Section 5 shows that our method obtains good results compared to others.
4. Sequential patterns for classification: the SPaC method
In this paper, we propose an original method (SPaC) for text classification based on sequential
patterns [15]. This method consists of two steps. In the first step, we build sequential patterns from texts.
In the second step, sequential patterns are used to classify texts.
Hereafter, we use the notations introduced in Table 3.
4.1. From Texts to sequential patterns
Each text is a set of words. Our method is based on sequential pattern mining. Texts are represented
as ordered sets of words using the T F -IDF representation. Each text is thus considered as being the
Galley Proof
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 7
S. Jaillet et al. / Sequential patterns for text categorization
7
equivalent of a client. The text consists of a set of sentences. Each sentence is associated with a date
(its position in the text). Finally the set of words contained in a sentence corresponds to the set of
items purchased by the client in the market basket analysis framework. Table 2 summarizes the two
terminologies.
This representation is coupled with a stemming step and a stop-list. The stemming step involves
replacing each word by its root word. The stop-list prevents the system from learning from noisy words
such as “the, a”.
Some words are discarded by considering the entropy of each stem over the corpus. This method
eliminates words that could skew the classifier since they are not discriminant enough. Moreover, this
method allows us to apply low supports in the sequential pattern discovery without deteriorating the
results. For this purpose, a user-defined threshold is considered. For each word w, we consider its
entropy H(w) over all classes C i defined as:
H(w) = −ΣCi [p(w).p(Ci |w).log(p(Ci |w))
+((1 − p(w)).p(Ci |w̄).log (p(Ci |w̄)))]
In SPaC, sequential patterns are extracted using a multiple minimum support strategy as done in
msCBA. This means that a different support is applied for each category C i .
In our approach, the training set is divided into n training sets, one for each category. Texts are thus
grouped depending on their category. Sequential pattern mining algorithms are applied separately on
these n databases using the corresponding minimum supports.
For each category, frequent sequential patterns are computed and their supports stored. The support
of a frequent pattern is the number of texts containing the sequence of words.
Definition 2. Let < s1 . . . sp > be a sequence. The support of < s 1 . . . sp > is defined as:
supp(< s1 . . . sp >) =
#texts matching < s1 . . . sp >
#texts
Contrary to msCBA, minimum supports are defined automatically in the following way:
(1) the minimum support is set at the lowest value, i.e. one text (for example, if the training set contains
200 texts, the minsup is set to 0.5%).
(2) if the mining step provides more than X rules, the process is started again with a higher minsup
value, i.e. on more texts (for example, if the training set contains 200 texts, the minsup is increased
by 0.5%).
The use of SPAM [6] to find sequential patterns makes the training step extremely fast. Indeed this
step, which extracts all sequential patterns from the training set, takes only a few minutes on a current
deskstop.2 The limited number of rules (i.e. X ) will also be detailed in the experiments (Section 5).
Algorithm 1 describes SPaC sequential pattern generation. The SPAM algorithm is used through the
SP M ining() function in order to find all frequent sequences in the transactional databases (DB ) [6].
For instance, the following frequent patterns have been extracted from the “Purchasing-Logistics”
category of our French database:
2
Pentium IV 2.4 GHz, 520 Mo.
Galley Proof
8
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 8
S. Jaillet et al. / Sequential patterns for text categorization
< (cacao) (ivoir) (abidjan)>
< (blé soja) (maı̈)>
< (soj)(blé lespin victor)(maı̈ soj )(maı̈ )(grain soj)(soj)>
The first sequential pattern means that some texts contain words cacao then ivoire then (ivory) Abidjan
in three different sentences. The second sequential pattern means that some texts contain the words bl é
and soja in the same sentence and then maı̈s3 (mai). The third sequential pattern means the word maı̈
occurs in two successive sentences before the word grain.
Experiments have led us to consider a threshold that eliminates about 5 to 10% (according to the Zipf
law [32]) of the words. Note that sequential patterns consider multiple occurences of the same word in
3
In French, blé means wheat and maı̈s stands for corn.
Galley Proof
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 9
S. Jaillet et al. / Sequential patterns for text categorization
9
Table 2
Application of the sequential pattern terminology to textual data
Usual Databases
client
item
items/transaction
date
↔
↔
↔
↔
fTextual Databases
text
word
sentence (set of words)
position of the sentence
Table 3
Notations
Notation
C = {C1 , . . . , Cn }
Ci ∈ C
minSupCi
T
T Ci ⊆ T
TT rain = {(Ci , T Ci )}
SEQ
SP
RuleSP
Meaning
set of n categories.
a given category.
user-defined minimum support for
category Ci .
set of texts.
set of texts belonging to category Ci .
Training set constituted by a set of
texts associated with their category.
set of sequences found for category
Ci , customer c at time t.
table of sequential patterns.
table of tuples (spj , Ci , confi,j ).
corresponding to the sequence spj ,
the category Ci and the confidence
confi,j of the rule spj → Ci .
the text, contrary to association rules. Moreover, some frequent co-occurrences can be identified with
sequential pattern mining.
4.2. From sequential patterns to categories
Once sequential patterns have been extracted for each category, the goal is to derive a categorizer from
the obtained patterns.
This is done by computing, for each category, the confidence of each associated sequential pattern. To
solve this problem, a rule γ is generated in the following way:
γ :< s1 . . . sp >→ Ci
where < s1 . . . sp > is a sequential pattern for category C i . This rule means that if a text contains s 1
then s2 . . . then sp then it will belong to category Ci . Each rule is associated with its confidence level,
indicating the extent to which the sequential pattern is characteristic of this category:
conf (γ) =
#texts from Ci matching < s1 . . . sp >
#texts matching < s1 . . . sp >
Rules are sorted depending on their confidence level and the size of the associated sequence. When
considering a new text to be classified, a simple categorization policy is applied: the K rules having
the best confidence level and being supported are applied. The text is then assigned to the class mainly
obtained within the K rules. This method is the same as majority voting in [8]. If two categories obtain
the same score, a random choice is made. This prevents the system from always choosing the same
category. The SPaC classifier step (SPaC-C) is described in Algorithm 2.
Galley Proof
10
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 10
S. Jaillet et al. / Sequential patterns for text categorization
5. Experiments
Experiments are conducted on three databases. The first two ones are the well-known English databases
20 Newsgroups (with 30% of training) and Reuters [13]. The third one is a real database. It describes
French texts (news) with 8,239 texts divided into 28 categories. For this corpus, a training set of 33%
was used.
Experiments compare our approach to results obtained with CBA and SVM. Sequential Patterns are
mined using SPAM [6]. Table 4 details these results. Comparisons are based on the F β measure [33].
This measure allows us to combine recall and precision for a global evaluation. The F β measure is
thus more relevant than accuracy since accuracy does not take into account the case when a text is not
classified. This leads us to consider that classifying no text (thus never making errors) would have an
accuracy of near 100%! Accuracy was taken as the reference measure in [8,20]. However, for the reasons
presented above, we argue that this is not as relevant as relying on recall and precision, as mentioned
in [33]
Galley Proof
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 11
S. Jaillet et al. / Sequential patterns for text categorization
11
Table 4
Comparison of SPaC, msCBA and SVM tested on three different corpuses
F 1M
F 1µ
Acc.
#Rules
SPaC
0.461
0.497
0.963
31060
French news
msCBA
0.367
0.401
0.956
315
SVM
0.485
0.486
0.969
−
SPaC
0.322
0.694
0.992
80985
Reuters
msCBA
0.082
0.679
0.992
640
20 Newsgroups
SPaC msCBA SVM
0.452
0.423
0.423
0.494
0.436
0.455
0.946
0.941
0.941
19938
642
−
SVM
0.500
0.840
0.996
−
Table 5
Results of SPaC on the French news corpus
Minsup
allowed
1%
3%
4%
5%
6%
12%
21%
49%
Acc.
F 1M
F 1µ
#Rules
Time
0.963
0.963
0.963
0.962
0.962
0.957
0.958
0.963
0.461
0.455
0.441
0.435
0.433
0.352
0.262
0.018
0.497
0.495
0.488
0.479
0.477
0.390
0.320
0.033
31060
29931
12227
5328
3019
394
63
1
441 s
414 s
271 s
221 s
208 s
181 s
181 s
180 s
Definition 3. The Fβ measure is defined as follows:
Fβ =
(β 2 + 1)πi ρi
β 2 πi + ρi
where ρ stands for the recall, and π stands for the precision. This measure is computed for each class C i .
Definition 4. Precision and recall are defined as follows:
πi =
T Pi
T Pi
, ρi =
T Pi + F Pi
T Pi + F Ni
where T Pi , F Pi , F Ni stand, respectively, for the number of texts in the class C i which are correctly
classified (True Positive), the number of texts mistakenly put in class C i (False Positive), the number of
texts mistakenly put in a different class from C i (False Negative).
Accuracy is defined as follows:
Definition 5. Accuracy:
Accuracyi =
T Pi + T Ni
T Pi + T Ni + F Pi + F Ni
In order to evaluate the classifier for all classes, we consider Micro-averaging (µ) and Macro-averaging
(M ) [33,39], defined as follows:
Definition 6. Macro-averaging and Micro-averaging
|C|
|C|
ρˆi
M
M
i=1 π̂i
, ρ̂ = i=1
π̂ =
|C|
|C|
Galley Proof
28/04/2006; 16:20
12
File: ida245.tex; BOKCTP/ljl p. 12
S. Jaillet et al. / Sequential patterns for text categorization
Table 6
Results of SPaC on the 20 Newsgroups corpus
Acc.
F 1M
F 1µ
#Rules
Time
0.946
0.946
0.945
0.942
0.942
0.943
0.942
0.452
0.446
0.428
0.386
0.275
0.148
0.050
0.494
0.491
0.478
0.441
0.348
0.231
0.079
19938
13246
9441
4950
1209
32
2
594 s
480 s
446 s
400 s
250 s
178 s
177 s
40000
0.5
F1 micro
#Rules
F1 micro
0.4
30000
0.3
20000
0.2
#Rules
Minsup
allowed
1%
3%
4%
6%
12%
21%
49%
10000
0.1
0
0
5 10 15 20 25 30 35 40 45
Minsup Allowed
Fig. 1. Results of SPaC on the French news corpus.
30000
0.5
F1 micro
#Rules
20000
0.3
#Rules
F1 micro
0.4
0.2
10000
0.1
0
0
5 10 15 20 25 30 35 40 45
Minsup Allowed
Fig. 2. Results of SPaC on the 20 Newsgroups corpus.
µ
|C|
π̂ = |C|
i=1 V
i=1 (V
Pi
Pi + F Pi )
µ
|C|
, ρ̂ = |C|
i=1 V
i=1 (V
Pi
Pi + F Ni )
Micro-averaging gives the same importance to each document, contrary to macro-averaging which
computes the average class by class (thus placing more importance on small categories).
In this paper, we consider that recall and precision have the same importance. We thus have: β = 1
for the Fβ measure.
Galley Proof
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 13
S. Jaillet et al. / Sequential patterns for text categorization
13
Table 7
Results of SPaC on the Reuters corpus
Minsup
allowed
1%
2%
13%
30%
50%
92%
99%
Acc.
F 1M
F 1µ
#Rules
Time
0.992
0.992
0.991
0.989
0.987
0.986
0.986
0.322
0.307
0.283
0.208
0.216
0.154
0.020
0.694
0.691
0.619
0.545
0.432
0.148
0.030
80985
79556
27931
6929
791
45
23
196 s
172 s
147 s
45 s
22 s
19 s
16 s
SVM results are obtained using a linear kernel (no better result is provided by a more complex kernel)
with SVMLight [17]. The supports chosen here are the values providing the best results while remaining
computable (too low supports lead to a very time-consuming application). SPaC is applied with K = 10
rules taken into account when classifying new documents. The automatic minsup definition is done by
limiting the mining process to X = 3000 rules per category. If SPAM exceeds this limit then the mining
process is started again with a higher minsup value. The experiments showed that a larger limit did not
provide significant enhancement during the evaluation step.
In this work, we argue that obtaining understandable knowledge is as (or even more) important than
getting the most accurate classifier. For this reason, comparisons are essentially studied between CBA
and SPaC. CBA is tested with the msCBA version of the algorithm, using the best results we obtained
when testing different supports. The results show that SPaC is always better than CBA. This is due to
the fact that SPaC is able to use more specific rules to classify. SPaC is similar to SVM for French texts,
but obtains better performance than SVM when dealing with the 20 Newsgroups database. Nevertheless,
SVM is still the best classifier on the Reuters corpus. But this corpus had particularities. Indeed, 2
categories (over 90) contain 2/3 of the train test. Moreover, Table 7 shows that few rules with a very
high support (>90%) alone provide a great part of the score.
Up to here, the automatic definition of minsup in SPaC provides a lot of rules compared to CBA. In
the following experiments, we reduce the number of categorization rules for SPaC by increasing the start
minsup value (in the automatic minsup definition process).
Tables 5, 6 and 7 together with Figs 1, 2 and 3 show the number of rules used for categorization and
the associated performance in the F 1 µ measure. These experiments show the stability of F 1 µ , even
when a lot of rules are pruned. On the other hand, SPaC probably reaches almost its best performance
with X = 3, 000 since it requires an exponential number of rules to increase its performaces. Currently,
CBA is not able to increase the number of generated rules to increase its performance. Indeed, CBA is
based on a hard rule pruning strategy.
Moreover, with a slight drop in performance, the number of generated sequential patterns is small
enough to be usable by an expert or for a trend analysis or for manually tuning the classifier.
In Fig. 4, we compare the results obtained for the F 1 measure regarding the number of rules considered
for the choice of class. Experiments have shown that K = 10 provides good results (similar to [8] where
best results correspond to maxrules = 9). Experiments are done (1) with order: the sequential pattern
(sequence) is supported by the text (2) without order: the sub-sequences are supported by the text in
an unspecified order. The corresponding sentences are in the document but are not ordered as in the
sequential pattern. We note that the results are always better when order is taken into account.
Galley Proof
28/04/2006; 16:20
14
File: ida245.tex; BOKCTP/ljl p. 14
S. Jaillet et al. / Sequential patterns for text categorization
0.7
100000
F1 micro
#Rules
0.6
80000
0.4
60000
0.3
40000
#Rules
F1 micro
0.5
0.2
20000
0.1
0
20
40
60
80
Minsup Allowed
0
100
Fig. 3. Results of SPaC on the Reuters corpus.
WITH ORDER
WITHOUT ORDER
0.5
F1 micro-average
0.48
0.46
0.44
0.42
0.4
0.38
2
4
6
8
K
10
12
14
Fig. 4. SPaC: F 1 as a function of the number of rules considered.
6. Conclusion and further work
In this paper, we address the problem of text categorization using sequential patterns. In our framework,
texts are represented by T F -IDF vectors, and each category is associated with a set of sequential patterns.
When classifying new data, a text is matched to a category depending on the number of sequential patterns
involved. The corresponding category is determined using majority voting. Even if SVM have proven
to be efficient in such a task, we argue that it is very important to provide users with understandable
knowledge about their data. In this framework, sequential patterns are well-adapted. They provide rules
that are used for classification. We show that this approach is efficient and relevant, in particular when
SVM do not perform well.
Moreover, the method we propose is simple and adaptable to drifting concepts since it is possible to
Galley Proof
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 15
S. Jaillet et al. / Sequential patterns for text categorization
15
update sequential patterns without performing the whole process using incremental sequential pattern
mining [25]. This possibility is of great importance for text categorization, especially for the automatic
analysis of news which is a very fast and variable area. This is thus a first step towards an On-Line
Classification Process (OLCP) using sequential patterns.
Future works include the integration of our approach for different foreign languages in order to
determine how important order is for each language. We are working on the automatic definition of
the best number of rules to take into account (the K parameter of our method). Our approach may
also be enhanced by mining generalized sequential patterns [3]. This framework allows us to integrate
time constraints, as shown in [26]. Finally, we aim at integrating muti-level sequential patterns, as
proposed in [7]. Very specific rules can thus be kept without damaging the classifier performances.
In this framework, we argue that it is interesting to build a very compact set of rules (rules where the
left-hand part is as short as possible). For this purpose, we aim at extending studies on δ-free sets [11]
to sequential patterns.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
R. Agrawal, R.J. Bayardo, Jr. and R. Srikant, Athena: Mining-Based Interactive Management of Text Databases, In Proc.
of the 7rd Int. Conf. on Extending Database Technology, (EDBT’00), Springer, March 2000, 365–379.
R. Agrawal and R. Srikant, Fast Algorithms for Mining Generalized Association Rules, in: Proc. 20th Int. Conf. Very
Large Data Bases, (VLDB’94), J.B. Bocca, M. Jarke and C. Zaniolo, eds, Morgan Kaufmann, 1994, pp. 487–499.
R. Agrawal and R. Srikant, Mining Sequential Patterns, In Proc. of the 11th Int. Conf. on Data Engineering (ICDE’95),
Tapei, Taiwan, March 1995. IEEE Computer Society Press, 3–14.
K. Ali, S. Manganaris and R. Srikant, Partial Classification Using Association Rules, In Proc. of the 3rd Int. Conf. on
Knowledge Discovery and Data Mining, (KDD’97), CA, USA, August 1997. AAAI Press, 115–118.
M.-L. Antonie and O. Zaiane, Text Document Categorization by Term Association, In Proc. of the 2002 IEEE Int. Conf.
on Data Mining (ICDM’02), December 2002, 19–26.
J. Ayres, J. Gehrke, T. Yiu and J. Flannick, Sequential Pattern Mining Using Bitmaps, In Proc. of the 8th Int. Conf. on
Knowledge Discovery and Data Mining, (KDD’02), July 2002.
E. Baralis, S. Chiusano and P. Garza, On support thresholds in associative classification, In Proc. of the 2004 ACM
Symposium on Applied Computing (SAC’04), ACM Press, March 2004, 553–558.
E. Baralis and P. Garza, Majority Classification by Means of Association Rules, In Proc. of the 7th European Conf. on
Principles and Practice of Knowledge Discovery in Databases (PKDD’03), Springer, September 2003, 35–46.
P. Clark and T. Niblett, The CN2 Induction Algorithm, Machine Learning 3 (March 1989), 261–283.
W. Cohen, Fast Effective Rule Induction, In Proc. of the 12th Int. Conf. on Machine Learning, (ICML’95), CA, USA,
July 1995. Morgan Kaufmann, 115–123.
B. Cremilleux and J.F. Boulicaut, Simplest rules characterizing classes generated by delta-free sets, In Proc. of the 22nd
BCS SGAI Int. Conf. on Knowledge Based Systems and Applied Artificial Intelligence, (ES’02), Springer, December
2002, 33–46.
G. Dong, X. Zhang, L. Wong and J. Li, CAEP: Classification by aggregating emerging patterns, In Proc. of the 2nd Int.
Conf. on Discovery Science, (DS’99), December 1999, 30–42.
S. Hettich and S.D. Bay, The UCI KDD Archive, [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department
of Information and Computer Science., 1999.
M. Iwayama and T. Tokunaga, Cluster-based text categorization: a comparison of category search strategies, In Proc. of
SIGIR-95, 18th ACM Int. Conf. on Research and Development in Information Retrieval, ACM Press, 1995, 273–281.
S. Jaillet, A. Laurent, M. Teisseire and J. Chauché, Order and Mess in text categorization: Why using sequential patterns
to classify, In Proc. of 3rd Workshop on Mining Temporal and Sequential Data, in conjunction with The 10th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, (TDM’04/KDD’04), WA, USA, August
2004.
D. Janssens, G. Wets, T. Brijs, K. Vanhoof and G. Chen, Adapting the CBA-algorithm by means of intensity of implication,
In Proc. of the 1st Int. Conf. on Fuzzy Information Processing Theories and Applications, China, March 2003, 397–403.
T. Joachims, Text categorization with support vector machines: learning with many relevant features, In Proc. of ECML98, 10th European Conf. on Machine Learning, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE., 137–142.
B. Lent, R. Agrawal and R. Srikant, it Discovering Trends in Text Databases, In Proc. of the 3rd Int. Conf. on Knowledge
Discovery and Data Mining, (KDD’97), CA, USA, August 1997. AAAI Press, 227–230.
Galley Proof
16
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
28/04/2006; 16:20
File: ida245.tex; BOKCTP/ljl p. 16
S. Jaillet et al. / Sequential patterns for text categorization
W. Li, J. Han and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, In
Proc. of the 2001 IEEE Int. Conf. on Data Mining (ICDM’01), CA, USA, November 2001, 369–376.
Y. Li and A. Jain, Classification of text documents, The Computer Journal 41(8) (1998), 537–546.
B. Liu, W. Hsu and Y. Ma, Integrating Classification and Association Rule Mining, In Proc . of the 4th Int. Conf. on
Knowledge Discovery and Data Mining (KDD’98), NY, USA, August 1998. AAAI Press, 80–86.
B. Liu, Y. Ma and C.-K. Wong, Improving an Association Rule Based Classifier, In Proc. of the 4th European Conf. on
Principles of Data Mining and Knowledge Discovery (PKDD’00), France, September 2000. Springer, 504–509.
B. Liu, Y. Ma and C.-K. Wong, Classification Using Association Rules: Weaknesses and Enhancements, in: Data Mining
for Scientific Application and Engineering Applications, V. Kumar, ed., Kluwer Academic, 2001.
M. Maron, Automatic indexing: An experimental inquiry, Journal of the ACM (JACM) 8 (1961), 404–417.
F. Masseglia, P. Poncelet and M. Teisseire, Incremental mining of sequential patterns in large databases, Data and
Knowledge Engineering 46(1) (2003).
F. Masseglia, P. Poncelet and M. Teisseire, Pre-processing Time Constraints for Efficiently Mining Generalized Sequential
Patterns, In Proc. of the 11th Int. Symposium on Temporal Representation and Reasoning, (TIME’04), IEEE Computer
Society Press, July 2004, 87–95.
D. Meretakis and B. Wuthrich, Extending Naive Bayes Classifiers Using Long Itemsets, In Proc. of the 5th Int. conf. on
Knowledge Discovery and Data Mining (KDD’99), CA, USA, August 1999, 165–174.
J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M.-C. Hsu, PrefixSpan mining sequential patterns
efficiently by prefix projected pattern growth, In International Conference on Data Engineering (ICDE’01), Heidelberg,
Germany, 2001, 215–226.
J. Quinlan, C4.5 – Programs for Machine Learning. Morgan Kaufman, CA, USA, 1993.
C.J. Van Rijsbergen, Information Retrieval, Butterworths, sec. edition, 1979.
G. Salton and M.J. McGill, Introduction to modern information retrieval, McGraw-Hill, New York, 1983.
G. Salton, C. Yang and C. Yu, A theory of term importance in automatic text analysis, Journal of the American Society
for Information Science 36 (1975), 33–44.
F. Sebastiani, Machine learning in automated text categorisation, (Vol. 34), In Proc. of ACM Computing Surveys, 2002,
1–47.
M. Shimbo, T. Yamasaki and Y. Matsumoto, Automatic classification of sentences in the medline abstracts: A case study
of the power of word sequence features, In Proc. of the 6th Sanken (ISIR) International Symposium, Osaka, Japan, 2003,
135–138.
R. Srikant and R. Agrawal, Mining Sequential Patterns: Generalizations and Performance Improvements, In Proc. of
the 5th Int.Conf. on Extending Database Technology (EDBT’96), September 1996, 3–17.
M. Takechi, T. Tokunaga, Y. Matsumoto and H. Tanaka, Feature selection in categorizing procedural expressions, In
Proc. Sixth International Workshop on Information Retrieval with Asian Languages (IRAL’2003), Sapporo, Japan, July
7 2003.
K. Wang, S. Zhou and Y. He, Growing Decision Trees on Support-less Association Rules, In Proc. of the 6th Int. Conf.
on Knowledge discovery and data mining (KDD’00), MA, USA, August 2000. ACM Press, 265–269.
P.-C. Wong, W. Cowley, H. Foote, E. Jurrus and J. Thomas, Visualizing Sequential Patterns for Text Mining, In Proc
of the 2000 IEEE Symposium on Information Visualization, (INFOVIS’00), UT, USA, October 2000. IEEE Computer
Society Press, 105–114.
Y. Yang, An Evaluation of statistical approaches to text categorization, Information Retrieval Journal 1(1/2) (1999),
69–90.