An Efficient Algorithm for Incremental Mining of Association Rules
Chin-Chen Chang
Department of Information Engineering and
Computer Science, Feng Chia University,
Taichung, Taiwan, R.O.C
Department of Computer Science and
Information Engineering, National Chung
Cheng University, Chiayi, Taiwan, R.O.C
ccc@cs.ccu.edu.tw
Yu-Chiang Li
Jung-San Lee
Department of Computer Science and
Information Engineering, National Chung
Cheng University, Chiayi, Taiwan, R.O.C
{lyc, ljs91}@cs.ccu.edu.tw
sub-problems (1) to find all frequent itemsets, and (2) to
use these frequent itemsets to generate association rules.
The first sub-problem importantly governs the overall
performance of the mining process. After frequent
itemsets have been recognized, the corresponding
association rules may be derived easily [3, 4]. Other
efficient approaches have also been presented, including
for example those in [1, 2, 6, 10, 11, 16, 18].
Existing mining methods cannot be applied to mine a
publication-like database efficiently.
A publication
database is also a transaction database where each item
involves an individual exhibition period [12]. Consider a
publication database in Fig. 1, items A, B and C are
exhibited from 1996 to 2004. However, item F is exhibited
from 2002 to 2004. Traditional mining techniques ignore
the exhibition period of each item and use a unique
minimum support threshold. This measure is unfair for
new products. Therefore, the PPM algorithm has been
proposed to solve this problem [12].
Abstract
Incremental algorithms can manipulate the results of
earlier mining to derive the final mining output in various
businesses. This study proposes a new algorithm, called
the New Fast UPdate algorithm (NFUP) for efficiently
incrementally mining association rules from large
transaction database. NFUP is a backward method that
only requires scanning incremental database. Rather than
rescanning the original database for some new generated
frequent itemsets in the incremental database, we
accumulate the occurrence counts of newly generated
frequent itemsets and delete infrequent itemsets obviously.
Thus, NFUP need not rescan the original database and to
discover newly generated frequent itemsets. NFUP has
good scalability in our simulation.
1. Introduction
Recent developments in information science have
resulted in the rapid accumulation of enormous amounts of
data. Consequently, efficiently managing very large
databases and quickly retrieving useful information are
very important. Data mining or knowledge discovery
techniques are normally used to discover useful
information in data warehouses. They have come to
represent an important field research and have been applied
extensively to several areas [7], including financial
analysis, market research, industrial retail and decision
support.
Mining association rules is the core task of numerous
data mining techniques. As the amount of data increases,
designing an efficient mining algorithm becomes
increasingly urgent; accordingly, two of the main issues
concerning data mining are therefore studied extensively
herein. One is the design of algorithms for mining rules or
patterns. The other is the design of algorithms to update
and maintain rules, called incremental mining.
The most celebrated algorithm of the first type is the
Apriori [3, 4] algorithm. The Apriori algorithm solves two
Figure 1. Each item has an individual exhibition
period
In the real world where large amounts of data grow
steadily, some old association rules can become useless,
and new databases may give rise to some implicitly valid
patterns or rules. Hence, updating rules or patterns is also
important. The FUP algorithm is well known in related to
1
Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA’05)
1097-8585/05 $20.00 © 2005 IEEE
confidence is expressed in the form confidence(X Y) =
support(X Y)/support(X).
Given the user-assigned
minimum support (minSup) and minimum confidence
(minConf) thresholds for the transaction database DB, the
mining of association rules is to find all rules whose
support and the confidence, in which are greater than the
two respective minimum thresholds. An itemset is called a
frequent itemset when its support is no less than the minSup
threshold; otherwise, it is an infrequent itemset.
this issue [8]. A simple method for solving the updating
problem is to reapply the mining algorithm to the entire
database, but this approach is time-consuming. The FUP
algorithm reuses information from old frequent itemsets to
improve its performance. Several other approaches to
incremental mining have been proposed [5, 9, 11, 13, 14,
15, 17, 19, 20].
Although many mining techniques for discovering
frequent itemsets and associations have been presented, the
process of updating frequent itemsets remains troublesome
for incremental databases. The mining of incremental
databases is more complicated than the mining of static
transaction databases, and may lead to some severe
problems, such as the combination of frequent itemsets
occurrence counts in the original database with the new
transaction database, or the rescanning of the original
database to check whether the itemsets remain frequent
while new transactions are added.
Earlier incremental algorithms have focused on
reducing the number of scanning on original database
while it is updated. However, they require the original
database to be rescanned at least once in many situations.
This work presents a novel algorithm NFUP for
incremental mining, which is based on FUP. This
algorithm can discover latest rules and does not need to
rescan the original database. This work focuses on the
generation of frequent itemsets in incremental
publication-like database.
The rest of this paper is organized as follows. Section 2
introduces some work on association rules. Then, Section
3 describes the proposed method (NFUP).
The
experimental results of NFUP will be presented in Section
4. Conclusions are finally drawn in Section 5.
2.1 Apriori Algorithm
The Apriori algorithm concentrates primarily on the
discovery of frequent itemsets according to a user-defined
minSup [4]. The algorithm relies on the fact that an itemset
could be frequent only when each of its subset is frequent;
otherwise, the itemset is infrequent. In the first pass, the
Apriori algorithm constructs and counts all 1-itemsets. (A
k-itemset is an itemset that includes k items.) After it has
found all frequent 1-itemsets, the algorithm joins the
frequent 1-itemsets with each other to form candidate
2-itemsets. Apriori scans the transaction database and
counts the candidate 2-itemsets to determine which of the
2-itemsets are frequent. The other passes are made
accordingly. Frequent (k - 1)-itemsets are joined to form
k-itemsets whose first k-1 items are identical. If k t 3,
Apriori prunes some of the k-itemsets; of these, (k –
1)-itemsets have at least one infrequent subset. All
remaining k-itemsets constitute candidate k-itemsets. The
process is reiterated until no more candidates can be
generated.
Example 2.1 Consider the database presented in Table 1
with a minimum support requirement is 50%. The database
includes 11 transactions. Accordingly, the supports of the
frequent itemsets are at least six. The first column “TID”
includes the unique identifier of each transaction, and the
“Items” column lists the set of items of each transaction.
Let Ck be the set of candidate k-itemsets and Fk be the set of
frequent k-itemsets. In the first pass, the database is
scanned to count C1. If the support count of a candidate
exceeds or equals six, then the candidate is added to F1.
The outcome is shown in Figure 2. Then, F1 f F1 forms C2
(Apriori-gen function is used to generate C2 [4]). After the
database has been scanned for a second time, Apriori
examines which itemset of C2 exceeds the predetermined
threshold. Moreover, C3 is generated from F2 as follows.
Figure 2 presents two frequent 2-itemsets with identical
first item, such as {BC} and {BE}. Then, Apriori tests
whether the 2-itemset {CE} is frequent. Since {CE} is a
frequent itemset, all the subsets of {BCE} are frequent.
Thus, {BCE} is a candidate 3-itemset, or {BCE} must be
pruned. Apriori stops to look for frequent itemsets when
no candidate 4-itemset can be joined from F3. Apriori
scans the database k times when candidate k-itemsets are
generated.
2. Related Work
In 1993, Agrawal et al. first defined the mining of
association rules in databases [3]. They considered the
example following example; 60% of transactions in which
bread is purchased are also transactions in which milk is
purchased. The formal statement is as follows [3]. Let I =
{i1, i2,… , im} represent the set of literals, called items. The
symbol T represents an arbitrary transaction, which is a set
of items (itemset) such that T I. Each transaction has a
unique identifier, TID. Let DB be a database of
transactions. Assume X is an itemset; a transaction T
contains X if and only if X T. An association rule applies
in the form X Y, where X I, Y I and X Y= I (For
example, I = {ABCDE}, X = {AC}, Y = {BE}). An
association rule X Y has two properties, support and
confidence. When s% of transactions in DB contain X Y,
the support of the rule X Y is s%. If some of the
transactions in DB contain X and, and c% also contain Y,
then the confidence in the rule X Y is c%. In general, the
2
Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA’05)
1097-8585/05 $20.00 © 2005 IEEE
rescanned. The process is reiterated until all frequent
itemsets have been found. In the worst case, FUP does not
reduce the number of the original database must be
scanned.
Table 1. An example of a transaction database
TID
001
002
003
004
005
006
007
008
009
010
011
Pass 1
Scan
DB
==>
Pass 2
F1 f F1
==>
C2
{AB}
{AC}
{AE}
{BC}
{BE}
{CE}
Scan
DB
==>
Items
ACDE
ACD
BCE
ABCE
ABE
BCE
ABE
BCDE
ABCD
CEF
ABCF
Table 2. Four scenarios associated with an
itemset in DB+
db
DB
Frequent
itemset
Infrequent
itemset
C1 Count
F1 Count
{A}
7
{A}
7
{B}
8
{B}
8
{C}
9
==> {C}
9
{D}
4
{E}
8
{E}
8
{F}
2
f
F2
& Prune
==>
Case 1:
Frequent
Case 3:
Case 2:
Case 4:
Infrequent
In 1997, Cheung et al. described the FUP2 algorithm,
which is a more general incremental technique than FUP.
FUP2 is efficient not only on developing of a database but
also on trimming data [9]. Tomas et al. proposed another
method to accelerate the incremental mining by
maintaining the negative border [19]. In 1999, Ayan et al.
proposed the UWEP (Update With Early Pruning) method
that employs a dynamic look-ahead strategy to update
existing frequent itemsets for detecting and removing
itemsets that are infrequent in the updated database [5]. In
2001, Lee et al. proposed the SWF approach [13]. SWF
partitions databases into several partitions, and applies a
filtering threshold in each partition to generate candidate
itemsets. In 2002, Veloso et al. described the ZigZag
algorithm which uses tidlist and computes maximal
frequent itemsets in the updated database to avoid the
generation of many unnecessary candidates [20].
C2 Count
F2 Count
{AB}
5
{BC} 6
{AC}
5
{BE} 6
{AE}
4
==> {CE} 6
{BC}
6
{BE}
6
{CE}
6
Scan
C3
C3 Count
{BCE} DB {BCE} 4
==>
==>
Infrequent
itemset
2.3 Other Incremental Algorithms
Pass 3
F2
Frequent
itemset
F3 Count
{}
Figure 2. Application of Apriori algorithm
3. New Fast Update Method (NFUP)
The key idea behind of previous incremental mining
techniques is to reduce the number of times that databases
need to be scanned. Although those techniques may avoid
some unnecessary scanning, they do rescan the original
database. The original database is normally much larger
than the incremental database. Therefore, scanning the
original database is time-consuming. This study proposes
a new fast update algorithm (NFUP) for incremental
mining of association rules. NFUP does not require the
rescanning of the original database to detect new frequent
itemsets or delete invalidate itemsets.
2.2 Fast Update Algorithm (FUP)
After a database has been updated, some existing rules
are no longer important and new rules may be introduced.
In 1996, Cheung et al. proposed the FUP algorithm to
efficiently generate associations in the updated database
[8]. The FUP algorithm relies on Apriori and considers
only these newly added transactions. Let db be a set of new
transactions and DB+ be the updated database (including all
transactions of DB and db). An itemset X is either frequent
or infrequent in DB or db. Therefore, X has four
possibilities, as shown in Table 2. In the first pass, FUP
scans db to obtain the occurrence count of each 1-itemset.
Since the occurrence counts of Fk in DB are known in
advance, the total occurrence count of arbitrary X is easily
calculated if X is in Case 2. If X is unfortunately in Case 3,
DB must be rescanned. Similarly, the next pass scans db to
count the candidate 2-itemsets of db. If necessary, DB is
3.1 NFUP Algorithm
FUP rescans the original database when at least one
candidate is in Case 3. In many situations, new information
is more important than old information, such as in
publication database, stock transactions, grocery markets,
3
Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA’05)
1097-8585/05 $20.00 © 2005 IEEE
or web-log records. Consequently, a frequent itemset in
the incremental database is also important even if it is
infrequent in the updated database.
To mine new interesting rules in updated database,
NFUP partitions the incremental database logically
according to unit time interval (month, quarter or year, for
example). For each item, assume that the ending time of
exhibition period is identical.
NFUP progressively
accumulates the occurrence count of each candidate
according to the partitioning characteristics. The latest
information is at the last partition of incremental database.
Therefore, NFUP scans each partition backward, namely,
the last partition is scanned first and the first partition is
scanned last. As in the preceding section, the original
transaction database is denoted as DB, where db indicates
the incremental portion, and DB+ signifies the updated
database. The frequent set of itemsets of DB is known in
advance. The new transaction database db includes n unit
time intervals. Logically, db can be divided into n portions
and each portion is called a partition (db =
P1 P2 , ..., Pn where Pn denotes the partition n). Let
dbm, n represent the continuous time interval from partition
Pm to partition Pn, where n t m t 1 and n N. Namely,
dbm, n = Pm Pm+1 , ..., Pn-1 Pn. The NFUP algorithm
is an Apriori-like algorithm. The final set of frequent
itemsets consists of the three following types.
(1) Į set: frequent itemsets in DB+,
(2) ȕ set: frequent itemsets in dbm, n (m d n), but
infrequent in dbm-1, n, and
(3) r set: frequent itemsets in dbm, m, but infrequent in
dbm+1, n.
Figure 3. Process of NFUP for Pm
The pseudo-code of NFUP algorithm is presented as
follows.
Input: (1) DB+: updated database that contain DB (original
database) with size |DB| and the incremental
n
database db1,n with size |db1,n| = | ¦m 1 Pm |, (2) Fk: set
of all frequent k-itemsets in DB, where k = 1, 2,…, h,
and (3) s: minSup.
Output: (1) FĮ (Į set): frequent itemsets in DB+, (2) Fȕ (ȕ
set): frequent itemsets in dbm, n (m d n), but
infrequent in dbm-1, n or infrequent in DB+ (if m = 1),
and (3) Fr (r set): frequent itemsets in dbm, m, but
infrequent in dbm+1, m+1.
Procedure:
// Cmk is the candidate k-itemsets in partition m
// FĮk is the set with k items in FĮ; Fȕk is the set with k items
in Fȕ; Frk is the set with k items in Fr;
// FDB is the set of frequent itemsets in DB
01 FĮ := I ; Fȕ := I ; Fr := I ;
02 for m := n to 1 {
03 k:=1;
04 SubProcedure();
05 for k := 2 to h {
06
Cmk := Apriori-gen(FĮk-1);
07
SubProcedure();
08 }
09 }
10 forall X FĮ FDB {
11 if X FĮ && X FDB {
12
F(X).start = 0 ;
13
F(X).count += FDB(X).count; }
{
14 elseif X FĮ && X FDB
15
F(X).type = ȕ;
}
16 elseif X FDB && X FĮ Fȕ Fr {
17
F(X).type = r;
}
18 }
19 return FĮ, Fȕ, and Fr;
For dbn, n (Pn), the process starts at 1-itemsets. Each
candidate or frequent itemset has three attributes.
(1) X.count: includes the occurrence count in current
partition,
(2) X.start: includes the partition number of the
corresponding starting partition when X becomes an
element of frequent set, and
(3) X.type: denotes one of the three types Į, ȕ, and r. In
the beginning, set Į, set ȕ, and set r are empty.
After Pn has been scanned, all frequent 1-itemsets are
added into the Į set. Each frequent 1-itemset is joined to
form 2-itemset candidates. In Pn, the process is performed
like that of Apriori. NFUP is applied to the next partition
Pn-1 whenever no more candidate k-itemsets can be
generated in Pn. The occurrence count of each candidate in
Pn-1 is known after Pn-1 is scanned. In each partition, NFUP
determines which candidate k-itemset will become an
element of Į, ȕ or r set and identifies from which partition
the k-itemset becomes frequent. After P1 is scanned, the
occurrence count is accumulated with that of DB. Figure 3
graphically depicts NFUP process for Pm, where FĮ is the Į
set, Fȕ is the ȕ set, and Fr is the r set.
4
Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA’05)
1097-8585/05 $20.00 © 2005 IEEE
to be scanned. In the beginning, these frequent k-itemsets
in P2 belong to the Į set. Table 5 lists the three types of
frequent set. The column “Start” states the identity of the
start partition. After P1 is scanned, the Į set consists of five
entries. In Į set, the value of start partition of each frequent
k-itemset is modified to be one. Four k-itemsets are
removed to the ȕ set because their total occurrence counts
are less than the threshold (6 * 50% = 3). Furthermore, the
three itemsets {E}, {BE}, and {CE} are presented in r set.
Although {BE} is frequent in db1, 2, it is infrequent in P2, as
marked by “*”. Finally, NFUP adds the occurrence counts
of frequent itemsets in DB to the corresponding frequent
itemsets of the Į set. In Table 5, the final Į set remains
three entries {A}, {B}, and {C}. The ȕ set and r set
increase two and one entries, respectively.
In Line 1, Į, ȕ and r sets are initialized to empty. In
Lines 4 and 7, SubProcedure() is applied to scan a portion
of db and to compute the support number of each candidate.
At the same time, it determines which candidate should be
added to a proper frequent set or be ignored. The
Apriori-gen function takes frequent (k-1)-itemsets to
generate k-itemset candidates. The prune step is included
in the Apriori-gen function. Some of the k-itemsets are
deleted; of these, at least one (k-1) sub-itemset is not in
FĮk-1. From Line 10 to Line 18, the final support value of
each frequent itemset of Į set needs to be added its
occurrence count in DB. Otherwise, they are the elements
of ȕ set. The pseudo-code of SubProcedure() is presented
as follows.
Table 3. Transaction database
SubProcedure():
// the k-th pass in each partition
01 if Cmk I { break; }
02 if m == n {
03
forall X Cmk { F(X).count = 0; } }
04 forall X Cmk { X.count :=0; }
05 forall T Pm { // scan Pm , T:transaction
06
forall X (subset of T) {
07
if X Cmk { X.count++; }
}
08 }
09 forall X Cmk {
10
if X FĮk {
11
if X.count t s*|Pm| {
12
F(X).start = m ; F(X).count += X.count; }
13
else {
14
if F.count + F(X).count t s*|dbm,n| {
15
F(X).count += X.count; }
16
else {
17
FĮk := FĮk - {X}; Fȕk := Fȕk {X};
18
F(X).count += X.count;
19
F(X).type := ȕ; } }
}
20
elseif X Cmk – FĮk – Fȕk – Frk{
21
if X.count t s*|Pm| {
22
F(X).count := X.count;
23
if m == n { F(X).type := Į ;}
24
else { Frk := Frk {X}; F(X).type := r ;}
25
} }
26 }
Date Partition TID Transaction
DB 001 A C D E
1999
002 A C D
|
003 B C E
2001
004 A B C E
005 A B E
2002
P1
006 B C E
007 A B E
008 B C D E
2003
P2
009 A B C D
010 C E F
011 A B C F
P
Table 4. Frequent itemsets in each partition
P2
P1
DB
Itemset Count Itemset Count Itemset Count
{A} 2
{B} 3
{A} 4
{B} 2
{C} 2
{B} 3
{C} 3
{E} 3
{C} 4
{F} 2
{BC} 2
{E} 4
{AB} 2
{BE} 3
{AC} 3
{AC} 2
{CE} 2
{AE} 3
{BC} 2
{BCE} 2
{BE} 3
{CF} 2
{CE} 3
{ABC} 2
P
As shown in Table 5 from 1999 to 2003, the three
publications or products ({A}, {B} and {C}) are very
popular because their start partitions are zero. {AB} and
{BC} are the two elements of the ȕ set and P1 is their start
partition. Thus, from 2002 to 2003, the two combinations
of products are interesting. {AE} is in the r set and the start
partition is zero. Hence, {AE} is very popular from 1999 to
2001. However, {AE} is no longer interesting from 2002
to 2003. The itemset {E} is frequent in traditional
incremental rules mining such as FUP, while {E} is in the r
set of the mining result of NFUP. Although the occurrence
Example 3.1 Consider the transaction database presented
in Table 3 with a minimum support requirement is 50%.
The original database contains five transactions and that six
incremental transactions are divided into two portions {P1,
P2}. DB corresponds to the time granularity from 1999 to
2001. P1 and P2 correspond to the time granularities 2002
and 2003, respectively. All frequent itemsets of DB are
known in advance, and presented in Table 4. Initially, the
three types of frequent set are null. Without loss of
generality, suppose DB is the partition zero (P0). All the
frequent k-itemsets in Pm are also list in Table 4. P2 is first
5
Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA’05)
1097-8585/05 $20.00 © 2005 IEEE
count of {E} is eight in DB+, {E} is no longer frequent in
the year 2003.
100,000, P = 1 to 4, |T| = 10, |I| = 4, |L|=2000, and N = 1000.
The notation Tx.Iy.Dz.dm.Pn denotes an updated database
DB+, where |T| = x, |I| = y, |D| = z, |d| = m, and P = n. Figure
4 shows the running time for P = 1, 2 and 4. The minSup
threshold is decreased from 1.2% to 0.2%. NFUP has the
best performance when the partition number of db is one.
Table 5. Frequent itemsets generated
incrementally by NFUP
After P2 is scanned After P1 is scanned Final frequent sets
Į set Start Count Į set Start Count Į set Start Count
{A} 2
2
{A} 1
3
{A} 0
7
{B} 2
2
{B} 1
5
{B} 0
8
{C} 2
3
{C} 1
5
{C} 0
9
{F} 2
2
{AB} 1
3
{AB} 2
2
{BC} 1
4
ȕ set Start Count
{AC} 2
2
{F} 2
2
{BC} 2
2
ȕ set Start Count {AB} 1
3
{CF} 2
2
{F} 2
2
{AC} 2
2
{ABC} 2
2
{AC} 2
2
{BC} 1
4
{CF} 2
2
{CF} 2
2
{ABC} 2
2
{ABC} 2
2
r set Start Count
{E} 1
3
*{BE} 1
3
{CE} 1
2
Table 6. Parameters
|D| Number of transactions in DB
|d| Number of transactions in db
P Partition number of the incremental db
|T| Mean size of the transactions
|I| Mean size of the maximal potentially frequent
itemsets
|L| Number of maximal potentially frequent
itemsets
N Number of items
r set Start Count
{E} 1
3
{AE} 0
3
{BE} 1
3
{CE} 1
2
To test the scalability with the number of transactions of
db, the number of partitions of db is set to 2. Figure 5
presents the results. Consider the two minimum support
thresholds 0.2% and 0.4%, the running time slightly
increases with the growth of db’s size. Thus, NFUP shows
good scalability.
Figure 6 shows the scalability with the number of
transactions of DB, where the incremental database is also
divided into two partitions. The curves of 0.2% and 0.4%
minimum support thresholds are very flat. Therefore, the
running time of NFUP is irrelevant to the number of
transactions of DB.
Running time (sec).
NFUP needs not rescan the original database since the Į
set is frequent in the updated database DB+, and the ȕ set
provides important rules in dbm,n. Determining all frequent
itemsets in DB+ as Apriori or FUP, it requires the
rescanning of the original database only once to check the ȕ
set and r set. However, this rescanning phase is
unnecessary because all frequent itemsets in DB+ are the
subset of Į ȕ r. Furthermore, all the itemsets
contained in ȕ set or r set are interesting. To determine all
frequent itemsets in DB+ as Apriori or FUP, the important
rule could be deleted by pruning some frequent itemsets
from the ȕ set or r set.
The set (Į ȕ r) is the superset of the mining result
of Apriori in DB+. The reason is that if DB is divided into n
partitions, then the frequent itemset X must appear as a
frequent itemset in least one of the n partitions [18].
T10.I4.D100k.d100k.Pn
40
30
n=1
20
n=2
10
n=4
0
0.2
4. Experimental Results
All the experiments were performed on a 1.5GHz
Pentium IV PC with 640 MB of main memory, running
under Windows 2000 professional. The algorithm was
coded in Visual C++ 6.0. The synthetic datasets employed
in our experiments are generated by using the same
technique introduced in [4]. Table 6 is the parameters of
the synthetic data generation program.
We generate the transaction database DB+ with size
|DB+|, where the first |DB| is the original database and the
next |db| (|DB+| - |DB|) is the incremental portion. The
result is shown in Figure 4, where |D| = 100,000, |d| =
0.4 0.6 0.8 1
1.2
Minimum Suppport(%)
Figure 4. Running time of NFUP for n = 1, 2 and 4
6
Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA’05)
1097-8585/05 $20.00 © 2005 IEEE
Running tine (sec).
References
T10.I4.D100k.dm .P2
25
[1] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad, “A tree
projection algorithm for generation of frequent itemsets,”
Journal of Parallel and Distributed Computing, Vol. 61, No.
3, pp. 350-361, 2000.
[2] C. C. Aggarwal and P. S. Yu, “Mining associations with the
collective strength approach,” IEEE Trans. Knowledge and
Data Engineering, Vol. 13, No. 6, pp. 863-873, 2001.
[3] R. Agrawal, T. Imielinski, and A. Swami, “Mining
association rules between sets of items in large databases” In
Proc. 1993 ACM SIGMOD Intl. Conf. on Management of
Data, Washington, D.C., pp. 207-216, May 1993.
[4] R. Agrawal and R. Srikant, “Fast algorithms for mining
association rules,” In Proc. 20th Intl. Conf. on Very Large
Data Bases, Santiago, Chile, pp. 487-499, Sep. 1994.
[5] N. F. Ayan, A. U. Tansel, and E. Arkun, “An efficient
algorithm to update large itemsets with early pruning,” In
Proc. 5th ACM SIGKDD Intl. Conf. on Knowledge Discovery
and Data Mining, San Diego, CA, pp. 287-291, Aug. 1999.
[6] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, “Dynamic
itemset counting and implication rules for market basket data,”
In Proc. 1997 ACM SIGMOD Intl. Conf. on Management of
Data, Tucson, AZ, pp. 255-264, May 1997.
[7] M. S. Chen, J. Han, and P. S. Yu, “Data mining: An overview
from a database perspective,” IEEE Trans. Knowledge Data
Engineering, Vol. 8 No. 6, pp. 866-883, 1996.
[8] D. W. Cheung, J. Han, V. T. Ng, and C. Y. Wong,
“Maintenance of discovered association rules in large
databases: an incremental updating technique,” In Proc. 12th
Intl. Conf. on Data Engineering, New Orleans, LA, pp.
106-114, Feb. 1996.
[9] D. W. Cheung, S. D. Lee, and B. Kao, “A general incremental
technique for maintaining discovered association rules,” In
Proc. 5th Intl. Conf. on Database Systems for Advanced
Applications, Melbourne, Australia, pp. 185-194, Apr. 1997.
[10] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without
candidate generation,” In Proc. 2000 ACM-SIGMOD Intl.
Conf. on Management of Data, Dallas, TX, pp. 1-12, May
2000.
[11] T. P. Hong, C. Y. Wang, and Y. H. Tao, “A new incremental
data mining algorithm using pre-large itemsets,” Intelligent
Data Analysis, Vol. 5, No. 2, pp. 111-129, 2001.
[12] C. H. Lee, M. S. Chen, and C. R. Lin, “Progressive partition
miner: An efficient algorithm for mining general temporal
association rules,” IEEE Trans. Knowledge Data Engineering,
Vol. 15, No. 4, pp. 1004-1017, 2003.
[13] C. H. Lee, C. R. Lin, and M. S. Chen, “Sliding-window
filtering: An efficient algorithm for incremental mining,” In
Proc. 10th Intl Conf. on Information and Knowledge
Management, Atlanta, GA, pp. 263-270, Nov. 2001.
[14] K. K. Ng and W. Lam, “Updating of association rules
dynamically,” In Proc. Intl. Symposium on Database
Applications in Non-Traditional Environments, Kyoto, Japan,
pp. 84-91, Nov. 1999.
[15] T. Y. Ng, M. L. Wong, and P. Bao, “Incremental mining of
association patterns on compressed data,” In Proc. Joint 9th
IFSA World Congress and 20th NAFIPS Intl. Conf.,
Vancouver, Canada, pp. 441-446, Jul. 2001.
[16] J. S. Park, M. S. Chen, and P. S. Yu, “An effective
hash-based algorithm for mining association rules,” In Proc.
20
15
0.20%
10
0.40%
5
0
20
40
60
80
100
|db | transaction number (k )
Figure 5. Scalability with the transaction number
of db
T10.I4.Dz .d100k.P2
Running time(sec).
25
20
15
0.20%
10
0.40%
5
0
100
200 400 600 800 1000
|DB | transaction number (k )
Figure 6. Scalability with the transaction number
of DB
5. Conclusions
In the real world, databases are periodically and
continually updated. Therefore, mining must be repeated.
Valid patterns and rules must to be efficiently generated.
Incremental mining must usually involve the original
database and the new added transactions. Scanning the
original database is very expensive, so the proposed
method outperforms others by avoiding the rescanning of
the original database.
This investigation has presented a new method, NFUP,
for incremental mining. NFUP does not require the
rescanning of the original database and can determine new
frequent itemsets at the latest time intervals. The proposed
method uses information available from a following
partition to avoid the rescanning of the original database; it
requires only the incremental database to be scanned. In
reality, the transaction number of the incremental database
is very small in contrast to the original database. The
running time of NFUP rises almost in direct proportion
with the transaction number of the incremental database.
Accordingly, NFUP is suited frequently updated databases.
In the future, the authors will consider the extension of the
NFUP algorithm to sequence rules.
7
Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA’05)
1097-8585/05 $20.00 © 2005 IEEE
1995 ACM-SIGMOD Intl. Conf. on Management of Data, San
Jose, CA, pp. 175-186, May 1995.
[17] N. L. Sarda and N. V. Srinivas, “An adaptive algorithm for
incremental mining of association rules,” In Proc. 9th Intl.
Workshop on Database and Expert Systems Applications,
Vienna, Austria, pp. 240-245, Aug. 1998.
[18] A. Savasere, E. Omiecinski, and S. Navathe, “An efficient
algorithm for mining association rules in large databases,” In
Proc. 21st Conf. on Very Large Data Bases, Zurich,
Switzerland, pp. 432-444, Sep. 1995.
[19] S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka, “An
efficient algorithm for the incremental updating of association
rules in large database,” In Proc. 3rd Intl. Conf. on Data
Mining and Knowledge Discovery, Newport Beach, CA, pp.
263-266, Aug. 1997.
[20] A. A. Veloso, W. Meira Jr., M. B. de Carvalho, B. Pôssas, S.
Parthasarathy, and M. J. Zaki, “Mining frequent itemsets in
evolving databases,” In Proc. 2nd SIAM Intl. Conf. on Data
Mining, Arlington, VA, Apr. 2002.
8
Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA’05)
1097-8585/05 $20.00 © 2005 IEEE
View publication stats