Mining Frequent Patterns From Very High Dimensional Data: A Top-Down Row Enumeration Approach
Mining Frequent Patterns From Very High Dimensional Data: A Top-Down Row Enumeration Approach
280
based mining algorithm. To show its effectiveness, we relation. For a ri S, and a ij I, (ri, ij) R denotes
design an algorithm, called TD-Close, to mine a that ri contains ij, or ij is contained by ri.
complete set of frequent closed patterns and compare it Let TT be the transposed table of T, in which each
with two bottom-up search algorithms, Carpenter [5], row corresponds to an item ij and consists of a set of
and FPclose [15]. Here are the main contributions of rids which contain ij in T. For clarity, we call each
this paper: row of TT a tuple. Table TT is a triple (I, S, R), where
(1) A top-down search method and a novel row- R S I is a relation. For a rid ri S, and an item ij
enumeration tree are proposed to take advantage I, (ri, ij) R denotes that ij contains ri, or ri is contained
of the pruning power of minimum support by ij.
threshold. Our experiments and analysis show
that this cuts down the search space dramatically. Example 2.1 (Table and transposed table) Table 2.1
This is critical for mining high dimensional data, shows an example table T with 4 attributes (columns):
because the dataset is usually big, and without A, B, C and D. The corresponding transposed table TT
pruning the huge search space, one has to is shown in Table 2.2. For simplicity, we use number i
generate a very large set of candidate itemsets for (i = 1, 2, , n) instead of ri to represent each rid. In
checking. order to describe our search strategy and mining
algorithm clearly, we need to define an order of these
(2) A new method, called closeness-checking, is rows. In this paper, we define the numerical order of
developed to check efficiently and effectively rids as the order, i.e., a row j is greater than k if j > k.
whether a pattern is closed. Unlike other existing Let minimum support (denoted minsup) be set to 2.
closeness-checking methods, it does not need to All the tuples with the number of rids less than minsup
scan the mining data set, nor the result set, and is is deleted from TT. Table TT shown in Table 2.2 is
easy to integrate with the top-down search already pruned by minsup. This kind of pruning will
process. The correctness of this method is proved be further explained in the following sections.
by both theoretic proof and experimental results. In this paper we aim to discover the set of the
(3) An algorithm using the above two methods is frequent closed patterns. Some concepts related to it
designed and implemented to discover a complete are defined as follows.
set of frequent closed patterns. Experimental
results show that this algorithm is more efficient Table 2.1 An example table T
and uses less memory than bottom-up search
styled algorithms, Carpenter and FPclose. ri A B C D
The remaining of the paper is organized as follows. 1 a1 b1 c1 d1
In section 2, we present some preliminaries and the 2 a1 b1 c2 d2
mining task. In section 3, we describe our top-down 3 a1 b1 c1 d2
search strategy and compare it with the bottom-up
search strategy. We present the new algorithm in 4 a2 b1 c2 d2
section 4 and conduct experimental study in section 5. 5 a2 b2 c2 d3
Finally, we give the related work in section 6 and
conclude the study in section 7. Table 2.2 Transposed table TT of T
281
i(S) = { ij I | ri S , (ri, ij) R } that along every search path, we search the row
Based on these definitions, we define C(I) as the enumeration space from small rowsets to large ones.
closure of an itemset I, and C(S) as the closure of a For example, first single rows, then 2-rowsets, , and
rowset S as follows: finally n-rowsets. Both depth-first and breadth-first
C(I) = i(r(I)) search of this tree belong to this search strategy.
C(S) = r(i(S)) In Figure 3.1, each node represents a rowset. Our
Note that definition 2.1 is applicable to both table T mining task is to discover all of the large closed
and TT. rowsets. So the main constraint for mining is the size
of rowset. Since it is monotonic in terms of bottom up
Definition 2.2 (Closed itemset and closed rowset) An search order, it is hard to prune the row enumeration
itemset I is called a closed itemset iff I = C(I). search space early. For example, suppose minsup is
Likewise, a rowset S is called a closed rowset iff S = set to 3, although obviously all of the nodes in the first
C(S). two levels from the root cannot satisfy this constraint,
Definition 2.3 (Frequent itemset and large rowset) these nodes still need to be checked [5, 6, 7, 8]. As a
Given an absolute value of user-specified threshold result, as the minsup increases, the time needed to
minsup, an itemset I is called frequent if |r(I)| minsup, complete the mining process cannot decrease rapidly.
and a rowset S is called large if |S| minsup, where This limits the application of this kind of algorithms to
|r(I)| is called the support of itemset I and minsup is real situations.
called the minimum support threshold. |S| is called the 123 1234 12345
124 1235
size of rowset S, and minsup is called minimum size 12
125 1245
threshold accordingly. Further, an itemset I is called
134 1345
frequent closed itemset if it is both closed and frequent. 13
Likewise, a rowset S is called large closed rowset if it 135
1
14 145
is both closed and large.
15 234 2345
Example 2.2 (Closed itemset and closed rowset) In 23 235
table 2.1, for an itemset {b1, c2}, r({b1, c2}) = {2, 4}, {} 2 24 245
and i({2, 4}) = {b1, c2, d2}, so C({b1, c2}) = {b1, c2, d2}.
Therefore, {b1, c2} is not a closed itemset. If minsup = 25
2, it is a frequent itemset. In table 2.2, for a rowset {1, 34 345
3
2}, i({1, 2}) = {a1, b1} and r({a1, b1}) = {1, 2, 3}, then 35
C(S) = {1, 2, 3}. So rowset {1, 2} is not a closed 4 45
rowset, but apparently {1, 2, 3} is. 5
Mining task: Originally, we want to find all of the Figure 3.1 Bottom-up row enumeration tree
frequent closed itemsets which satisfy the minimum
In addition, the memory cost for this kind of
support threshold minsup from table T. After
bottom-up search is also big. Take Carpenter as an
transposing T to transposed table TT, the mining task
example. Similar to several other algorithms [6, 7, 8],
becomes finding all of the large closed rowsets which
Carpenter uses a pointer list to point to each tuple
satisfy minimum size threshold minsup from table TT.
belonging to an x-conditional transposed table. For a
table with n rows, the maximum number of different
3 Top-down Search Strategy levels of pointer lists needed to remain in memory is n,
although among which the first (minsup 1) levels of
Before giving our top-down search strategy, we will pointer lists will not contribute to the final result.
first look at what is bottom-up search strategy used by
the previous mining algorithms [5, 6, 7, 8]. For These observations motivate the proposal of our
simplicity, we will use Carpenter as a representative method.
for this group of algorithms since they use the same
kind of search strategy. 3.2 Top-down Search Strategy
3.1 Bottom-up Search Strategy Almost all of the frequent pattern mining algorithms
dealing with the data set without transposition use the
Figure 3.1 shows a row enumeration tree that uses anti-monotonicity property of minsup to speed up the
the bottom-up search strategy. By bottom-up we mean mining process. For transposed data set, the minsup
282
constraint maps to the size of rowset. In order to stop and child table respectively. Following is the definition
further search when minsup is not satisfied, we of x-excluded transposed table.
propose to exploit a top-down search strategy instead
Definition 3.1 (x-excluded transposed table) Given a
of the bottom-up one. To do so, we design a row
rowset x = {ri1, ri2, , rik} with an order such that ri1 >
enumeration tree following this strategy, which is
ri2 > > rik, a minimum support threshold minsup and
shown in Figure 3.2. Contrary to the bottom-up search
its parent table TT|p, an x-excluded transposed table
strategy, the top-down searching strategy means that
TT|x is a table in which each tuple contains rids less
along each search path the rowsets are checked from
than any of rids in x, and at the same time contains all
large to small ones.
of the rids greater than any of rids in x. Rowset x is
In Figure 3.2, each node represents a rowset. We
called an excluded rowset.
define the level of root node as 0, and then the highest
level for a data set with n rows is (n 1). Example 3.1 (x-excluded transposed table) For
It is easy to see from the figure 3.2 that for a table transposed table TT shown in Table 2.2, two of its x-
with n rows, if the user specified minimum support excluded tranposed tables are shown in Tables 3.1 and
threshold is minsup, we do not need to search all of 3.2 respectively, assuming minsup = 2.
rowsets which are in levels greater than (n minsup) Table 3.1 shows an x-excluded tranposed table TT|54,
in the row enumeration tree. For example, suppose where x = {5, 4}. In this table, each tuple only
minsup = 3, for data set shown in Table 2.1, we can contains rids which are less then 4, and contains at
stop further search at level 2, because rowsets least two such rids as minsup is 2. Since the largest rid
represented by nodes at level 3 and 4 will not in the original data set is 5, it is not necessary for each
contribute to the set of frequent patterns at all. tuple to contain some other rids. Procedures to get this
table are shown in Example 3.2.
12 1
2
Table 3.2 is an x-excluded transposed table TT|4,
123 13 where x = {4}. Its parent table is the table shown in
23 3
Table 2.2. Each tuple in TT|4 must contain rid 5 as it is
14 4
124 greater than 4, and in the meantime must contain at
1234 24
134 34
least one rid less than 4 as minsup is set to 2. As a
result, in Table 2.2 only those tuples containing rid 5
234
15 5 can be a candidate tuple of TT|4. Therefore, only tuples
125 25 a2 and c2 satisfy this condition. But tuple a2 does not
12345 1235 135 satisfy minsup after excluding rid 4, so only one tuple
35
235 left in TT|4. Note, although the current size of tuple c2
45
in TT|4 is 1, its actual size is 2 since it contains rid 5
145
1245 which is not listed explicitly in the table.
245
1345
345 Table 3.1 TT|54
2345
283
(2) For each tuple obtained in the first step, keep TT|54. Then we get the final TT|54 shown in
only rids less than rik. Table 3.1.
(3) Get rid of tuples containing less than (minsup
j) number of rids, where j is the number of Table 3.4 TT|54 without pruning
rids greater than ri1 in S.
The reason of the operation in step 3 will be given itemset rowset
in section 4. Note that the original transposed table
a1 1, 2, 3
corresponds to TT|, where is an empty rowset.
Figure 3.3 shows the corresponding excluded row b1 1, 2, 3
enumeration tree for the row enumeration tree shown c1 1, 3
in Figure 3.2. This tree shows the parent-child c2 2
relationship between the excluded rowsets.
d2 2, 3
543 5432 54321
542 5431 From definition 3.1 and the above procedure to get
54
541 5421 x-excluded transposed table we can see that the size of
532 5321 the excluded table will become smaller and smaller due
53 531 to the minsup threshold, so the search space will shrink
5
52 521 rapidly.
51 432 4321 As for the memory cost, in order to compare with
43 431
Carpenter, we also use pointer list to simulate the x-
{} 4 excluded transposed table. What is different is that
42 421
this pointer list keeps track of rowsets from the end of
41 each tuple of TT, and we also split it according to the
32 321 current rid. We will not discuss the detail of
3
31 implementation due to space limitation. However, what
2 21 is clear is that when we stop search at level (n
1 minsup), we do not need to spend more memory for all
Figure 3.3 Excluded row enumeration tree of the excluded transposed tables corresponding to
nodes at levels greater than (n minsup), and we can
Example 3.2 (Procedure to get x-excluded release the memory used for nodes along the current
transposed table) Take TT|54 as an example, here is the search path. Therefore, comparing to Carpenter, it is
step to get it. Table TT|5 shown in Table 3.3 is its more memory saving. This is also demonstrated in our
parent table. experimental study, as Carpenter often runs out of
memory before completion.
Table 3.3 TT|5
itemset rowset
4 Algorithm
a1 1, 2, 3 To mining frequent closed itemsets from high
b1 1, 2, 3, 4 dimensional data using the top-down search strategy,
we design an algorithm, TD-Close, and compare it
c1 1, 3
with the corresponding bottom-up based algorithm
c2 2, 4 Carpenter. In this section, we first present our new
d2 2, 3, 4 closeness-checking method and then describe the new
algorithm.
(1) Each tuple in table TT|5 is a candidate tuple
of TT|54 as there is no rid greater than 5 for 4.1 Closeness-Checking Method
the original data set.
(2) After excluding rids not less than 4, the table To avoid generating all the frequent itemsets during
is shown in Table 3.4. the mining process, it is important to perform the
(3) Since tuple c2 only contains one rid, it does closeness checking as early as possible during mining.
not satisfy minsup and is thus pruned from Thus an efficient closeness-checking method has been
developed, based on the following lemmas.
284
Lemma 4.1 Given an itemset I I and a rowset S S, tuples in TT|54 is 4. Table 4.1 shows TT|54 with this
the following two equations hold: kind of skip-rowsets.
r(I) = r(i(r(I))) (1) In Table 4.1, the first 2 tuples have the same rowset
i(S) = i(r(i(S))) (2) {1, 2, 3}. After merging these two tuples, it becomes
Table 4.2. The skip-rowset of this rowset becomes
Proof. Since r(I) is a set of rows that share a given set empty because the intersection of an empty set and any
of items, and i(S) is a set of items that is common to a other set is still empty. If the intersection result is
set of rows, according to this semantics and definition empty, it means that currently this rowset is the result
2.1, these two equations are obviously correct. of intersection of two tuples. When it is time to output
Lemma 4.2 In transposed table TT, a rowset S S is a rowset, this skip-rowset will be checked. If it is
closed iff it can be represented by an intersection of a empty, then it must be a closed rowset.
set of tuples, that is:
I I, s.t. S = r(I) = j r({ij}) Table 4.1 TT|54 with skip-rowset
where ij I, and I = {i1, i2, , il} itemset rowset skip-rowset
Proof. First, we prove that if S is closed then s = r(I)
holds. If S is closed, according to definition 2.2, S = a1 1, 2, 3
C(S) = r(i(S)), we can always set I = i(S), so S = r(I)
b1 1, 2, 3 4
holds. Now we need to prove that if s = r(I) holds then
S is closed. If s = r(I) holds for some I I , according c1 1, 3
to definition 2.1, C(S) = r(i(S)) = r(i(r(I))) holds. Then d2 2, 3 4
based on Lemma 4.1 we have r(i(r(I))) = r(I) = S, so
C(S) = S holds. Therefore S is closed.
Table 4.2 TT|54 after merge
Lemma 4.3 Given a rowset S S, in transposed table
TT, for every tuple ij containing S, which means ij itemset rowset skip-rowset
i(S), if S jr({ij}), where ij i(S), then S is not a1 b1 1, 2, 3
closed.
Proof. For tuple ij i(S), jr({ij}) S apparently c1 1, 3
holds. If S jr({ij}) holds, then jr({ij}) S holds, d2 2, 3 4
which means that there exists at least another one item,
say y, such that Sy = jr({ij}). So S r(I), that is S
Table 4.3 TT|4 with skip-rowset
r(i(S)). Therefore S is not closed.
itemset rowset skip-rowset
Lemmas 4.2 and 4.3 are the basis of our closeness- c2 2 4
checking method. In order to speed up this kind of
checking, we add some additional information for each
x-excluded transposed table. The third column of the 4.2 The TD-Close Algorithm
table shown in Tables 4.1 or 4.3 is just it. The so-
called skip-rowset is a set of rids which keeps track of Based on the top-down search strategy and the
of the rids that are excluded from the same tuple of all closeness-checking method, we design an algorithm,
of its parent tables. When two tuples in an x-excluded called TD-Close, to mine all of the frequent closed
transposed table have the same rowset, they will be patterns from table T.
merged to one tuple, and the intersection of Figure 4.1 shows the main steps of the algorithm. It
corresponding two skip-rowsets will become the begins with the transposition operation that transforms
current skip-rowset. table T to the transposed table TT. Then, after the
initialization of the set of frequent closed patterns FCP
Example 4.1 (skip-rowset and merge of x-excluded to empty set and excludedSize to 0, the subroutine
transposed table) In example 3.2, when we got TT|54 TopDownMine is called to deal with each x-excluded
from its parent TT|5, we excluded rid 4 from tuple b1 transposed table and find all of the frequent closed
and d2 respectively. The skip-rowset of these two itemsets. The General processing order of rowsets is
tuples in TT|5 should be empty as they do not contain equivalent to the depth-first search of the row
rid 5 in TT|. Therefore, the skip-rowset of these two enumeration tree shown in Figure 3.2.
285
Subroutine TopDownMine takes each x-excluded Pruning strategy 2: If an x-excluded transposed table
transposed table and another two variables, cMinsup contains only one tuple, it is not necessary to do
excludedSize, as parameters and checks each candidate further recursive call to deal with its child transposed
rowset of the x-excluded transposed table to see if it is tables.
closed. Candidate rowsets are those large rowsets that The reason for this pruning strategy is apparent.
occur at least once in table TT. Parameter cMinsup is Suppose the rowset corresponds to this tuple is S.
a dynamically changing minimum support threshold as From the itemset point of view, any child transposed
indicated in step 5, and excludedSize is the size of table of this current table will not produce any
rowset x. There are five main steps in this subroutine, different itemsets from the one corresponding to
which will be explained one by one as follows. rowset S. From the rowset point of view, each rowset
Si corresponding to each child transposed table of S is
Algorithm TD-Close a subset of S, and Si cannot be closed because r(i(Si))
Input: Table T, and minimum support threshold, minsup S holds, and therefore Si r(i(Si)) holds.
Output: A complete set of frequent closed patterns, FCP Of course, before return according to pruning
Method: strategy 2, the current rowset S might be a closed
1. Transform T into transposed table TT rowset, so if the skip-rowset is empty, we need to
2. Initialize FCP = and excludedSize = 0 output it first.
3. Call TopDownMine(TT|, minsup, excludedSize) Example 4.2 (pruning strategy 2) For table TT|4
shown in Table 4.3, there is only one tuple in this table.
Subroutine TopDownMine(TT|x, cMinsup, excludedSize) After we check this tuple to see if it is closed (it is not
Method:
closed apparently as its skip-rowset is not empty), we
1. Pruning 1: if excludedSize >= (nminsup) return;
do not need to produce any child excluded transposed
2. Pruning 2: If the size of TT|x is 1, output the
table from it. That is, according to excluded row
corresponding itemset if the rowset is closed, and
then return.
enumeration tree shown in Figure 3.3, all of the child
nodes of node {4} are pruned. This is because all of
3. Pruning 3: Derive TT|xy and TT|x, where
the subsets of the current rowset cannot be closed
y is the largest rid among rids in tuples of TT|x, anymore since it is already contained by a larger
TT|x = {tuple ti | ti TT|x and ti contains y}, rowset.
TT|xy = {tuple ti | ti TT|x and if ti contains y, Step 3 is performed to derive from TT|x two child
size of ti must be greater than cMinsup} excluded transposed tables: TT|xy and TT|x, where y is
Note, we delete y from both TT|xy and TT|x . the largest rid among all rids in tuples of TT|x. These
4. Output: Add to FCI itemset corresponding to two tables correspond to a partition of current table
each rowset in TT|xy with the largest size k and TT|x. TT|xy is the sub-table without y, and TT|x is the
ending with rid k. sub-table with every tuple containing y. Since every
5. Recursive call: rowset that will be derived from TT|x must contain y,
TopDownMine(TT|xy, cMinsup, excluedSize+1) we delete y from TT|x and at the same time decrease
TopDownMine(TT|x, cMinsup1, excluedSize)
cMinsup by 1. Pruning strategy 3 is applied to shrink
table TT|xy.
Figure 4.1 Algorithm TD-Close
Pruning strategy 3: Each tuple t containing rid y in
In step 1, we apply pruning strategy 1 to stop TT|x will be deleted from TT|xy if size of t (that is the
processing current excluded transposed table. number of rids t contains) equals cMinsup.
Pruning strategy 1: If excludedSize is equal to or Example 4.3 (pruning strategy 3) Suppose currently
greater than (nminsup), then there is no need to do we have finished dealing with table TT|54 which is
any further recursive call of TopDownMine. shown in Table 4.2, and we need to create TT|543 with
ExcludedSize is the number of rids excluded from cMinsup being 2. Then, according to pruning strategy
current table. If it is not less than (nminsup), the size 2, tuples c1 and d2 will be pruned from TT|543, because
of each rowset in current transposed table must be less after excluding rid 3 from these two tuples, their size
than minsup, so these rowsets are impossible to will become less than cMinsup, although currently they
become large. satisfy the minsup threshold. As a result, there is only
In step 2, we apply pruning strategy 2 to stop one tuple {a1b1} left in TT|543, as shown in Table 4.4.
further recursive calls.
286
We can get these two tables, TT|xy and TT|x, by However, when dealing with every rowset, Carpenter
steps as follows. First, for each tuple t in TT|x with backward pruning strategy still needs to scan the
containing rid y, delete y from it, and copy it to TT|x. corresponding conditional transposed table to find the
And then check if the size of t is less than cMinsup. If largest common rowset, and also needs to scan the
not, keep it and in the meantime put y in the skip- original transposed table to do backward pruning, so it
rowset, otherwise get rid of it from TT|x. Finally, TT|x needs lots of scan of the transposed table. In addition,
to make the comparison fair, we just use a flat table to
becomes TT|xy, and at the same time TT|x is obtained.
represent the transposed table TT instead of other data
Step 4 is the major output step. If there is a tuple
structure such as FP-tree that may speed up search to
with empty skip-rowset and the largest size, say k, and
some extent.
containing rid k in TT|xy, then output it. Since this
tuple has the largest size among all of the tuples in
5.1 Synthetic Data Sets
TT|xy, there is no other tuple that will contribute to it.
Therefore, it can be output. For example, in table In order to test the performance of our top-down
TT|54 shown in Table 4.2, itemset a1b1 corresponding strategy based algorithm TD-Close with respect to
to rowset {1,2,3} can be output as its size is 3 which is several aspects, we use synthetic data set first. Figures
larger than the size of the other two tuples, and it also 5.1 to 5.8 illustrate the change of running time as
contains the largest rid 3. minsup decreases for data sets with different
Step 5 conducts two recursive calls to deal with dimensions, tuples, and cardinalities. We use D#T#C#
TT|xy and TT|x respectively. to represent specific dataset, where D# stands for
dimension, the number of attributes of each data set,
5 Experimental Study T# for number of tuples, and C# for cardinality, the
number of values per dimension (or attribute). All
In this section, we will study the performance of the data are generated randomly. In these experiments, D#
algorithm TD-Close. Experiments for both synthetic ranges from 4000 to 10000, T# varies from 100, 150 to
data sets and real microarray data sets were conducted. 200, and C# varies from 8, 10 to 12.
Carpenter has already shown its much better To test the performance of three algorithms with
performance than those column enumeration based respect to the number of dimensions, we created 5 data
algorithms such as CLOSET [10] and CHARM [11], sets with 4000, 6000, 8000 and 10000 dimensions
so we only compare our algorithm with Carpenter and respectively. Figures 5.1 to 5.4 show the effect of
another column enumeration-based algorithm FPclose, changing dimensionality on the runtime of these three
which won the FIMI03 best implementation award algorithms.
[15, 16]. All experiments were performed on a PC We can see from Figure 5.1 that as minsup
with a Pentium-4 1.5 Ghz CPU, 1GB RAM, and 30GB decreases, runtime of these three algorithms increases,
hard disk. All of the runtimes plotted in the figures and TD-Close is the fastest among these algorithms.
include both computation time and I/O time. Apparently, the increase speed of algorithm TD-Close
For algorithm FPclose, we downloaded the source and Carpenter is not fast, while when minsup reaches
of the implementation from the FIMI repository [17]. 10, the runtime of FPclose increases dramatically.
For algorithm Carpenter, we implement it to our This is because when minsup reaches 10, the number
best knowledge according to paper [5] and its of frequent one items increases dramatically. Since
subsequent papers [6, 7, 8], and we also improve it by FPclose is a column enumeration-based algorithm, the
using a faster closeness-checking method, backward increase of the number of frequent items will lead to
pruning, which is used in several other algorithms the explosion of the number of frequent itemsets which
dealing with microarray data [6, 8]. The original are needed to be checked one by one. On the other
closeness-checking method is that before outputting hand, row enumeration-based algorithms, such as TD-
each itemset found currently, we must check if it is Close and Carpenter, search the row combination
already found before. If not, output it. Otherwise, space. The number of rows influences the runtime of
discard it. This method is slower than the backward these algorithms much more than the number of
pruning method, because itemset from very high frequent items does. The reason that TD-Close uses
dimensional data set usually contains large number of less time than Carpenter is that TD-Close can prune
items, and it is not in specific order, so comparison the search space much more than Carpenter, and can
with a large number of large itemsets takes long time. stop search much earlier than Carpenter.
This is also the reason why some algorithms proposed Figures 5.2 to 5.4 also indicate the same trend as
after Carpenter use backward checking method. shown in Figure 5.1. What is different is that we
287
cannot get all runtimes for Carpenter and FPclose for due to the rise of the number of itemsets as the number
some datasets. For example, for dataset of dimensions increases.
D6000T100C10, we cannot get the runtime for
Carpenter when minsup is less than 13, and for 9000
FPclose when minsup is less than 11, which is either 8000 TD-Close
because it can not run to the end in reasonable time (in 7000 FPclose
several hours) or due to memory error occurred during
runtime (s)
6000
running. Same situations also happen in some of the 5000
3000
8000 2000
5000 13 12 11 10 9
4000 minsup
3000
2000
Figure 5.4 Runtime for D10000T100C10
1000
0 9000
12 11 10 9 8
8000 TD-Close
minsup runtime (s)
7000 FPclose
5000
4000
4500
3000
4000 TD-Close
2000
3500 Carpenter
FPclose 1000
runtime (s)
3000
0
2500 19 18 17 16 15 14 13
2000
minsup
1500
1000 Figure 5.5 Runtime for D4000T150C10
500
0
13 12 11 10 9 10000
9000
minsup TD-Close
8000 FPclose
7000
runtime (s)
1000
1500 0
25 23 22 20 19 18
1000 minsup
288
high values of minsup, FPclose runs a little faster than be pruned by minsup becomes small, so the time saved
TD-Close. However, when minsup becomes relatively by this kind of pruning in algorithm TD-Close
small, TD-Close runs much faster than FPclose. The becomes less.
reason is the same as explained above. That is, when
Comparing with data set D4000T100C12, data set
minsup is high, FPclose can cut the itemset space to
D4000T100C8 is relatively denser. With the same
very small by minsup threshold so the search time is
minsup threshold, the latter data set will produce more
limited. But once the number of frequent items
frequent patterns. Figure 5.8 shows almost the same
becomes large, the search space increases
features of TD-Close and FPclose. But obviously, it
exponentially, while the number of rows remains
took them much more time than the former data set for
relative stable.
the same minsup threshold. That is also the reason
In the above two groups of experiments, the
why we cannot get the exact runtime of Carpenter on
cardinality of each data set is set to 10, which means
it.
each dimension of each dataset has 10 distinct values.
To test the performance regarding different From these experimental results, one may see that
cardinalities, two other datasets are created, which for some relatively high minsup, FPclose is a little bit
correspond to 12 distinct values and 8 distinct values faster than TD-Close. But in that case, the runtime for
respectively. The number of dimension is 4000 and both TD-close and FPclose is usually less than one
the number of tuples is 100. Experimental results are minute, or only several minutes. So, the difference is
shown in Figures 5.7 and 5.8. not that significant. However, when minsup becomes
low, it is obvious that TD-Close outperforms FPclose
5000 and Carpenter much.
4500
4000 TD-Close
5.2 Real Data Sets
runtime (s)
3500 Carpenter
FPclose
3000
Besides testing on the synthetic data sets, we also
2500
2000
tested our algorithm on three real microarray data sets.
1500 They are clinical data on ALL-AML leukemia, lung
1000 cancer and breast cancer [5, 6]. We take the
500 corresponding training data sets for experiments. ALL-
0
11 10 9 8 7 6
AML has 38 rows and 7129 dimensions, Lung Cancer
minsup has 32 rows and 12533 dimensions, and Breast Cancer
has 78 rows and 24481 dimensions. Before using
Figure 5.7 Runtime for D4000T100C12 frequent pattern mining algorithms, they are all
discretized using the equi-depth binning method. To
test the performance on different cardinality, we
4000
produce two sets of discretized data sets, one with five
TD-Close
3500
FP_close
bins for each dimension, and another with 10 bins for
3000
each dimension.
runtime (s)
2500
The first group of experiments is done for these
2000
three data sets with 5 bins per dimension. Figures 5.9
1500
to 5.11 show the runtime of TD-Close, Carpenter, and
1000
FPclose at different minsup values. Note that the y-
500
axes in these figures are in logarithmic scale, and we
0
15 14 13 12 11 10
plot the figure as minsup increase so that we can see
m insup
the pruning power of minsup clearly.
Figure 5.9 shows that as minsup increases the
Figure 5.8 Runtime for D4000T100C8 runtime of both TD-Close and Fpclose reduces
dramatically, while the runtime of Carpenter remains
Figure 5.7 shows us that when minsup becomes relatively stable. This is because Carpenter still needs
small, for example, less than 9, both TD-Close and to search the rowset space in which the size of each
Carpenter spend less time than FPclose. When rowset is less than minsup although apparently they
minsup reaches 6, the runtime of TD-Close and will not satisfy minsup. This figure also indicates that
Carpenter become almost the same. This is because among these three algorithms, TD-Close is the fastest.
when minsup becomes very low, the search space can We did not get the runtime of FPclose when minsup is
289
7, because when minsup is 8, it already spends much row enumeration algorithms TD-Close and Carpenter,
longer time than the other two algorithms. they only check rowsets instead of itemsets, and the
number of rowsets does not change that dramatically.
For breast cancer data set, we cannot get the
10000
rumtime for FPclose when minsup is equal to or less
TD-Close than 16 because of memory error (run out of memory).
1000 Carpenter
Similarly, Carpenter cannot run to completion when
runtime (s)
FPclose
100
algorithms can use the minsup threshold to stop further
search the superset of an itemset once this itemset does
10
TD-Close
Ca rpenter not satisfy the minsup threshold. However, for very
high dimensional data sets, since n becomes very large,
FPclose
290
based classification rules. Farmer also searches the row [5] F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. J.
enumeration tree by depth-first order. In [7], Zaki. CARPENTER: Finding closed patterns in long
Algorithm Cobbler is proposed to find frequent closed biological datasets. In Proc. 2003 ACM SIGKDD Int.
itemsets by integrating row enumeration method with Conf. On Knowledge Discovery and Data Mining,
2003.
column enumeration method. It shows its high
performance by conducting experiments on a data set [6] G. Cong, A. K. H. Tung, X. Xu, F. Pan, and J. Yang.
FARMER: Finding interesting rule groups in
with high dimension and a relatively large number of
microarray datasets. In Proc. 23rd ACM Int. Conf.
rows. In [8], an algorithm is proposed to find the top- Management of Data, 2004.
k classification rules for each row of the microarray
[7] F Pan, A. K. H. Tung, G. Cong, X. Xu. COBBLER:
data set. All of these algorithms aim to facilitate the Combining column and Row Enumeration for Closed
mining of frequent pattern by searching the row Pattern Discovery. In Proc 2004 Int. Conf. on
enumeration space, and they all search the space in a Scientific and Statistical Database Management
top-down style. (SSDBM'04), Santorini Island, Greece, June 2004. pp.
21-30.
7 Conclusions [8] G. Cong, K.-L. Tan, A. K. H. Tung, X. Xu. Mining
Top-k covering Rule Groups for Gene Expression Data.
In this paper we propose a top-down search strategy In 24th ACM International Conference on
Management of Data, 2005.
for mining frequent closed patterns from very high
dimensional data such as microarray data. Existing [9] C. Creighton and S. Hanash. Mining gene expression
databases for association rules. Bioinformatics, 19,
algorithms, such as Carpenter and several other related
2003.
algorithms, adopt a bottom-up fashion to search the
[10] J. Pei, J. Han, and R. Mao. CLOSET: An efficient
row enumeration space, which makes the pruning
algorithm for mining frequent closed itemsets. In Proc.
power of minimum support threshold (minsup) very 2000 ACM-SIGMOD Int. Workshop Data Mining and
weak, and therefore results in long mining process, Knowledge Discovery (DMKD'00), pp. 11-20, Dallas,
even for high minsup, and much memory cost. To TX, May 2000.
solve this problem, based on our top-down search [11] M. Zaki and C. Hsiao. CHARM: An efficient algorithm
strategy, a top-down style row enumeration method for closed association rule mining. In Proc. of 2002
and an effective closeness-checking method are SIAM Data Mining Conf., 2002.
proposed. A new algorithm, TD-Close, is designed [12] J. Han and J. Pei. Mining frequent patterns by pattern
and implemented for mining a complete set of frequent growth: methodology and implications. KDD
closed itemsets from high dimensional data. Several Exploration, 2, 2000.
pruning strategies are developed to speed up searching. [13] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
Both analysis and experimental study show that these Discovering frequent closed itemsets for association
methods are effective and useful. Future work rules. In Proc. 7th Int. Conf. Database Theory
includes integrating top-down row enumeration (ICDT99), Jan. 1999.
method and column row enumeration method for [14] J. Li and L. Wong. Identifying good diagnostic genes
frequent pattern mining from both long and deep large or genes groups from gene expression data by using the
datasets, and mining classification rules based on concept of emerging patterns. Bioinformatics, 18:725
association rules using top-down searching strategy. 734, 2002.
[15] G. Grahne and J. Zhu. Efficiently Using Prefix-trees in
Mining Frequent Itemsets. In Proc. of the IEEE ICDM
References Workshop on Frequent Itemset Mining
[1] R. Agrawal and R. Srikant. Fast algorithms for mining Implementations. Melbourne, Florida, Nov., 2003
association rules. In Proc. 1994 Int. Conf. Very Large [16] http://fimi.cs.helsinki.fi/fimi03/
Data Bases (VLDB94), pp. 487499, Sept. 1994. [17] http://fimi.cs.helsinki.fi/.
[2] C. Niehrs and N. Pollet. Synexpression groups in [18] M. J. Zaki. Scalable algorithms for association mining.
eukaryotes. Nature, 402: 483-487, 1999. IEEE Trans. Knowledge and Data Engineering,
[3] Y. Cheng and G. M. Church. Biclustering of 12:372{390, 2000.
expression data. In Proc of the 8th Intl. Conf. [19] J. Wang, J. Han, and J. Pei. Closet+: Searching for the
Intelligent Systems for Mocular Biology, 2000. best strategies for mining frequent closed itemsets. In
[4] J. Yang, H. Wang, W. Wang, and P. S. Yu. Enhanced Proc. 2003 ACM SIGKDD Int. Conf. On Knowledge
Biclustering on Gene Expression data. In Proc. of the Discovery and Data Mining (KDD'03), Washington,
3rd IEEE Symposium on Bioinformatics and D.C., Aug 2003.
Bioengineering (BIBE), Washington DC, Mar. 2003.
291