An Implementation of The FP-growth Algorithm
An Implementation of The FP-growth Algorithm
net/publication/228913454
CITATIONS READS
238 7,143
1 author:
Christian Borgelt
Paris Lodron University of Salzburg
385 PUBLICATIONS 5,509 CITATIONS
SEE PROFILE
All content following this page was uploaded by Christian Borgelt on 03 June 2014.
Christian Borgelt
Department of Knowledge Processing and Language Engineering
School of Computer Science, Otto-von-Guericke-University of Magdeburg
Universitätsplatz 2, 39106 Magdeburg, Germany
borgelt@iws.cs.uni-magdeburg.de
1
a d f d a Of course, this is not the only way in which the initial FP-
a c de d 8 d c ae tree can be built. At first sight it may seem to be more
b d b 7 d b natural to build it by inserting transaction after transaction
b c d c 5 d b c into an initially empty FP-tree, creating the necessary nodes
b c a 4 b c for each new transaction. Indeed, such an approach even
a b d e 3 d b a has the advantage that the transaction database need not
b d e f 2 d b e be loaded in a simple form (for instance, as a list of integer
b c e g g 1 b c e arrays) into main memory. Since only one transaction is
c d f d c processed at a time, only the FP-tree representation and
a b d d b a one new transaction is in main memory. This usually saves
space, because an FP-tree is often a much more compact
Table 1: Transaction database (left), item frequen- representation of a transaction database.
cies (middle), and reduced transaction database
with items in transactions sorted descendingly w.r.t. Nevertheless I decided against such a representation for the
their frequency (right). following reasons: in order to build a prefix tree by sequen-
tially adding transactions, one needs pointers from parent
nodes to child nodes, so that one can descend in the tree
d: 8 d: 8 according to the items present in the transaction. However,
this is highly disadvantageous. As we will see later on, the
b: 7 b: 5 b: 2 further processing of an FP-tree, especially the main oper-
ation of projecting it, does not need such parent-to-child
c: 5 c: 1 c: 2 c: 2 pointers in my implementation, but rather child-to-parent
pointers. Since each node in an FP-tree (with the exception
a: 4 a: 2 a: 1 a: 1
of the roots) has exactly one parent, this, in principle, makes
it possible to work with nodes of constant size. If, however,
e: 3 e: 1 e: 1 e: 1
we have to accomodate an array of child pointers per node,
the nodes either have variable size or are unnecessarily large
Figure 1: FP-tree for the (reduced) transaction (because we have pointers that are not needed), rendering
database shown in Table 1. the memory management much less efficient.
2
ory objects. The idea is to allocate larger arrays (with sev- d: 8 d: 8 d: 2
eral thousand elements) of these objects and to organize the
elements into a “free” list (i.e., a list of available memory b: 7 b: 5 b: 2
blocks of equal size). With such a system allocating and b: 1 b: 1
deallocating FP-tree nodes gets very efficient: the former c: 5 c: 1 c: 2 c: 2
c: 1 c: 1
retrieves (and removes) the first element of the free list, the
a: 4 a: 2 a: 1 a: 1
latter adds the node to deallocate at the beginning of the a: 1
free list. As experiments showed, introducing this special-
e: 3 e: 1 e: 1 e: 1
ized memory management led to a considerable speed-up.
The first method is illustrated in Figure 2 for the example The second projection method also traverses, in an outer
FP-tree shown in Figure 1. The red arrows show the flow loop, the deepest level of the FP-tree. However, it does not
of the processing and the blue “shadow” FP-tree is the cre- follow the chain of parent pointers up the root, but only
ated projection. In an outer loop, the lowest level of the copies the parent of each node, not its higher ancestors. In
FP-tree, that is, the list of nodes corresponding to the pro- doing so, it also copies the parent pointers of the original FP-
jection item, is traversed. For each node of this list, the nodes, thus making it possible to find the ancestors in later
parent pointers are followed to traverse all ancestors up to steps. These later steps consist in traversing the levels of the
the root. Each encountered ancestor is copied and linked (partially constructed) “shadow” FP-tree (not the levels of
from its original (this is what the auxiliary pointer in each the original one!) from bottom to top. On each level the
node, which was mentioned above, is needed for). During parents of the copied nodes (which are nodes in the original
the copying, the parent pointers of the copies are set, the tree) are determined and copied, and the parent pointers of
copies are also organized into level lists, and a sum of the the copies are set. That is, instead of branch by branch, the
counter values in each node is computed in head elements FP-tree is rather constructed level by level (even though in
for these lists (these head elements are omitted in Figure 2). each step nodes on several levels may be created). The ad-
vantage of this method over the one described above is that
Note that the counters in the copied nodes are determined for branches that share a path close to the root, this common
only from the counters in the nodes on the deepest level, path has to be traversed only once with this method (as the
which are propagated upwards, so that each node receives counters for all branches are summed before they are passed
the sum of its children. Note also that due to this we cannot to the next higher level). However, the experiments reported
stop following the chain of ancestors at a node that has below show that the first method is superior in practice. As
already been copied, even though it is clear that in this it seems, the additional effort needed for temporarily setting
case all ancestors higher up in the FP-tree must already another parent etc. more than outweighs the advantage of
have been copied. The reason is that one has to update the the better combination of the counter values.
number of transactions in the copies, adding the counter
value from the current branch to all copies of the ancestors 5. PRUNING A PROJECTED FP-TREE
on the path to the root. This is what the second projection After we obtained an FP-tree of a projected database, we
method tries to improve upon. may carry out an additional pruning step in order to sim-
plify the tree, thus speeding up projections. I got this idea
In a second traversal of the same branches, carried out in from [4], which introduces pruning techniques in a slightly
exactly the same manner, the copies are detached from their different context than pure frequent item set mining (suf-
originals (the auxiliary pointers are set to null), which yields fice it to say that there are additional constraints). One of
the independent projected FP-tree shown in Figure 3. This these techniques, however, can nevertheless be used here,
3
a: 6 a: 6 a: 6 a: 6 log(time/s) over support
b: 1 b: 1 b: 1 1
c: 4 c: 1 c: 3 c: 4 c: 4
d: 3 d: 1 d: 2 d: 3 d: 3
-1
34 35 36 37 38 39 40 41 42 43 44 45 1
6. EXPERIMENTAL RESULTS 1
4
Figures 5 to 9 show, each for one of the five data sets, the http://www.ics.uci.edu/˜mlearn/MLRepository.html
decimal logarithm of the execution time over different (ab-
solute) minimum support values. The solid black line refers [4] F. Bonchi and B. Goethals. FP-Bonsai: the Art of
to the implementation of the FP-growth algorithm described Growing and Pruning Small FP-trees. Proc. 8th
here, the dotted black line to the version that uses the al- Pacific-Asia Conference on Knowledge Discovery and
ternative projection method. The grey lines represent the Data Mining (PAKDD’04, Sydney, Australia),
corresponding results for Apriori (solid line), Eclat (dashed 155–160. Springer-Verlag, Heidelberg, Germany 2004
line), and Relim (dotted line).1 [5] C. Borgelt. Efficient Implementations of Apriori and
Eclat. Proc. 1st IEEE ICDM Workshop on Frequent
Among these implementations, all of which are highly opti- Item Set Mining Implementations (FIMI 2003,
mized, FP-growth clearly performs best. With the exception Melbourne, FL). CEUR Workshop Proceedings 90,
of the artificial dataset T10I4D100K, on which it is bet by Aachen, Germany 2003.
a considerable margin by Relim, and for higher support val- http://www.ceur-ws.org/Vol-90/
ues on BMS-Webview-1, where Relim also performs slightly
better (presumably, because it does not need to construct a [6] C. Borgelt. Recursion Pruning for the Apriori
prefix tree), FP-growth is the clear winner. Only on chess, Algorithm. Proc. 2nd IEEE ICDM Workshop on
Eclat can come sufficiently close to be called competitive. Frequent Item Set Mining Implementations (FIMI
2003, Brighton, United Kingdom). CEUR Workshop
The second projection methods for FP-trees (dotted black Proceedings 126, Aachen, Germany 2004.
line) generally fares worse, although there is not much dif- http://www.ceur-ws.org/Vol-126/
ference between the two methods on chess and mushroom.
This is a somewhat surprising result, because there are good [7] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
reasons to believe that the second projection method may be R. Uthurusamy, eds. Advances in Knowledge
able to yield better results than the first. I plan to examine Discovery and Data Mining. AAAI Press / MIT Press,
this issue in more detail in the future. Cambridge, CA, USA 1996
[8] J. Han, H. Pei, and Y. Yin. Mining Frequent Patterns
7. CONCLUSIONS without Candidate Generation. In: Proc. Conf. on the
In this paper I described an implementation of the FP- Management of Data (SIGMOD’00, Dallas, TX).
growth algorithm, which contains two methods for efficiently ACM Press, New York, NY, USA 2000
projecting an FP-tree—the core operation of the FP-growth
algorithm. As the experimental results show, this implemen- [9] R. Kohavi, C.E. Bradley, B. Frasca, L. Mason, and
tation clearly outperforms Apriori and Eclat, even in highly Z. Zheng. KDD-Cup 2000 Organizers’ Report: Peeling
optimized versions. However, the performance of the two the Onion. SIGKDD Exploration 2(2):86–93. 2000.
projection methods, especially, why the second is sometimes
[10] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li.
much slower than the first, needs further investigation.
New Algorithms for Fast Discovery of Association
Rules. Proc. 3rd Int. Conf. on Knowledge Discovery
8. PROGRAM and Data Mining (KDD’97), 283–296. AAAI Press,
The implementation of the FP-growth algorithm described Menlo Park, CA, USA 1997
in this paper (Windowstm and Linuxtm executables as well
as the source code, distributed under the LGPL) can be [11] Synthetic Data Generation Code for Associations and
downloaded free of charge at Sequential Patterns. Intelligent Information Systems,
IBM Almaden Research Center
http://fuzzy.cs.uni-magdeburg.de/˜borgelt/software.html http://www.almaden.ibm.com/
At this URL my implementations of Apriori, Eclat, and Re- software/quest/Resources/index.shtml
lim are also available as well as a graphical user interface
(written in Java) for finding association rules with Apriori.
9. REFERENCES
[1] R. Agrawal, T. Imielienski, and A. Swami. Mining
Association Rules between Sets of Items in Large
Databases. Proc. Conf. on Management of Data,
207–216. ACM Press, New York, NY, USA 1993
[2] A. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and
A. Verkamo. Fast Discovery of Association Rules. In:
[7], 307–328
[3] C.L. Blake and C.J. Merz. UCI Repository of Machine
Learning Databases. Dept. of Information and
Computer Science, University of California at Irvine,
CA, USA 1998
1
Relim is described in a sibling paper that has also been
submitted to this workshop.