0% found this document useful (0 votes)
26 views

Lecture 5 - FP-Growth Algorithm

The document summarizes the FP-Growth algorithm for mining frequent itemsets. It describes how FP-Growth uses an FP-tree to compactly represent the transaction database in two scans. It then mines frequent itemsets from the FP-tree without candidate generation by recursively checking conditional patterns. The algorithm has strengths of being more efficient than Apriori when the FP-tree is dense but struggles with sparse databases with many branches.

Uploaded by

johndeuterok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Lecture 5 - FP-Growth Algorithm

The document summarizes the FP-Growth algorithm for mining frequent itemsets. It describes how FP-Growth uses an FP-tree to compactly represent the transaction database in two scans. It then mines frequent itemsets from the FP-tree without candidate generation by recursively checking conditional patterns. The algorithm has strengths of being more efficient than Apriori when the FP-tree is dense but struggles with sparse databases with many branches.

Uploaded by

johndeuterok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Lecture 5:

FP-Growth Algorithm
FP-Growth Algorithm
• Use Frequent Pattern Tree (FP-Tree) as a compact input database
representation
• Root node is labelled as null
• Each child node comprises an item name, support counter, node-link
pointer (refers to sibling node) and parent-link pointer (refers to parent
of current node)
• Header table is constructed to serve as an index to the FP-Tree:
consists of frequent item name, count of support for the item and a
pointer to the head of node-link.

• Only two scans of the database required for constructing the FP-Tree.

• Frequent itemsets are derived from the FP-Tree.


A review of a tree structure

Root node
FP-Growth Algorithm
Phase 1: FP-Tree Construction
1. Construct a FP-Tree with a header table
Phase 2: Mining frequent itemsets
1. Mining frequent itemsets from the FP-Tree by pattern
fragment growth
Phase 1: FP-Growth Algorithm
• FP-Tree Growth Algorithm
1. Collect individual items, count their supports and obtain a
list of frequent items;
2. Sort the items in descending order of support and save
them to header table L;
3. Create the root node for tree T and mark it NULL;
4. For each transaction t do the following:
a) Sort t in the same order as in L;
b) For each item of the transaction along the order
• if a node for the item already exist, increase the
counter value by 1; otherwise create a new node for
the item and a default counter value of 1.
• Amend the node pointer reference and parent pointer
reference accordingly
FP-Growth Algorithm
• First database scan – determine frequent 1-itemset
C
TID Items Items Support F
1 2 Min sup=2 Items Support
100 1 3 4 2 3 2 3
200 2 3 5 3 3 3 3
300 1 2 3 5 4 1 5 3
400 2 5 1 2
5 3

Header Table
Items Head of Node-link
2
3
5
1
FP-Growth Algorithm
• Second scan of database – building header table
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
null
Header Table
Items Head of Node-link 3:1
2
3
5 1:1
1
FP-Growth Algorithm
• Building header table
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
null
Header Table
Items Head of Node-link 3:1 2:1
2
3 3:1
5 1:1
1
5:1
FP-Growth Algorithm
• Building header table
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
null
Header Table
Items Head of Node-link 3:1 2:2
2
3 3:2
5 1:1
1
5:2

1:1
FP-Growth Algorithm
• Building header table
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
null
Header Table
Items Head of Node-link 3:1 2:3
2
3 3:2
5
5:1
1:1
1
5:2

1:1
FP-Growth Algorithm
• FP-Growth for Frequent Itemsets
Start from the bottom of the header table towards the top, for each item in
the header table:
1. Create a frequent itemset with the item and its total support;
2. Obtain a subtree with the item as leaves;
3. If the resulting tree consists of a single path
a) Generate all combinations of the items along the path;
b) Attach the suffix to each combination to form a frequent itemset;
c) Set the support level to the lowest in the nodes involved
Otherwise,
a) Determine the conditional pattern base of the suffix.
b) Construct a conditional FP-tree for the item;
c) Repeat the process from step 2
Terminology
• Suffix
• Given a string pattern “ABGED”
• D is the suffix of ABGE, while ABGE is the prefix of D
• ED is a suffix ABG, while ABG is the prefix of ED

• Conditional pattern base – set of prefix paths in the FP-Tree co-


occurring with the suffix pattern.
Terminology

• Conditional FP-Tree – is a FP-Tree constructed from the conditional


pattern base.
FP-Growth Algorithm
• FP-Growth (Example)
null
Header Table
Items Head of Node-link 3:1 2:3
2
3
5
3:2 5:1
1:1
1
5:2

null 1:1

3:1 2:3

3:2
1:1
5:2

1:1
FP-Growth Algorithm
• FP-Growth (Example) null
Header Table
Items Head of Node-link 3:1 2:3
2
3
5
3:2 5:1
1:1
1
5:2

1:1
null

3:1 2:3

3:2 5:1

5:2
FP-Growth Algorithm
• FP-Growth (Example) null
Header Table
Items Head of Node-link 3:1 2:3
2
3
5
3:2 5:1
1:1
1
5:2

1:1
null

3:1 2:2

3:2
FP-Growth Algorithm
• FP-Growth (Example) null
Header Table
Items Head of Node-link 3:1 2:3
2
3
5
3:2 5:1
1:1
1
5:2

1:1

null

2:3

F = {({1}, 2), ({1,3}, 2), ({5}, 3), ({3,5}, 2), ({2,5}, 3), ({2,3,5}, 2),
({3}, 3), ({2,3}, 2), ({2}, 3)}
FP-Growth Algorithm
• Strengths and Weaknesses
✓ No need to scan the database many times (only twice)
✓ Compact representation of the database in FP-Tree
✓ Very good when FP-tree is dense without many branches
✓ Supports for frequent itemsets are derived from the supports
in the FP-tree nodes
✓ Some tests show several folds faster than Apriori
 Not very good when the database is sparse (too many
branches in the FP-tree)
 Resource implications for recursive solutions
 Much more complex than Apriori, and therefore quite difficult
to implement
Evaluating Association Rule Quality
• Some Evaluators
• Support, coverage, and confidence: strength of posterior association
• Support and confidence alone may not be sufficient
• Lift measure:

𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑋⟹𝑌) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋∪𝑌)
𝑙𝑖𝑓𝑡 𝑋 ⇒ 𝑌 = =
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑌) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 ∗𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑌)

where support is represented in fraction


• Lift = 1: if X and Y are independent, Lift > 1: if occurrence of Y is
related to occurrence of X, Lift < 1: if occurrence of Y is related to
absence of X.
e.g. {2,3}  {5} cf = 100% and sup({5}) = 75%
Lift({2,3}  {5}) = 100/75  1.33
Association Rule Mining in Practice

• Process of an Association Mining Project


1. Objective Definition (association between what?)
2. Data preparation (getting the relevant attributes)
3. Identifying transactions and items
4. Setting parameters
5. Conduct the mining
6. Evaluation of association rule quality and strength
7. Rule filtering and selection (real problem at present)
Association Rule Mining in Practice

• Making Association Rule Mining Efficient


• Exponential time complexity
• Combinatorial nature of the problem: big search space
• Algorithm improvement: limited success so far
• Implementation techniques: limited scale of improvement through use
of good data structures
• Sampling the database: reducing the number of transactions to be
used for discovery of rules
• Parallel processing: hardware solution – sharing the workload among
multiple processors
Association Rule Mining in Practice

• Interactive Discovery
• Roles of data miners in parameter setting
• Set initial minimum support and minimum confidence thresholds
• May fine-tune thresholds during the discovery phase based on
understanding the rules discovered and the number of rules found.
• Roles of data miners in constraint specification
• Left hand side item specification in X
e.g. what other items does customers buy if they buy T-shirt?
• Right hand side item specification in Y
e.g. what items lead to the sales of T-shirt?
• Item specification in both X and Y
e.g. Verify the truth of Jeans  T-shirt
Association Rule Mining in Practice

• Various Boolean Rules


• Rules leading to course of action (e.g. banana  tomato)
• Rules having no course of action (i.e. rules with inexplicable
associations – meanings that are not easy to explain)
• Rules that state known knowledge (e.g. link_to_homepage 
link_to_main_picture: the main picture is ON the homepage)
• Anonymous rules (no transaction IDs identified) vs. signature rules
(transaction IDs identified)
Summary

• Boolean association rules state associations between the presence of


certain items with the presence of some other items
• Association rules are measured and evaluated in terms of support,
confidence and lift
• The most significant part of association rule mining is the discovery of
frequent itemsets
• Apriori algorithm for finding frequent itemsets follows a create-and-test
greedy approach. FP-growth algorithm offers an alternative by traversing a
FP-tree without generating candidate itemsets
• Apriori approach is also used for generating rule expressions
• Association rule mining faces with practical problems and issues
Useful Further References

• Read Chapter 8 of Data Mining Techniques and Applications


• Tan, P-N., Steinbach, M. and Kumar, V. (2006), Introduction to Data Mining,
Addison-Wesley, Chapters 6
• Han, J. and Kamber, M. (2006), Data Mining: Concepts and Techniques, 2nd
ed. Morgan Kaufmann, Chapter 5

You might also like