Fast Algorithms For Mining Association Rules: Milan Garg Rohit Das Sarthak Mittal

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Fast Algorithms for Mining Association Rules

Rakesh Agarwal Ramakrishnan Srikant

Milan Garg
Rohit Das
Sarthak Mittal
Introduction

Bar Code technology has brought in a lot of scope for retail organizations to
store large amounts of data which is referred to as basket data. We use
association mining rules over basket data to find most frequently occurring
rules over the dataset. Eg People who buy tires also get car services done.

Finding such rules helps a lot for attached mail and cross market services. The
databases involved are very large and hence we need fast algorithms for this
task.
Problem Statement
Given a database of transactions , find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction.

TID Items
{Diaper} -> {Beer}
1 Bread,Milk

2 Bread,Diaper,Beer,Eggs {Milk,Bread} -> {Eggs,Coke}

3 Milk,Diaper,Beer,Coke {Beer,Bread} -> {Milk}

4 Bread,Milk,Diaper,Beer

5 Bread,Milk,Diaper,Coke
Motivation

Walmart has lots of data about point of sales

Data on each user

Basket of items that has been bought during each visit.

Similarly, phone companies have information available about phone


calls made at a particular time to a location

How can the above information be leveraged for better logistics,


better placement of items on shelves etc.
L1 = {large 1-itemsets};

for( k=2; Lk-1; k++) do begin

Ck = apriori-gen(Lk-1); // New candidates

Algorithm for forall transactions t D do // Counting support

Apriori Ct = subset(Ck, t); // Candidates in t

forall candidates c Ct do

c.count++;

end

Lk = {c Ck | c.count minsup}

End

Answer = k Lk;
Flowchart for Apriori Algorithm

Get Frequent Generate Candidate Get Frequent Item


Start
Items Item set sets

No

Yes Generate
End Generate Strong
set = NULL
Rules
Apriori Candidate Generation

Takes in Lk-1 and returns Ck. Candidate Generation (Join)

Two steps:

Join large itemsets Lk-1 with Lk-1.

Prune out all itemsets in joined result


which contain a (k-1)subset not found in
Lk-1.
Candidate Generation (Example)

L3 C4 C4

{Beer, Butter, Coke} Join {Beer, Butter, Coke, Cheese} Prune {Beer, Butter, Coke, Cheese}

{Beer, Butter, Cheese} {Beer, Coke, Cheese, Maggi}

{Beer, Coke, Cheese}

{Beer, Coke, Maggi}

{Butter, Coke, Cheese}


Hash-tree Example
Depth
Beer Butter 1
L3

{Beer, Butter, Coke}

{Beer, Butter, Cheese} Butter Coke {Butter, Coke,


2
Cheese}
{Beer, Coke, Cheese}

{Beer, Coke, Maggi} {Beer, Butter, Coke} {Beer, Coke, Cheese}


{Beer, Butter, Cheese} {Beer, Coke, Maggi} 3
{Butter, Coke, Cheese}
AprioriTid

Does not use the transactions in the database for counting itemset
support.

Instead stores transactions as sets of possible large itemsets, Ck.

Each member of Ck is of the form:

< TID, {Xk}> , Xk is a possible large itemset


AprioriTid Example

TID Items TID Set of Itemsets Itemset Suport

100 134 100 {{1} ,{3} ,{4}} {1} 2

200 235 200 {{2} ,{3} ,{5}} {2} 3

300 1235 300 {{1} ,{2} ,{3} ,{5}} {3} 3

400 25 400 {{2} ,{5}} {5} 3


AprioriTid Example
Itemsets TID Set of Itemsets
Itemset Suport
{12} 100 {{1 3}}
{1 3} 2
{13} 200 {{2 3} ,{2 5} ,{3 5}}
{2 3} 2
{15} 300 {{1 2} ,{1 3} ,{1 5} {2
{2 5} 3
3} ,{2 5} ,{3 5}}
{23}
{3 5} 2
400 {{2 5}}
{25}

{35}
TID Set-of-Items ItemSet Support
Itemset 200 {{2 3 5}} {2 3 5} 2
{2 3 5} 300 {{2 3 5}}
AprioriHybrid

It is not necessary to use the same algorithm for all


the passes.

Combine the two algorithms!

Start with Apriori

When Ck can fit in main memory switch to


AprioriTID
Apriori Advantages/Disadv
Advantages: Disadvantages:
This algorithm has less It requires many scans of
memory consumption. database.
Easy implementation. It allows only a single
It uses Apriori property for minimum support threshold.
pruning therefore, itemsets It is favourable only for
left for further support small database.
checking remain less. It explains only the presence
or absence of an item in the
database.
Conclusion

The performance gap increased with the problem size


and ranged from a factor of three for small problems
to more than an order of magnitude for large
problems

The algorithms presented in the paper have been


implemented on several data repositories and were
found to give consistent results.
Thanks!

You might also like