CH 03 Frequent Pattern Mining 2021

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 62

Frequent Patterns Mining

Association Rule Problem


• Given a database of transactions:

• Find all the association rules:


Applications
• Market Basket Analysis: given a database of customer
transactions, where each transaction is a set of items the goal
is to find groups of items which are frequently purchased
together.
• Telecommunication (each customer is a transaction
containing the set of phone calls)
• Credit Cards/ Banking Services (each card/account is a
transaction containing the set of customer’s payments)
• Medical Treatments (each patient is represented as a
transaction containing the ordered set of diseases)
• Basketball-Game Analysis (each game is represented as a
transaction containing the ordered set of ball passes)
Association Rule Definitions

· I={i1, i2, ..., in}: a set of all the items


· Transaction T: a set of items such that T I
· Transaction Database D: a set of transactions
· A transaction T  I contains a set X  I of some
items, if X  T
· An Association Rule: is an implication of the
form X  Y, where X, Y  I
Association Rule Definitions
Frequent pattern:
A pattern (a set of items, subsequences, substructures, etc.) that
occurs frequently in a data set

· The support s of an itemset X is the percentage of transactions


in the transaction database D that contain X.

· The support of the rule X  Y in the transaction database D is


the support of the items set X  Y in D.

· The confidence of the rule X  Y in the transaction database D


is the ratio of the number of transactions in D that contain X  Y
to the number of transactions that contain X in D.
Association Rule Problem

• Given:
― a set I of all the items;
― a database D of transactions;
― minimum support s;
― minimum confidence c;
• Find:
― all association rules X  Y with a minimum
support s and confidence c.
Problem Decomposition

1. Find all sets of items that have minimum


support (frequent itemsets)
2. Use the frequent itemsets to generate the
desired rules
Problem Decomposition
Transaction ID Items Bought
1 Shoes, Shirt, Jacket
2 Shoes,Jacket
3 Shoes, Jeans
4 Shirt, Sweatshirt

If the minimum support is 50%, then {Shoes,Jacket} is the only 2-


itemset that satisfies the minimum support.
Frequent Itemset Support
{Shoes} 75%
{Shirt} 50%
{Jacket} 50%
{Shoes, Jacket} 50%
If the minimum confidence is 50%, then the only two rules generated
from this 2-itemset, that have confidence greater than 50%, are:
Shoes  Jacket Support=50%, Confidence=66%
Jacket  Shoes Support=50%, Confidence=100%
The Apriori Algorithm
• Frequent Itemset Property:
Any subset of a frequent itemset is frequent.
Apriori algorithm uses frequent itemsets to generate
association rules. It is based on the concept that a subset of a
frequent itemset must also be a frequent itemset. Frequent
Itemset is an itemset whose support value is greater than a
threshold value(support).
• Contrapositive:

If an itemset is not frequent, none of its


supersets are frequent.
Frequent Itemset Property
The Apriori Algorithm
• Lk: Set of frequent itemsets of size k (with min support)
• Ck: Set of candidate itemset of size k (potentially frequent
itemsets)

L1 = {frequent items};
for (k = 1; Lk !=; k++) do
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
return k Lk;
The Apriori Algorithm — Example 1
Database D

A dataset D has 4 transactions. TID Items


100 134
Let the minimum support be 50% and 200 235
minimum confidence be 80%.
300 1235
Find all the frequent item set and also 400 25
generate association rule using Apriori
algorithm

Min support count = min support threshold * total no. of transactions


= (50/100) * 4
=2
The Apriori Algorithm — Example Min support =50%

Database D itemset sup.


L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
Therefore frequent item sets are
L= {2,3,5}
Now, for strong association rule:
Generate non-empty subset for L
S={2},{3},{5},{2,3},{2,5},{3,5}
Find S(L-S)
For example: {2}{3,5}
S(L-S) Support Confidence Confidence(%)

{2}{3,5} 2 2/3 66.66

{3}{2,5} 2 2/3 66.66

{5}{2,3} 2 2/3 66.66

{2,3}{5} 2 2/2 100

{2,5}{3} 2 2/3 66.66

{3,5}{2} 2 2/2 100

Minimum confidence threshold=80%


Therefore strong association rules are_
{2,3}{5}
{3,5}{2}
Apriori Example 2
• Consider the following database with minimum support
count=60%.Find all the frequent itemset using apriori
algorithm and also generate strong association rules if
minimum confidences= 50%.

Hint : O is bought 4 times in total, but, it occurs in just 3


transactions.
Min support count = min support threshold * total
no. of transactions

Min support count = (60/100) * 5

Min support count = 3


Step 1: Generate C1
Step 2: Generate L1 from C1
Step 3: Generate C2 from L1 by
Join Step
Step 4: Generate L2 from C2 by
prune Step
Step 5: Generate C3 from L2 by
Join step
Step 6: Generate L3 from C3 by
Prune step
For strong association rule_

L = { O, K, E}

Generate non empty subset of L

S = {O}, {K}, {E}, {OK}, {OE}, {KE}

Generate association rule S(L-S)


Step 7: Finding association
Rules with min confidences
Association Rule
=50%Confidence Confidence(100
Support
%)

O,K E 3 3/3 100%

O,E K 3 3/3 100%


K,E O 3 3/4 75%

O K,E 3 3/3 100%

K O,E 3 3/5 60%


E O,K 3 3/4 75%
All the association rules are having
confidence more than 50%.
Therefore all rules are strong association
rule
Apriori Example 3
• Consider the following database with minimum support
count=50%.Find all the frequent itemset using apriori
algorithm and also generate strong association rules if
minimum confidences= 50%.
T id Items Bought

1 A,B,D

2 A,D

3 A,C

4 B,D,E,F
Apriori Example 4
Tid items
Find frequent item set and
1 ABD strong association rule
2 BCD
3 AB
4 BD
5 ABC

• min support = 30%


• min confidence = 75%
How to Generate Candidates
Input: Li-1 : set of frequent itemsets of size i-1
Output: Ci : set of candidate itemsets of size i
Ci = empty set;
for each itemset J in Li-1 do
for each itemset K in Li-1 s.t. K<> J do
if i-2 of the elements in J and K are equal then
if all subsets of {K  J} are in Li-1 then
Ci = Ci  {K  J}
return Ci;
Example of Generating Candidates
• L3={abc, abd, acd, ace, bcd}

• Generating C4 from L3
– abcd from abc and abd
– acde from acd and ace

• Pruning:
– acde is removed because ade is not in L3

• C4={abcd}
Example of Discovering Rules
Let use consider the 3-itemset {I1, I2, I5}:
I1  I2  I5
I1  I5  I2
I2  I5  I1
I1  I2  I5
I2  I1  I5
I5  I1  I2
Discovering Rules

for each frequent itemset I do


for each subset C of I do
if (support(I) / support(I - C) >= minconf) then
output the rule (I - C)  C,
with confidence = support(I) / support (I - C)
and support = support(I)
Example of Discovering Rules
TID List of
Item_IDs Let use consider the 3-itemset {I1, I2, I5}
T100 I1, I2, I5 with support of 0.22(2)%. Let generate
T200 I2, I4 all the association rules from this itemset:
T300 I2, I3
I1  I2  I5 confidence= 2/4 = 50%
T400 I1, I2, I4
I1  I5  I2 confidence= 2/2 = 100%
T500 I1, I3
T600 I2, I3 I2  I5  I1 confidence= 2/2 = 100%
T700 I1, I3 I1  I2  I5 confidence= 2/6 = 33%
T800 I1, I2, I3, I5 I2  I1  I5 confidence= 2/7 = 29%
T900 I1, I2, I3 I5  I1  I2 confidence= 2/2 = 100%

Frequent Itemset with support count 2 is


{I1,I2,I3} and {I1,I2,I5}
Association
Frequent Item support
Sup- count
confidence Confidence
set
rule %
1,2,3
1,52 2 2 2/2 100
1,2,5
2,51 2 2 2/2 100
51,2 2 2/2 100
Apriori Advantages/Disadvantages
• Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
• Disadvantages:
– Assumes transaction database is memory
resident.
– Requires many database scans.
What is FP Growth?
• FP Growth Stands for frequent pattern
growth
• It is a scalable technique for mining
frequent pattern in a database
FP Growth
• FP growth improves Apriority to a big
extent
• Frequent Item set Mining is possible
without candidate generation
• Only “two scan” to the database is needed

BUT HOW?
FP Growth
• Simply a two step procedure
– Step 1: Build a compact data structure called
the FP-tree
• Built using 2 passes over the data-set.
– Step 2: Extracts frequent item sets directly
from the FP-tree

Note: Assume min support = 2


TID List of
Item_IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
FP Growth
• Now Lets Consider the following
transaction table
TID List of
Item_IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
FP Growth
• Now we will build a FP tree of that
database
• Item sets are considered in order of their
descending value of support count.
FP Growth

Items Support count


TID List of
Item_IDs
I1 6
T100 I2, I1, I5
I2 7
T200 I2, I4
I3 6
T300 I2, I3
I4 2
T400 I2, I1, I4
I5 2
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
T900 I2, I1, I3
For Transaction:
I2,I1,I5

null

I2:1

I1:1

I5:1
For Transaction:
I2,I4

null

I2:2

I1:1
I4:1

I5:1
For Transaction:
I2,I3

null

I2:3

I1:1 I4:1
I3:1

I5:1
For Transaction:
I2,I1,I4

null

I2:4

I1:2 I4:1
I3:1

I5:1 I4:1
For Transaction:
I1,I3
null

I2:4 I1:1

I3:1
I1:2 I3:1 I4:1

I5:1 I4:1
For Transaction:
I2,I3
null

I2:5 I1:1

I3:1
I1:2 I3:2 I4:1

I5:1 I4:1
For Transaction:
I1,I3
null

I2:5 I1:2

I3:2
I1:2 I3:2 I4:1

I5:1 I4:1
For Transaction:
I2,I1,I3,I5
null

I2:6 I1:2

I3:2
I1:3 I3:2 I4:1

I5:1 I4:1 I3:1

I5:1
For Transaction:
I2,I1,I3
null

I2:7 I1:2

I3:2
I1:4 I3:2 I4:1

I5:1 I4:1 I3:2

I5:1
To facilitate tree traversal, an
item header table is built so
that each item points to its null
occurrences in the tree via a
chain of node-links.

I2 7 I1:2
I2:7
I1 6
I3 6
I4 2
I4:1 I3:2
I5 2 I1:4 I3:2

I3:2
I5:1 I4:1

I5:1
FP Growth
• FP Tree Construction Over!!
Now we need to find conditional pattern
base and Conditional FP Tree for each
item
Frequent Patters Generated
Item Conditional Pattern Conditional FP-tree Frequent Pattern Generated
Base

I5 {I2,I1 : 1},{I2,I1,I3 : 1} {I2:2,I1:2} {I2, I5: 2}, {I1, I5: 2},


{I2, I1, I5: 2}

I4 {I2,I1:1},{I2:1} {I2:2} {I2, I4: 2}

I3 {I2,I1:2},{I2:2},{I1:2} {I2:4,I1:2},{I1:2} {I2, I3: 4}, {I1, I3: 4}, {I2, I1,
I3: 2}

I1 {I2:4} {I2:4} {I2, I1: 4}

I2
Ignore as no Branch
Introduction to Market Basket Analysis
• Def:Market Basket Analysis (Association
Analysis) is a mathematical modeling
technique based upon the theory that if you
buy a certain group of items, you are likely to
buy another group of items.

• It is used to analyze the customer purchasing


behavior and helps in increasing the sales and
maintain inventory by focusing on the point of
sale transaction data.

• Given a dataset, the Apriori Algorithm trains


and identifies product baskets and product
Constraint-Based Mining :
• A data mining process finds thousands of rule
from a given dataset, most of which are unrelated
or uninterested

• Users have a good sense of which direction of


mining may lead to interesting patterns

• Users specify such intuition or expectation as


constraints to confine the search space

• This strategy is known as constraint-based mining


The constraints can include the following:
• Knowledge type constraint:
– Specify the type of knowledge to be mined,
such as association or correlation

• Data constraint
– specify the set of task relevant data
— using SQL-like queries
• find product pairs sold together in stores in Chicago
in Dec.’02
• Dimension/level constraint
– Specify the desired dimension of data, or levels of
the concept hierarchies to be mined
– in relevance to region, price, brand, customer
category
• Rule (or pattern) constraint
– Specify the form of rule to be mined (metarule-
rule template)
– small sales (price < $10) triggers big sales (sum >
$200)
• Interestingness constraint
– Specify thresholds such as support, confidence
– strong rules: min_support  3%, min_confidence
 60%
Mining Various Kinds of Association
Rules

• Mining multilevel association

• Mining multidimensional association

• Mining quantitative association

• Mining interesting correlation patterns

5/14/22 Data Mining: Concepts and 59


Techniques
Multilevel Association Rule

• If the association rule gives association


between value at multiple level of
abstraction the it is called multilevel
association rule
Electronics

Comp Printer TV

Dot LCD Plasma


DT LT Inkjet matrix

buys (X, ”computer”)buys (X, ”Inkjet printer”)


Multidimensional Association Rule

• If the association rule involve more than one


attributes then it is called multidimensional
association rule
Example:

Age(X, “20..29”) ^ occupation(X, “Student”) buyes (X, “laptop”)

Above example contains three predicates or attributes


like age, occupation and buys hence it is
multidimensional association rule

You might also like