Unit-5: Concept Description and Association Rule Mining
Unit-5: Concept Description and Association Rule Mining
Unit-5: Concept Description and Association Rule Mining
1
Outline
What is concept description?
Market basket analysis
Association Rule Mining
Generating Rules
Improved apriori algorithm
Incremental ARM (Association Rule Mining)
Associative Classification
Rule Mining
2
Concept description
Data mining can be classified into two categories: descriptive data
mining and predictive data mining.
Descriptive data mining describes the data set in a concise and
summative manner and presents interesting general properties
of the data.
Predictive data mining analyzes the data in order to construct one
or a set of models, and attempts to predict the behavior of new
data sets.
Database is usually storing the large amounts of data in great
detail. However users often like to view sets of summarized data
in concise, descriptive terms.
3
Concept description (Cont..)
The simplest kind of descriptive data mining is called
concept description.
A concept usually refers to a collection of data such as
frequent_buyers, graduate_students etc.
As a data mining task, concept description is not a simple
enumeration (number of things done one by one) of the data.
Concept description generates descriptions for characterization
and comparison of the data it is also called class description.
4
Concept description (Cont..)
Characterization provides a concise and brief summarization of
the data.
While concept or class comparison (also known as discrimination)
provides discriminations (inequity) comparing two or more
collections of data.
Example
• Given the ABC Company database, for example, examining individual
customer transactions.
• Sales managers may prefer to view the data generalized to higher levels,
such as summarized by customer groups according to geographic regions,
frequency of purchases per group and customer income.
5
Market basket analysis
Market Basket Analysis is a modelling technique.
It is based on, if you buy a certain group of items, you are more (or
less) likely to buy another group of items.
For example, if you are in a store and you buy a car then you are
more likely to buy insurance at the same time than somebody
who didn't buy insurance.
The set of items a customer buys is referred to as an itemset.
Market basket analysis seeks to find relationships between
purchases (Items).
E.g. IF {Car, Accessories} THEN {Insurance}
{Car, Accessories} {Insurance}
6
Market basket analysis (Cont..)
{Car, Accessories} {Insurance}
The probability that a customer will buy car without an
accessories is referred as the support for rule.
The conditional probability that a customer will purchase
Insurance is referred to as the confidence.
7
Association rule mining
Given a set of transactions, we need rules that will predict the
occurrence of an item based on the occurrences of other items in
the transaction.
Market-Basket transactions
TID Items
1 Bread, Milk Example of Association Rules
2 Bread, Chocolate, Pepsi, Eggs
3 Milk, Chocolate, Pepsi, Coke {Chocolate} → {Pepsi},
4 Bread, Milk, Chocolate, Pepsi {Milk, Bread} → {Eggs, Coke},
5 Bread, Milk, Chocolate, Coke {Pepsi, Bread} → {Milk}
8
Association rule mining (Cont..)
Itemset TID Items
Support
• Fraction of transactions that contain an itemset
o E.g. s({Milk, Bread, Chocolate}) = 2/5
Frequent Itemset
• An itemset whose support is greater than or equal to a minimum support
threshold
9
Association rule mining (Cont..)
Association Rule
• An implication expression of the form X → Y, where X and Y are itemsets
o E.g.: {Milk, Chocolate} → {Pepsi}
Rule Evaluation
• Support (s)
o Fraction of transactions that contain both X and Y
• Confidence (c)
o Measures how often items in Y appear in transactions that contain X
10
Association rule mining (Cont..)
TID Items
1 Bread, Milk
2 Bread, Chocolate, Pepsi, Eggs
3 Milk, Chocolate, Pepsi, Coke
4 Bread, Milk, Chocolate, Pepsi
5 Bread, Milk, Chocolate, Coke
Example:
Find support & confidence for {Milk, Chocolate} ⇒ Pepsi
= c = 67
11
Association rule mining - Example
TID Items
Calculate Support & Confidence:
1 Bread, Milk {Milk, Chocolate} → {Pepsi}
2 Bread, Chocolate, Pepsi, Eggs {Milk, Pepsi} → {Chocolate}
3 Milk, Chocolate, Pepsi, Coke {Chocolate, Pepsi} → {Milk}
4 Bread, Milk, Chocolate, Pepsi {Pepsi} → {Milk, Chocolate}
{Chocolate} → {Milk, Pepsi}
5 Bread, Milk, Chocolate, Coke {Milk} → {Chocolate, Pepsi}
Answer
Support (s) : 0.4
{Milk, Chocolate} → {Pepsi} c = 0.67
{Milk, Pepsi} → {Chocolate} c = 1.0
{Chocolate, Pepsi} → {Milk} c = 0.67
{Pepsi} → {Milk, Chocolate} c = 0.67
{Chocolate} → {Milk, Pepsi} c = 0.5
{Milk} → {Chocolate, Pepsi} c = 0.5
12
Association rule mining (Cont..)
A common strategy adopted by many association rule
mining algorithms is to decompose the problem into two
major subtasks:
1. Frequent Itemset Generation
• The objective is to find all the item-sets that satisfy the
minimum support threshold.
• These itemsets are called frequent itemsets.
2. Rule Generation
• The objective is to extract all the high-confidence rules
from the frequent itemsets found in the previous step.
• These rules are called strong rules.
13
Apriori algorithm
Purpose: The Apriori Algorithm is an influential algorithm for
mining frequent itemsets for Boolean association rules.
Key Concepts:
• Frequent Itemsets:
The sets of item which has minimum support (denoted by Li for ith-Itemset).
• Apriori Property:
Any subset of frequent itemset must be frequent.
• Join Operation:
To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 itself.
14
Apriori algorithm (Cont..)
Find the frequent itemsets
• The sets of items that have minimum support and a subset of a frequent
itemset must also be a frequent itemset (Apriori Property).
• E.g. if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset.
• Use the frequent item sets to generate association rules.
The Apriori Algorithm : Pseudo code
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1) itemset that is not frequent cannot be a subset of a
frequent k-itemset
15
Apriori algorithm steps (Cont..)
Step 1:
• Start with itemsets containing just a single item (Individual items).
Step 2:
• Determine the support for itemsets.
• Keep the itemsets that meet your minimum support threshold and
remove itemsets that do not support minimum support.
Step 3:
• Using the itemsets you have kept from Step 1, generate all the possible
itemset combinations.
Step 4:
• Repeat steps 1 & 2 until there are no more new itemsets.
16
Apriori algorithm - Pseudo code (Cont..)
Ck: Candidate itemset of size k
Lk: Frequent itemset of size k
L1= {frequent items};
for (k = 1; Lk != ∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
Increment the count of all candidates in C k+1
That are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
17
Apriori algorithm - Example Minimum Support = 2
18
Apriori algorithm - Example Minimum Support = 2
Rules generation
Association Rule Support Confidence Confidence (%)
2^35 2 2/2 = 1 100 %
3^52 2 2/2 = 1 100 %
2^53 2 2/3 = 0.66 66%
2=3^5 2 2/3 = 0.66 66%
3=2^5 2 2/3 = 0.66 66%
5=2^3 2 2/3 = 0.66 66%
19
Improve apriori’s efficiency
Hash-based itemset counting: A k-itemset whose corresponding hashing
bucket count is below the threshold cannot be frequent.
Transaction reduction: A transaction that does not contain any frequent
k-itemset is useless in subsequent scans.
Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB.
Sampling: Mining on a subset of given data, lower support threshold + a
method to determine the completeness.
Dynamic itemset counting: Add new candidate itemsets only when all of
their subsets are estimated to be frequent.
20
Incremental Mining of Association Rules
It is noted that analysis of past transaction data can provide very
valuable information on customer buying behavior, and thus improve
the quality of business decisions.
With the increasing use of the record-based databases whose data is
being continuously added, updated, deleted etc.
Examples of such applications include Web log records, stock market
data, grocery sales data, transactions in e-commerce, and daily
weather/traffic records etc.
In many applications, we would like to mine the transaction database for
a fixed amount of most recent data (say, data in the last 12 months).
21
Incremental Mining of Association Rules
Mining is not a one-time operation, a naive approach to solve the
incremental mining problem is to re-run the mining algorithm on the
updated database.
Mining
Data
Feedback Business
Strategy
22
FP-Growth Algorithm
The FP-Growth Algorithm is proposed by Han.
It is an efficient and scalable method for mining the complete set
of frequent patterns.
Using prefix-tree structure for storing information about frequent
patterns named frequent-pattern tree (FP-tree).
Once an FP-tree has been constructed, it uses a recursive divide-
and-conquer approach to mine the frequent item sets.
23
FP-Growth Algorithm (Cont..)
Building the FP-Tree
1. Scan data to determine the support count of each item.
• Infrequent items are discarded, while the frequent items are sorted in
decreasing support counts.
2. Make a second pass over the data to construct the FP-tree.
• As the transactions are read, before being processed, their items are sorted
according to the above order.
24
FP-Growth Algorithm - Example
FP-Tree Generation Step:1 Step:2
TID Transactions Freq. 1-Itemsets. Transactions with items sorted based on
Min_Sup 2 frequencies, and ignoring the infrequent
1 ABCEFO Transactions items.
2 ACG A:8 ACEBF
3 EI C:8 ACG
4 ACDEG E:8 E
G:5 ACEGD
5 ACEGL
B:2 ACEG
6 EJ
D:2 E
7 ABCEFP F:2 ACEBF
8 ACD Remaining all ACD
O,I,J,L,P,M & N
9 ACEGM is with ACEG
min_sup = 1
10 A C E G N ACEG
25
FP-Tree after reading 1st transaction
ACEBF Header null
ACG
A:8 A:1
E
C:8
ACEGD E:8
C:1
ACEG G:5
E B:2
D:2 E:1
ACEBF
F:2
ACD
B:1
ACEG
ACEG F:1
26
FP-Tree after reading 2nd transaction
ACEBF Header null
ACG
A:8 A:2
E C:8
ACEGD E:8
C:2
ACEG G:5
E B:2
E:1 G:1
D:2
ACEBF
F:2
ACD B:1
ACEG
ACEG F:1
27
FP-Tree after reading 3rd transaction
ACEBF Header null
ACG
A:8
E C:8
A:2 E:1
ACEGD E:8
C:2
ACEG G:5
E B:2
E:1 G:1
D:2
ACEBF
F:2
ACD B:1
ACEG
ACEG F:1
28
FP-Tree after reading 4th transaction
ACEBF Header null
ACG
A:8
E A:3 E:1
C:8
ACEGD E:8
C:3
ACEG G:5
E B:2
E:2 G:1
D:2
ACEBF
F:2
ACD B:1
G:1
ACEG
ACEG F:1 D:1
29
FP-Tree after reading 5th transaction
ACEBF Header null
ACG
A:8
E A:4 E:1
C:8
ACEGD E:8
C:4
ACEG G:5
E B:2
E:3 G:1
D:2
ACEBF
F:2
ACD B:1
G:2
ACEG
ACEG F:1 D:1
30
FP-Tree after reading 6th transaction
ACEBF Header null
ACG
A:8
E A:4 E:2
C:8
ACEGD E:8
ACEG C:4
G:5
E B:2
E:3 G:1
D:2
ACEBF
F:2
ACD B:1
G:2
ACEG
ACEG F:1 D:1
31
FP-Tree after reading 7th transaction
ACEBF Header null
ACG
A:8
E A:5 E:2
C:8
ACEGD E:8
ACEG C:5
G:5
E B:2
E:4 G:1
D:2
ACEBF
F:2
ACD B:2
G:2
ACEG
ACEG F:2 D:1
32
FP-Tree after reading 8th transaction
ACEBF Header null
ACG
A:8
E A:6 E:2
C:8
ACEGD E:8
ACEG C:6
G:5
E B:2
E:4 G:1 D:1
ACEBF D:2
F:2
ACD B:2
G:2
ACEG
ACEG F:2 D:1
33
FP-Tree after reading 9th transaction
ACEBF Header null
ACG
A:8
E A:7 E:2
C:8
ACEGD E:8
ACEG C:7
G:5
E B:2
E:5 G:1 D:1
ACEBF D:2
F:2
ACD G:3
B:2
ACEG
ACEG F:2 D:1
34
FP-Tree after reading 10th transaction
ACEBF Header null
ACG
A:8
E A:8 E:2
C:8
ACEGD E:8
ACEG C:8
G:5
E B:2
E:6 G:1 D:1
ACEBF D:2
F:2
ACD G:4
B:2
ACEG
ACEG F:2 D:1
35
FP-Growth algorithm - Example
Minimum Support
>= 2 FP-Tree
TID Items
null
1 AB
2 BCD Header B:8 A:2
3 ACDE Item Support
A:5 C:3 C:1 D:1
4 ADE B 8
5 ABC A 7 C:3 D:1 D:1 E:1 D:1 E:1
6 ABCD C 7
7 BC D 5 D:1 E:1
8 ABC E 3
9 ABD
10 BCE
36
FP-Growth Example (Try it) Minimum Support = 3
37
FP-Growth Example - Answer
FP-Tree Construction
null
Header Table
f:4 c:1
Item frequency
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3 m:2 b:1
p:2 m:1
38
Thank you
39