Unit-5: Concept Description and Association Rule Mining

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 39

Unit-5

Concept Description and Association Rule Mining

1
Outline
 What is concept description?
 Market basket analysis
 Association Rule Mining
 Generating Rules
 Improved apriori algorithm
 Incremental ARM (Association Rule Mining)
 Associative Classification
 Rule Mining

2
Concept description
 Data mining can be classified into two categories: descriptive data
mining and predictive data mining.
 Descriptive data mining describes the data set in a concise and
summative manner and presents interesting general properties
of the data.
 Predictive data mining analyzes the data in order to construct one
or a set of models, and attempts to predict the behavior of new
data sets.
 Database is usually storing the large amounts of data in great
detail. However users often like to view sets of summarized data
in concise, descriptive terms.

3
Concept description (Cont..)
The simplest kind of descriptive data mining is called
concept description.
 A concept usually refers to a collection of data such as
frequent_buyers, graduate_students etc.
 As a data mining task, concept description is not a simple
enumeration (number of things done one by one) of the data.
 Concept description generates descriptions for characterization
and comparison of the data it is also called class description.

4
Concept description (Cont..)
 Characterization provides a concise and brief summarization of
the data.
 While concept or class comparison (also known as discrimination)
provides discriminations (inequity) comparing two or more
collections of data.
 Example
• Given the ABC Company database, for example, examining individual
customer transactions.
• Sales managers may prefer to view the data generalized to higher levels,
such as summarized by customer groups according to geographic regions,
frequency of purchases per group and customer income.

5
Market basket analysis
 Market Basket Analysis is a modelling technique.
 It is based on, if you buy a certain group of items, you are more (or
less) likely to buy another group of items.
 For example, if you are in a store and you buy a car then you are
more likely to buy insurance at the same time than somebody
who didn't buy insurance.
 The set of items a customer buys is referred to as an itemset.
 Market basket analysis seeks to find relationships between
purchases (Items).
 E.g. IF {Car, Accessories} THEN {Insurance}
{Car, Accessories}  {Insurance}

6
Market basket analysis (Cont..)
{Car, Accessories}  {Insurance}
 The probability that a customer will buy car without an
accessories is referred as the support for rule.
 The conditional probability that a customer will purchase
Insurance is referred to as the confidence.

7
Association rule mining
 Given a set of transactions, we need rules that will predict the
occurrence of an item based on the occurrences of other items in
the transaction.
 Market-Basket transactions
TID Items
1 Bread, Milk Example of Association Rules
2 Bread, Chocolate, Pepsi, Eggs
3 Milk, Chocolate, Pepsi, Coke {Chocolate} → {Pepsi},
4 Bread, Milk, Chocolate, Pepsi {Milk, Bread} → {Eggs, Coke},
5 Bread, Milk, Chocolate, Coke {Pepsi, Bread} → {Milk}

8
Association rule mining (Cont..)
 Itemset TID Items

• A collection of one or more items 1 Bread, Milk


o E.g. : {Milk, Bread, Chocolate} 2 Bread, Chocolate, Pepsi, Eggs
• k-itemset 3 Milk, Chocolate, Pepsi, Coke
An itemset that contains k items 4 Bread, Milk, Chocolate, Pepsi
 Support count (σ) 5 Bread, Milk, Chocolate, Coke
• Frequency of occurrence of an itemset
o E.g. σ({Milk, Bread, Chocolate}) = 2

 Support
• Fraction of transactions that contain an itemset
o E.g. s({Milk, Bread, Chocolate}) = 2/5

 Frequent Itemset
• An itemset whose support is greater than or equal to a minimum support
threshold

9
Association rule mining (Cont..)
 Association Rule
• An implication expression of the form X → Y, where X and Y are itemsets
o E.g.: {Milk, Chocolate} → {Pepsi}

 Rule Evaluation
• Support (s)
o Fraction of transactions that contain both X and Y
• Confidence (c)
o Measures how often items in Y appear in transactions that contain X

10
Association rule mining (Cont..)
TID Items
1 Bread, Milk
2 Bread, Chocolate, Pepsi, Eggs
3 Milk, Chocolate, Pepsi, Coke
4 Bread, Milk, Chocolate, Pepsi
5 Bread, Milk, Chocolate, Coke

Example:
Find support & confidence for {Milk, Chocolate} ⇒ Pepsi
   
= c = 67

11
Association rule mining - Example
TID Items
Calculate Support & Confidence:
1 Bread, Milk {Milk, Chocolate} → {Pepsi}
2 Bread, Chocolate, Pepsi, Eggs {Milk, Pepsi} → {Chocolate}
3 Milk, Chocolate, Pepsi, Coke {Chocolate, Pepsi} → {Milk}
4 Bread, Milk, Chocolate, Pepsi {Pepsi} → {Milk, Chocolate}
{Chocolate} → {Milk, Pepsi}
5 Bread, Milk, Chocolate, Coke {Milk} → {Chocolate, Pepsi}

Answer
Support (s) : 0.4
{Milk, Chocolate} → {Pepsi} c = 0.67
{Milk, Pepsi} → {Chocolate} c = 1.0
{Chocolate, Pepsi} → {Milk} c = 0.67
{Pepsi} → {Milk, Chocolate} c = 0.67
{Chocolate} → {Milk, Pepsi} c = 0.5
{Milk} → {Chocolate, Pepsi} c = 0.5
12
Association rule mining (Cont..)
 A common strategy adopted by many association rule
mining algorithms is to decompose the problem into two
major subtasks:
1. Frequent Itemset Generation
• The objective is to find all the item-sets that satisfy the
minimum support threshold.
• These itemsets are called frequent itemsets.
2. Rule Generation
• The objective is to extract all the high-confidence rules
from the frequent itemsets found in the previous step.
• These rules are called strong rules.

13
Apriori algorithm
 Purpose: The Apriori Algorithm is an influential algorithm for
mining frequent itemsets for Boolean association rules.
 Key Concepts:
• Frequent Itemsets:
The sets of item which has minimum support (denoted by Li for ith-Itemset).
• Apriori Property:
Any subset of frequent itemset must be frequent.
• Join Operation:
To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 itself.

14
Apriori algorithm (Cont..)
 Find the frequent itemsets
• The sets of items that have minimum support and a subset of a frequent
itemset must also be a frequent itemset (Apriori Property).
• E.g. if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset.
• Use the frequent item sets to generate association rules.
 The Apriori Algorithm : Pseudo code
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1) itemset that is not frequent cannot be a subset of a
frequent k-itemset

15
Apriori algorithm steps (Cont..)
 Step 1:
• Start with itemsets containing just a single item (Individual items).
 Step 2:
• Determine the support for itemsets.
• Keep the itemsets that meet your minimum support threshold and
remove itemsets that do not support minimum support.
 Step 3:
• Using the itemsets you have kept from Step 1, generate all the possible
itemset combinations.
 Step 4:
• Repeat steps 1 & 2 until there are no more new itemsets.

16
Apriori algorithm - Pseudo code (Cont..)
Ck: Candidate itemset of size k
Lk: Frequent itemset of size k
L1= {frequent items};
for (k = 1; Lk != ∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
Increment the count of all candidates in C k+1
That are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
17
Apriori algorithm - Example Minimum Support = 2

TID Items C1 ItemSet Min. Sup L1 ItemSet Min. Sup


100 134 {1} 2 {1} 2
200 235 Scan D
{2} 3 {2} 3
300 1235 {3} 3 {3} 3
400 25 {4} 1 {5} 3
{5} 3
C2 Itemset Min. Sup C2 Itemset
L2 ItemSet Min. Sup
{1 2} 1 {1 2}
{1 3} 2
{1 3} 2 Scan D {1 3}
{2 3} 2
{1 5} 1 {1 5}
{2 5} 3
{2 3} 2 {2 3}
{3 5} 2
{2 5} 3 {2 5}
{3 5} 2 {3 5}

18
Apriori algorithm - Example Minimum Support = 2

ItemSet Min. Sup ItemSet Min. Sup


L2 {1 3} 2 C3 {1 2 3} 1
L3 Items Sup
{2 3} 2 Scan D {1 2 5} 1
{2 3 5} 2
{2 5} 3 {1 3 5} 1
{3 5} 2 {2 3 5} 2

Rules generation
Association Rule Support Confidence Confidence (%)
2^35 2 2/2 = 1 100 %
3^52 2 2/2 = 1 100 %
2^53 2 2/3 = 0.66 66%
2=3^5 2 2/3 = 0.66 66%
3=2^5 2 2/3 = 0.66 66%
5=2^3 2 2/3 = 0.66 66%

19
Improve apriori’s efficiency
 Hash-based itemset counting: A k-itemset whose corresponding hashing
bucket count is below the threshold cannot be frequent.
 Transaction reduction: A transaction that does not contain any frequent
k-itemset is useless in subsequent scans.
 Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB.
 Sampling: Mining on a subset of given data, lower support threshold + a
method to determine the completeness.
 Dynamic itemset counting: Add new candidate itemsets only when all of
their subsets are estimated to be frequent.

20
Incremental Mining of Association Rules
 It is noted that analysis of past transaction data can provide very
valuable information on customer buying behavior, and thus improve
the quality of business decisions.
 With the increasing use of the record-based databases whose data is
being continuously added, updated, deleted etc.
 Examples of such applications include Web log records, stock market
data, grocery sales data, transactions in e-commerce, and daily
weather/traffic records etc.
 In many applications, we would like to mine the transaction database for
a fixed amount of most recent data (say, data in the last 12 months).

21
Incremental Mining of Association Rules
 Mining is not a one-time operation, a naive approach to solve the
incremental mining problem is to re-run the mining algorithm on the
updated database.
Mining

Data

Feedback Business
Strategy

22
FP-Growth Algorithm
 The FP-Growth Algorithm is proposed by Han.
 It is an efficient and scalable method for mining the complete set
of frequent patterns.
 Using prefix-tree structure for storing information about frequent
patterns named frequent-pattern tree (FP-tree).
 Once an FP-tree has been constructed, it uses a recursive divide-
and-conquer approach to mine the frequent item sets.

23
FP-Growth Algorithm (Cont..)
 Building the FP-Tree
1. Scan data to determine the support count of each item.
• Infrequent items are discarded, while the frequent items are sorted in
decreasing support counts.
2. Make a second pass over the data to construct the FP­-tree.
• As the transactions are read, before being processed, their items are sorted
according to the above order.

24
FP-Growth Algorithm - Example
FP-Tree Generation Step:1 Step:2
TID Transactions Freq. 1-Itemsets. Transactions with items sorted based on
Min_Sup  2 frequencies, and ignoring the infrequent
1 ABCEFO Transactions items.
2 ACG A:8 ACEBF
3 EI C:8 ACG
4 ACDEG E:8 E
G:5 ACEGD
5 ACEGL
B:2 ACEG
6 EJ
D:2 E
7 ABCEFP F:2 ACEBF
8 ACD Remaining all ACD
O,I,J,L,P,M & N
9 ACEGM is with ACEG
min_sup = 1
10 A C E G N ACEG
25
FP-Tree after reading 1st transaction
ACEBF Header null
ACG
A:8 A:1
E
C:8
ACEGD E:8
C:1
ACEG G:5
E B:2
D:2 E:1
ACEBF
F:2
ACD
B:1
ACEG
ACEG F:1

26
FP-Tree after reading 2nd transaction
ACEBF Header null
ACG
A:8 A:2
E C:8
ACEGD E:8
C:2
ACEG G:5
E B:2
E:1 G:1
D:2
ACEBF
F:2
ACD B:1
ACEG
ACEG F:1

27
FP-Tree after reading 3rd transaction
ACEBF Header null
ACG
A:8
E C:8
A:2 E:1

ACEGD E:8
C:2
ACEG G:5
E B:2
E:1 G:1
D:2
ACEBF
F:2
ACD B:1
ACEG
ACEG F:1

28
FP-Tree after reading 4th transaction
ACEBF Header null
ACG
A:8
E A:3 E:1
C:8
ACEGD E:8
C:3
ACEG G:5
E B:2
E:2 G:1
D:2
ACEBF
F:2
ACD B:1
G:1
ACEG
ACEG F:1 D:1

29
FP-Tree after reading 5th transaction
ACEBF Header null
ACG
A:8
E A:4 E:1
C:8
ACEGD E:8
C:4
ACEG G:5
E B:2
E:3 G:1
D:2
ACEBF
F:2
ACD B:1
G:2
ACEG
ACEG F:1 D:1

30
FP-Tree after reading 6th transaction
ACEBF Header null
ACG
A:8
E A:4 E:2
C:8
ACEGD E:8
ACEG C:4
G:5
E B:2
E:3 G:1
D:2
ACEBF
F:2
ACD B:1
G:2
ACEG
ACEG F:1 D:1

31
FP-Tree after reading 7th transaction
ACEBF Header null
ACG
A:8
E A:5 E:2
C:8
ACEGD E:8
ACEG C:5
G:5
E B:2
E:4 G:1
D:2
ACEBF
F:2
ACD B:2
G:2
ACEG
ACEG F:2 D:1

32
FP-Tree after reading 8th transaction
ACEBF Header null
ACG
A:8
E A:6 E:2
C:8
ACEGD E:8
ACEG C:6
G:5
E B:2
E:4 G:1 D:1
ACEBF D:2
F:2
ACD B:2
G:2
ACEG
ACEG F:2 D:1

33
FP-Tree after reading 9th transaction
ACEBF Header null
ACG
A:8
E A:7 E:2
C:8
ACEGD E:8
ACEG C:7
G:5
E B:2
E:5 G:1 D:1
ACEBF D:2
F:2
ACD G:3
B:2
ACEG
ACEG F:2 D:1

34
FP-Tree after reading 10th transaction
ACEBF Header null
ACG
A:8
E A:8 E:2
C:8
ACEGD E:8
ACEG C:8
G:5
E B:2
E:6 G:1 D:1
ACEBF D:2
F:2
ACD G:4
B:2
ACEG
ACEG F:2 D:1

35
FP-Growth algorithm - Example
Minimum Support
>= 2 FP-Tree
TID Items
null
1 AB
2 BCD Header B:8 A:2
3 ACDE Item Support
A:5 C:3 C:1 D:1
4 ADE B 8
5 ABC A 7 C:3 D:1 D:1 E:1 D:1 E:1
6 ABCD C 7
7 BC D 5 D:1 E:1
8 ABC E 3
9 ABD
10 BCE

36
FP-Growth Example (Try it) Minimum Support = 3

TID Items Bought Frequent Items


(sorted order)
100 FACDGIMP FCAMP
200 ABCFLMO FCABM
300 BFHJOW FB
400 BCKSP CBP
500 AFCELPMN FCAMP

37
FP-Growth Example - Answer
FP-Tree Construction

null
Header Table
f:4 c:1
Item frequency
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3 m:2 b:1

p:2 m:1

38
Thank you

39

You might also like