CSE 385 - Data Mining and Business Intelligence - Lecture 02
CSE 385 - Data Mining and Business Intelligence - Lecture 02
CSE 385 - Data Mining and Business Intelligence - Lecture 02
BUSINESS INTELLIGENCE -
LECTURE 02
Dr. Mahmoud Mounir
mahmoud.mounir@cis.asu.edu.eg
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 6 —
◼ Basic Concepts
Evaluation Methods
◼ Summary
3
Association Rule Mining
◼ Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Juice},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Juice, Eggs {Juice, Bread} → {Milk},
3 Milk, Diaper, Juice, Coke
4 Bread, Milk, Diaper, Juice Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
4
Definition: Frequent Itemset
◼ Itemset
◼ A collection of one or more items
◼ Example: {Milk, Bread, Diaper} TID Items
◼ k-itemset 1 Bread, Milk
◼ An itemset that contains k items 2 Bread, Diaper, Juice, Eggs
◼ Support count () 3 Milk, Diaper, Juice, Coke
◼ Frequency of occurrence of an itemset 4 Bread, Milk, Diaper, Juice
◼ E.g. ({Milk, Bread,Diaper}) = 2 5 Bread, Milk, Diaper, Coke
◼ Support (Relative Support)
◼ Fraction of transactions that contain
an itemset
◼ E.g. s({Milk, Bread, Diaper}) = 2/5
◼ Frequent Itemset
◼ An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
TID Items
Association Rule
1 Bread, Milk
– An implication expression of the form
X → Y, where X and Y are itemsets 2 Bread, Diaper, Juice, Eggs
– Example: 3 Milk, Diaper, Juice, Coke
{Milk, Diaper} → {Juice} 4 Bread, Milk, Diaper, Juice
5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
Example:
– Support (s)
◆ Fraction of transactions that contain
{Milk , Diaper} {Juice}
both X and Y
– Confidence (c) (Milk , Diaper, Juice) 2
◆ Measures how often items in Y
s= = = 0 .4
|T| 5
appear in transactions that
contain X (Milk, Diaper, Juice) 2
c= = = 0.67
(Milk , Diaper ) 3
6
Association Rule Mining Task
◼ Given a set of transactions T, the goal of association rule
mining is to find all rules having
◼ support ≥ minsup threshold
◼ Brute-force approach:
◼ List all possible association rules
thresholds
Computationally prohibitive!
7
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?— Juice and diapers?!
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
8
Mining Association Rules
TID Items
Example of Rules:
1 Bread, Milk {Milk,Diaper} → {Juice} (s=0.4, c=0.67)
2 Bread, Diaper, Juice, Eggs {Milk, Juice} → {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Juice, Coke {Diaper, Juice} → {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Juice {Juice} → {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper} → {Milk, Juice} (s=0.4, c=0.5)
{Milk} → {Diaper, Juice} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Juice}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
9
Mining Association Rules
◼ Two-step approach:
1. Frequent Itemset Generation
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of
a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
12
Reducing Number of Candidates
◼ Apriori principle:
◼ If an itemset is frequent, then all of its subsets
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
14
Why Is Freq. Pattern Mining Important?
◼ Broad applications
15
Basic Concepts: Frequent Patterns
16
Basic Concepts: Association Rules
Tid Items bought ◼ Find all the rules X → Y with
10 Juice, Nuts, Diaper
minimum support and confidence
20 Juice, Coffee, Diaper
30 Juice, Diaper, Eggs ◼ support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X U Y
50 Nuts, Coffee, Diaper, Eggs, Milk
◼ confidence, c, conditional
Customer
buys both
Customer probability that a transaction
having X also contains Y
buys
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Juice :3, Nuts:3, Diaper:4, Eggs:3,
Customer {Juice, Diaper}:3
buys juice ◼ Association rules: (many more!)
◼ Juice → Diaper (60%, 100%)
◼ Diaper → Juice (60%, 75%)
17
Closed Patterns and Max-Patterns
◼ Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
◼ Min_sup = 1.
◼ What is the set of closed itemset?
◼ <a1, …, a100>: 1
◼ < a1, …, a50>: 2
◼ What is the set of max-pattern?
◼ <a1, …, a100>: 1
◼ What is the set of all patterns?
◼ !!
18
Computational Complexity of Frequent Itemset
Mining
◼ How many itemsets are potentially to be generated in the worst case?
◼ The number of frequent itemsets to be generated is senstive to the
minsup threshold
◼ When minsup is low, there exist potentially an exponential number of
frequent itemsets
◼ The worst case: MN where M: # distinct items, and N: max length of
transactions
◼ The worst case complexty vs. the expected probability
◼ Ex. Suppose Walmart has 104 kinds of products
◼ The chance to pick up one product 10-4
◼ The chance to pick up a particular set of 10 products: ~10-40
◼ What is the chance this particular set of 10 products to be frequent
103 times in 109 transactions?
19
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
◼ Basic Concepts
Evaluation Methods
◼ Summary
20
Scalable Frequent Itemset Mining Methods
Approach
Data Format
21
The Downward Closure Property and Scalable
Mining Methods
◼ The downward closure property of frequent patterns
◼ Any subset of a frequent itemset must be frequent
diaper}
◼ i.e., every transaction having {juice, diaper, nuts} also
@SIGMOD’00)
◼ Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
22
Apriori: A Candidate Generation & Test Approach
23
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
25
Implementation of Apriori
27
The Apriori Algorithm—An Example
Min Support = 2
Confidence = 70%
28
The Apriori Algorithm—An Example
29
The Apriori Algorithm—An Example
30
Rule Generation
rules:
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,
32
Rule Generation for Apriori Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
◼ Size of database
34
Factors Affecting Complexity of Apriori
◼ Choice of minimum support threshold
◼ lowering support threshold results in more frequent itemsets
◼ this may increase number of candidates and max length of
frequent itemsets
◼ Dimensionality (number of items) of the data set
◼
35
Impact of Support Based Pruning
TID Items Items (1-itemsets)
1 Bread, Milk
Item Count
2 Juice, Bread, Diaper, Eggs
Bread 4
3 Juice, Coke, Diaper, Milk Coke 2
4 Juice, Bread, Diaper, Milk Milk 4
Juice 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1
Minimum Support = 3
Minimum Support = 2
If every subset is considered,
6C + 6C + 6C
1 2 3
If every subset is considered,
6 + 15 + 20 = 41 6C + 6C + 6C + 6C
With support-based pruning, 1 2 3 4
6 + 15 + 20 +15 = 56
6 + 6 + 4 = 16
36
Factors Affecting Complexity of Apriori
37
Factors Affecting Complexity of Apriori
◼ Choice of minimum support threshold
◼ lowering support threshold results in more frequent itemsets
◼ this may increase number of candidates and max length of
frequent itemsets
◼ Dimensionality (number of items) of the data set
◼ More space is needed to store support count of itemsets
◼ if number of frequent itemsets also increases, both computation
and I/O costs may also increase
◼ Size of database
◼ run time of algorithm increases with number of transactions
◼ Average transaction width TID Items
1 Bread, Milk
2 Juice, Bread, Diaper, Eggs
3 Juice, Coke, Diaper, Milk
4 Juice, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
38
Factors Affecting Complexity of Apriori
◼ Choice of minimum support threshold
◼ lowering support threshold results in more frequent itemsets
◼ this may increase number of candidates and max length of
frequent itemsets
◼ Dimensionality (number of items) of the data set
◼ More space is needed to store support count of itemsets
◼ if number of frequent itemsets also increases, both computation
and I/O costs may also increase
◼ Size of database
◼ run time of algorithm increases with number of transactions
◼ Average transaction width
◼ transaction width increases the max length of frequent itemsets
◼ number of subsets in a transaction increases with its width,
increasing computation time for support counting
39
Factors Affecting Complexity of Apriori
40
Compact Representation of Frequent Itemsets
10
= 3
10
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCD
E
42
What are the Maximal Frequent Itemsets in this Data?
(A1-A10)
(B1-B10)
(C1-C10)
43
An illustrative example
Items
2
3
4
Transactions
5
6
7
8
9
10
44
An illustrative example
Items
4
Transactions
5
6
7
8
9
10
45
An illustrative example
Items
4
Transactions
6
7
8
9
10
46
An illustrative example
Items
4
Transactions
7
8
9
10
47
Another illustrative example
Items
5
6
7
8
9
10
48
Closed Itemset
◼ An itemset X is closed if none of its immediate supersets
has the same support as the itemset X.
◼ X is not closed if at least one of its immediate supersets
has support count as X.
49
Closed Itemset
◼ An itemset X is closed if none of its immediate supersets
has the same support as the itemset X.
◼ X is not closed if at least one of its immediate supersets
has support count as X.
Itemset Support
{A} 4
TID Items {B} 5 Itemset Support
1 {A,B} {C} 3 {A,B,C} 2
2 {B,C,D} {D} 4 {A,B,D} 3
3 {A,B,C,D} {A,B} 4 {A,C,D} 2
4 {A,B,D} {A,C} 2 {B,C,D} 2
5 {A,B,C,D} {A,D} 3 {A,B,C,D} 2
{B,C} 3
{B,D} 4
{C,D} 3
50
Maximal vs Closed Itemsets
null
Transaction Ids
TID Items
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE
51
Maximal Frequent vs Closed Frequent Itemsets
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE # Closed frequent = 9
# Maximal freaquent = 4
ABCDE
52
What are the Closed Itemsets in this Data?
(A1-A10)
(B1-B10)
(C1-C10)
53
Example 1
Items
{C,D} 2
5
6
7
8
9
10
54
Example 1
Items
{C,D} 2 ✓
5
6
7
8
9
10
55
Example 2
Items
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10
56
Example 2
Items
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E 2 ✓
8
}
9
10
57
Example 3
Items
1
2
3
4
Transactions
5
6
7
8
9
10
58
Example 4
Items
1
2
3
4
Transactions
5
6
7
8
9
10
59
Maximal vs Closed Itemsets
60
Example question
◼ Given the following transaction data sets (dark cells indicate presence of an item
in a transaction) and a support threshold of 20%, answer the following
questions
62
Ref: Basic Concepts of Frequent Pattern Mining
63
Ref: Apriori and Its Improvements
◼ R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94
◼ H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94
◼ A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining
association rules in large databases. VLDB'95
◼ J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95
◼ H. Toivonen. Sampling large databases for association rules. VLDB'96
◼ S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket analysis. SIGMOD'97
◼ S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining
with relational database systems: Alternatives and implications. SIGMOD'98
64
Ref: Depth-First, Projection-Based FP Mining
◼ R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation
of frequent itemsets. J. Parallel and Distributed Computing, 2002.
◼ G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
FIMI'03
◼ B. Goethals and M. Zaki. An introduction to workshop on frequent itemset mining
implementations. Proc. ICDM’03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI’03), Melbourne, FL, Nov. 2003
◼ J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD’ 00
◼ J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic
Projection. KDD'02
◼ J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without
Minimum Support. ICDM'02
◼ J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining
Frequent Closed Itemsets. KDD'03
65
Ref: Vertical Format and Row Enumeration Methods
66
Ref: Mining Correlations and Interesting Rules
◼ S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97.
◼ M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94.
◼ R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest.
Kluwer Academic, 2001.
◼ C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining
causal structures. VLDB'98.
◼ P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure
for Association Patterns. KDD'02.
◼ E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
◼ T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework", Data Mining and Knowledge Discovery, 21(3):371-
397, 2010
67