Lecture No - 4,5,6
Lecture No - 4,5,6
Lecture No - 4,5,6
Dept of CSE
Dronacharya Group of Institution
transactional databases
Mining multidimensional association rules from
Association
rule mining:
Finding frequent patterns, associations, correlations,
or causal structures among sets of items or objects
in transaction databases, relational databases, and
other information repositories.
Applications:
Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, clustering,
classification, etc.
Examples.
Rule form: Body ead [support, confidence].
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) ^ takes(x, DB) grade(x, A) [1%,
75%]
Data Mining: Concepts and Techniques
Boolean
10
(itemset) in I, if X t.
An association rule is an implication of the
form:
X Y, where X, Y I, and X Y =
An itemset is a set of items.
E.g., X = {milk, bread, cereal} is an itemset.
11
12
13
14
( X Y ).count
support
n
( X Y ).count
confidence
X .count
Data Mining: Concepts and Techniques
15
16
Transaction data
Assume:
minsup = 30%
minconf = 80%
An example frequent itemset:
{Chicken, Clothes, Milk}
[sup = 3/7]
Association rules from the itemset:
17
E.g,
the quantity of each item purchased and
the price paid.
18
itemset.
An itemset can also be seen as a conjunction
of items (or a predicate)
19
threshold.
An itemset satisfies minimum support if the
occurrence frequency of the itemset is greater
than or equal to min_sup.
If an itemset satisfies minimum support, then
it is a frequent itemset.
20
21
22
23
25
26
Consider
a
database,
D
consisting of 9 transactions.
27
28
29
30
Based on the Apriori property that all subsets of a frequent itemset must
also be frequent, we can determine that four latter candidates cannot
possibly be frequent. How ?
For example , lets take {I1, I2, I3}.The 2-item subsets of it are {I1, I2},
{I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members
of L2, We will keep {I1, I2, I3} in C3.
Lets take another example of {I2, I3, I5}which shows how the pruning is
performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating
Apriori Property. Thus We will have to remove {I2, I3, I5} from C3.
Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members
of result of Join operation for Pruning.
Now, the transactions in D are scanned in order to determine L3,
consisting of those candidates 3-itemsets in C3having minimum
support.
31
Data Mining: Concepts and Techniques
32
Procedure:
For each frequent itemset l , generate all
nonempty subsets of l.
For every nonempty subset s of l, output the
rule
s ->(l-s) if
support_count(l) / support_count(s) >=
min_conf where min_conf is minimum
confidence threshold.
33
34
R4: I1 -> I2 ^ I5
Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is Rejected.
R5: I2 -> I1 ^ I5
Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
R5 is Rejected.
R6: I5-> I1 ^ I2
Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is Selected.
36
Dataset T
TID
minsup=0.5
T100 1, 3, 4
Items
T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
37
38
39
40
Advantages
Disadvantages
memory resident.
Requires many database scans.
41
42
43
L1
44
Step 2
C2
Step 3
L2
45
C3 =Null
46
Step 5
min_sup=40% min_conf=70%
47
48
Step 6
min_sup = 40% min_conf = 70%
49
Diaper.
Some rules need additional analysis, like Milk
Beer.
Some rules are unbelievable, like Diaper
Beer.
Note this example could contain unreal results
because its small data.
50
transactional databases
Mining multidimensional association rules from
51
Hash-based
itemset
counting:
k-itemset
whose
52
53
54
DHP
Apriori
Generate candidate set
Generate candidate set
Count support
Count support
55
i1, i2, in. For any pair (x, y), has according to
Hash function bucket #=
56
57
TID
Items
100
ACD
200
BCE
300
ABCE
400
BE
58
Itemset
Sup
{A}
{B}
{C}
{D}
{E}
Itemset
Sup
{A}
{B}
{C}
{E}
C1
L1
Data Mining: Concepts and Techniques
59
TID
2-itemset
100
{A C} {A D} {C D}
200
{B C} {B E} {C E}
300
{A B} {A C} {A E} {B C} {B E} {C
E}
400
{B E}
60
Hash function
bucket 0
{A E}
1
1
{B C}
{B C}
{B E}
{B E}
{B E}
{A B} {A C}
{C D}
{A C}
61
L1*L1
# in the
bucket
{A B}
{A C}
{A E}
{B C}
{B E}
3
1
2
3
{C E}
C2 of
Apriori
Resulted
C2
{A C}
{B C}
{B E}
{C E}
{A B}
{A C}
{A E}
{B C}
{B E}
{C E}
August 18, 2015
62
63
64
65
Use
66
67
68
data layout
Instead of {TID: itemset} it stores {item: TID_set}
69
TID-list
Data Mining: Concepts and Techniques
70
B
1
2
5
7
8
10
AB
1
5
7
8
3 traversal approaches:
top-down, bottom-up and hybrid
Advantage: very fast support counting
Disadvantage: intermediate tid-lists may become too
71
72
73
74
multilevel association
transactional databases
rules
from
75
76
77
78
level of abstraction.
Example:
79
Advantages:
80
threshold.
The deeper the level of abstraction , the smaller the
corresponding threshold.
81
83
as static discretization.
Referred as mining multidimensional AR using static
discretization of quantitative attributes.
Data Mining: Concepts and Techniques
84
(b)
85
86
transactional databases
Mining multidimensional association rules from
87
88
89
90
P ( A B )
P ( A) P ( B )
91
occurrence of itemset B , if
P(A U B) = P(A). P(B)
Otherwise, itemsets A and B are dependent and
correlated as events.
P( A B)
The lift between A and B can be measured as
:
lift
A, B
P( A) P( B)
92
Is equivalent to :
lift A, B
P(B/A)/ P(B)
or
confidence (A=> B)/
Support(B)
In other words, it asses the degree to which the
occurrence of one lifts the occurrence of other.
P( A B )
P( A) P( B)
Example:
If A=Computer Games & B= Videos then,
Given the market conditions , the sale of games is
said to increase or lift the likelihood of the sale of
videos by a factor of the value returned by the
equation.
Data Mining: Concepts and Techniques
93
94
95
etc.
Data constraint: Specify the task relevant data, SQLlike queries
Dimension/level constraints: Specify the desired
dimension of data.
in relevance to region, price, brand, customer category.
statistical measures
96
frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High
Performance Data Mining), 2000.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in
Santiago, Chile.
R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan.
R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle,
Washington.
S. Brin, R. Motwani, and C. Silverstein.
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication
rules for market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997.
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes.
large databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA.
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg
97
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00,
optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96, 13-23, Montreal,
Canada.
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules.
from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
98
F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast,
Birmingham, England.
H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction
association rules. SIGMOD Workshop on Research Issues on Data Mining and Knowledge
Discovery (DMKD'98), 12:1-12:7, Seattle, Washington.
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association
Arizona.
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning
99
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00. Boston, MA.
Aug. 2000.
G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro
and W. J. Frawley, editors, Knowledge Discovery in Databases, 229-238. AAAI/MIT Press, 1991.
B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando,
FL.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
100
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal
rectilinear regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997.
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of
101
102