New Association Rule

Data Mining:
Association
Mining Association Rules
in Large Databases
 Association rule mining
 Mining single-dimensional Boolean association
rules from transactional databases
 Mining multilevel association rules from
transactional databases
 Mining multidimensional association rules from
transactional databases and data warehouse
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
What Is Association Mining?
 Association rule mining:
 Finding frequent patterns, associations,

correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories.
 Applications:
 Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, clustering,
classification, etc.
Example:
Computer=>antivirus_software [support=2%, confidence=60%]
A support of 2% for above rule

means that 2% of all the
transactions under analysis show
that computer and antivirus software
are purchased together.
A confidence of 60% means that

60% of the customers who
purchased a computer also bought
Association Rule: Basic Concepts
 Given: (1) database of transactions, (2) each

transaction is a list of items (purchased by a
customer in a visit)
 Find: all rules that correlate the presence of one set

of items with that of another set of items
 E.g., 98% of people who purchase tires and auto

accessories also get automotive services done
Association Rule Mining: A
Road Map
 Boolean vs. quantitative associations (Based
on the types of values handled)
 buys(x, “SQLServer”) ^ buys(x, “DMBook”)

® buys(x, “DBMiner”) [0.2%, 60%]
 age(x, “30..39”) ^ income(x, “42..48K”) ®

buys(x, “PC”) [1%, 75%]
Mining Association Rules in
Large Databases
 Summary
Mining Association Rules—An Example
Transaction ID Items Bought Min. support 50%

2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
{A} 75%
5000 B,E,F
{B} 50%
{C} 50%
For rule A  C: {A,C} 50%
support = support({A C}) = 50%
confidence = support({A C})/support({A}) =
66.6%
The Apriori principle:
Any subset of a frequent itemset must be
Mining Frequent
Itemsets: the Key Step
 Find the frequent itemsets: the sets of
items that have minimum support
 A subset of a frequent itemset must also be a
frequent itemset
 i.e., if {AB} is a frequent itemset, both {A} and
{B} should be a frequent itemset
 Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
 Use the frequent itemsets to generate
association rules.
The Apriori Algorithm
 Join Step: C is generated by joining L with
k k-1
itself
 Prune Step: Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent k-
itemset
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
The Apriori Algorithm — Example
Database D itemset sup.

L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
The Apriori Algorithm — Example
 Example 2:
How to Generate Candidates?
 Suppose the items in Lk-1 are listed in an
order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2,
p.itemk-1 < q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
How to Count Supports of
Candidates?
 Why counting supports of candidates a problem?
 The total number of candidates can be very
huge
 One transaction may contain many
candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of
itemsets and counts
 Interior node contains a hash table
 Subset function: finds all the candidates
Example of Generating
Candidates
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
Methods to Improve Apriori’s
Efficiency
 Hash-based itemset counting: A k-itemset whose

corresponding hashing bucket count is below the
threshold cannot be frequent
 Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets
only when all of their subsets are estimated to be frequent
Visualization of Association Rule Using Plane Graph
Large Databases
 Summary
Multiple-Level Association
Rules
Food
 Items often form
hierarchy. milk bread
 Items at the lower level
are expected to have skim 2% wheat white
lower support.
 Rules regarding itemsets Fraser Sunset
at
appropriate levels could TID Items
be quite useful. T1 {111, 121, 211, 221}
 Transaction database can T2 {111, 211, 222, 323}
be encoded based on T3 {112, 122, 221, 411}
dimensions and levels T4 {111, 121}
 We can explore shared T5 {111, 122, 211, 221, 413}
Mining Multi-Level
Associations
 A top_down, progressive deepening approach:
 First find high-level strong rules:
milk ® bread [20%, 60%].
 Then find their lower-level “weaker” rules:
2% milk ® wheat bread [6%, 50%].
 Variations at mining multiple-level association rules.
 Level-crossed association rules:
2% milk ® Wonder wheat bread
 Association rules with multiple, alternative
hierarchies:
2% milk ® Wonder bread
Multi-level Association: Uniform
Support vs. Reduced Support
 Uniform Support: the same minimum support for all
levels
 + One minimum support threshold. No need to examine
itemsets containing any item whose ancestors do not have
minimum support.
 – Lower level items do not occur as frequently. If support
threshold
 too high  miss low level associations
 too low  generate too many high level associations
 Reduced Support: reduced minimum support at
lower levels
 There are 4 search strategies:
 Level-by-level independent
 Level-cross filtering by k-itemset
 Level-cross filtering by single item
 Controlled level-cross filtering by single item
Uniform Support
Multi-level mining with uniform support
Level 1 Milk
min_sup = 5%
[support = 10%]
Level 2 2% Milk Skim Milk

min_sup = 5% [support = 6%] [support = 4%]
Back
Reduced Support
Multi-level mining with reduced support
Level 1 Milk
min_sup = 5%
[support = 10%]
Level 2 2% Milk Skim Milk

min_sup = 3% [support = 6%] [support = 4%]
Back
Multi-level Association:
Redundancy Filtering
 Some rules may be redundant due to

“ancestor” relationships between items.
 Example
 milk  wheat bread [support = 8%, confidence =
70%]
 2% milk  wheat bread [support = 2%, confidence =
72%]
 We say the first rule is an ancestor of the
second rule.
 A rule is redundant if its support is close to
the “expected” value, based on the rule’s
Multi-Level Mining:
Progressive Deepening
 A top-down, progressive deepening
approach:
 First mine high-level frequent items:
milk (15%), bread (10%)
 Then mine their lower-level “weaker”
frequent itemsets:
2% milk (5%), wheat bread
(4%)
 Different min_support threshold across
multi-levels lead to different algorithms:
 If adopting the same min_support across
multi-levels
then toss t if any of t’s ancestors is infrequent.
 If adopting reduced min_support at lower
Progressive Refinement of
Data Mining Quality
 Why progressive refinement?
 Mining operator can be expensive or cheap, fine
or rough
 Trade speed with quality: step-by-step
refinement.
 Superset coverage property:
 Preserve all the positive answers—allow a
positive false test but not a false negative test.
 Two- or multi-step mining:
 First apply rough/cheap operator (superset
coverage)
Large Databases
 Summary
Multi-Dimensional
Association: Concepts
 Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
 Multi-dimensional rules:  2 dimensions or predicates
 Inter-dimension association rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”)
 hybrid-dimension association rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
 Categorical Attributes
 finite number of possible values, no ordering among values
 Quantitative Attributes
 numeric, implicit ordering among values
Large Databases
 Summary
Interestingness
Measurements
 Objective measures
Two popular measurements:
 support; and
 confidence
 Subjective measures (Silberschatz &

Tuzhilin, KDD95)
A rule (pattern) is interesting if
 it is unexpected (surprising to the user);
and/or
 actionable (the user can do something
with it)
Criticism to Support and
Confidence
 Example 1: (Aggarwal & Yu, PODS98)
 Among 5000 students
 3000 play basketball
 3750 eat cereal
 2000 both play basket ball and eat cereal
 play basketball  eat cereal [40%, 66.7%] is misleading
because the overall percentage of students eating cereal
is 75% which is higher than 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence
basketball not basketball sum(row)
cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
Criticism to Support and
Confidence (Cont.)
 Example 2:
 X and Y: positively X 1 1 1 1 0 0 0 0
correlated, Y 1 1 0 0 0 0 0 0
 X and Z, negatively related Z 0 1 1 1 1 1 1 1
 support and confidence of
X=>Z dominates
 We need a measure of
Rule Support Confidence
dependent or A B)
P(correlated X=>Y 25% 50%
corr
eventsA, B 
P( A) P( B) X=>Z 37.50% 75%

Other Interestingness Measures:
Interest
 Interest (correlation, lift)P( A  B)
P ( A) P ( B )
 taking both P(A) and P(B) in consideration
 P(A^B)=P(B)*P(A), if A and B are independent
events
 A and B negatively correlated, if the value is less
than 1; otherwise A and B positively
Itemset Support correlated
Interest
X 1 1 1 1 0 0 0 0
X,Y 25% 2
Y 1 1 0 0 0 0 0 0 X,Z 37.50% 0.9
Z 0 1 1 1 1 1 1 1 Y,Z 12.50% 0.57
Large Databases
 Summary
Constraint-Based Mining
 Interactive, exploratory mining giga-bytes of data?
 Could it be real? — Making good use of constraints!
 What kinds of constraints can be used in mining?
 Knowledge type constraint: classification, association,
etc.
 Data constraint: SQL-like queries
 Find product pairs sold together in Vancouver in
Dec.’98.
 Dimension/level constraints:
 in relevance to region, price, brand, customer
category.
 Rule constraints
 small sales (price < $10) triggers big sales (sum >
$200).
 Interestingness constraints:
 strong rules (min_support  3%, min_confidence 
Large Databases
 Summary

New Association Rule

Uploaded by

Copyright:

Available Formats

New Association Rule

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

New Association Rule

Uploaded by

Copyright:

Available Formats

Data Mining:

 Association rule mining:

 Finding frequent patterns, associations,

Computer=>antivirus_software [support=2%, confidence=60%]

A support of 2% for above rule

A confidence of 60% means that

 Given: (1) database of transactions, (2) each

 Find: all rules that correlate the presence of one set

 E.g., 98% of people who purchase tires and auto

 buys(x, “SQLServer”) ^ buys(x, “DMBook”)

 age(x, “30..39”) ^ income(x, “42..48K”) ®

Transaction ID Items Bought Min. support 50%

Database D itemset sup.

 L3={abc, abd, acd, ace, bcd}

 Hash-based itemset counting: A k-itemset whose

Level 2 2% Milk Skim Milk

Level 2 2% Milk Skim Milk

 Some rules may be redundant due to

 Subjective measures (Silberschatz &

You might also like