Clustering and Association Rule
Clustering and Association Rule
Clustering and Association Rule
Definition
• Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
Type of data in clustering analysis
• Interval-scaled variables :For interval attributes, the
differences between values are meaningful, i.e., a unit of
measurement exists. (+, -)
– Examples: calendar dates, temperatures in Celsius or Fahrenheit
• Nominal: The values of a nominal attribute are just different
names, i.e., nominal attributes provide only enough
information to distinguish one object from another. (=, ≠)
– Examples: ID numbers, eye color, zip codes
• Ratio variables:
• For ratio variables, both differences and ratios
are meaningful. (*, /)
– Examples: temperature in Kelvin, length, time,
counts
• Binary variables:
• Variables of mixed types:
Properties of Attribute Values
0
d(2,1) 0
Dissimilarity matrix:
d(3,1) d ( 3,2) 0
n-by n table
: : :
d ( n,1) d ( n,2) ... ... 0
Dissimilarity Matrix
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• where d(i, j) is the measured difference or dissimilarity
between objects i and j.
• In general, d(i, j) is a nonnegative number that is close to 0
when objects i and j are
d (i, j) q (| x x | q | x x | q ... | x x | q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
- Properties
• d(I , j) 0
• d(I ,I ) = 0
• d(I ,j ) = d(j ,i) Symmetric
• d( I , j) d(I ,k ) + d(k ,j ) triangular inequality
d (i, j) p p m
rif 1
zif
M f 1
yif = log(xif)
Yif can be treated as interval-valued
– Treat them as continuous ordinal data treat their rank as interval-
scaled
Dissimilarity between Variables of Mixed
Types
• A database may contain all the six types of variables
d (i, j)
pf 1ij( f )dij( f )
pf 1ij( f )
ij( f ) 0 if x if x jf 0 or mis sin g
No Change in Centroids.
Hierarchical Clustering
• Grouping data objects into a tree of clusters.
• Agglomerative(bottom-up strategy)
• Divisive (top-down strategy)
Example
• In agglomerative merge nodes that have the least dissimilarity(minimum
Euclidean distance )
• In hierarchical divisive each cluster is represented by all of the objects
• Cluster is split (maximum Euclidean distance between the points)
• St • St • St • St • St • agglomerative
• Ae • e e e e • (AGNES)
A
• Jp p p •p A J Mp E R
• M1 J2 3• M 4E R 5
• E
• ER
• R
• divisive
• St • St • St • St • St • (DIANA)
Agglomerative Clustering
• Go on merging
A B C D,F E
A 0.00 0.71 5.66 3.20 4.24
B 0.71 0.00 4.95 2.50 3.54
C 5.66 4.95 0.00 2.24 1.41
D,F 3.20 2.50 2.24 0.00 1.00
E 4.24 3.54 1.41 .00 0.00
•
•We want to merge the two closest clusters (D and F) and
update the matrix.
Example: Agglomerative Clustering III
A,B C D,F E
A,B 0.00 4.95 2.50 3.54
C 4.95 0.00 2.24 1.41
D,F 2.50 2.24 0.00 1.00
E 3.54 1.41 1.00 0.00
Example: Agglomerative Clustering IV
A,B C (D,F),E
A,B 0.00 4.95 2.50
C 4.95 0.00 1.41
(D,F),E 2.50 1.41 0.00
A,B ((D,F),E),C
A,B 0.00 2.50
A,B C (D,F),E
A,B C (D,F),E
Dist(C
C i Clusteri
i , Cj)
C j Cluster j
Dist(Clust eri , Clusterj )
|Clusteri ||Clusterj |
Strengths
Min
• Can handle non-elliptical shapes
Max
• Less susceptible to noise and outliers
Group Average
• Less susceptible to noise and outliers
Limitations
Group Average
• Biased towards globular clusters
Dendrogram
• Average link: In this method, the distance between two clusters is the
average distance of all pair-wise distances between the data points in
two clusters.
9 R, A, M Frequent Support
Itemsets
10 R, M
{R} 75%
11 R, N
{A} 50%
12 A, V, W
{M} 50%
{R,M} 50%
For rule R M:
support = support({R}{M}) = 50%
confidence = support({R}{M})/support({R}) = 66.6%
Mining Association Rules:
What We Need to Know
TID Items_bought
T100 { M ,O, N, K ,E, Y }
T200 { D ,O, N ,K, E, Y }
T300 { M, A, K, E }
T400 { M, U, C, K, Y }
T500 { C, O, O, K, I, E }
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
• How to generate candidates?
• Step 1: self-joining Lk (Apriori assume the items in Lk-1 are listed in an lexicographic order).
• The join, Lk-1 * Lk-1, is performed, where members of Lk-1 are joinable if their first (k-2)
items are in common.
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a transaction
Challenges of Frequent Pattern Mining
• Challenges
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• Improving Apriori: general ideas
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates
Bottleneck of Frequent-pattern Mining
• Multiple database scans are costly
• Mining long patterns needs many passes of scanning
and generates lots of candidates
– To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !
• Bottleneck: candidate-generation-and-test
• Can we avoid candidate generation?
Mining Frequent Patterns Without
Candidate Generation
• Frequent pattern growth or FP-growth
E-conditional FP-tree
M-conditional FP-tree
A Special Case: Single Prefix Path in FP-tree
• {} r1
{}
• a1:n1
• a2:n2 a1:n1
b1:m1 C1:k1
• a3:n3 r1 = +
a2:n2
• C2:k2 • C3:k3
Mining Frequent Patterns With FP-trees
• Idea: Frequent pattern growth
– Recursively grow frequent patterns by pattern and
database partition
• Method
– For each frequent item, construct its conditional pattern-
base, and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-
tree
– Until the resulting FP-tree is empty, or it contains only one
path—single path will generate all the combinations of its
sub-paths, each of which is a frequent pattern
Why Is FP-Growth the Winner?
• Divide-and-conquer:
– decompose both the mining task and DB according to the frequent
patterns obtained so far
– leads to focused search of smaller databases
• Other factors
– no candidate generation, no candidate test
– compressed database: FP-tree structure
– no repeated scan of entire database
– basic ops—counting local freq items and building sub FP-tree, no
pattern search and matching