Clustering
Clustering
2
Clustering
3
Applications include
• Grouping of news online
• Biology: Classification of plant and animal kingdom given their features
• Marketing: Customer segmentation based on customer data and past records
• Recognize communities in social network
• Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth observation
database
• City-planning: Identifying groups of houses according to their house type, value,
and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
• Climate: understanding earth climate, find patterns of atmospheric and ocean
• Economic Science: market research
What is a natural grouping among
these objects?
What is a natural grouping among
these objects?
Clustering is subjective
Similarity is
hard to define,
but…
“We know it
when we see it”
The real
meaning of
similarity is a
philosophical
question. We
will take a more
pragmatic
approach.
Defining Distance Measures
Definition: Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real number
denoted by D(O1,O2)
Peter Piotr
0.23 3 342.7
Aspects of clustering
• A clustering algorithm
• Partitional clustering
• Hierarchical clustering
12
k-means clustering
▷ one of the simplest unsupervised learning
algorithms for solving the clustering problem.
▷ start by choosing k points arbitrarily as the
“centres” of the clusters, one for each cluster.
▷ then associate each of the given data points with
the nearest centre.
▷ now take the averages of the data points
associated with a centre and replace the centre
with the average, and this is done for each of the
centres.
▷ repeat the process until the centres converge to
some fixed points.
13
k-means clustering – Problem 1
14
Solution
15
16
4. The cluster centres are recalculated as
follows:
19
8. The cluster centres are recalculated as
follows:
20
10.The cluster centres are recalculated as follows:
21
k-means clustering - Algorithm
22
Disadvantage of K means clustering
• Learning algorithm requires apriori specification of number of
clusters
• Final cluster depends on the initial selection of centroids (seed
points)
• With different representation of data, we get different results
(Cartesian and Polar coordinate representation)
• Cannot be applied to categorical data
• Random selection of initial cluster centre is not efficient
• Euclidian distance can unequally weight underlying factors
• Problems with outliers
CS583, Bing Liu, UIC 24
26
A dendrogram of the
dataset {a,b, c, d, e}
showing the distances
(heights) of the clusters
at different levels
27
Methods for hierarchical clustering
28
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
29
Measures of distance between groups of data
points
30
Measure of dissimilarity
▷ Complete linkage clustering.
○ d(A,B) = max{d(x, y) x > A, y > B}.
○ The method is also known as farthest neighbour clustering.
32
Algorithm for agglomerative hierarchical clustering
34
Solution
35
1. Dataset : {a, b, c, d, e}.
Initial clustering (singleton sets) C1: {a},
{b}, {c}, {d}, {e}.
2. The following table gives the distances
between the various clusters in C1:
36
▷ In the above table, the minimum distance is
the distance between the clusters {c} and {e}.
▷ Also d({c}, {e}) = 2.
▷ We merge {c} and {e} to form the cluster {c,
e}.
▷ The new set of clusters C2: {a}, {b}, {d}, {c, e}.
37
▷ Let us compute the distance of {c, e} from
other clusters.
○ d({c, e}, {a}) = max{d(c, a), d(e, a)} = max{3, 11} = 11.
○ d({c, e}, {b}) = max{d(c, b), d(e, b)} = max{7, 10} = 10.
○ d({c, e}, {d}) = max{d(c, d), d(e, d)} = max{9, 8} = 9.
▷ The following table gives the distances
between the various clusters in C2.
38
▷ In the above table, the minimum distance is the
distance between the clusters {b} and {d}.
▷ Also d({b}, {d}) = 5.
▷ We merge {b} and {d} to form the cluster {b, d}.
▷ The new set of clusters C3: {a}, {b, d}, {c, e}.
39
▷ Let us compute the distance of {b, d} from other
clusters.
○ d({b, d}, {a}) = max{d(b, a), d(d, a)} = max{9, 6} = 9.
○ d({b, d}, {c, e}) = max{d(b, c), d(b, e), d(d, c), d(d, e)}
= max{7, 10, 9, 8} = 10.
40
▷ In the above table, the minimum distance is
the distance between the clusters {a} and {b,
d}.
▷ Also d({a}, {b, d}) = 9.
▷ We merge {a} and {b, d} to form the cluster {a,
b, d}.
▷ The new set of clusters C4: {a, b, d}, {c, e}
41
▷ Only two clusters are left. We merge them to
form a single cluster containing all data
points.
▷ We have d({a, b, d}, {c, e})
= max{d(a, c), d(a, e), d(b, c), d(b, e), d(d, c),
d(d, e)}
= max{3, 11, 7, 10, 9, 8} = 11
42
Dendrogram for the data given
43
Divisive method
44
45
DIvisive ANAlysis (DIANA)
46
47
Problem
Given the dataset {a, b, c, d, e} and the following distance
matrix, construct a dendrogram by the divisive analysis
algorithm.
48
49
50
Repeat the process by taking Ci={a,c,e} and cj=ϕ
51
52
DENSITY BASED CLUSTERING
• Major features:
Density-Based Clustering
Disadvantages
1) DBSCAN algorithm fails in case of varying density clusters.
2) Does not work well in case of high dimensional data.
3)Still requires user to supply Minpts and ε
61
Finite mixtures
• Probabilistic clustering algorithms model the data using a
mixture of distributions
• Each cluster is represented by one distribution
• The distribution governs the probabilities of attributes
values in the corresponding cluster
• They are called finite mixtures because there is only a
finite number of clusters being represented
• Usually individual distributions are normal distribution
• Distributions are combined using cluster weights
62
model
Pr[ x | A] Pr[ A] f ( x; A , A ) p A
with Pr[ A | x]
Pr[ x] Pr[ x]
( x )2
1
f ( x; , ) e 2 2
2
The EM algorithm
• EM algorithm: expectation-maximization algorithm
• Generalization of k-means to probabilistic setting
• Similar iterative procedure:
1. Calculate cluster probability for each instance
(expectation step)
2. Estimate distribution parameters based on the
cluster probabilities (maximization step)
• Cluster probabilities are stored as instance weights
66
More on EM
• Estimating parameters from weighted instances:
w1 x1 w2 x2 ... wn xn
A
w1 w2 ... wn
2 w1 ( x1 ) 2 w2 ( x2 ) 2 ... wn ( xn ) 2
A
w1 w2 ... wn
log( p
i
A Pr[ xi | A] pB Pr[ xi | B])
67
68
Finite mixtures
• Probabilistic clustering algorithms model the data using a
mixture of distributions
• Each cluster is represented by one distribution
• The distribution governs the probabilities of attributes
values in the corresponding cluster
• They are called finite mixtures because there is only a
finite number of clusters being represented
• Usually individual distributions are normal distribution
• Distributions are combined using cluster weights
69
model
Pr[ x | A] Pr[ A] f ( x, A , A ) p A
Pr[ A | x]
Pr[ x] Pr[ x]
( x )2
1
f ( x, , ) e 2 2
2
The EM algorithm
• EM algorithm: expectation-maximization algorithm
• Generalization of k-means to probabilistic setting
The EM algorithm
• Usage of EM algorithm –
• It can be used to fill the missing data in a sample.
• It can be used as the basis of unsupervised learning of clusters.
• It can be used for the purpose of estimating the parameters of
Hidden Markov Model (HMM).
• It can be used for discovering the values of latent variables.
• Advantages of EM algorithm –
• It is always guaranteed that likelihood will increase with each
iteration.
• The E-step and M-step are often pretty easy for many problems in
terms of implementation.
• Solutions to the M-steps often exist in the closed form.
•
76
• Disadvantages of EM algorithm –
• It has slow convergence.
• It makes convergence to the local optima only.
• It requires both the probabilities, forward and backward
(numerical optimization requires only forward probability).
77
More on EM
• Estimating parameters from weighted instances:
w1 x1 w2 x2 ... wn xn
A
w1 w2 ... wn
2 w1 ( x1 ) 2 w2 ( x2 ) 2 ... wn ( xn ) 2
A
w1 w2 ... wn
log( p
i
A Pr[ xi | A] pB Pr[ xi | B])
78