0% found this document useful (0 votes)
15 views80 pages

Clustering

This document discusses different clustering methods including k-means clustering, hierarchical clustering, and measures of distance between clusters. K-means clustering partitions data into k groups, while hierarchical clustering creates clusters at different levels to form a tree structure. Distance measures are used to determine how clusters are merged or split in hierarchical clustering.

Uploaded by

Javada Javada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views80 pages

Clustering

This document discusses different clustering methods including k-means clustering, hierarchical clustering, and measures of distance between clusters. K-means clustering partitions data into k groups, while hierarchical clustering creates clusters at different levels to form a tree structure. Distance measures are used to determine how clusters are merged or split in hierarchical clustering.

Uploaded by

Javada Javada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Module 6

CS 467 - Machine Learning


Syllabus

▷ Unsupervised Learning - Clustering Methods -


K-means, Expectation-Maximization Algorithm,
Hierarchical Clustering Methods , Density based
clustering

2
Clustering

▷ Clustering or cluster analysis is the task of


grouping a set of objects in such a way that
▷ objects in the same group (called a cluster)
are more similar (in some sense) to each
other than to those in other groups (clusters).
▷ Useful for:
○ Automatically organizing data
○ Understanding hidden structure in data
○ Pre processing for further analysis

3
Applications include
• Grouping of news online
• Biology: Classification of plant and animal kingdom given their features
• Marketing: Customer segmentation based on customer data and past records
• Recognize communities in social network
• Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth observation
database
• City-planning: Identifying groups of houses according to their house type, value,
and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
• Climate: understanding earth climate, find patterns of atmospheric and ocean
• Economic Science: market research
What is a natural grouping among
these objects?
What is a natural grouping among
these objects?

Clustering is subjective

Simpson's Family School Employees Females Males


What is Similarity?
The quality or state of being similar, likeness, resemblance, as, a similarity of
features. Webster's Dictionary

Similarity is
hard to define,
but…
“We know it
when we see it”

The real
meaning of
similarity is a
philosophical
question. We
will take a more
pragmatic
approach.
Defining Distance Measures
Definition: Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real number
denoted by D(O1,O2)

Peter Piotr

0.23 3 342.7
Aspects of clustering
• A clustering algorithm
• Partitional clustering
• Hierarchical clustering

• A distance (similarity, or dissimilarity) function


• Clustering quality
• Inter-clusters distance  maximized
• Intra-clusters distance  minimized
• The quality of a clustering result depends on the algorithm,
the distance function, and the application
Types of Clustering
11

Major Clustering Approaches


• Partitioning approach:
• Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
• Typical methods: k-means, k-medoids
• Hierarchical approach:
• Create a hierarchical decomposition of the set of data (or objects) using some
criterion
• Typical methods: Diana(Divisive Analysis), Agnes(Agglomerative Nesting),
BIRCH, CAMELEON
• Density-based approach:
• Based on connectivity and density functions
• Typical methods: DBSACN, OPTICS, DenClue
Partitional

▷ simply a division of the set of data objects into


non-overlapping subsets (clusters) such that each
data object is in exactly one subset.
▷ Eg K Means

12
k-means clustering
▷ one of the simplest unsupervised learning
algorithms for solving the clustering problem.
▷ start by choosing k points arbitrarily as the
“centres” of the clusters, one for each cluster.
▷ then associate each of the given data points with
the nearest centre.
▷ now take the averages of the data points
associated with a centre and replace the centre
with the average, and this is done for each of the
centres.
▷ repeat the process until the centres converge to
some fixed points.

13
k-means clustering – Problem 1

▷ The distance between the points (x1, x2) and


(y1, y2) will be calculated using the familiar
distance formula of elementary analytical
geometry:

14
Solution

1. In the problem, the required number of


clusters is 2 and we take k = 2.
2. We choose two points arbitrarily as the initial
cluster centres. Let us choose arbitrarily

3. We compute the distances of the given data


points from the cluster centers.

15
16
4. The cluster centres are recalculated as
follows:

5. We compute the distances of the given data


points from the new cluster centers.
17
18
6. The cluster centres are recalculated as
follows:

7. We compute the distances of the given data


points from the new cluster centers.

19
8. The cluster centres are recalculated as
follows:

9. This divides the data into two clusters as


follows

20
10.The cluster centres are recalculated as follows:

11.These are identical to the cluster centres calculated


in Step 8. So there will be no reassignment of data
points to different clusters and hence the
computations are stopped here.
12.Conclusion: The k means clustering algorithm with k
= 2 applied to the dataset yields the following
clusters and the associated cluster centres:

21
k-means clustering - Algorithm

22
Disadvantage of K means clustering
• Learning algorithm requires apriori specification of number of
clusters
• Final cluster depends on the initial selection of centroids (seed
points)
• With different representation of data, we get different results
(Cartesian and Polar coordinate representation)
• Cannot be applied to categorical data
• Random selection of initial cluster centre is not efficient
• Euclidian distance can unequally weight underlying factors
• Problems with outliers
CS583, Bing Liu, UIC 24

Weaknesses of k-means: Problems with


outliers
Hierarchical clustering
▷ is a method of cluster analysis which seeks to
build a hierarchy of clusters (or groups) in a
given dataset.
▷ The hierarchical clustering produces clusters
in which the clusters at each level of the
hierarchy are created by merging clusters at
the next lower level.
▷ The decision regarding whether two clusters
are to be merged or not is taken based on the
measure of dissimilarity between the clusters.
25
Hierarchical clustering -Dendrograms

▷ A dendrogram is a tree diagram used to


illustrate the arrangement of the clusters
produced by hierarchical clustering.

26
A dendrogram of the
dataset {a,b, c, d, e}
showing the distances
(heights) of the clusters
at different levels

27
Methods for hierarchical clustering

▷ There are two methods for the hierarchical


clustering of a dataset.
▷ Agglomerative method (AGNES)
○ or the bottom-up method
▷ Divisive method (DIANA)
○ the top-down method

28
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)

a
ab

b
abcde

c
cde
d
de

e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0

29
Measures of distance between groups of data
points

▷ Let A and B be two groups of observations


and let x and y be arbitrary data points in A
and B respectively.
▷ d(x, y) denote the distance between x and y.
▷ d(A,B) is the distance between the groups A
and B.

30
Measure of dissimilarity
▷ Complete linkage clustering.
○ d(A,B) = max{d(x, y) x > A, y > B}.
○ The method is also known as farthest neighbour clustering.

▷ Single linkage clustering


○ d(A,B) = min{d(x, y) x > A, y > B}.
○ The method is also known as nearest neighbour clustering.

▷ Average linkage clustering

○ Where |A| and |B| are respectively the number of elements


in A and B.
○ It is also known as UPGMA (Unweighted Pair Group Method
with Arithmetic Mean).
31
Complete linkage clustering. Single linkage clustering

32
Algorithm for agglomerative hierarchical clustering

Given a set of N items to be clustered and an N × N distance


matrix, required to construct a hierarchical clustering of the
data using the agglomerative method.

Step 1. Start by assigning each item to its own cluster, so that


we have N clusters, each containing just one item. Let the
distances between the clusters equal the distances between
the items they contain.
Step 2. Find the closest pair of clusters and merge them into a
single cluster, so that now we have one less cluster.
Step 3. Compute distances between the new cluster and each
of the old clusters.
Step 4. Repeat Steps 2 and 3 until all items are clustered into
a single cluster of size N.
33
Problem
Given the dataset {a,b,c,d, e} and the following distance
matrix, construct a dendrogram by completelinkage
hierarchical clustering using the agglomerative method.

34
Solution

▷ The complete-linkage clustering uses the


“maximum formula”, that is, the following
formula to compute the distance between two
clusters A and B:

35
1. Dataset : {a, b, c, d, e}.
Initial clustering (singleton sets) C1: {a},
{b}, {c}, {d}, {e}.
2. The following table gives the distances
between the various clusters in C1:

36
▷ In the above table, the minimum distance is
the distance between the clusters {c} and {e}.
▷ Also d({c}, {e}) = 2.
▷ We merge {c} and {e} to form the cluster {c,
e}.
▷ The new set of clusters C2: {a}, {b}, {d}, {c, e}.

37
▷ Let us compute the distance of {c, e} from
other clusters.
○ d({c, e}, {a}) = max{d(c, a), d(e, a)} = max{3, 11} = 11.
○ d({c, e}, {b}) = max{d(c, b), d(e, b)} = max{7, 10} = 10.
○ d({c, e}, {d}) = max{d(c, d), d(e, d)} = max{9, 8} = 9.
▷ The following table gives the distances
between the various clusters in C2.

38
▷ In the above table, the minimum distance is the
distance between the clusters {b} and {d}.
▷ Also d({b}, {d}) = 5.
▷ We merge {b} and {d} to form the cluster {b, d}.
▷ The new set of clusters C3: {a}, {b, d}, {c, e}.

39
▷ Let us compute the distance of {b, d} from other
clusters.
○ d({b, d}, {a}) = max{d(b, a), d(d, a)} = max{9, 6} = 9.
○ d({b, d}, {c, e}) = max{d(b, c), d(b, e), d(d, c), d(d, e)}
= max{7, 10, 9, 8} = 10.

40
▷ In the above table, the minimum distance is
the distance between the clusters {a} and {b,
d}.
▷ Also d({a}, {b, d}) = 9.
▷ We merge {a} and {b, d} to form the cluster {a,
b, d}.
▷ The new set of clusters C4: {a, b, d}, {c, e}

41
▷ Only two clusters are left. We merge them to
form a single cluster containing all data
points.
▷ We have d({a, b, d}, {c, e})
= max{d(a, c), d(a, e), d(b, c), d(b, e), d(d, c),
d(d, e)}
= max{3, 11, 7, 10, 9, 8} = 11

42
Dendrogram for the data given

43
Divisive method

▷ The divisive method starts at the top and at each


level recursively split one of the existing clusters
at that level into two new clusters.
▷ If there are N observations in the dataset, then
the divisive method will produce N − 1 levels in
the hierarchy.
▷ The split is chosen to produce two new groups
with the largest “between-group dissimilarity”.
▷ Each nonterminal node has two daughter nodes.
▷ The two daughters represent the two groups
resulting from the split of the parent.

44
45
DIvisive ANAlysis (DIANA)

46
47
Problem
Given the dataset {a, b, c, d, e} and the following distance
matrix, construct a dendrogram by the divisive analysis
algorithm.

48
49
50
Repeat the process by taking Ci={a,c,e} and cj=ϕ

51
52
DENSITY BASED CLUSTERING

• Clustering based on density (local cluster criterion), such as


density-connected points

• Major features:

• Discover clusters of arbitrary shape


• Handle noise
• Need density parameters as termination condition
54

Density-Based Clustering

• In density-based clustering, each cluster has a considerable


higher density of points than outside of the cluster
• Objects in the sparse areas (that are required to separate clusters)
are usually considered to be noise and border points
• Density-Based Spatial Clustering of Applications with Noise
(DBSCAN) is most widely used density based algorithm
55

Terminology and notations


• Let Є(epsilon) be some constant distance. Let p be an arbitrary
data point. The neighbourhood of p is the set

• We choose some number m0 to define points of “high density”:


We say that a point p is point of high density if contains
at least m0 points.
• We define a point p as a core point if has more than m0
points.
• We define a point p as a border point if has fewer than
m0 points, but is in the Є neighbourhood of a core point.
• A point which is neither a core point nor a border point is called
a noise point.
57

DBSCAN: Core, Border and Noise Points

Original Points Point types: core,


border and noise

Eps = 10, MinPts = 4


Advantages

1) Does not require a-priori specification of number of clusters.


2) Able to identify noise data while clustering.
3) DBSCAN algorithm is able to find arbitrarily size and
arbitrarily shaped clusters.

Disadvantages
1) DBSCAN algorithm fails in case of varying density clusters.
2) Does not work well in case of high dimensional data.
3)Still requires user to supply Minpts and ε
61

Finite mixtures
• Probabilistic clustering algorithms model the data using a
mixture of distributions
• Each cluster is represented by one distribution
• The distribution governs the probabilities of attributes
values in the corresponding cluster
• They are called finite mixtures because there is only a
finite number of clusters being represented
• Usually individual distributions are normal distribution
• Distributions are combined using cluster weights
62

A two-class mixture model data


A 51 B 62 B 64 A 48 A 39 A 51
A 43 A 47 A 51 B 64 B 62 A 48
B 62 A 52 A 52 A 51 B 64 B 64
B 64 B 64 B 62 B 63 A 52 A 42
A 45 A 51 A 49 A 43 B 63 A 48
A 42 B 65 A 48 B 65 B 64 A 41
A 46 A 48 B 62 B 66 A 48
A 45 A 49 A 43 B 65 B 64
A 45 A 46 A 40 A 46 A 48

model

A=50, A =5, pA=0.6 B=65, B =2, pB=0.4


63

Using the mixture model


• The probability of an instance x belonging to cluster A is:

Pr[ x | A] Pr[ A] f ( x;  A ,  A ) p A
with Pr[ A | x]  
Pr[ x] Pr[ x]

• The likelihood of an instance given the clusters is:

( x )2
1 
f ( x;  ,  )  e 2 2

2 

Pr[ x | the distributions]   Pr[ x | clusteri ] Pr[clusteri ]


i
64

Learning the clusters


• Assume we know that there are k clusters
• To learn the clusters we need to determine their
parameters
• I.e. their means and standard deviations
• We actually have a performance criterion: the likelihood of
the training data given the clusters
• Fortunately, there exists an algorithm that finds a local
maximum of the likelihood
65

The EM algorithm
• EM algorithm: expectation-maximization algorithm
• Generalization of k-means to probabilistic setting
• Similar iterative procedure:
1. Calculate cluster probability for each instance
(expectation step)
2. Estimate distribution parameters based on the
cluster probabilities (maximization step)
• Cluster probabilities are stored as instance weights
66

More on EM
• Estimating parameters from weighted instances:
w1 x1  w2 x2  ...  wn xn
A 
w1  w2  ...  wn
2 w1 ( x1   ) 2  w2 ( x2   ) 2  ...  wn ( xn   ) 2
A 
w1  w2  ...  wn

• Procedure stops when log-likelihood saturates


• Log-likelihood (increases with each iteration; we wish it to
be largest):

 log( p
i
A Pr[ xi | A]  pB Pr[ xi | B])
67
68

Finite mixtures
• Probabilistic clustering algorithms model the data using a
mixture of distributions
• Each cluster is represented by one distribution
• The distribution governs the probabilities of attributes
values in the corresponding cluster
• They are called finite mixtures because there is only a
finite number of clusters being represented
• Usually individual distributions are normal distribution
• Distributions are combined using cluster weights
69

A two-class mixture model data


A 51 B 62 B 64 A 48 A 39 A 51
A 43 A 47 A 51 B 64 B 62 A 48
B 62 A 52 A 52 A 51 B 64 B 64
B 64 B 64 B 62 B 63 A 52 A 42
A 45 A 51 A 49 A 43 B 63 A 48
A 42 B 65 A 48 B 65 B 64 A 41
A 46 A 48 B 62 B 66 A 48
A 45 A 49 A 43 B 65 B 64
A 45 A 46 A 40 A 46 A 48

model

A=50, A =5, pA=0.6 B=65, B =2, pB=0.4


70

Using the mixture model


• The probability of an instance x belonging to cluster A is:

Pr[ x | A] Pr[ A] f ( x,  A ,  A ) p A
Pr[ A | x]  
Pr[ x] Pr[ x]

( x )2
1 
f ( x,  ,  )  e 2 2

2 

• The likelihood of an instance given the clusters is:

Pr[ x | the distributi ons]   Pr[ x | clusteri ] Pr[clusteri ]


i
71

Learning the clusters


• Assume we know that there are k clusters
• To learn the clusters we need to determine their
parameters
• I.e. their means and standard deviations

• We actually have a performance criterion: the likelihood of


the training data given the clusters
• Fortunately, there exists an algorithm that finds a local
maximum of the likelihood
72

Expectation-Maximization (EM Algorithm)


• Maximum likelihood estimation is an approach to density estimation for
a dataset by searching across probability distributions and their
parameters.
• Maximum likelihood becomes intractable if there are variables that
interact with those in the dataset but were hidden or not observed,
so-called latent variables.
• The expectation-maximization algorithm is an approach for
performing maximum likelihood estimation in the presence of latent
variables.
• EM algorithm does this by first estimating the values for the latent
variables, then optimizing the model, then repeating these two steps
until convergence.
• EM algorithm is an effective and general approach and is most
commonly used for density estimation with missing data, such as
clustering algorithms like the Gaussian Mixture Model.
73

The EM algorithm
• EM algorithm: expectation-maximization algorithm
• Generalization of k-means to probabilistic setting

• Similar iterative procedure:


1. Calculate cluster probability for each instance
(expectation step)
2. Estimate distribution parameters based on the
cluster probabilities (maximization step)

• Cluster probabilities are stored as instance weights


74

The EM algorithm

Step 1. Initialize the parameters to be estimated.

Step 2. Expectation step (E-step)


Take the expected value of the complete data given the observation
and the current parameter estimate, say, θj .
This is a function of θ and θj , say, Q(θ, θj).

Step 3. Maximization step (M-step)


Find the values θ that maximizes the function Q(θ, θj).

Step 4. Repeat Steps 1 and 2 until the parameter values or the


likelihood function converge.
75

• Usage of EM algorithm –
• It can be used to fill the missing data in a sample.
• It can be used as the basis of unsupervised learning of clusters.
• It can be used for the purpose of estimating the parameters of
Hidden Markov Model (HMM).
• It can be used for discovering the values of latent variables.

• Advantages of EM algorithm –
• It is always guaranteed that likelihood will increase with each
iteration.
• The E-step and M-step are often pretty easy for many problems in
terms of implementation.
• Solutions to the M-steps often exist in the closed form.

76

• Disadvantages of EM algorithm –
• It has slow convergence.
• It makes convergence to the local optima only.
• It requires both the probabilities, forward and backward
(numerical optimization requires only forward probability).
77

More on EM
• Estimating parameters from weighted instances:
w1 x1  w2 x2  ...  wn xn
A 
w1  w2  ...  wn
2 w1 ( x1   ) 2  w2 ( x2   ) 2  ...  wn ( xn   ) 2
A 
w1  w2  ...  wn

• Procedure stops when log-likelihood saturates


• Log-likelihood (increases with each iteration; we wish it to
be largest):

 log( p
i
A Pr[ xi | A]  pB Pr[ xi | B])
78

You might also like