Unit 5 - Cluster Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

https://genuinenotes.

com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

Cluster analysis
A cluster is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other clusters. The process of grouping a set of physical or abstract
objects into classes of similar objects is called clustering.
Clustering techniques apply when there is no class to be predicted but rather when the instances
are to be divided into natural groups. These clusters presumably reflect some mechanism at work
in the domain from which instances are drawn, a mechanism that causes some instances to bear a
stronger resemblance to each other than they do to the remaining instances. Clustering naturally
requires different techniques to the classification and association learning.
Clustering differs from classification in that there is no target variable for clustering. The clustering
task does not try to classify, estimate, or predict the value of a target variable. Instead, clustering
algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters,
where the similarity of the records within the cluster is maximized, and the similarity to records
outside this cluster is minimized. By definition this is an example of unsupervised learning.

Applications
Business: Businesses collect large amounts of information on current and potential customers.
Clustering can be used to segment customers into small number of groups for additional analysis
and marketing activities
Information Retrieval: The World Wide Web consists of billions of Web pages, and the results
of a query to a search engine can return thousands of pages. Clustering can be used to group these
search results into a small number of clusters, each of which captures a particular aspect of the
query. For instance, a query of "movie" might return Web pages grouped into categories such as
reviews, trailers, stars, and theaters. Each category (cluster) can be broken into subcategories (sub-
clusters), producing a hierarchical structure that further assists a user's exploration of the query
results.
Climate: Understanding the Earth's climate requires finding patterns in the atmosphere and ocean.
Cluster analysis has been applied to find patterns in the atmospheric pressure of polar region and
areas of the ocean that have significant impact on land climate.
Biology: used to derive plant and animal taxonomies, categorize genes with similar functionality,
and gain insight into structures inherent in populations.

Arjun Lamichhane 1
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

Types of clustering algorithm.


In general, the major clustering methods can be classified into the following categories.

1. Partitioning methods
Given a database of n objects or data tuples, a partitioning method constructs k partitions of the
data, where each partition represents a cluster and k ≤ n. Given the number of partitions (k) to
construct, a partitioning method creates an initial partitioning. It then uses an iterative relocation
technique that attempts to improve the partitioning by moving objects from one group to another.
The algorithms that use such methods are k-means algorithm, k-medoids algorithm etc.

2. Hierarchical Methods
A hierarchical method creates a hierarchical decomposition of the given set of data objects. If we
permit clusters to have subclusters, then we obtain a hierarchical clustering, which is a set of nested
clusters that are organized as a tree. Each node (cluster) in the tree (except for the leaf nodes) is
the union of its children (subclusters), and the root of the tree is the cluster containing all the
objects. A hierarchical method can be classified as being either agglomerative or divisive, based
on how the hierarchical decomposition is formed.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be
undone. This rigidity is useful in that it leads to smaller computation costs by not having to worry
about a combinatorial number of different choices. However, such techniques cannot correct
erroneous decisions.

3. Density based methods


Most partitioning methods cluster objects based on the distance between objects. Such methods
can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary
shapes. Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density (number of objects or
data points) in the “neighborhood” exceeds some threshold; that is, for each data point within a
given cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary
shape. DBSCAN and its extension, OPTICS, are typical density-based methods that grow clusters
according to a density-based connectivity analysis.

Arjun Lamichhane 2
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

4. Grid-based methods
Grid-based methods quantize the object space into a finite number of cells that form a grid
structure. All of the clustering operations are performed on the grid structure (i.e., on the quantized
space). The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in each
dimension in the quantized space.

5. Model-based methods:
Model-based methods hypothesize a model for each of the clusters and find the best fit of the data
to the given model. A model-based algorithm may locate clusters by constructing a density
function that reflects the spatial distribution of the data points. It also leads to a way of
automatically determining the number of clusters based on standard statistics, taking “noise” or
outliers into account and thus yielding robust clustering methods.
Only the first three methods are of interest to us.

Partitioning algorithm
Partitioning algorithm divides the data into k groups, which together satisfy the following
requirements:
 Each group must contain at least one object, and
 Each object must belong to exactly one group.
These heuristic clustering methods work well for finding spherical-shaped clusters in small to
medium-sized databases. The clusters are formed to optimize an objective partitioning criterion,
such as a dissimilarity function based on distance, so that the objects within a cluster are “similar,”
whereas the objects of different clusters are “dissimilar” in terms of the data set attributes.

K-means Clustering algorithm


K-means defines a cluster in terms of a centroid, which is usually the mean of a group of points.
We first choose K initial centroids, where K is a user specified parameter, namely, the number of
clusters desired. Each point is then assigned to the closest centroid, and each collection of points
assigned to a centroid is a cluster. The centroid of each cluster is then updated based on the points
assigned to the cluster. We repeat the assignment and update steps until no point changes clusters,
or equivalently, until the centroids remain the same.

Arjun Lamichhane 3
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

K-means is formally described by following algorithm:

Algorithm

Input:

k: the number of clusters,

D: a data set containing n objects.

Output: A set of k clusters.

Method:

1. Arbitrarily choose k objects from D as the initial cluster centers;


2. Repeat
2.1 Form k clusters by assigning each point to its closest centroid.
2.2 Recompute the centroid of each cluster.
3. Until Centroids do not change.

The operation of K-means is illustrated in Figure 1, which shows how, starting from three
centroids, the final clusters are found in three assignment-update steps. These figure shows the
centroids at the start of the iteration and the assignment of the points to those centroids. The
centroids are indicated by the "+" symbol; all points belonging to the same cluster have the same
outlines encircled by dotted curves. In the first step, shown in Figure 1(a), points are assigned to
the initial centroids (chosen randomly), which are all in the larger group of points. For this

Figure 1 Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by ‘+’)

example, we use the mean as the centroid. After points are assigned to a centroid, the centroid is
then updated. In the second step, points are assigned to the updated centroids, and the centroids

Arjun Lamichhane 4
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

are updated again which are shown in Figures 1(b) and 1(c). When the K-means algorithm
terminates in Figure 1(c), because no more changes occur, the centroids have identified the natural
groupings of points.
Example 1.
D={2, 3, 4, 10, 11, 12, 20, 25, 30}
Divide into three clusters
---Refer class notes---
Example 2.
Generate cluster from following data set using K-means algorithm (take k= 2 and consider up to 2
iterations)
T1: Bread, Jelly, Butter
T2: Bread, Butter
T3: Bread, Milk, Butter
T4: Coke, Bread
T5: Coke, Milk
--- Refer Class Notes---
Strength
K-means is simple and can be used for a wide variety of data types. It works well when the clusters
are compact clouds that are rather well separated from one another. The method is relatively
scalable and efficient in processing large data sets, even though multiple runs are often performed.
Weakness
K-means also has trouble clustering data that contains outliers. Outlier detection and removal can
help significantly in such situations. The k-means method is not suitable for discovering clusters
with nonconvex shapes or clusters of very different size. Finally, K-means is restricted to data for
which there is a notion of a center (centroid) so it might not be applicable in some applications,
such as when data with categorical attributes are involved.

Hierarchical Clustering
Hierarchical clustering techniques are a second important category of clustering methods. As with
K-means, these approaches are relatively old compared to many clustering algorithms, but they
still enjoy widespread use.

Arjun Lamichhane 5
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

A hierarchical clustering method works by grouping data objects into a tree of clusters.
Hierarchical clustering methods can be further classified as either agglomerative or divisive,
depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-
down (splitting) fashion. The quality of a pure hierarchical clustering method suffers from its
inability to perform adjustment once a merge or split decision has been executed. That is, if a
particular merge or split decision later turns out to have been a poor choice, the method cannot
backtrack and correct it.
In general, there are two types of hierarchical clustering methods:
Agglomerative: Start with the points as individual clusters and, at each step, merge the closest
pair of clusters. This requires defining a notion of cluster proximity.
Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton
clusters of individual points remain. In this case, we need to decide which cluster to split at each
step and how to do the splitting.
A hierarchical clustering is often displayed graphically using a tree-like diagram called a
dendrogram, which displays both the cluster-subcluster relationships and the order in which the
clusters were merged (agglomerative view) or split (divisive view).

Agglomerative Clustering
Many agglomerative hierarchical clustering techniques are variations on a single approach: starting
with individual points as clusters, successively merge the two closest clusters until only one cluster
remains. In general, agglomerative algorithms are more frequently used in real world applications
than divisible methods.
Algorithm
1. Compute the proximity matrix, if necessary.
2. Repeat
2.1. Merge the closest two clusters.
2.2. Update the proximity matrix to reflect the proximity between the new cluster and the
original clusters.
3. Until Only one cluster remains.

Arjun Lamichhane 6
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

Defining inter-cluster similarity

Figure 2 Types of inter-cluster similarity

The key operation of Algorithm for agglomerative clustering is the computation of the proximity
(distance) between two clusters. For example, many agglomerative hierarchical clustering
techniques, such as MIN, MAX, and Group Average, come from a graph-based view of clusters.
MIN defines cluster proximity as the proximity between the closest two points that are in different
clusters, or using graph terms, the shortest edge between two nodes in different subsets of nodes.
Alternatively, MAX takes the proximity between the farthest two points in different clusters to be
the cluster proximity, or using graph terms, the longest edge between two nodes in different subsets
of nodes. The group average technique, defines cluster proximity to be the average pairwise
proximities (average length of edges) of all pairs of points from different clusters.
Example: Perform clustering using Agglomerative algorithm for the following:
Points: A (1, 1), B (1.5, 1.5), C (5, 5) D (3, 4), E (4, 4), F (3, 3.5)
Step 1. Compute distance between all the points.
𝑑𝐴,𝐵 = √(1 − 1.5)2 + (1 − 1.5)2 = √0.5 = 0.71

𝑑𝐷,𝐹 = √(3 − 3)2 + (4 − 3.5)2 = √0.25 = 0.50


Similarly computing all distances, we obtain proximity matrix as:
A B C D E F
A 0.00 0.71 5.66 3.61 4.24 3.20
B 0.71 0.00 4.95 2.92 3.54 2.50
C 5.66 4.95 0.00 2.24 1.41 2.50
D 3.61 2.92 2.24 0.00 1.00 0.50
E 4.24 3.54 1.41 1.00 0.00 1.12
F 3.20 2.50 2.50 0.50 1.12 0.00

Arjun Lamichhane 7
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

Step 2: Merge two closest points (Hence D, F)

A B C DF E
A 0.00
B 0.71 0.00
C 5.66 4.95 0.00
DF ? ? ? 0.00
E 4.24 3.54 1.41 ? 0.00
To compute proximity of cluster DF to other point we use MIN method. (Any method like MAX
or average can be used but the final cluster may differ in such case.)
𝑑𝐷,𝐹→𝐴 = min(𝑑𝐷𝐴 , 𝑑𝐹𝐴 ) = min(3.61, 3.20) = 3.20
𝑑𝐷,𝐹→𝐵 = min(𝑑𝐷𝐵 , 𝑑𝐹𝐵 ) = min(2.92, 2.50) = 2.50
𝑑𝐷,𝐹→𝐶 = min(𝑑𝐷𝐶 , 𝑑𝐹𝐶 ) = min(2.24, 2.50) = 2.24
𝑑𝐷,𝐹→𝐸 = min(𝑑𝐷𝐸 , 𝑑𝐹𝐸 ) = min(1.00, 1.12) = 1.00
So the table will be as:
A B C DF E
A 0.00
B 0.71 0.00
C 5.66 4.95 0.00
D F 3.20 2.50 2.24 0.00
E 4.24 3.54 1.41 1.00 0.00

Step 3: Merge A and B

AB C DF E
A B 0.00
C ? 0.00
DF ? 2.24 0.00
E ? 1.41 1.00 0.00
Now, compute proximity of AB to other points and clusters
𝑑𝐴,𝐵→𝐶 = min(𝑑𝐴𝐶 , 𝑑𝐵𝐶 ) = min(5.66, 4.95) = 4.95
𝑑𝐴,𝐵→𝐷,𝐹 = min(𝑑𝐴𝐷 , 𝑑𝐴𝐹 , 𝑑𝐵𝐷 , 𝑑𝐵𝐹 ) = min(3.61, 3.20, 2.92, 2.50) = 2.50

Arjun Lamichhane 8
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

𝑑𝐴,𝐵→𝐸 = min(𝑑𝐴,𝐸 , 𝑑𝐵𝐸 ) = min(4.24, 3.54) = 3.54


AB C DF E
A B 0.00
C 4.95 0.00
D F 2.50 2.24 0.00
E 3.54 1.41 1.00 0.00

Step 4: Merge DF and E

AB C DFE
AB 0.00
C 4.95 0.00
DFE ? ? 0.00
On computing the distances we get
𝑑𝐷,𝐹,𝐸→𝐴,𝐵 = min(𝑑𝐷𝐴 , 𝑑𝐷𝐵 , 𝑑𝐹𝐴 , 𝑑𝐹𝐵 , 𝑑𝐸𝐴 , 𝑑𝐸𝐵 ) = min(3.61, 2.92, 3.20, 2.50, 4.24, 3.54)
= 2.50
𝑑𝐷,𝐹,𝐸→𝐶 = min(𝑑𝐷𝐶 , 𝑑𝐹𝐶 , 𝑑𝐸𝐶 ) = min(2.24, 2.50, 1.41) = 1.41
AB C DFE
AB 0.00
C 4.95 0.00
D F E 2.50 1.41 0.00

Step 5: Merge DFE and C

AB DFEC
AB 0.00
DFEC ? 0.00

𝑑𝐷,𝐹,𝐸,𝐶→𝐴,𝐵 = min(𝑑𝐷,𝐴 , 𝑑𝐷,𝐵 , 𝑑𝐹,𝐴 , 𝑑𝐹,𝐵 , 𝑑𝐸,𝐴 , 𝑑𝐸,𝐵 , 𝑑𝐶,𝐴 , 𝑑𝐶,𝐵 )


= min(3.61, 2.92, 3.20, 2.50, 4.24, 3.54, 5.66, 4.95) = 2.50
So,

Arjun Lamichhane 9
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

AB DFEC
AB 0.00
D F E C 2.50 0.00

Step 6: Finally merge A B and D F E C to get all points in one single cluster.

The dendogram representing all above clusters is as follows:

Divisive Method
A divisive algorithm starts from the entire set of samples X and divides it into a partition of subsets,
then divides each subset into smaller sets, and so on until each object forms a cluster on its own or
until it satisfies certain termination conditions, such as a desired number of clusters is obtained or
the diameter of each cluster is within a certain threshold.. Thus, a divisive algorithm generates a
sequence of partitions that is ordered from a coarser one to a finer one. In this case, we need to
decide which cluster to split at each step and how to do the splitting. For a large data set, this
method is computationally expensive.
Strength
Such algorithms are typically useful for the application such as creation of a taxonomy which
requires a hierarchy of clusters. Also, there have been some studies that suggest that these
algorithms can produce better-quality clusters.
Weakness

Arjun Lamichhane 10
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

Agglomerative hierarchical clustering algorithms are expensive in terms of their computational


and storage requirements. The hierarchical clustering method, though simple, often encounters
difficulties regarding the selection of merge or split points. The fact that all merges are final can
also cause trouble for noisy, high-dimensional data, such as document data.

Density based clustering


To discover clusters with arbitrary shape, density-based clustering methods have been developed.
These typically regard clusters as dense regions of objects in the data space that are separated by
regions of low density (representing noise). DBSCAN grows clusters according to a density-based
connectivity analysis. Density-based clustering locates regions of high density that are separated
from one another by regions of low density. DBSCAN is a simple and effective density-based
clustering algorithm that illustrates a number of important concepts that are important for any
density-based clustering approach.

DBSCAN: Density-Based Spatial Clustering of Applications with Noise


DBSCAN algorithm grows regions with sufficiently high density into clusters and discovers
clusters of arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of
density-connected points.
In the center-based approach, density is estimated for a particular point in the data set by counting
the number of points within a specified radius (Eps or ɛ) of that point. This includes the point itself.
The size of epsilon neighborhood (ε) and the minimum points in a cluster (m) are user defined
parameters.
The basic ideas of density-based clustering involve a number of new definitions.
Core points:
These points are in the interior of a density-based cluster. A point is a core point if the number of
points within a given neighborhood around the point as determined by the distance function and a
user specified distance Eps (or ɛ), exceeds a certain threshold, MinPts. In figure 3, if we suppose
minpts=7, A is a core point because its neighborhood contains seven points (including itself).
Border points:
A border point is not a core point, but falls within the neighborhood of a core point. In figure 3, B
is a border point because its neighborhood doesn’t satisfy minpts threshold but contains points
which are within neighborhood of another core point i.e. A.

Arjun Lamichhane 11
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

Noise points: noise point is any point that is neither a core point nor a border point. In figure 3, C
is a noise point because it is neither core nor border point.

Figure 3 Core, Border and Noise points

Directly density reachable points:

Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is
within the ɛ-neighborhood of q, and q is a core object.

Density reachable points

An object p is density-reachable from object q with respect to ɛ and MinPts in a set of objects, D,
if there is a chain of objects p1, . . . , pn, where p1 = q and pn = p such that pi+1 is directly density-
reachable from pi with respect to ɛ and MinPts.

Density connected points

An object p is density-connected to object q with respect to ɛ and MinPts in a set of objects, D, if


there is an object such that both p and q are density-reachable from o with respect to ɛ and MinPts.

For example:

Figure 4 Density reachability and density connectivity in density-based clustering.

Arjun Lamichhane 12
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

Consider Figure 4 for a given ɛ represented by the radius of the circles. let MinPts = 3. Based on
the above definitions:

Of the labeled points, m, p, o and r are core objects because each is in a ɛ-neighborhood containing
at least three points.

q is directly density-reachable from m and m is directly density-reachable from p

q is density-reachable from p because q is directly density-reachable from m and m is directly


density-reachable from p. However, p is not density-reachable from q because q is not a core
object. Similarly, r and s are density-reachable from o, and o is density-reachable from r.

o, r, and s are all density-connected.

Algorithm

1. Label all points as core, border, or noise points.


2. Eliminate noise points,
3. Put an edge between all core points that are within Eps (or ɛ) of each other.
4. Make each group of connected core points into a separate cluster.
5. Assign each border point to one of the clusters of its associated core points.

Strength

Because DBSCAN uses a density-based definition of a cluster, it is relatively resistant to noise and
can handle clusters of arbitrary shapes and sizes. Thus, DBSCAN can find many clusters that could
not be found using K-means.

Weakness

DBSCAN has trouble when the clusters have widely varying densities. It also has trouble with
high-dimensional data because density is more difficult to define for such data.

Arjun Lamichhane 13
https://genuinenotes.com
Data Mining and Data Warehousing Unit 5: Cluster Analysis

References

[1] J. Han and K. Micheline, Data Mining: Concepts and Techniques, San Francisco: Elsevier Inc., 2006.

[2] P.N. Tan, M. Steinbach and V. Kumar, INTRODUCTION TO DATA MINING, New York: PEARSON
Addison Wesley, 2006.

[3] D. T. LAROSE, DISCOVERING KNOWLEDGE IN DATA An Introduction to Data Mining, New Jersey:
John Wiley & Sons, Inc., 2005.

Arjun Lamichhane 14

You might also like