DM UNIT-5 NOTES
DM UNIT-5 NOTES
DM UNIT-5 NOTES
Cluster Analysis: Basic Concepts and Methods- Cluster Analysis, partitioning methods,
Hierarchical Methods and evaluation of Clustering
Hierarchical clustering
Hierarchical clustering is a method of data mining that groups similar data points into clusters by
creating a hierarchical structure of the clusters.
Identify the 2 clusters which can be closest together, and Merge the 2 maximum comparable clusters.
We need to continue these steps until all the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram
called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or
splits) graphically represents this hierarchy.
A hierarchical method can be classified as being either agglomerative or divisive
The agglomerative approach, also called the bottom-up approach, starts with each object forming a
separate group. It successively merges the objects or groups close to one another, until all the groups are
merged into one (the topmost level of the hierarchy), or a termination condition holds.
The divisive approach, also called the top-down approach, starts with all the objects in the same
cluster. In each successive iteration, a cluster is split into smaller clusters, until eventually each object is
in one cluster, or a termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be undone.
A density-based method clusters objects based on the notion of density. It grows clusters either according to
the density of neighborhood objects (e.g., in DBSCAN) or according to a density function (e.g., in DENCLUE).
OPTICS is a density-based method that generates an augmented ordering of the data’s clustering structure.
A grid-based method first quantizes the object space into a finite number of cells that form a grid structure,
and then performs clustering on the grid structure. STING is a typical example of a grid-based method based on
statistical information stored in grid cells. CLIQUE is a grid-based and subspace clustering algorithm.
where E is the sum of the squared error for all objects in the data set; p is the point in space representing a given
object; and ci is the centroid of cluster Ci (both p and ci are multidimensional).
“How does the k-means algorithm work?”
The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster. It
proceeds as follows.
First, it randomly selects k of the objects in D, each of which initially represents a cluster mean or center. For
each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the
Euclidean distance between the object and the cluster mean.
The k-means algorithm then iteratively improves the within-cluster variation. For each cluster, it computes the
new mean using the objects assigned to the cluster in the previous iteration.
All the objects are then reassigned using the updated means as the new cluster centers. The iterations continue
until the assignment is stable, that is, the clusters formed in the current round are the same as those formed in
the previous round.
The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative
relocation
The k-means procedure:
The time complexity of the k-means algorithm is O(nkt), where n is the total number of objects, k is the number
of clusters, and t is the number of iterations. Normally, k <<n and t<< n. Therefore, the method is relatively
scalable and efficient in processing large data sets.
There are several variants of the k-means method. These can differ in the selection of the initial k-means, the
calculation of dissimilarity, and the strategies for calculating cluster means.
The k-means method can be applied only when the mean of a set of objects is defined. This may not be the case
in some applications such as when data with nominal attributes are involved.
The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by
replacing the means of clusters with modes. It uses new dissimilarity measures to deal with nominal objects and
Example: Clustering by k-means partitioning. Consider a set of objects located in 2-D space, as depicted in
the following Figure Let k = 2, that is, the user would like the objects to be partitioned into two clusters.
Drawbacks of k-means:
The k-means method is not suitable for discovering clusters with nonconvex shapes or clusters of very
different size.
Moreover, it is sensitive to noise and outlier data points because a small number of such data can
substantially influence the mean value.
“How can we make the k-means algorithm more scalable?”
One approach to making the k-means method more efficient on large data sets is to use a good-sized set
of samples in clustering.
Another is to employ a filtering approach that uses a spatial hierarchical data index to save costs when
computing means.
A third approach explores the microclustering idea, which first groups nearby objects into
“microclusters” and then performs k-means clustering on the microclusters.
How to choose the value of "K number of clusters" in K-means Clustering?
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which
defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters)
is given below:
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point is considered as
the best value of K.
• Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
where E is the sum of the absolute error for all objects p in the data set, and oi is the representative object of Ci .
This is the basis for the k-medoids method, which groups n objects into k clusters by minimizing the absolute
error.
The Partitioning Around Medoids (PAM) algorithm is a popular realization of k-medoids clustering. It
tackles the problem in an iterative, greedy way. Like the k-means algorithm, the initial representative objects
(called seeds) are chosen arbitrarily. We consider whether replacing a representative object by a non-
representative object would improve the clustering quality. All the possible replacements are tried out.
The iterative process of replacing representative objects by other objects continues until the quality of the
resulting clustering cannot be improved by any replacement.
This quality is measured by a cost function of the average dissimilarity between an object and the
representative object of its cluster.
PAM, a k-medoids partitioning algorithm:
Specifically, let o1,...,ok be the current set of representative objects (i.e., medoids).
To determine whether a non-representative object, denoted by o random, is a good replacement for a current
medoid oj (1 ≤ j ≤ k),
we calculate the distance from every object p to the closest object in the set {o1,...,oj−1,orandom,oj+1,...,ok}, and
use the distance to update the cost function.
The reassignments of objects to {o1,...,oj−1,orandom,oj+1,...,ok} are simple. Suppose object p is currently
assigned to a cluster represented by medoid oj .
Do we need to reassign p to a different cluster if oj is being replaced by orandom?
Object p needs to be reassigned to either orandom or some other cluster represented by oi (i 6= j), whichever is
the closest.
Each time a reassignment occurs, a difference in absolute error, E, is contributed to the cost function. Therefore,
the cost function calculates the difference in absolute-error value if a current representative object is replaced
by a no representative object. The total cost of swapping is the sum of costs incurred by all no representative
Hierarchical Methods:
Agglomerative versus Divisive Hierarchical Clustering
Distance Measures in Algorithmic Methods
BIRCH: Multiphase Hierarchical Clustering Using Clustering Feature Trees
Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling
Probabilistic Hierarchical Clustering
Hierarchical Methods:
A hierarchical clustering method works by grouping data objects into a hierarchy or “tree” of clusters.
Representing data objects in the form of a hierarchy is useful for data summarization and visualization.
A hierarchical method creates a hierarchical decomposition of the given set of data objects. The method can
be classified as being either agglomerative (bottom-up) or divisive (top-down), based on how the hierarchical
decomposition is formed. To compensate for the rigidity of merge or split, the quality of hierarchical
agglomeration can be improved by analyzing object linkages at each hierarchical partitioning (e.g., in
Chameleon), or by first performing microclustering (that is, grouping objects into “microclusters”) and then
operating on the microclusters with other clustering techniques such as iterative relocation (as in BIRCH).
For example, as the manager of human resources at AllElectronics, you may organize your employees into
Figure-2 : Dendrogram representation for hierarchical clustering of data objects {a,b,c,d, e}.
DIANA, the divisive method, proceeds in the contrasting way. All the objects are used to form one initial cluster.
The cluster is split according to some principle such as the maximum Euclidean distance between the closest
When an algorithm uses the minimum distance, dmin(Ci ,Cj), to measure the distance between clusters, it is
sometimes called a nearest-neighbor clustering algorithm.
Example :
Single versus complete linkages. Let us apply hierarchical clustering to the data set of Figure 10.8(a). Figure
10.8(b) shows the dendrogram using single linkage. Figure 10.8(c) shows the case using complete linkage,
where the edges between clusters {A,B,J,H} and {C,D,G,F,E} are omitted for ease of presentation. This example
shows that by using single linkages we can find hierarchical clusters defined by local proximity, whereas
complete linkage tends to find clusters opting for global closeness.
A clustering feature is essentially a summary of the statistics for the given cluster. Using a clustering feature,
we can easily derive many useful statistics of a cluster. For example, the cluster’s centroid, x0, radius, R, and
diameter, D, are
Here, R is the average distance from member objects to the centroid, and D is the average pairwise distance
within a cluster. Both R and D reflect the tightness of the cluster around the centroid.
Summarizing a cluster using the clustering feature can avoid storing the detailed information about individual
objects or points. Instead, we only need a constant size of space to store the clustering feature. This is the key to
BIRCH efficiency in space. Moreover, clustering features are additive. That is, for two disjoint clusters, C1 and
C2, with the clustering features CF1 = <n1,LS1,SS1> and CF2 = <n2,LS2,SS2>, respectively, the clustering feature
for the cluster that formed by merging C1 and C2 is simply
Example: Clustering feature. Suppose there are three points, (2,5),(3,2), and (4,3), in a cluster, C1.
Suppose that C1 is disjoint to a second cluster, C2, where CF2 = h3,(35,36),(417,440)i. The clustering feature of a
new cluster, C3, that is formed by merging C1 and C2, is derived by adding CF1 and CF2. That is,
A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical clustering.
An example is shown in Figure. By definition, a nonleaf node in a tree has descendants or “children.” The nonleaf
nodes store sums of the CFs of their children, and thus summarize clustering information about their children.
Parameters of BIRCH Algorithm :
threshold : threshold is the maximum number of data points a sub-cluster in the leaf node of the CF tree
can hold.
branching_factor : This parameter specifies the maximum number of CF sub-clusters in each node
(internal node).
where EC{Ci ,Cj} is the edge cut as previously defined for a cluster containing both Ci and Cj . Similarly, ECCi (or
ECCj ) is the minimum sum of the cut edges that partition Ci (or Cj) into two roughly equal parts.
The relative closeness, RC(Ci ,Cj), between a pair of clusters, Ci and Cj , is the absolute closeness between Ci
and Cj , normalized with respect to the internal closeness of the two clusters, Ci and Cj . It is defined as
where SEC{Ci ,Cj } is the average weight of the edges that connect vertices in Ci to vertices in Cj , and SECCi (or
SECCj ) is the average weight of the edges that belong to the mincut bisector of cluster Ci (or Cj).
Probabilistic Hierarchical Clustering:
Probabilistic hierarchical clustering aims to overcome some of these disadvantages by using probabilistic
models to measure distances between clusters.
A probabilistic hierarchical clustering method can adopt the agglomerative clustering framework, but use
probabilistic models to measure the distance between clusters
A probabilistic model is an unsupervised technique that helps us solve density estimation or “soft” clustering
problems.
In probabilistic clustering, data points are clustered based on the likelihood that they belong to a particular
distribution.
The Gaussian Mixture Model (GMM) is the one of the most commonly used probabilistic clustering methods.
Generative models are statistical models capable of generating new datasets based on the probability
distribution, which is then estimated, thereby generating a distribution similar to the original one.
For example, when we conduct clustering analysis on a set of marketing surveys, we assume that the surveys
collected are a sample of the opinions of all possible customers. Here, the data generation mechanism is a
probability distribution of opinions with respect to different customers, which cannot be obtained directly and
completely. The task of clustering is to estimate the generative model as accurately as possible using the
observed data objects to be clustered.
In practice, we can assume that the data generative models adopt common distribution functions, such as
Gaussian distribution or Bernoulli distribution, which are governed by parameters. The task of learning a
generative model is then reduced to finding the parameter values for which the model best fits the observed
data set.
Fig: Merging clusters in probabilistic hierarchical clustering: (a) Merging clusters C1 and C2 leads to an
increase in overall cluster quality, but merging clusters (b) C3 and (c) C4 does not.
2. Sample n points, q1, . . . , qn, uniformly from D. For each qi (1 ≤ i ≤ n), we find the nearest neighbor of qi
in D − {qi}, and let yi be the distance between qi and its nearest neighbor in D− {qi}. That is,
.
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses
the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total
variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:
Cluster Homogeneity: This requires that the more pure the clusters in a clustering are, the better the clustering.
Suppose that ground truth says that the objects in a data set, D, can belong to categories L1,...,Ln. Consider
clustering, C1, wherein a cluster C ∈ C1 contains objects from two categories Li ,Lj (1 ≤ i < j ≤ n). Also consider
clustering C2, which is identical to C1 except that C2 is split into two clusters containing the objects in Li and Lj
, respectively. A clustering quality measure, Q, respecting cluster homogeneity should give a higher score to C2
than C1, that is, Q(C2,Cg ) > Q(C1,Cg ).
Cluster completeness: Cluster completeness is the essential parameter for good clustering, if any two data
objects are having similar characteristics then they are assigned to the same category of the cluster
according to ground truth. Cluster completeness is high if the objects are of the same category .
Rag bag: In some situations, there can be a few categories in which the objects of those categories cannot be
merged with other objects. Then the quality of those cluster categories is measured by the Rag Bag method.
According to the rag bag method, we should put the heterogeneous object into a rag bag category.
Small cluster preservation: If a small category of clustering is further split into small pieces, then those
small pieces of cluster become noise to the entire clustering and thus it becomes difficult to identify that
small category from the clustering. The small cluster preservation criterion states that are splitting a small
category into pieces is not advisable and it further decreases the quality of clusters as the pieces of clusters
are distinctive.
Many clustering quality measures satisfy some of these four criteria. Here, we introduce the BCubed precision
and recall metrics, which satisfy all four criteria.
BCubed evaluates the precision and recall for every object in a clustering on a given data set according to
ground truth.
The precision of an object indicates how many other objects in the same cluster belong to the same category
as the object.
The recall of an object reflects how many objects of the same category are assigned to the same cluster.
Then, for two objects, oi and oj , (1 ≤ i,j,≤ n,i 6= j), the correctness of the relation between oi and oj in clustering
C is given by
The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own
cluster and poorly matched to neighboring clusters.