UNIT 2 DMW

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

UNIT 2

Clustering paradigms; Partitioning algorithms like K-Medioid,


CLARA, CLARANS; Hierarchical clustering, DBSCAN, BIRCH,
CURE; categorical clustering algorithms, STIRR, ROCK, CACTUS.

Cluster Analysis:
 The process of grouping a set of physical or abstract objects
into classes of similar objects is called clustering.
 A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the
objects in other clusters.
 A cluster of data objects can be treated collectively as one
group and so may be considered as a form of data
compression.
 Cluster analysis tools based on k-means, k-medoids, and
several methods have also been built into many statistical
analysis software packages or systems, such as S-Plus, SPSS,
and SAS.

Major Clustering Methods:


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods

Partitioning Methods:
A partitioning method constructs k partitions of the data,
where each partition represents a cluster and k <= n. That is, it
classifies the data into k groups, which together satisfy the
following requirements:
 Each group must contain at least one object, and
 Each object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then


uses an iterative relocation technique that attempts to improve
the partitioning by moving objects from one group to another.

The general criterion of a good partitioning is that objects in


the same cluster are close or related to each other, whereas
objects of different clusters are far apart or very different.

Hierarchical Methods:
A hierarchical method creates a hierarchical decomposition of
the given set of data objects. A hierarchical method can be
classified as being either agglomerative or divisive, based on
how the hierarchical decomposition is formed.
 The agglomerative approach, also called the bottom-up
approach, starts with each object forming a separate group. It
successively merges the objects or groups that are close to one
another, until all of the groups are merged into one or until a
termination condition holds.
 The divisive approach, also called the top-down approach,
starts with all of the objects in the same cluster. In each
successive iteration, a cluster is split up into smaller clusters,
until eventually each object is in one cluster, or until a
termination condition holds.
Hierarchical methods suffer from the fact that once a step
(merge or split) is done, it can never be undone. This rigidity is
useful in that it leads to smaller computation costs by not
having to worry about a combinatorial number of different
choices.

There are two approaches to improving the quality of


hierarchical clustering:
 Perform careful analysis of object ―linkages‖ at each
hierarchical partitioning, such as in Chameleon, or
 Integrate hierarchical agglomeration and other approaches by
first using a hierarchical agglomerative algorithm to group
objects into micro clusters, and then Performing acroclustering
on the micro clusters using another clustering method such as
iterative relocation.

Density-based methods:
 Most partitioning methods cluster objects based on the
distance between objects. Such methods can find only
spherical-shaped clusters and encounter difficulty at
discovering clusters of arbitrary shapes.
 Other clustering methods have been developed based on the
notion of density. Their general idea is to continue growing the
given cluster as long as the density in the neighborhood
exceeds some threshold; that is, for each data point within a
given cluster, the neighborhood of a given radius has to contain
at least a minimum number of points. Such a method can be
used to filter out noise (outliers)and discover clusters of
arbitrary shape.
 DBSCAN and its extension, OPTICS, are typical density-based
methods that growclusters according to a density-based
connectivity analysis. DENCLUE is a methodthat clusters objects
based on the analysis of the value distributions of density
functions.
Grid-Based Methods:
 Grid-based methods quantize the object space into a finite
number of cells that form a grid structure.
 All of the clustering operations are performed on the grid
structure i.e., on the quantized space. The main advantage
of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent
only on the number of cells in each dimension in the
quantized space.
 STING is a typical example of a grid-based method. Wave
Cluster applies wavelet transformation for clustering analysis
and is both grid-based and density-based.

Model-Based Methods:
 Model-based methods hypothesize a model for each of the
clusters and find the best fit of the data to the given model.
 A model-based algorithm may locate clusters by constructing
a density function that reflects the spatial distribution of the
data points.
 It also leads to a way of automatically determining the
number of clusters based on standard statistics, taking
―noise‖ or outliers into account and thus yielding robust
clustering methods.

Tasks in Data Mining:


 Clustering High-Dimensional Data
 Constraint-Based Clustering

Clustering High-Dimensional Data:


 It is a particularly important task in cluster analysis
because many applications require the analysis of
objects containing a large number of features or
dimensions.
 For example, text documents may contain thousands
of terms or keywords as features, and DNA micro array
data may provide information on the expression levels
of thousands of genes under hundreds of conditions.
 Clustering high-dimensional data is challenging due to
the curse of dimensionality.
 Many dimensions may not be relevant. As the number
of dimensions increases, the data become increasingly
sparse so that the distance measurement between
pairs of points become meaningless and the average
density of points anywhere in the data is likely to be
low. Therefore, a different clustering methodology
needs to be developed for high-dimensional data.
 CLIQUE and PROCLUS are two influential subspace
clustering methods, which search for clusters in
subspaces of the data, rather than over the entire data
space.
 Frequent pattern–based clustering, another clustering
methodology, extracts distinct frequent patterns
among subsets of dimensions that occur frequently. It
uses such patterns to group objects and generate
meaningful clusters.

Constraint-Based Clustering:
 It is a clustering approach that performs clustering
by incorporation of user-specified or application-
oriented constraints.
 A constraint expresses a user’s expectation or
describes properties of the desired clustering
results, and provides an effective means for
communicating with the clustering process.
 Various kinds of constraints can be specified, either
by a user or as per application requirements. Spatial
clustering employs with the existence of obstacles
and clustering under userspecified constraints.
 In addition, semi-supervised clusteringemploys
forpairwise constraints in order to improvethe
quality of the resulting clustering.

Classical Partitioning Methods:


The mostwell-known and commonly used partitioning methods are
 k-Means Method
 k-Medoids Method
Centroid-Based Technique:
The K-Means Method: The k-means algorithm takes the input
parameter, k, and partitions a set of n objects in to k clusters so that
the resulting intra cluster similarity is high but the inter cluster
similarity is low.
Cluster similarity is measured in regard to the mean value of the
objects in a cluster, which can be viewed as the cluster’s centroid or
center of gravity.
The k-means algorithm proceeds as follows.
 First, it randomly selects k of the objects, each of which initially
represents a cluster mean or center.
 For each of the remaining objects, an object is assigned to the
cluster to which it is the most similar, based on the distance
between the object and the cluster mean.
 It then computes the new mean for each cluster.
 This process iterates until the criterion function converges.

Typically, the square-error criterion is used, defined as

Where E is the sum of the square error for all objects in the data set
p is the point in space representing a given object
mi is the mean of cluster Ci.

The k-means partitioning algorithm:


The k-means algorithm for partitioning, where each cluster’s center
is represented by the mean value of the objects in the cluster.
The k-Medoids Method:
The k-means algorithm is sensitive to outliers because an object with
an extremely large value may substantially distort the distribution of
data. This effect is particularly exacerbated due to the use of the
square-error function.
Instead of taking the mean value of the objects in a cluster as a
reference point, we can pick actual objects to represent the clusters,
using one representative object per cluster. Each remaining object is
clustered with the representative object to which it is the most
similar.
The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object and its
corresponding reference point. That is, an absolute-error criterion is
used, defined as

Where E is the sum of the absolute error for all objects in the data
set
p is the point in space representing a given object in cluster Cj
oj is the representative object of Cj
 The initial representative objects are chosen arbitrarily. The
iterative process of replacing representative objects by non
representative objects continues as long as the quality of the
resulting clustering is improved.
 This quality is estimated using a cost function that measures
the average dissimilaritybetween an object and the
representative object of its cluster.
 To determine whether a non representative object, oj random,
is a good replacement for a current representativeobject, oj,
the following four cases are examined for each of the
nonrepresentative objects.
The k-Medoids Algorithm:
The k-medoids algorithm for partitioning based on medoid or
central objects.
The k-medoids method is more robust than k-means in the
presence of noise and outliers, because a medoid is less
influenced by outliers or other extreme values than a mean.
However, its processing is more costly than the k-means
method.

Hierarchical Clustering Methods:


 A hierarchical clustering method works by grouping data
objects into a tree of clusters.
 The quality of a pure hierarchical clustering method suffers
from its inability to perform adjustment once a merge or split
decision has been executed. That is, if a particular merge or
split decision later turns out to have been a poor choice, the
method cannot backtrack and correct it.

Hierarchical clustering methods can be further classified as


either agglomerative or divisive, depending on whether the
hierarchical decomposition is formed in a bottom-up or top-
down fashion.

Agglomerative hierarchical clustering:


 This bottom-up strategy starts by placing each object in its own
cluster and then merges these atomic clusters into larger and
larger clusters, until all of the objects are in a single cluster or
until certain termination conditions are satisfied.
 Most hierarchical clustering methods belong to this category.
They differ only in their definition of intercluster similarity.
Divisive hierarchical clustering:
 This top-down strategy does the reverse of agglomerative
hierarchical clustering by starting with all objects in one cluster.
 It subdivides the cluster into smaller and smaller pieces, until
each object forms a cluster on its own or until it satisfies certain
termination conditions, such as a desired number of clusters is
obtained or the diameter of each cluster is within a certain
threshold.

Constraint-Based Cluster Analysis:


Constraint-based clustering finds clusters that satisfy user-
specified preferences or constraints. Depending on the nature of the
constraints, constraint-based clustering may adopt rather different
approaches. There are a few categories of constraints.
 Constraints on individual objects: We can specify constraints on
the objects to be clustered. In a real estate application, for
example, one may like to spatially cluster only those luxury
mansions worth over a million dollars. This constraint confines
the set of objects to be clustered. It can easily be handled by
pre-processing after which the problem reduces to an instance
of unconstrained clustering.
 Constraints on the selection of clustering parameters: A user
may like to set a desired range for each clustering parameter.
Clustering parameters are usually quite specific to the given
clustering algorithm. Examples of parameters include k, the
desired number of clusters in a k-means algorithm; or e the
radius and the minimum number of points in the DBSCAN
algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually
confined to the algorithm itself. Thus, their fine tuning and
processing are usually not considered a form of constraint-
based clustering.
 Constraints on distance or similarity functions: We can specify
different distance or similarity functions for specific attributes
of the objects to be clustered, or different distance measures
for specific pairs of objects. When clustering sportsmen, for
example, we may use different weighting schemes for height,
body weight, age, and skill level. Although this will likely change
the mining results, it may not alter the clustering process per
se. However, in some cases, such changes may make the
evaluation of the distance function nontrivial, especially when
it is tightly intertwined with the clustering process.
 User-specified constraints on the properties of individual
clusters: A user may like to specify desired characteristics of the
resulting clusters, which may strongly influence the clustering
process.
 Semi-supervised clustering based on partial supervision: The
quality of unsupervised clustering can be significantly improved
using some weak form of supervision. This may be in the form
of pairwise constraints (i.e., pairs of objects labeled as
belonging to the same or different cluster). Such a constrained
clustering process is called semi-supervised clustering.

OR
K-Medoids and K-Means are two types of clustering mechanisms in
Partition Clustering. First, Clustering is the process of breaking down
an abstract group of data points/ objects into classes of similar
objects such that all the objects in one cluster have similar traits. , a
group of n objects is broken down into k number of clusters based on
their similarities.
Two statisticians, Leonard Kaufman, and Peter J. Rousseeuw came up
with this method. This tutorial explains what K-Medoids do, their
applications, and the difference between K-Means and K-Medoids.
K-medoids is an unsupervised method with unlabelled data to be
clustered. It is an improvised version of the K-Means algorithm
mainly designed to deal with outlier data sensitivity. Compared to
other partitioning algorithms, the algorithm is simple, fast, and easy
to implement.
The partitioning will be carried on such that:

1. Each cluster must have at least one object


2. An object must belong to only one cluster

Here is a small recap on K-Means clustering:


In the K-Means algorithm, given the value of k and unlabelled data:

1. Choose k number of random points (Data point from the data set or some
other points). These points are also called "Centroids" or "Means".
2. Assign all the data points in the data set to the closest centroid by applying
any distance formula like Euclidian distance, Manhattan distance, etc.
3. Now, choose new centroids by calculating the mean of all the data points in
the clusters and goto step 2
4. Continue step 3 until no data point changes classification between two
iterations.

The problem with the K-Means algorithm is that the algorithm needs to
handle outlier data. An outlier is a point different from the rest of the points.
All the outlier data points show up in a different cluster and will attract other
clusters to merge with it. Outlier data increases the mean of a cluster by up
to 10 units. Hence, K-Means clustering is highly affected by outlier
data.

K-Medoids:
Medoid: A Medoid is a point in the cluster from which the sum of distances
to other data points is minimal.

(or)

A Medoid is a point in the cluster from which dissimilarities with all the other
points in the clusters are minimal.

Instead of centroids as reference points in K-Means algorithms, the K-Medoids


algorithm takes a Medoid as a reference point.
There are three types of algorithms for K-Medoids Clustering:

1. PAM (Partitioning Around Clustering)


2. CLARA (Clustering Large Applications)
3. CLARANS (Randomized Clustering Large Applications)

PAM is the most powerful algorithm of the three algorithms but has the
disadvantage of time complexity. The following K-Medoids are performed
using PAM. In the further parts, we'll see what CLARA and CLARANS are.

Algorithm:
Given the value of k and unlabelled data:

1. Choose k number of random points from the data and assign these k points to
k number of clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid
and assign it to the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to
the medoids)
4. Select a random point as the new medoid and swap it with the previous
medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid,
make the new medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous
medoid, undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new
medoids to classify data points.

Here is an example to make the theory clear:

Scatter plot:
If k is given as 2, we need to break down the data points into 2 clusters.

1. Initial medoids: M1(1, 3) and M2(4, 9)


2. Calculation of distances

Manhattan Distance: |x1 - x2| + |y1 - y2|


Cluster 1: 0

Cluster 2: 1, 3

1. Calculation of total cost:


(5) + (5 + 7) = 17
2. Random medoid: (5, 4)

M1(5, 4) and M2(4, 9):

Cluster 1: 2, 3

Cluster 2: 1

1) Calculation of total cost:

(5 + 5) + 5 = 15

Less than the previous cost

New medoid: (5, 4).

2) Random medoid: (7, 7)

M1(5, 4) and M2(7, 7)


Cluster 1: 4

Cluster 2: 0, 2

1) Calculation of total cost:

(5) + (5 + 10) = 20

Greater than the previous cost

UNDO

Hence, the final medoids: M1(5, 4) and M2(7, 7)

Cluster 1: 2

Cluster 2: 3, 4

Total cost: 12

Clusters:
Limitation of PAM:

Time complexity: O(k * (n - k)2)


Possible combinations for every node: k*(n - k)

Cost for each computation: (n - k)

Total cost: k*(n - k)2

Hence, PAM is suitable and recommended to be used for small data sets.

CLARA:

CLARA (Clustering Large Applications) − The difference between the


PAM and CLARA algorithms is that the following one is based upon sampling.
There is only a small area of the real data is chosen as a representative of
the data and medoids are chosen from this sample utilizing PAM.

The idea is that if the sample is selected in a fairly random manner, then it
correctly represents the whole dataset and therefore, the representative
objects (medoids) chosen will be similar as if chosen from the whole dataset.
CLARA draws several samples and outputs the good clustering out of these
samples. CLARA can deal with a higher dataset than PAM. The complexity of
each iteration now becomes O(kS2+k(n−k)), where S is the size of the
sample.

OR
It is an extension to PAM to support Medoid clustering for large data sets.
This algorithm selects data samples from the data set, applies Pam on
each sample, and outputs the best Clustering out of these samples.
This is more effective than PAM. We should ensure that the selected samples
aren't biased as they affect the Clustering of the whole data.

CLARANS:

CLARANS algorithm combines both PAM and CLARA by searching only the
subset of the dataset and it does not constraint itself to some sample at any
given time. While CLARA has a constant sample at each phase of the search,
CLARANS draws a sample with some randomness in every phase of the
search.

The clustering phase can be presented as searching a graph where each


node is a possible solution, i.e, a set of k medoids. The clustering obtained
after replacing a single medoid is called the neighbor of the current
clustering.

OR
This algorithm selects a sample of neighbors to examine instead of selecting
samples from the data set. In every step, it examines the neighbors of every
node. The time complexity of this algorithm is O(n 2), and this is the best and
most efficient Medoids algorithm of all.

DBSCAN
DBSCAN algorithm can cluster densely grouped points efficiently into one
cluster. It can identify local density in the data points among large datasets.
DBSCAN can very effectively handle outliers. An advantage of DBSACN over
the K-means algorithm is that the number of centroids need not be known
beforehand in the case of DBSCAN.

DBSCAN algorithm depends upon two parameters epsilon and minPoints.


Epsilon is defined as the radius of each data point around which the density is
considered.

minPoints is the number of points required within the radius so that the data
point becomes a core point.

The circle can be extended to higher dimensions.

Working of DBSCAN Algorithm

In the DBSCAN algorithm, a circle with a radius epsilon is drawn around each
data point and the data point is classified into Core Point, Border Point, or
Noise Point. The data point is classified as a core point if it has minPoints
number of data points with epsilon radius. If it has points less than minPoints
it is known as Border Point and if there are no points inside epsilon radius it is
considered a Noise Point.

Let us understand working through an example.

In the above figure, we can see that point A has no points inside epsilon(e)
radius. Hence it is a Noise Point. Point B has minPoints(=4) number of points
with epsilon e radius , thus it is a Core Point. While the point has only 1 ( less
than minPoints) point, hence it is a Border Point.

Steps Involved in DBSCAN Algorithm.

 First, all the points within epsilon radius are found and the core points
are identified with number of points greater than or equal to minPoints.
 Next, for each core point, if not assigned to a particular cluster, a new
cluster is created for it.
 All the densely connected points related to the core point are found
and assigned to the same cluster. Two points are called densely
connected points if they have a neighbor point that has both the points
within epsilon distance.
 Then all the points in the data are iterated, and the points that do not
belong to any cluster are marked as noise.

BIRCH

BIRCH (Balanced Iterative Reducing and Clustering hierarchies) is a


hierarchical clustering algorithm that is designed to handle large datasets
efficiently. The algorithm builds a treelike structure of clusters by recursively
partitioning the data into subclusters until a stopping criterion is met.

BIRCH uses two main data structures to represent the clusters: Clustering
Feature (CF) and Sub-Cluster Feature (SCF). CF is used to summarize the
statistical properties of a set of data points, while SCF is used to represent
the structure of subclusters.

BIRCH clustering has three main steps –

 Initialization − BIRCH constructs an empty tree structure and sets the


maximum number of CFs that can be stored in a node.
 Clustering − BIRCH reads the data points one by one and adds them
to the tree structure. If a CF is already present in a node, BIRCH
updates the CF with the new data point. If there is no CF in the node,
BIRCH creates a new CF for the data point. BIRCH then checks if the
number of CFs in the node exceeds the maximum threshold. If the
threshold is exceeded, BIRCH creates a new subcluster by recursively
partitioning the CFs in the node.
 Refinement − BIRCH refines the tree structure by merging the
subclusters that are similar based on a distance metric.

OR

Balanced Iterative Reducing and Clustering using


Hierarchies (BIRCH) is a clustering algorithm that can cluster
large datasets by first generating a small and compact summary
of the large dataset that retains as much information as possible.
This smaller summary is then clustered instead of clustering the
larger dataset. BIRCH is often used to complement other
clustering algorithms by creating a summary of the dataset that
the other clustering algorithm can now use. However, BIRCH has
one major drawback – it can only process metric attributes.
A metric attribute is any attribute whose values can be
represented in Euclidean space i.e., no categorical attributes
should be present. Before we implement BIRCH, we must
understand two important terms: Clustering Feature (CF) and
CF – Tree Clustering Feature (CF): BIRCH summarizes large
datasets into smaller, dense regions called Clustering Feature
(CF) entries. Formally, a Clustering Feature entry is defined as an
ordered triple, (N, LS, SS) where ‘N’ is the number of data points
in the cluster, ‘LS’ is the linear sum of the data points and ‘SS’ is
the squared sum of the data points in the cluster. It is possible
for a CF entry to be composed of other CF entries. CF Tree: The
CF tree is the actual compact representation that we have been
speaking of so far. A CF tree is a tree where each leaf node
contains a sub-cluster. Every entry in a CF tree contains a
pointer to a child node and a CF entry made up of the sum of CF
entries in the child nodes. There is a maximum number of entries
in each leaf node. This maximum number is called
the threshold. We will learn more about what this threshold
value is. Parameters of BIRCH Algorithm :
 threshold : threshold is the maximum number of data
points a sub-cluster in the leaf node of the CF tree can
hold.
 branching_factor : This parameter specifies the
maximum number of CF sub-clusters in each node
(internal node).
 n_clusters : The number of clusters to be returned after
the entire BIRCH algorithm is complete i.e., number of
clusters after the final clustering step. If set to None, the
final clustering step is not performed and intermediate
clusters are returned.

CURE

CURE(Clustering Using Representatives)

 It is a hierarchical based clustering technique, that


adopts a middle ground between the centroid based and
the all-point extremes. Hierarchical clustering is a type of
clustering, that starts with a single point cluster, and
moves to merge with another cluster, until the desired
number of clusters are formed.
 It is used for identifying the spherical and non-spherical
clusters.
 It is useful for discovering groups and identifying
interesting distributions in the underlying data.
 Instead of using one point centroid, as in most of data
mining algorithms, CURE uses a set of well-defined
representative points, for efficiently handling the clusters
and eliminating the outliers.
 Idea: Random sample, say ‘s’ is drawn out of a given
data. This random sample is partitioned, say ‘p’
partitions with size s/p. The partitioned sample is
partially clustered, into say ‘s/pq’ clusters. Outliers are
discarded/eliminated from this partially clustered
partition. The partially clustered partitions need to be
clustered again. Label the data in the disk.

 Procedure :
1. Select target sample number ‘gfg’.
2. Choose ‘gfg’ well scattered points in a cluster.
3. These scattered points are shrunk towards
centroid.
4. These points are used as representatives of
clusters and used in ‘Dmin’ cluster merging
approach. In Dmin(distance minimum) cluster
merging approach, the minimum distance from
the scattered point inside the sample ‘gfg’ and
the points outside ‘gfg sample, is calculated. The
point having the least distance to the scattered
point inside the sample, when compared to other
points, is considered and merged into the
sample.
5. After every such merging, new sample points will
be selected to represent the new cluster.
6. Cluster merging will stop until target, say ‘k’ is
reached.

categorical clustering algorithms, STIRR, ROCK, CACTUS.

https://www.slideshare.net/slideshow/clustering-in-data-miningpdf/
252172468#20

You might also like