Unit4 DMW

III BCA
Clustering
Unit 4: clustering
Clustering is the process of grouping a set of data objects into multiple groups
or clusters so that objects within a cluster have high similarity, but are very
dissimilar to objects in other clusters.
• Clustering as a data mining tool has its roots in many application areas such
as biology, security, business intelligence, and Web search.
Cluster Analysis
Cluster analysis or simply clustering is the process of partitioning a set of data
objects(or observations) into subsets.
• Each subset is a cluster, such that objects in a cluster are similar to one
another, yet dissimilar to objects in other clusters.
• The set of clusters resulting from a cluster analysis can be referred to as a
clustering.
• Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security.
• In business intelligence, clustering can be used to organize a large number of
customers into groups, where customers within a group share strong similar
characteristics.
Requirements for Cluster Analysis
The following are typical requirements of clustering in data mining
❖ Scalability : Clustering on only a sample of a given large data set may lead to
biased results. Therefore, highly scalable clustering algorithms are needed.
❖ Ability to deal with different types of attributes :
Recently, more and more applications need clustering techniques for complex
data types such as graphs, sequences, images, and documents.
❖ Discovery of clusters with arbitrary shape :
Many clustering algorithms determine clusters based on Euclidean or Manhattan

distance measures . Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. However, a cluster could be of any
shape.
Govt. First Grade College Shimoga 1

III BCA
Clustering
❖ Requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to provide domain knowledge in the
form of input parameters such as the desired number of clusters
❖ Ability to deal with noisy data: Clustering algorithms can be sensitive to
such noise and may produce poor quality Clusters. Therefore, we need
clustering methods that are robust to noise
❖ Constraint-based clustering: Real-world applications may need to perform
clustering Under various kinds of constraints
❖ Interpretability and usability: Users want clustering results to be
interpretable, Comprehensible, and usable. That is, clustering may need to be
tied in with specific semantic interpretations and applications.
❖ High dimensionality: A database or a data warehouse can contain several
dimensions Or attributes.
❖ Incremental clustering and insensitivity to the order of input :
Records : It is important to develop incremental clustering algorithms and
algorithms that are insensitive to the order of input.
The following are orthogonal aspects with which clustering methods can be
Compared:
The partitioning criteria : The methods partition data objects hierarchically,

where clusters can be formed at different semantic levels.
Separation of clusters: Some methods partition data objects into mutually

exclusive clusters.
Similarity measure: Some methods determine the similarity between two objects
by the distance between them. Such a distance can be defined on Euclidean space, a
road network, a vector space, or any other space
Clustering space: Many clustering methods search for clusters within the entire
given data space.

III BCA
Clustering
Clustering Methods
the major fundamental clustering methods can be classified into the following
categories
Partitioning methods:
The simplest and most fundamental version of cluster analysis is partitioning,
which organizes the objects of a set into several exclusive groups or clusters.
• To keep the problem specification concise, we can assume that the number
of clusters is given as background knowledge. This parameter is the
starting point for partitioning methods.
• Formally, given a data set, D, of n objects, and k, the number of clusters to
form, a partitioning algorithm organizes the objects into k partitions (k ≤
n), where each partition represents a cluster.
• The clusters are formed to optimize an objective partitioning criterion,
such as a dissimilarity function based on distance, so that the objects

III BCA
Clustering
within a cluster are “similar” to one another and “dissimilar” to objects in
other clusters in terms of the data set attributes.
• commonly used partitioning methods—k-means
K- Means clustering
K-Means Clustering is an Unsupervised Learning algorithm, which groups the

unlabeled dataset into different clusters.
Working of K-Means
The K-Means Algorithm, a principle player in partitioning methods of data mining,

operates through a series of clear steps that move from basic data grouping to detailed
cluster analysis.

III BCA
Clustering
• Initialization − Specify the number of clusters 'K' to be created. This is an

integral step for the successful execution of the K-Means algorithm.
• Random Centroid Selection − In this phase, 'K' centroids are chosen
randomly where 'K' denotes pre-defined number of clusters.
• Assign Objects to Nearest Cluster − The algorithm then assigns each object
in the dataset to its nearest centroid based on distance measures such as
Euclidean or Manhattan distance.
• Re-calculate Centroids − Once all objects are assigned, the position of the
'K' centroids is re-calculated. This is done by computing the mean value of all
objects within each cluster.
• Repeat Steps 3 and 4 − These two steps are iteratively repeated until there
is no change in the clusters, meaning objects stay within the same cluster
during consecutive iterations.
• Stop Criterion − The process stops when there is no switch of data points
between two different groups or clusters and centroids remain static.
Example:

III BCA
Clustering

III BCA
Clustering

III BCA
Clustering
A hierarchical clustering
• Hierarchical clustering method works by grouping data objects into a
hierarchy or “tree” of clusters.
• In Hierarchical Clustering, the aim is to produce a hierarchical series of nested
clusters.
• A diagram called Dendrogram (A Dendrogram is a tree-like diagram that
statistics the sequences of merges or splits) graphically represents this
hierarchy and is an inverted tree that describes the order in which factors are
merged (bottom-up view) or clusters are broken up (top-down view).
• Representing data objects in the form of a hierarchy is useful for data

summarization and visualization.
• For instance, the general group of staff can be further divided into subgroups
of senior officers, officers, and trainees. All these groups form a hierarchy. We
can easily summarize or characterize the data that are organized into a
hierarchy, which can be used to find, say, the average salary of managers and
of officers.
A hierarchical clustering method can be either agglomerative or divisive,

depending on whether the hierarchical decomposition is formed in a bottom-up
(merging) or top-down (splitting) fashion.
❖ the application of AGNES (AGglomerative NESting), an agglomerative

hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive
hierarchical clustering method, to a data set of five objects, {A,B,C,D,E,F}.
❖

III BCA
Clustering
1. Agglomerative Hierarchical Clustering
• An agglomerative hierarchical clustering method uses a bottom-up
strategy.
• It typically starts by letting each object form its own cluster and
iteratively merges clusters into larger and larger clusters, until all the
objects are in a single cluster or certain termination conditions are
satisfied.
• The single cluster becomes the hierarchy’s root.
• For the merging step, it finds the two clusters that are closest to each
other (according to some similarity measure), and combines the two to
form one cluster. Because two clusters are merged per iteration, where
each cluster contains at least one object, an agglomerative method
requires at most n iterations
2. Divisive Hierarchical Clustering

• A divisive hierarchical clustering method employs a top-down
strategy.
• It starts by placing all objects in one cluster, which is the hierarchy’s
root.
• It then divides the root cluster into several smaller subclusters, and
recursively partitions those clusters into smaller ones.
• The partitioning process continues until each cluster at the lowest level
is coherent enough—either containing only one object, or the objects
within a cluster are sufficiently similar to each other

III BCA
Clustering
Density-Based Methods
To discover clusters with arbitrary shape, density-based clustering methods have

been developed. These typically regard clusters as dense regions of objects in the data
space that are separated by regions of low density (representing noise)
There are two different parameters to calculate the density-based clustering
• ɛ: The radius of our neighborhoods around a data point p.

• minPts: The minimum number of data points we want in a neighborhood to
define a cluster.

III BCA
Clustering
❖ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• It finds core objects, that is, objects that have dense neighborhoods. It
connects core objects and their neighborhoods to form dense regions as
clusters.
• The algorithm identifies noise points that do not belong to any cluster and
ignores them in the cluster formation process.

III BCA
Clustering
❖ OPTICS
• OPTICS stands for Ordering Points To Identify the Clustering Structure.
• It gives a significant order of database with respect to its density-based
clustering structure.
• The order of the cluster comprises information equivalent to the density-
based clustering related to a long range of parameter settings.
• OPTICS methods are beneficial for both automatic and interactive cluster
analysis, including determining an intrinsic clustering structure.
❖ DENCLUE: Clustering Based on Density Distribution Functions

• Density estimation is a core issue in density-based clustering methods.
DENCLUE
• (DENsity-based CLUstEring) is a clustering method based on a set of
density distribution functions.
Step 1: Density Estimation
In this step, we estimate densities for all data points using Gaussian
kernels with different bandwidths (h). This process results in an n-
dimensional probability distribution function (PDF), where n represents
the number of dimensions/features present in our dataset.
Step 2: Attraction Basin Identification
After estimating densities for all data points, we need to identify

attraction basins – regions where high-density areas converge towards
lower-density areas – because these are the potential cluster centers.
We use gradient ascent to identify these attraction basins.
Step 3: Cluster Assignment
Once we have identified all attraction basins, we assign each data point
to its nearest basin using a distance metric such as Euclidean distance
This process results in clusters of varying sizes and shapes.

III BCA
Clustering
Grid-Based Methods
The grid-based clustering approach uses a multiresolution grid data structure.
It quantizes the object space into a finite number of cells that form a grid structure
on which all of the operations for clustering are performed.
❖ STING: Statistical Information Grid
STING is a grid-based multiresolution clustering technique in which the

embedding spatial area of the input objects is divided into rectangular cells.
• The space can be divided in a hierarchical and recursive way.

• Several levels of such rectangular cells correspond to different levels of
resolution and form a hierarchical structure: Each cell at a high level is
partitioned to form a number of cells at the next lower level
Or
How STING Work:

Step 1: Determine a layer, to begin with.
Step 2: For each cell of this layer, it calculates the confidence interval or
estimated range of probability that this is cell is relevant to the query.
Step 3: From the interval calculate above, it labels the cell as relevant or not
relevant.
Step 4: If this layer is the bottom layer, go to point 6, otherwise, go to point 5.

III BCA
Clustering
Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those
cells that form the relevant cell of the high-level layer.
Step 6: If the specification of the query is met, go to point 8, otherwise go to
point 7.
Step 7: Retrieve those data that fall into the relevant cells and do further
processing. Return the result that meets the requirement of the query. Go to
point 9.
Step 8: Find the regions of relevant cells. Return those regions that meet the
requirement of the query. Go to point 9.
Step 9: Stop or terminate
• STING offers several advantages:

1) the grid-based computation is query-independent because the statistical
information stored in each cell represents the summary information of
the data in the grid cell, independent of the query
2) the grid structure facilitates parallel processing and incremental
updating
3) the method’s efficiency is a major advantage
❖ CLIQUE (CLustering In QUEst)

• It is a simple grid-based method for finding density based clusters in
subspaces.
• CLIQUE partitions each dimension into non-overlapping intervals, thereby
partitioning the entire embedding space of the data objects into cells.
• It uses a density threshold to identify dense cells and sparse ones. A cell is
dense if the number of objects mapped to it exceeds the density threshold.
• The main strategy behind CLIQUE for identifying a candidate search space
uses the monotonicity of dense cells with respect to dimensionality. This is
based on the Apriori property used in frequent pattern and association rule
mining .
Evaluation of clustering
Cluster evaluation assesses the feasibility of clustering analysis on a data set and the
quality of the results generated by a clustering method. The major tasks of clustering
evaluation include the following:
1. Assessing clustering tendency.

III BCA
Clustering
2. The number of clusters in a data set
3. Measuring clustering quality
1. Assessing clustering tendency :

In this task, for a given data set, we assess whether a non-random
structure exists in the data. however, the clusters mined may be misleading.
Clustering analysis on a data set is meaningful only when there is a non-
random structure in the data.
• We can try to measure the probability that the data set is generated by a
uniform data distribution.
• This can be achieved using statistical tests for spatial randomness.
• The Hopkins Statistic is a spatial statistic that tests the spatial randomness of
a variable as distributed in a space.
2. The number of clusters in a data set :

A few algorithms, such as k-means, require the number of clusters in a
data set as the parameter. Moreover, the number of clusters can be regarded
as an interesting and important summary statistic of a data set.
• Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
• The elbow method is based on the observation that increasing the number
of clusters can help to reduce the sum of within-cluster variance of each
cluster. This is because having more clusters allows one to capture finer
groups of data objects that are more similar to each other.
3.Measuring clustering quality :
After applying a clustering method on a data set, we want to assess how good
the resulting clusters are. A number of measures can be used.
❖ Extrinsic Methods
Cluster homogeneity. This requires that the more pure the clusters in a
clustering
are, the better the clustering.
Cluster completeness requires that a clustering should assign objects

belonging to the same category (according to ground truth) to the same cluster

III BCA
Clustering
Small cluster preservation : The small cluster preservation criterion states
that splitting a small category into pieces is more harmful than splitting a
large category into pieces.
❖ Intrinsic Methods
Intrinsic methods evaluate a clustering by examining how well the

clusters are separated and how compact the clusters are. Many intrinsic
methods have the advantage of a similarity metric between objects in the data
set
Some methods measure how well the clusters fit the data set, while others measure
how well the clusters match the ground truth, if such truth is available. There are
also measures that score clusterings and thus can compare two sets of clustering
results on the same data set

Unit4 DMW

Uploaded by

Copyright:

Available Formats

Unit4 DMW

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit4 DMW

Uploaded by

Copyright:

Available Formats

III BCA

Requirements for Cluster Analysis

The following are typical requirements of clustering in data mining

❖ Discovery of clusters with arbitrary shape :

Many clustering algorithms determine clusters based on Euclidean or Manhattan

Govt. First Grade College Shimoga 1

The partitioning criteria : The methods partition data objects hierarchically,

Separation of clusters: Some methods partition data objects into mutually

Govt. First Grade College Shimoga 2

Govt. First Grade College Shimoga 3

K-Means Clustering is an Unsupervised Learning algorithm, which groups the

The K-Means Algorithm, a principle player in partitioning methods of data mining,

Govt. First Grade College Shimoga 4

• Initialization − Specify the number of clusters 'K' to be created. This is an

Govt. First Grade College Shimoga 5

Govt. First Grade College Shimoga 6

Govt. First Grade College Shimoga 7

• Representing data objects in the form of a hierarchy is useful for data

A hierarchical clustering method can be either agglomerative or divisive,

❖ the application of AGNES (AGglomerative NESting), an agglomerative

Govt. First Grade College Shimoga 8

2. Divisive Hierarchical Clustering

Govt. First Grade College Shimoga 9

To discover clusters with arbitrary shape, density-based clustering methods have

There are two different parameters to calculate the density-based clustering

• ɛ: The radius of our neighborhoods around a data point p.

Govt. First Grade College Shimoga 10

Govt. First Grade College Shimoga 11

❖ DENCLUE: Clustering Based on Density Distribution Functions

Step 1: Density Estimation

Step 2: Attraction Basin Identification

After estimating densities for all data points, we need to identify

Step 3: Cluster Assignment

Govt. First Grade College Shimoga 12

❖ STING: Statistical Information Grid

STING is a grid-based multiresolution clustering technique in which the

• The space can be divided in a hierarchical and recursive way.

How STING Work:

Govt. First Grade College Shimoga 13

• STING offers several advantages:

❖ CLIQUE (CLustering In QUEst)

1. Assessing clustering tendency.

Govt. First Grade College Shimoga 14

1. Assessing clustering tendency :

2. The number of clusters in a data set :

3.Measuring clustering quality :

are, the better the clustering.

Cluster completeness requires that a clustering should assign objects

Govt. First Grade College Shimoga 15

Intrinsic methods evaluate a clustering by examining how well the

Govt. First Grade College Shimoga 16

You might also like