Unit4 DMW
Unit4 DMW
Unit4 DMW
Clustering
Unit 4: clustering
Clustering is the process of grouping a set of data objects into multiple groups
or clusters so that objects within a cluster have high similarity, but are very
dissimilar to objects in other clusters.
• Clustering as a data mining tool has its roots in many application areas such
as biology, security, business intelligence, and Web search.
Cluster Analysis
Cluster analysis or simply clustering is the process of partitioning a set of data
objects(or observations) into subsets.
• Each subset is a cluster, such that objects in a cluster are similar to one
another, yet dissimilar to objects in other clusters.
• The set of clusters resulting from a cluster analysis can be referred to as a
clustering.
• Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security.
• In business intelligence, clustering can be used to organize a large number of
customers into groups, where customers within a group share strong similar
characteristics.
❖ Scalability : Clustering on only a sample of a given large data set may lead to
biased results. Therefore, highly scalable clustering algorithms are needed.
❖ Ability to deal with different types of attributes :
Recently, more and more applications need clustering techniques for complex
data types such as graphs, sequences, images, and documents.
The following are orthogonal aspects with which clustering methods can be
Compared:
Similarity measure: Some methods determine the similarity between two objects
by the distance between them. Such a distance can be defined on Euclidean space, a
road network, a vector space, or any other space
Clustering space: Many clustering methods search for clusters within the entire
given data space.
Clustering Methods
the major fundamental clustering methods can be classified into the following
categories
Partitioning methods:
The simplest and most fundamental version of cluster analysis is partitioning,
which organizes the objects of a set into several exclusive groups or clusters.
• To keep the problem specification concise, we can assume that the number
of clusters is given as background knowledge. This parameter is the
starting point for partitioning methods.
• Formally, given a data set, D, of n objects, and k, the number of clusters to
form, a partitioning algorithm organizes the objects into k partitions (k ≤
n), where each partition represents a cluster.
• The clusters are formed to optimize an objective partitioning criterion,
such as a dissimilarity function based on distance, so that the objects
K- Means clustering
Working of K-Means
A hierarchical clustering
• Hierarchical clustering method works by grouping data objects into a
hierarchy or “tree” of clusters.
• In Hierarchical Clustering, the aim is to produce a hierarchical series of nested
clusters.
• A diagram called Dendrogram (A Dendrogram is a tree-like diagram that
statistics the sequences of merges or splits) graphically represents this
hierarchy and is an inverted tree that describes the order in which factors are
merged (bottom-up view) or clusters are broken up (top-down view).
• For instance, the general group of staff can be further divided into subgroups
of senior officers, officers, and trainees. All these groups form a hierarchy. We
can easily summarize or characterize the data that are organized into a
hierarchy, which can be used to find, say, the average salary of managers and
of officers.
Density-Based Methods
• The algorithm identifies noise points that do not belong to any cluster and
ignores them in the cluster formation process.
❖ OPTICS
• OPTICS stands for Ordering Points To Identify the Clustering Structure.
• It gives a significant order of database with respect to its density-based
clustering structure.
• The order of the cluster comprises information equivalent to the density-
based clustering related to a long range of parameter settings.
• OPTICS methods are beneficial for both automatic and interactive cluster
analysis, including determining an intrinsic clustering structure.
In this step, we estimate densities for all data points using Gaussian
kernels with different bandwidths (h). This process results in an n-
dimensional probability distribution function (PDF), where n represents
the number of dimensions/features present in our dataset.
Once we have identified all attraction basins, we assign each data point
to its nearest basin using a distance metric such as Euclidean distance
This process results in clusters of varying sizes and shapes.
Grid-Based Methods
The grid-based clustering approach uses a multiresolution grid data structure.
It quantizes the object space into a finite number of cells that form a grid structure
on which all of the operations for clustering are performed.
Or
Evaluation of clustering
Cluster evaluation assesses the feasibility of clustering analysis on a data set and the
quality of the results generated by a clustering method. The major tasks of clustering
evaluation include the following:
After applying a clustering method on a data set, we want to assess how good
the resulting clusters are. A number of measures can be used.
❖ Extrinsic Methods
Cluster homogeneity. This requires that the more pure the clusters in a
clustering
❖ Intrinsic Methods
Some methods measure how well the clusters fit the data set, while others measure
how well the clusters match the ground truth, if such truth is available. There are
also measures that score clusterings and thus can compare two sets of clustering
results on the same data set