Unit 4 Clustering
Unit 4 Clustering
Unit 4 Clustering
Hence, iteratively, we are splitting the data, which was once grouped as a
single large cluster, into “n” number of smaller clusters to which the data
points now belong.
It must be taken into account that this algorithm is highly “rigid” when
splitting the clusters – meaning, once a clustering is done inside a loop,
there is no way that the task can be undone.
1.2 Agglomerative Approach
Agglomerative is quite the contrary to Divisive, where all the “N” data
points are considered to be a single member of “N” clusters that the data
is comprised into. We iteratively combine these numerous “N” clusters to
a fewer number of clusters, let’s say “k” clusters, and hence assign the
data points to each of these clusters accordingly. This approach is a
bottom-up one, and also uses a termination logic in combining the
clusters. This logic can be a number-based criterion (no more clusters
beyond this point) or a distance criterion (clusters should not be too far
apart to be merged) or a variance criterion (increase in the variance of
the cluster being merged should not exceed a threshold, Ward Method)
Pre-defining the number of clusters at the initial stage is the most crucial
yet most complicated stage for the clustering approach. Despite the
drawback, it is a vastly used clustering approach for surfacing and
optimizing large datasets. The K-Means algorithm lies in this category.
The clusters formed vary in arbitrary shapes and sizes and contain a
maximum degree of homogeneity due to similar density. This clustering
approach includes the noise and outliers in the datasets effectively.
When performing most of the clustering, we take two major assumptions:
the data is devoid of any noise and the shape of the cluster so formed is
purely geometrical (circular or elliptical).
The fact is, data always has some extent of inconsistency (noise) which
cannot be ignored. Added to that, we must not limit ourselves to a fixed
attribute shape. It is desirable to have arbitrary shapes to not to ignore
any data points. These are the areas where density-based algorithms
have proven their worth!
4. Distribution-Based Clustering
Until now, the clustering techniques as we know them are based on
either proximity (similarity/distance) or composition (density). There is a
family of clustering algorithms that take a totally different metric into
consideration – probability.
5. Fuzzy Clustering
Fuzzy clustering generalizes the partition-based clustering method by
allowing a data object to be a part of more than one cluster. The process
uses a weighted centroid based on the spatial probabilities.
1. · K-Means clustering
2. Mini batch K-Means clustering algorithm
3. Mean Shift
4. Divisive Hierarchical Clustering
5. Hierarchical Agglomerative clustering
6. Gaussian Mixture Model
7. DBSCAN
8. OPTICS
9. BIRCH Algorithm
1. K-Means clustering
K-Means is a partition-based clustering technique that uses the distance
between the Euclidean distances between the points as a criterion for
cluster formation. Assuming there are ‘n’ numbers of data objects, K-
Means groups them into a predetermined ‘k’ number of clusters.
Each cluster has a cluster center allocated and each of them is placed at
farther distances. Every incoming data point gets placed in the cluster
with the closest cluster center. This process is repeated until all the data
points get assigned to any cluster. Once all the data points are covered
the cluster centers or centroids are recalculated.
After having these ‘k’ new centroids, a new grouping is done between the
nearest new centroid and the same data set points. Iteratively, there may
be a change in the k centroid values and their location this loop continues
until the cluster centers do not change or in other words, centroids do
not move anymore. The algorithm aims to minimize the objective function
Where
4. DBSCAN
DBSCAN – Density-Based Spatial Clustering of Applications with Noise
identifies discrete groups in data. The algorithm aims to cluster the data
as contiguous regions having high point density. Each cluster is
separated from the others by points of low density. In simpler words, the
cluster covers the data points that fit the density criteria which is the
minimum number of data objects in a given radius.
Core: Point having least points within the distance from itself
Border: Point having at least one core point from a given
distance.
Noise: Having less than m points within a given distance. This
point is neither the core nor the border.
minPts: A threshold value of a minimum number of points that
considers the cluster to be dense.
eps ε: The distance measure used to put the points in the region
of any other data point.
Reachability: Density distribution identifying whether a point is
reachable from other points if it lies at a distance eps from it.
Connectivity: Transitivity-based chaining approach that
establishes the location of any point in a specified cluster.
For implementing DBSCAN, we first begin with defining two important
parameters – a radius parameter eps (ϵ) and a minimum number of points
within the radius (m).
The algorithm starts with a random data point that has not been
accessed before and its neighborhood is marked according to ϵ.
If this contains all the m minimum points, then cluster formation
begins – hence marking it as “visited” – if not, then it is labeled as
“noise” for that iteration, which can get changed later.
Starting from a random point, all the points in the space are visited.
The neighborhood is computed using distance epsilon ε to determine
if it is a core point or an outlier. If it’s a core point cluster is made
around the point.
Cluster expansion is done by adding the reachable points in the
proximity radius.
Repeating the steps until all points join the clusters or are labeled as
outliers.
5. BIRCH Algorithm
Balanced Iterative Reducing and Clustering using Hierarchies, or
BIRCH is a clustering technique used for very large datasets. A fast
algorithm that scans the entire dataset in a single pass. It is dedicated to
solving the issues of large dataset clustering by focusing on densely
occupied spaces and creating a precise summary.
BIRCH fits in with any provided amount of memory and minimizes the I/O
complexity. The algorithm only works to process metric attributes, which
means the one with no categorical variables or the attribute whose value
can be represented by explicit coordinates in a Euclidean space. The
main parameters of the algorithm are the CR tree and the threshold.
CF tree: The clustering Feature tree is a tree in which each leaf node
consists of a sub-cluster. Each entry in a CF tree holds a pointer to a
child node. The CF entry is made up of the sum of CF entries in the
child nodes.
Threshold: A maximum number of entries in each leaf node
Steps of BIRCH Algorithm
Step 1- Building the Clustering feature (CF) Tree: Building small and
dense regions from the large datasets. Optionally, in phase 2 condensing
the CF tree into further small CF.
Applications of Clustering
Clustering is applied in various fields to prepare the data for various
machine learning processes. Some of the applications of clustering are as
follows.