CLUSTRING
CLUSTRING
CLUSTRING
Keywords: Clustering, Density-Based, Fuzzy, Grid-Based, Hierarchical, K-Means, Partitioning, STING, Wavelet.
--------------------------------------------------------------------------------------------------------------------------------------------------
1. INTRODUCTION
Data mining [1] refers to excavate information from In data mining the data is mined using two learning
large amount of data. It can also be termed as approaches. They are supervised learning and
“Knowledge Discovery from Data (KDD) which is an unsupervised learning.
essential step in the process of knowledge discovery.
Data Mining is performed in four steps. Assemble data,
Apply data mining tools on datasets, Interpretation and Supervised Learning
evaluation of result, Result application [5]. Supervised Learning [3] discovers patterns in the data
that relate data attributes with a target (class) attribute.
These patterns are then utilized to predict the values of
the target attribute in future data instances.
UnSupervised Learning
In UnSupervised Learning [35], there is no a priori
output. The data have no target attribute. The data is
explored to find some intrinsic structures in them.
2. CLUSTERING
Clustering [4] is a significant task in data
analysis and data mining applications. The process of
organizing the objects into groups whose elements are
similar under certain consideration is called Clustering.
It is usually performed when no information is available
concerning the membership of data items to predefined
classes. For this reason, clustering is traditionally seen
Fig. 1: Steps of Data Mining Process as part of unsupervised learning. It is useful for the
study of inter-relationships among a collection of
patterns, by organizing into homogeneous clusters. It is
Data mining practice has the four important jobs. They
called as unsupervised learning because no a priori
are Anomaly detection, Association, Classification, labeling of some patterns is available to use in
Clustering. Anomaly detection is the recognition of odd categorize others and infer the cluster structure of the
data records, that may be remarkable or data errors that
whole data. Intra-connectivity is a measure of the
involve further investigation. Association rule learning
density. A high intra-connectivity is a way to form a
is the process to find the relationships between the good cluster arrangement because the instances grouped
variables. Classification [2] is the assignment of within the same cluster are highly dependent on each
generalizing the known structure to apply to new data.
other. Inter-connectivity is a measure of the
Int. J. Advanced Networking and Applications 2105
Volume: 5 Issue: 6 Pages: 2104-2116 (2014) ISSN : 0975-0290
3.1.3 CLARANS
Clustering Large Applications based on
i = 1,2........N Randomized Search – CLARANS [11] (Ng and Han,
j = 1,2........k 1994). It combines the sampling techniques with PAM.
d(xi, mj) is the distance between data i and cluster j; The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
Calculate the mean of objects in each cluster as the new set of k-medoids. The clustering obtained after
cluster centers, replacing a medoid is called the neighbor of the current
clustering. In this clustering process, certain graph is
searched where each node is represented by a set of k
medoids in which two nodes are neighbors if they differ
by one medoid. Each node has k(T – k) neighbors,
where T is the total number of objects.
i=l, 2 . . . k; Ni is the number of samples of current The computational complexity is
cluster i. O((β + numlocal)(T – k))
3.1.2 K-Medoids based on the number of partitions per object or
It is also clalled PAM (Partitioning Around O′((β + numlocal)k(T – k))
Medoids, 1987). The k-means algorithm [1] is sensitive
to outliers because an object ith an extremely large based on the number of distance calculations, where β is
value may significantly alter the distribution of data. the number of test moves between nodes.
This effect is particularly exacerbated due to the use of The algorithm steps are
the square-error function. Instead of taking the mean • Randomly choose k mediod
Int. J. Advanced Networking and Applications 2107
Volume: 5 Issue: 6 Pages: 2104-2116 (2014) ISSN : 0975-0290
• Randomly consider the one of mediod • Assign randomly to each point coefficients for
swapped with non mediod being in the clusters.
• If the cost of new configuration is lower repeat • Repeat until the algorithm has converged (that
step 2 with new solution is, the coefficients' change between two
• If the cost higher repeat step 2 with different iterations is no more than , the given
non mediod object unless limit has been sensitivity threshold) .
reached • Compute the centroid for each cluster, using
the formula above.
• Compare the solution keep the best
• For each point, compute its coefficients of
• Return step 1 unless limit has been reached being in the clusters, using the formula above.
(set to the value of 2).
• The algorithm minimizes intra-cluster variance
3.1.4. Fuzzy K-Means as well.
It is an extension of k-means. Fuzzy K-Means
[12] allows data points to be assigned into more than
3.1.5. K-Modes
one cluster. Each data point has a degree of membership
The K-means clustering algorithm cannot
(or probability) of belonging to each cluster. In fuzzy
cluster categorical data because of the dissimilarity
clustering, each point has a degree of belonging to
measure it uses. The K-modes clustering algorithm [14]
clusters, as in fuzzy logic, rather than belonging is based on K-means paradigm but removes the numeric
completely too just one cluster. Thus, points on the data limitation while preserving its efficiency. The K-
edge of a cluster may be in the cluster to a lesser degree
modes algorithm extends K-means paradigm to cluster
than points in the center of cluster. For each point x we
categorical data by removing the limitation imposed by
have a coefficient giving the degree of being in the kth
K-means through following modifications:
cluster uk(x). Usually, the sum of those coefficients is
defined to be • Using a simple matching dissimilarity measure
or the hamming distance for categorical data
objects
• Replacing means of clusters by their modes.
With fuzzy k-means, the centroid of a cluster is the
The algorithm steps are
mean of all points, weighted by their degree of
• Insert the first K objects into K new clusters.
belonging to the cluster:
• Calculate the initial K modes for K clusters.
• Repeat {
• For (each object O)
Calculate the similarity
between object O and the
modes of all clusters.
The degree of belonging is related to the inverse of the
Insert object O into the
distance to the cluster center:
cluster C whose mode is the
least dissimilar to object O.
• Recalculate the cluster modes so that
the cluster similarity between mode
and objects is maximized.
then the coefficients are normalized and fuzzyfied with • until (no or few objects change
a real parameter m > 1 so that their sum is 1. So clusters).
Let X, Y be two categorical objects described by m
categorical attributes [13]. The simple dissimilarity
measure between X and Y is defined by the total
mismatches of the corresponding attribute values of the
For m equal to 2, this is equivalent to normalizing the two objects. The smaller the number of mismatches is,
coefficient linearly to make their sum 1. When m is the more similar the two objects. Formally,
close to 1, then cluster center closest to the point is
given much more weight than the others, and the
algorithm is similar to k-means.
The fuzzy k-means algorithm is very similar to the k-
means algorithm. Where
The algorithm steps are
• Choose a number of clusters.
Int. J. Advanced Networking and Applications 2108
Volume: 5 Issue: 6 Pages: 2104-2116 (2014) ISSN : 0975-0290
• These points are used as representaatives and at clustering feature and clustering feature tree (CF tree),
each step of the algorithm, two clusters
c with which are used to review cluster representations. These
closest pair of representatives are merged structures help the clustering meethod achieve good
(dmin). speed and scalability in large databaases.
• After each merging, another c points are Incrementally construct a CF (Clusstering Feature) tree,
selected from original represeentatives of a hierarchical data structure for multtiphase clustering.
previous clusters to represent new cluster.
c Phase 1: scan DB to build an initiall in-memory CF tree
• Cluster merging stops until target k clusters are (a multi-level compression of thee data that tries to
found. preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster
3.2.4 CHAMELEON the leaf nodes of the CF-tree.
It [21] is a newly developed ag gglomerative
Hierarchical Clustering algorithm based on the nearest- Given n d-dimensional data objeects or points in a
neighbor graph, in which an edge is elimin nated if both cluster, we can define the centroid x0, radius R, and
vertices are not within the closest points rellated to each diameter D of the cluster as follows:
other. It [7] can find dynamic modeling. It is based on
two phases:
• Use a graph partitioning algoritthm: cluster
objects into a large number of relaatively small
sub-clusters.
• Use an agglomerative hierarchicaal clustering
algorithm: find the genuine clusters c by
repeatedly combining these sub-clu usters.
The algorithm steps are
• At the first step, the connectiviity graph is where R is the average distance from
m member objects to
divided into a set of subclusterrs with the the centroid, and D is the averagge pairwise distance
minimal edge cut. within a cluster. Both R and D refflect the tightness of
• Each subgraph should contain enouugh nodes in the cluster around the centroid.
order for effective similarity comp putation. By
combining the relative interconneectivity and Clustering Feature
A clustering feature (CF) ( is a three-
relative closeness make to explore e the
dimensional vector summarizing information about
characteristics of potential clusters..
clusters of objects. Given n d-dim mensional objects or
• These small subsets are merged and a thus we
points in a cluster, {xi}, then the CF of the cluster is
get the ultimate clustering solutionss.
defined as
Here, the relative interconnectivity (or closeness)
c is
obtained by normalizing the sum of weightss (or average
weight) of the edges connecting the two clusters
c over
the internal connectivity (or closeness) of the clusters.
Fig 6 provides an overview of the overall ap pproach used where n is the number of points in the
t cluster, LS is the
by Chameleon to find the clusters in a data point.
p linear sum of the n points , SS is thee square sum of data
points.
Clustering Feature Tree
• Each non-leaf node has at most
m B entries.
• Each leaf node has at mosst L CF entries, each
of which satisfies thresholld T.
• Node size is determined by b dimensionality of
Fig. 6: CHAMELEON Process data space and input parammeter P (page size).
3.2.5 BIRCH
BIRCH - Balanced Iterative Reeducing and
Clustering using Hierarchies. This methood [22] has
been designed so as to minimize the num mber of I/O
operations. It is incrementally and dynam mically forms
clusters using incoming multi-dimensional metric data
points to try to produce the best quality clu
ustering with
the available memory and time constraints. It is the first Fig. 7: CF Tree
clustering algorithm planned in the datab base area to
handle ''noise''. This method introduces tw wo concepts,
Int. J. Advanced Networking and Applications 2111
Volume: 5 Issue: 6 Pages: 2104-2116 (2014) ISSN : 0975-0290
Clustering features are sufficient for calculating all of computing similarity for each pair of queries
the measurements that are needed for making clustering (A,B) using measure for instance i.e.
decisions in BIRCH. BIRCH thus utilizes storage Similarity (A, B) = (|A ∩ B|)/(A U B|)
efficiently by employing the clustering features to • Computation of Adjacency Matrix : Compute
summarize information about the clusters of objects, Adjacency Matrix (A) using similarity
thereby bypassing the need to store all objects. threshold θ≥0.4 i.e.
if similarity(A, B)≥θ then 1; else 0
3.2.6 ROCK • Computation of Links: Compute Link Matrix
ROCK - RObust Clustering using links. It by multiplying Adjacency Matrix to itself i.e.
[24][25] performs agglomerative hierarchical clustering A x A to find the number of links.
and explores the concept of links for data with • Calculation of Goodness Measure: The
categorical attributes. goodness measure for each pair of documents
The various attributes are defined below:- is calculated by using the following function:
Links - The number of common neighbours between
two objects.
Neighbors - If similarity between two points exceeds
certain similarity threshold (θ), they are neighbours i.e., Where f (θ) = (1-θ)/(1+θ).
if similarity (A,B)≥θ then only two points A, B are • Merge the two documents with the highest
neighbours, where similarity is a similarity function and similarity (goodness measure).
θ is a user-specified threshold. • When no more entry exists in the goodness
measure table then stop algorithm by resulting
Criterion Function - The objective is to maximize the in k number of clusters and noise (if any)
criterion function to get the good quality clusters. By otherwise go to step 4.
maximizing we mean maximizing the sum of links of • A group of documents i.e. clusters are the
intra cluster point pairs while minimizing the sum of outputs we get.
links of inter cluster point pairs.
3.3 Density-Based Clustering
Density-Based Clusters [26] are defined as
areas of higher density than the remainder of the data
set. Objects in these sparse areas - that are required to
Goodness Measure - While performing clustering the separate clusters - are usually considered to be noise
motive of using goodness measure is – to maximize the and border points. It requires just two parameters and is
criterion function and to identify the best pair of mostly in sensitive to the ordering of the database. The
clusters to b merged at each step of ROCK. quality of density-based clustering depends on the
distance measure used in the function. It does not
require one to specify the number of clusters in the data
a priori.
points but with the value space that surrounds the data all values in this cell Min: minimum value of
points. In general, a typical grid-based clustering the value of the attribute in this cell).
algorithm consists of the following five basic steps: • The distribution types - normal, uniform,
exponential or none.
• Creating the grid structure, i.e., partitioning the • Value of distribution may either be assigned
data space into a finite number of cells. by the user or obtained by hypothesis tests
• Calculating the cell density for each cell. such as Χ2 test.
• Sorting of the cells according to their densities. • When data are loaded into database,
• Identifying cluster centers. parameters calculate count,m,s, min,max of the
• Traversal of neighbor cells. bottom level cells directly.
The Merits and Demerits are as follows • Irrelevant cells are removed and this process is
repeated until the bottom layer is reached.
Merits
• Fast processing time.
• Good cluster quality.
• No distance computations
• Clustering is performed on summaries and not
individual objects; complexity is usually O(#-
populated-grid-cells) and not O(#objects)
Fig. 12: A Hierarchical Structure of STING Clustering
• Easy to determine which clusters are
neighboring
• Shapes are limited to union of grid-cells 3.4.2 Wave Cluster
Wave Cluster [34] is a multi-resolution
Demerits clustering approach which applies wavelet transform to
• All the cluster boundaries are either horizontal the feature space. A wavelet transform is a signal
or vertical, and no diagonal boundary is processing technique that decomposes a signal into
detected. different frequency sub-band. . Given a set of spatial
objects oi; 1 ≤ i ≤ N, the goal of the algorithm is to
The important algorithms in this method include detect clusters and assign labels to the objects based on
STING, Wavelet and CLIQUE. These are explained in the cluster that they belong to. The main idea in
the following sections. WaveCluster is to transform the original feature space
by applying wavelet transform and then find the dense
3.4.1 STING regions in the new space. It yields set of clusters at
STING [6] - STatistical INformation Grid. It different resolutions and scales, which can be chosen
is a grid based multi resolution clustering technique in based on the user’s needs.
which the spatial area is separated into rectangular cells The algorithm steps are
(using latitude and longitude) and employs a • Multidimensional data objects’ feature vectors
hierarchical structure. Several levels of such rectangular are given as input.
cells represent different levels of resolution. Figure 12 • The feature space is quantized then objects
shows the hierarchical structure of STING Clustering. are assigned to the cells.
• Wavelet transform is applied on the quantized
The algorithm steps are
feature space.
• At lower level, each cell is partitioned in to • At different levels, the connected components
child cells (clusters) in the subbands of transformed
• A cell in level 'i' corresponds to union of its feature space are found.
children at level i + 1. • Labels are assigned to the cells.
• Each cell (except the leaves) has 4 children & • The lookup table is created.
each child corresponds to one quadrant of the • The objects to the clusters are plotted.
parent cell.
• Statistical information regarding the attributes
in each grid cell is pre computed and stored.
• Statistical parameters of higher level cells can
easily be computed from the parameters of
lower level cells. Fig. 13: A Sample of two dimensional feature space.
• For each cell, there are attribute independent
parameters and attribute dependant parameters.
(Attribute independent parameter - count and Figure 13 shows a sample of two dimensional feature
Attribute dependant parameters - M: Mean of space, where each point in the image represents the
all values in the cell; S: Standard deviation of
Int. J. Advanced Networking and Applications 2115
Volume: 5 Issue: 6 Pages: 2104-2116 (2014) ISSN : 0975-0290
Fig. 14: Multiresolution of the feature space in Figure 13 at (a) scale 1 (high
resolution); (b) scale
2 (medium resolution); and (c) scale 3 (low resolution).
clustering algorithms”, TJMCS Vol .5 No.3 Algorithm For The Optimization Of Query
(2012) 229-240 . Searching Time”, International Journal on
[8]. S. Anitha Elavarasi, Dr. J. Akilandeswari,Dr. B. Computer Science and Engineering.
Sathiyabhama, “A SURVEY ON PARTITION [25]. Sudipto Guha, Rajeev Rastogi, Kyuseok Shim ,
CLUSTERING ALGORITHMS”, International “ROCK: A robust clustering algorithm for
Journal of Enterprise Computing and Business categorical attributes”.
Systems. [26]. Anoop Kumar Jain & Prof. Satyam Maheswari,
[9]. Pritesh Vora, Bhavesh Oza, “A Survey on K- “Survey of Recent Clustering Techniques in
mean Clustering and Particle Swarm Data Mining”, International Journal of Computer
Optimization”, International Journal of Science Science and Management Research
and Modern Engineering (IJISME). [27]. Lior Rokach & Oded Maimon, “CLUSTERING
[10]. Deepti Sisodia, Lokesh Singh, “Clustering METHODS”.
Techniques: A Brief Survey of Different [28]. Parallel DBSCAN - Oliver Sampson
Clustering Algorithms”, International Journal of [29]. Dr. Masoud Yaghini, “Cluster Analysis--
Latest Trends in Engineering and Technology Density-Based Methods”, Spring 2010.
(IJLTET). [30]. Pooja Batra Nagpal and Priyanka Ahlawat Mann,
[11]. Shruti Aggrwal and Prabhdip Kaur, “ Survey of “Comparative Study of Density based Clustering
Partition Based Clustering Algorithm Used for Algorithms”, International Journal of Computer
Outlier Detection”, INTERNATIONAL Applications (0975 – 8887) Volume 27– No.11.
JOURNAL FOR ADVANCE RESEARCH IN [31]. David Breitkreutz and Kate Casey, “Clusters: a
ENGINEERING AND TECHNOLOGY. Comparison of Partitioning and Density-Based
[12]. Ramandeep Kaur & Gurjith Singh Bhathal, “A Algorithms and a Discussion of Optimizations”.
Survey of Clustering Techniques”, International [32]. [Mihael Ankerst, Markus M. Breunig, Hans-
Journal on Computer Science and Engineering Peter Kriegel, J&g Sander, “OPTICS: Ordering
Vol. 02, No. 09, 2010, 2976-2980. Points To Identify the Clustering Structure”.
[13]. Zengyou He, “Approximation Algorithms for K- [33]. Wei Cheng, Wei Wang and Sandra Batista,
Modes Clustering”. “Data Clustering: Algorithms and Applications”,
[14]. Shehroz S Khan and Dr. Shri Kant, Chapter 1.
“Computation of Initial Modes for K-modes [34]. Gholamhosein Sheikholeslami, Surojit
Clustering Algorithm using Evidence Chatterjee, Aidong Zhang, “WaveCluster: a
Accumulation”, IJCAI-07 ,2784. wavelet-based clustering approach for spatial
[15]. Literature Survey: Data Clustering Algorithms data in very large databases”, The VLDB Journal
and Soft Computing,Chapter 2. (2000) 8: 289–304.
[16]. Swati Harkanth, Prof. B. D. Phulpagar, “A [35]. http://sisla06.samsi.info/jpal/mult1031.pdf
Survey on Clustering Methods and Algorithms”,
International Journal of Computer Science and
Information Technologies, Vol. 4 (5) , 2013,
687-691.
[17]. Gagandeep Kaur, “Clustering”.
[18]. Maria Halkidi, Yannis Batistakis, Michalis
Vazirgiannis , “On Clustering Validation
Techniques”, Journal of Intelligent Information
Systems.
[19]. http://www.stat.wmich.edu/wang/561/classnotes/
Grouping/Cluster.pdf
[20]. Xuanqing Chu and Zhuofu Song, “CURE: An
Efficient Clustering Algorithm for Large
Databases ”.
[21]. Rui Xu, and Donald Wunsch II , “Survey of
Clustering Algorithms”, IEEE
TRANSACTIONS ON NEURAL NETWORKS,
VOL. 16, NO. 3, MAY 2005.
[22]. Bill Andreopoulos , “Literature Survey of
Clustering Algorithms”.
[23]. Hierarchical Clustering- Wikipedia.
[24]. Ashwina Tyagi and Sheetal Sharma,
“Implementation Of ROCK Clustering