1120pm - 85.epra Journals 8308
1120pm - 85.epra Journals 8308
ABSTRACT
Clustering is the process of arranging comparable data elements into groups. One of the most frequent data mining analytical
techniques is clustering analysis; the clustering algorithm's strategy has a direct influence on the clustering results. This
study examines the many types of algorithms, such as k-means clustering algorithms, and compares and contrasts their
advantages and disadvantages. This paper also highlights concerns with clustering algorithms, such as time complexity and
accuracy, in order to give better outcomes in a variety of environments. The outcomes are described in terms of big datasets.
The focus of this study is on clustering algorithms with the WEKA data mining tool. Clustering is the process of dividing a big
data set into small groups or clusters. Clustering is an unsupervised approach that may be used to analyze big datasets with
many characteristics. It's a data-modeling technique that provides a clear image of your data. Two clustering methods, k-
means and hierarchical clustering, are explained in this survey and their analysis using WEKA tool on different data sets.
KEYWORDS: data clustering, weka , k-means, hierarchical clustering
Partitioning methods variables like client profile, value, limits and expense
Fuzzy clustering issues. Creators have investigated deals information
Hierarchical clustering with bunching calculations like K-Means and EM
Density-based clustering (assumption augmentation) that uncovered many
Model-based clustering fascinating examples helpful for improving deals
income and accomplishing higher deals volumes. K-
II. LITERATURE REVIEW Means and EM (segment Procedures) calculations are
1. Manish Verma, Mauly Srivastava, Neha …” A more qualified to assess deals information in
Comparative Study of Various Clustering Algorithms correlation with thickness based Procedures.
in Data Mining” [5]
The author made a comparison between different 5. Soumi Ghosh, S. K. Dubey, “Comparative Analysis
clustering techniques. The aim was to measure the of K-Means…..” [9]
algorithm which gives the best performance. It was The paper includes comparison of two clustering
observed that K-means is faster than all the algorithms techniques, centroid-based K-Means and representative
that are mentioned in this paper. K-means and EM object-based Fuzzy C-Means clustering techniques.
gives the best results than hierarchical clustering when This analysis is based on a performance evaluation with
working on huge data set. these algorithms about how efficient outputs are
generated. The results of this comparative research
2. U. Kaymak and M. Setnes, “Extended fuzzy depicts that efficiency of FCM is somewhat closer to
clustering algorithms” [6] K-means. However, computation time is still longer
The author uses fuzzy clustering algorithm to than K-means since the fuzzy measure calculations are
divide dataset into clusters. Some of the issues using involved.
fussy algorithm were discussed by the author such as
number and shape of clusters, division of data patterns, 6. M.Venkat Reddy, M. Vivekananda, RUVN Satish.
choosing the number of clusters in the data. Enhanced [10]
version of fussy means were given and their properties The researchers have discovered an efficient
were illustrated. Examples were used to show that the clustering technique by comparing Divisive and
enhanced algorithms does not require any additional Agglomerative Hierarchical Clustering with K-means.
input from the user and can determine partition of data The outcome of paper was that Agglomerative
on its own. clustering along with k-means is the practical choice to
achieve a high degree of accuracy. Divisive clustering
3. Karthikeyan B., Dipu Jo George, G. Manikandan, with k-means also functions efficiently where each
Tony Thomas “A comparative study on k-means cluster is fixed i.e. where the initial centroids are taken
clustering and agglomerative hierarchical clustering,” in a fixed number for each cluster rather than by
[7] random selection.
The authors have done a comparative study to
determine the best-suited algorithm among K-Means 7.. N. Sharma .“Comparison the various clustering
and Agglomerative Hierarchical Clustering. It was algorithms of weka tools”.[11]
concluded that k-means can be best used for larger The authors have compared and contrasted
datasets with minimal runtime and memory change different clustering algorithms. Weka Tool is used to
rate. It is also concluded that the agglomeration implement all of the proposed algorithms. The purpose
hierarchical clustering technique is best suited for of their research is to determine which algorithm is
smaller data sets because of the minimum overall more appropriate and efficient. DBSCAN, EM,
memory consumption. Farthest First, OPTICS, and the K-Means algorithms
are among these algorithms. They show the benefits
4. S. H. Sastry, P. Babu and M. S. Prasada, “Analysis and drawbacks of each algorithm in this study.
& Prediction of Sales Data in SA P-ERP System” They have demonstrated the benefits and drawbacks of
using Clustering Algorithms”, [8] each method in this paper, however based on their
The authors of this paper used grouping study, they discovered that the k-means clustering
procedures for recognizing contrast in item deals and algorithm is the simplest of the algorithms and fastest
furthermore to recognize and think about deals algorithm to be used with large datasets.
throughout a specific time. The interest for steel items
is repeating furthermore, relies upon numerous
Advantages HIERARCHICAL
Simple: - Easy to understand and to
CLUSTERING
implement.
Efficient: Time complexity is O(t.k.n) very
efficient to work with huge data sets
Requires an input from user. AGGLOMERATIVE DIVISIVE
HIERARCHICAL HIERARCHICAL
VI. WEKA TOOL users to apply machine learning algorithms to their own
Weka is freely available on the Internet and comes with data, independent of computer’s platform. We used the
a new data mining document that describes and Weka tool version 3.8.5 in this work to examine the
thoroughly explains all of the techniques that are accuracy and speed of simple K-means and
included. Weka class libraries-based applications may Hierarchical clustering algorithms on pre-given
operate on any computer with a Web browser, allowing datasets.
Fig 2 Shows Running Time when Both Algorithms are applied on the Same Datasets.
7
4
k-means
3
2 hierarchical
0
diabetes hypothyroid
Figure 2. Running time v/s Datasets
100
90
80
70
60
50 k-means
40 heirarchical
30
20
10
0
diabetes hypothyroid
IX. REFERENCES
1. D. Karaboga and C. Ozturk, “A novel clustering
approach: Artificial Bee Colony (ABC)
algorithm”,Applied Soft Computing, vol. 11, no. 1,
(2011), pp. 652-657.
2. J. Senthilnath, S. N. Omkar and V. Mani,
“Clustering using firefly algorithm: performance
study”,Swarm and Evolutionary Computation, vol.
1, no. 3, (2011), pp. 164-171.
3. M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On
clustering validation techniques, ” J. Intell. Inf.
Syst., vol. 17, no. 2–3, pp. 107–145, 2001.
4. K. Wang, B. Wang, and L. Peng, “CVAP:
validation for cluster analyses,” Data Sci. J., vol. 8,
pp. 88–93, 2009.
5. M. Verma, M. Srivastava, N. Chack, A. K. Diswar,
and N. Gupta, “A Comparative Study of Various
Clustering Algorithms in Data Mining,” Int. J. Eng.
Res. Appl.
6. U. Kaymak and M. Setnes, “Extended fuzzy
clustering algorithms”, ERIM Report Series
Reference No.ERS-2001-51-LIS, (2000).
7. B. Karthikeyan, D. J. George, G. Manikandan, and
T. Thomas, “A comparative study on k-means
clustering and agglomerative hierarchical
clustering,” Int. J. Emerg. Trends Eng. Res., vol. 8,
no. 5