0% found this document useful (0 votes)
30 views7 pages

1120pm - 85.epra Journals 8308

This document compares and contrasts the K-means and hierarchical clustering algorithms. It discusses how clustering is an unsupervised machine learning technique used to group similar data points together. The document analyzes the K-means and hierarchical clustering algorithms using the WEKA data mining tool on different datasets. It reviews related literature that has compared various clustering algorithms and tools. The goal is to better understand the advantages and disadvantages of different clustering approaches to select the most appropriate one for a given use case and dataset.

Uploaded by

Grace Angelia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views7 pages

1120pm - 85.epra Journals 8308

This document compares and contrasts the K-means and hierarchical clustering algorithms. It discusses how clustering is an unsupervised machine learning technique used to group similar data points together. The document analyzes the K-means and hierarchical clustering algorithms using the WEKA data mining tool on different datasets. It reviews related literature that has compared various clustering algorithms and tools. The goal is to better understand the advantages and disadvantages of different clustering approaches to select the most appropriate one for a given use case and dataset.

Uploaded by

Grace Angelia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ISSN (Online): 2455-3662

EPRA International Journal of Multidisciplinary Research (IJMR) - Peer Reviewed Journal


Volume: 7 | Issue: 8 | August 2021|| Journal DOI: 10.36713/epra2013 || SJIF Impact Factor 2021: 8.047 || ISI Value: 1.188

A COMPARATIVE ANALYSIS OF K-MEANS AND


HIERARCHICAL CLUSTERING

Aastha Gupta1, Himanshu Sharma2, Anas Akhtar3


1,2,3
Jagan Institute of Management Studies, sec-5, Rohini

Article DOI: https://doi.org/10.36713/epra8308


DOI No: 10.36713/epra8308

ABSTRACT
Clustering is the process of arranging comparable data elements into groups. One of the most frequent data mining analytical
techniques is clustering analysis; the clustering algorithm's strategy has a direct influence on the clustering results. This
study examines the many types of algorithms, such as k-means clustering algorithms, and compares and contrasts their
advantages and disadvantages. This paper also highlights concerns with clustering algorithms, such as time complexity and
accuracy, in order to give better outcomes in a variety of environments. The outcomes are described in terms of big datasets.
The focus of this study is on clustering algorithms with the WEKA data mining tool. Clustering is the process of dividing a big
data set into small groups or clusters. Clustering is an unsupervised approach that may be used to analyze big datasets with
many characteristics. It's a data-modeling technique that provides a clear image of your data. Two clustering methods, k-
means and hierarchical clustering, are explained in this survey and their analysis using WEKA tool on different data sets.
KEYWORDS: data clustering, weka , k-means, hierarchical clustering

I. INTRODUCTION another, while data in other clusters varies. Clustering


Clustering is a vital part of data mining, and it's is a common data analysis approach for identifying
also one of the hottest topics of science in recent times. homogenous groups of objects based on attribute
It is a technology that examines the logical or physical values. Data Clustering has many different real life
relationships between data and divides the data set into applications such as image segmentation, data analysis,
many clusters, each of which is made up of similar data machine learning, search engines, document retrieval,
sets in nature. object recognition and evaluation, computational,
Data clustering is a process in which we group economics, libraries, insurances studies.
together entities with similar characteristics. Clustering Clustering algorithms are effective meta-
quality depending on the similarity metric and how it's learning tools for assessing the information generated
implemented. The clustering's main aim is to find a by modern applications. Clustering methods are widely
collection of patterns, points, and connections or employed in a variety of applications. Data
objects from a natural grouping. One of the most organization and categorization, as well as data
remarkable data mining technique is clustering. Based modelling as well as data compression. When selecting
on some rules, data may be classified into several a clustering algorithm, think about whether it can scale
classes or clusters, resulting in great similarity among to your dataset. Machine learning datasets can contain
data sets of the same class and substantial differences millions of instances, but not all clustering algorithms
among data objects of other classes. [1] scale well. The similarity of all pairs of examples is
Clustering is a method for logically categorizing computed by several clustering algorithms.
raw data and looking for hidden patterns in large Clustering approaches are used to classify
datasets. It's the act of grouping data into fragmented groups of related data in multivariate data sets. There
clusters so that data in one cluster matches data in are a variety of clustering methods including:

2021 EPRA IJMR | www.eprajournals.com | Journal DOI URL: https://doi.org/10.36713/epra2013


412
ISSN (Online): 2455-3662
EPRA International Journal of Multidisciplinary Research (IJMR) - Peer Reviewed Journal
Volume: 7 | Issue: 8 | August 2021|| Journal DOI: 10.36713/epra2013 || SJIF Impact Factor 2021: 8.047 || ISI Value: 1.188

 Partitioning methods variables like client profile, value, limits and expense
 Fuzzy clustering issues. Creators have investigated deals information
 Hierarchical clustering with bunching calculations like K-Means and EM
 Density-based clustering (assumption augmentation) that uncovered many
 Model-based clustering fascinating examples helpful for improving deals
income and accomplishing higher deals volumes. K-
II. LITERATURE REVIEW Means and EM (segment Procedures) calculations are
1. Manish Verma, Mauly Srivastava, Neha …” A more qualified to assess deals information in
Comparative Study of Various Clustering Algorithms correlation with thickness based Procedures.
in Data Mining” [5]
The author made a comparison between different 5. Soumi Ghosh, S. K. Dubey, “Comparative Analysis
clustering techniques. The aim was to measure the of K-Means…..” [9]
algorithm which gives the best performance. It was The paper includes comparison of two clustering
observed that K-means is faster than all the algorithms techniques, centroid-based K-Means and representative
that are mentioned in this paper. K-means and EM object-based Fuzzy C-Means clustering techniques.
gives the best results than hierarchical clustering when This analysis is based on a performance evaluation with
working on huge data set. these algorithms about how efficient outputs are
generated. The results of this comparative research
2. U. Kaymak and M. Setnes, “Extended fuzzy depicts that efficiency of FCM is somewhat closer to
clustering algorithms” [6] K-means. However, computation time is still longer
The author uses fuzzy clustering algorithm to than K-means since the fuzzy measure calculations are
divide dataset into clusters. Some of the issues using involved.
fussy algorithm were discussed by the author such as
number and shape of clusters, division of data patterns, 6. M.Venkat Reddy, M. Vivekananda, RUVN Satish.
choosing the number of clusters in the data. Enhanced [10]
version of fussy means were given and their properties The researchers have discovered an efficient
were illustrated. Examples were used to show that the clustering technique by comparing Divisive and
enhanced algorithms does not require any additional Agglomerative Hierarchical Clustering with K-means.
input from the user and can determine partition of data The outcome of paper was that Agglomerative
on its own. clustering along with k-means is the practical choice to
achieve a high degree of accuracy. Divisive clustering
3. Karthikeyan B., Dipu Jo George, G. Manikandan, with k-means also functions efficiently where each
Tony Thomas “A comparative study on k-means cluster is fixed i.e. where the initial centroids are taken
clustering and agglomerative hierarchical clustering,” in a fixed number for each cluster rather than by
[7] random selection.
The authors have done a comparative study to
determine the best-suited algorithm among K-Means 7.. N. Sharma .“Comparison the various clustering
and Agglomerative Hierarchical Clustering. It was algorithms of weka tools”.[11]
concluded that k-means can be best used for larger The authors have compared and contrasted
datasets with minimal runtime and memory change different clustering algorithms. Weka Tool is used to
rate. It is also concluded that the agglomeration implement all of the proposed algorithms. The purpose
hierarchical clustering technique is best suited for of their research is to determine which algorithm is
smaller data sets because of the minimum overall more appropriate and efficient. DBSCAN, EM,
memory consumption. Farthest First, OPTICS, and the K-Means algorithms
are among these algorithms. They show the benefits
4. S. H. Sastry, P. Babu and M. S. Prasada, “Analysis and drawbacks of each algorithm in this study.
& Prediction of Sales Data in SA P-ERP System” They have demonstrated the benefits and drawbacks of
using Clustering Algorithms”, [8] each method in this paper, however based on their
The authors of this paper used grouping study, they discovered that the k-means clustering
procedures for recognizing contrast in item deals and algorithm is the simplest of the algorithms and fastest
furthermore to recognize and think about deals algorithm to be used with large datasets.
throughout a specific time. The interest for steel items
is repeating furthermore, relies upon numerous

2021 EPRA IJMR | www.eprajournals.com | Journal DOI URL: https://doi.org/10.36713/epra2013


413
ISSN (Online): 2455-3662
EPRA International Journal of Multidisciplinary Research (IJMR) - Peer Reviewed Journal
Volume: 7 | Issue: 8 | August 2021|| Journal DOI: 10.36713/epra2013 || SJIF Impact Factor 2021: 8.047 || ISI Value: 1.188

III. CLUSTERING PROCESS obtained from clustering algorithms are based on


The analytical processes required in cluster analysis some assumptions which depends on the properties
have been established in the literature based on the of the data set (geometry and density distribution)
basic paradigm on Knowledge Discovery in databases. and input parameter values since the class labels
Figure 1 depicts the steps involved in the clustering are not specified. A good clustering algorithm can
process.[3] recognize clusters regardless of their structure.
1. Feature selection
The stage is about choosing characteristics for 3. Cluster validation
cluster analysis. Because the class labels aren't Cluster validation of the clusters is an
predefined in cluster analysis, there's a good assessment of the clusters generated. Clusters are
chance you'll pick features that are irrelevant or checked to determine a satisfactory quality of the
inconsequential. Additionally, removing non- created clusters and to achieve the desired clusters.
essential information improves clustering results. External clusters can all be used to test clusters
The process of determining the most effective with internal indices and relative indices. The
subset of the original characteristics to employ in clusters generated by the algorithm are assessed in
clustering is known as feature selection. The this stage. Visualizing the clusters is a useful way
application of one or more modifications of the to rapidly double-check the cluster results. [4]
input features to create new salient characteristics
is known as feature extraction. To get an adequate 4. Result Analysis
collection of characteristics to employ in The clusters produced from the initial set of
clustering, one or both of these strategies can be data are analyzed to gain a better understanding of
applied. them and to guarantee that the attributes of the
clusters are obtained. Integration of expert
2. Clustering algorithm evaluations with additional experimental findings
The choice of a clustering algorithm influences and analysis might also help to broaden the
the clusters obtained from the data . The results interpretation.

Fig 1 . Clustering Process

2021 EPRA IJMR | www.eprajournals.com | Journal DOI URL: https://doi.org/10.36713/epra2013


414
ISSN (Online): 2455-3662
EPRA International Journal of Multidisciplinary Research (IJMR) - Peer Reviewed Journal
Volume: 7 | Issue: 8 | August 2021|| Journal DOI: 10.36713/epra2013 || SJIF Impact Factor 2021: 8.047 || ISI Value: 1.188

IV. CLUSTERING Disadvantages


BENCHMARKING CRITERION  K-Means may be computationally faster only
The comparative strengths and limitations of each if value of K is small.
algorithm in relation to the three-dimensional [3-D]  Can only be used if the mean is known.
characteristics of large data should be analyzed by  Not suitable for high dimensional data
particular criteria for the evaluation of large-data  Sensitive to noise/outliers [12]
clustering methods including Volume, Velocity, and
Variety. V.II. Hierarchical clustering
The efficiency to manage a large amount of data is A hierarchical method creates a hierarchical
called volume of a clustering process. The following representation of a set of data items. Dendrograms are
criteria are taken into account while choosing a good made using the Tree of Clusters. Sibling clusters split
clustering algorithm for the Volume property: the points covered by their shared parent, whereas child
i) the dataset's size, clusters exist in every cluster node. A typical clustering
ii) dealing with high dimensionality, and approach that can be helpful for a range of data mining
iii) managing the noisy data. tasks is hierarchical algorithms. A hierarchical
The capability to handle various sorts of data is referred clustering technique creates a succession of clusterings
to as variety of clustering process. The following in which each grouping gets nestled into the clustering
criteria are taken into account while choosing a good behind it.
clustering algorithm for the Variety property: Advantages
i) the dataset type;  Applicable to all attribute types.
ii) shape of clusters  Easy at handling similarity data.
The speed of an algorithm over massive data is referred  Small groups are formed, making analysis and
to as velocity of clustering process. The different comprehension simpler.
criteria are taken into account while choosing a good  The number of clusters are not pre-defined, so
clustering procedure for the Velocity property: the user has the ability to dynamically select
i) the algorithm's complexity; clusters.
ii) the algorithm's run-time performance  Concept wise simple.

V. COMPARITIVE ANALYSIS Disadvantages


V.I. K-Means  Clustering Cluster merging/splitting is a
The K means clustering algorithm is commonly used. permanent process.
This technique will be useful in extracting meaningful  It is impossible to correct erroneous judgments
information from a large database using a cluster. The afterwards.
K-means clustering algorithm is a well-known data
 Divisive techniques can be time-consuming to
clustering technique. It is used in a variety of
compute.
applications, including information retrieval and
 Methods aren't always (necessarily) scalable
computer vision. K-means clustering divides n data
when dealing with huge datasets.
points into k clusters, allowing for the grouping of
comparable data points. It's an iterative strategy for  A termination/readout condition is required.
assigning each point to the cluster with the closest Hierarchical clustering can be divided into two sub
centroid. The centroid of these clusters is then categories:
calculated again by taking the average.

Advantages HIERARCHICAL
 Simple: - Easy to understand and to
CLUSTERING
implement.
 Efficient: Time complexity is O(t.k.n) very
efficient to work with huge data sets
 Requires an input from user. AGGLOMERATIVE DIVISIVE
HIERARCHICAL HIERARCHICAL

2021 EPRA IJMR | www.eprajournals.com | Journal DOI URL: https://doi.org/10.36713/epra2013


415
ISSN (Online): 2455-3662
EPRA International Journal of Multidisciplinary Research (IJMR) - Peer Reviewed Journal
Volume: 7 | Issue: 8 | August 2021|| Journal DOI: 10.36713/epra2013 || SJIF Impact Factor 2021: 8.047 || ISI Value: 1.188

I. Agglomerative Hierarchical clustering II. Divisive Hierarchical clustering


The bottom-up approach is often referred to as the The divisive clustering method, on the other hand,
agglomerative approach, since it begins with each works from the top down, starting with a single cluster
object that forms a separate group. It continues to at the top and dividing it down to the bottom. It usually
merge the nearby objects or groups. It keeps going until starts in the same cluster with all of the objects. Then,
all the groups are combined to one or until the through the application of the K-means clustering, a
condition of termination is maintained. The aim of cluster is divided into smaller clusters. It is down until
agglomerative clustering technique is to group together the termination condition carries every object in one
objects with similar characteristics. [14] cluster. [13]

VI. WEKA TOOL users to apply machine learning algorithms to their own
Weka is freely available on the Internet and comes with data, independent of computer’s platform. We used the
a new data mining document that describes and Weka tool version 3.8.5 in this work to examine the
thoroughly explains all of the techniques that are accuracy and speed of simple K-means and
included. Weka class libraries-based applications may Hierarchical clustering algorithms on pre-given
operate on any computer with a Web browser, allowing datasets.

VII. EXPERIMENT explanation of datasets utilized in experiment


Various datasets with known clustering are available in evaluation, are used in this study. [11]
the UCI collection of machine learning databases for Table 1 lists some of the features of the test datasets –
testing the accuracy and efficiency of simple k-means number of attributes and number of instances formed in
and hierarchical clustering algorithms. The Diabetes the given dataset.
datasets and Hypothyroid datasets, as well as a brief

2021 EPRA IJMR | www.eprajournals.com | Journal DOI URL: https://doi.org/10.36713/epra2013


416
ISSN (Online): 2455-3662
EPRA International Journal of Multidisciplinary Research (IJMR) - Peer Reviewed Journal
Volume: 7 | Issue: 8 | August 2021|| Journal DOI: 10.36713/epra2013 || SJIF Impact Factor 2021: 8.047 || ISI Value: 1.188

Table 1. Description of Data Sets


Datasets No. of Attributes No of Instances
Diabetes 09 768
Hypothyroid 30 3772

Table 2.Clustering Results for Data Sets.


k-means Hierarchical time k-means Hierarchical
running time(sec) clustering Accuracy % clustering
running time (sec) Accuracy %
Diabeties 0.06 2.14 51.692 65.104
Hypothyroid 0.16 5.74 69.64 93.24
Table 2 shows the clustering findings for cluster k=3.

Fig 2 Shows Running Time when Both Algorithms are applied on the Same Datasets.
7

4
k-means
3

2 hierarchical

0
diabetes hypothyroid
Figure 2. Running time v/s Datasets

100
90
80
70
60
50 k-means
40 heirarchical
30
20
10
0
diabetes hypothyroid

Figure 3. Accuracy v/s Datasets

2021 EPRA IJMR | www.eprajournals.com | Journal DOI URL: https://doi.org/10.36713/epra2013


417
ISSN (Online): 2455-3662
EPRA International Journal of Multidisciplinary Research (IJMR) - Peer Reviewed Journal
Volume: 7 | Issue: 8 | August 2021|| Journal DOI: 10.36713/epra2013 || SJIF Impact Factor 2021: 8.047 || ISI Value: 1.188

VIII. CONCLUSION 8. S. H. Sastry, P. Babu and M. S. Prasada, “Analysis


The K-mean method performs in clustering huge & Prediction of Sales Data in SA P-ERP System
data sets, and its performance improves as the using Clustering Algorithms”, arXiv preprint
arXiv:1312.2678, (2013).
number of clusters grows. For categorical data, a 9. Soumi Ghosh, Sanjay Kumar Dubey, Comparative
hierarchical algorithm was employed, and according Analysis ofK-Means and Fuzzy C-Means
to its complexity, a new approach for giving rank Algorithms, International Journal of Advanced
values to each categorical attribute using K- means Computer Science and Applications, Vol. 4, No.4,
was applied, in which categorical data is first 2013.
transformed to numeric by assigning rank values to 10. M. V. Reddy, M. Vivekananda, and R. U. V. N.
each categorical attribute. The K-mean algorithm Satish, “Divisive Hierarchical Clustering with K-
performs better than the Hierarchical Clustering means and Agglomerative Divisive Hierarchical
Algorithm. The RMSE lowers as the number of Clustering with K-means and Agglomerative
Hierarchical Clustering,”
clusters rises, and the performance of the K-Means 11. N. Sharma, A. Bajpai and R. Litoruya,
method improves as the RMSE drops. When “Comparison the various clustering algorithms of
clustering certain (noisy) data, all of the methods weka tools” International Jornal of Emerging
contain some ambiguity. When clustered, all of the technology and Advanced Engineering, vol. 2, no.
methods exhibit some uncertainty in some (noisy) 5, (2012) May.
data. When a large dataset is used, the quality of all 12. Amit Saxena1 , Mukesh Prasad2 , Akshansh Gupta3
algorithms improves dramatically. The K-Means , Neha Bharill4 , Om Prakash Patel4 , Aruna
algorithm is extremely sensitive to dataset noise. This Tiwari4, A Review of Clustering Techniques and
noise makes it difficult for the algorithm to cluster Developments
13. K. Wang, B. Wang, and L. Peng, “CVAP:
data into appropriate clusters, and thus has an impact validation for cluster analyses,” Data Sci. J., vol. 8,
on the method's outcome. When working with large pp. 88–93, 2009
datasets, the K-Means system results conventional 14. Performance of selected agglomerative
clustering algorithms while still producing high- hierarchical clustering methods nusa erman1 , ales
quality clusters. korosec2 , jana suklan3

IX. REFERENCES
1. D. Karaboga and C. Ozturk, “A novel clustering
approach: Artificial Bee Colony (ABC)
algorithm”,Applied Soft Computing, vol. 11, no. 1,
(2011), pp. 652-657.
2. J. Senthilnath, S. N. Omkar and V. Mani,
“Clustering using firefly algorithm: performance
study”,Swarm and Evolutionary Computation, vol.
1, no. 3, (2011), pp. 164-171.
3. M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On
clustering validation techniques, ” J. Intell. Inf.
Syst., vol. 17, no. 2–3, pp. 107–145, 2001.
4. K. Wang, B. Wang, and L. Peng, “CVAP:
validation for cluster analyses,” Data Sci. J., vol. 8,
pp. 88–93, 2009.
5. M. Verma, M. Srivastava, N. Chack, A. K. Diswar,
and N. Gupta, “A Comparative Study of Various
Clustering Algorithms in Data Mining,” Int. J. Eng.
Res. Appl.
6. U. Kaymak and M. Setnes, “Extended fuzzy
clustering algorithms”, ERIM Report Series
Reference No.ERS-2001-51-LIS, (2000).
7. B. Karthikeyan, D. J. George, G. Manikandan, and
T. Thomas, “A comparative study on k-means
clustering and agglomerative hierarchical
clustering,” Int. J. Emerg. Trends Eng. Res., vol. 8,
no. 5

2021 EPRA IJMR | www.eprajournals.com | Journal DOI URL: https://doi.org/10.36713/epra2013


418

You might also like