International Journal of Computer Applications (0975 – 8887)
Volume 65– No.11, March 2013
A New Efficient Approach towards k-means
Clustering Algorithm
Pallavi Purohit
Ritesh Joshi
Department of Information Technology, Medi-caps
Institute of Technology, Indore
Department of Master of Computer Application,
Medi-caps Institute of Technology, Indore
ABSTRACT
K-means clustering algorithms are widely used for many
practical applications. Original k-mean algorithm select initial
centroids and medoids randomly that affect the quality of the
resulting clusters and sometimes it generates unstable and
empty clusters which are meaningless. The original k-means
algorithm is computationally expensive and requires time
proportional to the product of the number of data items,
number of clusters and the number of iterations.
The new approach for the k-mean algorithm eliminates the
deficiency of exiting k mean. It first calculates the initial
centroids k as per requirements of users and then gives better,
effective and good cluster without scarifying Accuracy. It
generates stable clusters to improve accuracy. It also reduces
the mean square error and improves the quality of clustering.
We also applied our algorithm for the evaluation of student’s
academic performance for the purpose of making effective
decision by the student councilors.
partitioning N objects into k classes, to get the best clustering
is a NP-hard problem. It is a well-known fact that the standard
k-means algorithm gets easily trapped in a local minimum.
In Section-2 we have describe procedure of cluster analysis.
In section-3 we have described advantages and limitations of
existing K-mean algorithm. In Section-4 we discuss a new
approach of variation of k mean. In section Section-5 we
discuss the performance study of existing k mean and
variation of k mean. Finally Section-6, Section-7 and Section8 contain the conclusion, future work and references
respectively.
2. PROCEDURE OF CLUSTER
ANALYSIS
Cluster analysis is mainly divided into four basic steps as
shown in Figure: 1[3]
Keywords
Cluster analysis, Centroids, K-mean.
1. INTRODUCTION
Unsupervised learning is the part of machine learning whose
purpose is to give the ability to machine to find some hidden
structure within data. Typical task in unsupervised learning
include the discovery of “natural” clusters present in the data,
finding a meaningful low dimensional representation of the
data or learning explicitly a probability function that
represents the true distribution of the data. The clustering
problem is classical problem of database, knowledge
discovery, artificial intelligence and theoretical literature is
use to find similar groups of record from very large datasets
[6]. Given a training data set, the goal of a clustering
algorithm is to group similar data points in the same cluster
while putting dissimilar data points in different clusters.
Clustering is used in a wide variety of fields: biology,
statistics, pattern recognition, information retrieval, machine
learning, psychology, and data mining. For example, it is used
to group related documents for browsing, to find genes and
proteins that have similar functionality, to find the similarity
in medical image database, or as a means of data compression.
Clustering is an important branch of pattern recognition, and it
aims at modeling fuzzy (i.e., ambiguous) unlabeled pattern
efficiently [1].
There are a number of clustering methods which can be
classified into following categories: Partitioning methods,
Hierarchical methods, Density-based methods, Grid-based
methods, Model-based methods [10]. Each of these methods
handles some issues related to clustering but, there is not a
single universal clustering algorithm that can handle all the
issues related to it [9]. With regard to the problem of
Figure 1: Clustering procedure steps
2.1 Feature Selection or Extraction
Feature selection is the process of discovering the most
relevant attribute of a dataset to the data mining task. It is
commonly used and powerful technique for reduction the
dimensionality of a problem to more manageable task. Feature
extraction utilizes some transformations to generate useful
and novel features from original ones. It does not remove any
of the original attribute from further consideration, This
technique is best suited to dataset where most of the
dimensions are relevant to the clustering task, but may are
highly correlated or redundant. Generally the ideal features
should be of use in distinguishing patterns belonging to
different clusters, immune to noise, easy to extract and
interpret [2].
7
International Journal of Computer Applications (0975 – 8887)
Volume 65– No.11, March 2013
3.2 Cluster Seed
2.2 Clustering Algorithm design and
selection
In this step, the proximity (similarity or dissimilarity) measure
and criterion function is selected. Proximity measure greatly
affects the resulting clusters. Almost all clustering algorithm
are explicitly or implicitly connected to some definition of
proximity measure. Once the proximity measure is chosen, the
criterion function is selected in order to optimize clustering
problem, which is well defined mathematically (e.g. square
error function). There are lots of clustering algorithms has
been developed to solve different issues related to clustering
in variety of fields, but there is no clustering algorithm that
can be universally used to solve all problems. Therefore, it is
important to carefully select and design the clustering
algorithm which satisfies the characteristics of the specified
problem.
2.3 Cluster Validation
It is difficult to identify that whether the clusters generated are
of meaningful or just an artifact of an algorithm. Each
clustering algorithm divides the given dataset into number of
partition, without worrying about whether there exists any
structure or not. Moreover, different clustering algorithm
generates different result for the same dataset, and even some
algorithm generates different result for different result for
different set of parameters or different order of input data.
Therefore there must be some evaluation standards and
criteria to provide the user with the degree of confidence for
the clustering results derived from the used algorithm.
There are three methods of validating criteria: [5]
External indices: based on prior knowledge and used as a
standard to validate clustering solutions.
Internal indices: independent or prior knowledge. They
examine the clustering structure directly from the original
data.
Relative criteria: compares different clustering structure to
decide which one may best reveal the characteristics of the
objects.
2.4 Result Interpretation
The goal of the clustering algorithm is to extract the important
hidden information from the original dataset and to provide
user with meaningful insights. The result should be easily
interpretable and usable by the user. The above Figure: 1
shows the feedback pathway, because it is possible that
clustering algorithm may iterate for several times to find the
optimal solution, or to find optimal value of parameters or
select appropriate features.
3. REVIEW OF EXISTING K-MEAN
CLUSTERING:
3.1 Distance Calculation
The distance between two points is taken as a common metric
to assess the similarity among the components of a
population. The most commonly used distance measure is the
Euclidean metric which defines the distance between two
points p = (p1, p2,…) and q = (q1,q2, ….) as
d
(p
i
q
i)
2
(1)
First document or object of a cluster is defined as the initiator
of that cluster i.e. every incoming object’s similarity is
compared with the initiator. The initiator is called the cluster
seed.
3.3 Existing K-mean
K-means is one of the simplest unsupervised learning
algorithms that solve the well known clustering problem. The
procedure follows a simple and easy way to classify a given
data set through a certain number of clusters (assume k
clusters) fixed a priori. The main idea is to define k centroids,
one for each cluster. These centroids should be placed in a
cunning way because of different location causes different
result [2]. So, the better choice is to place them as much as
possible far away from each other. The next step is to take
each point belonging to a given data set and associate it to the
nearest centroid. When no point is pending, the first step is
completed and an early group age is done. At this point, this
method needs to re-calculate k new centroids as barycenters
of the clusters resulting from the previous step. After these k
new centroids, a new binding has to be done between the
same data set points and the nearest new centroid. A loop has
been generated. As a result of this loop we may notice that the
k centroids change their location step by step until no more
changes are done. In other words centroids do not move any
more. Finally, this algorithm aims at minimizing an objective
function, in this case a squared error function. The objective
function [9]
k n
(j)
2
j||
J
||x
c
i
j
1i
1
Where
|| xi( j ) cj
(2)
||2 is a chosen distance measure between
( j)
i
and the cluster centre cj , is an indicator of
a data point x
the distance of the n data points from their respective cluster
centers [4].
The algorithm is composed of the following steps:
1. Place K points into the space represented by the objects that
are being clustered. These points represent initial group
centroids.
2. Assign each object to the group that has the closest
centroid.
3. When all objects have been assigned, recalculate the
positions of the K centroids.
4.Repeat Steps 2 and 3 until the centroids no longer move.
This produces a separation of the objects into groups from
which the metric to be minimized can be calculated.
4. PROPOSED ALGORITHM
In proposed Algorithm of k-mean, for better result two main
tasks are done. Instead of initial centroids are selected
randomly, for the stable cluster the initial centroids are
determined systematically. It calculates the Euclidean distance
between each data point and selects two data-points between
8
International Journal of Computer Applications (0975 – 8887)
Volume 65– No.11, March 2013
Dataset: Ecoli
Algorit
Cluster seed
hm
Our Proposed Algorithm is as follow:
Accuracy
(%)
891350
93.57
80.95
123456
81.18
82.44
456539
60.66
88.69
237854
61.26
91.36
48.26
92.85
New K----mean
Table I Accuracy & MSE performance (Ecoli dataset)
Performance Chart (Ecoli dataset)
94
92
90
88
Accuracy (%)
86
Existing K mean
84
New K mean
82
80
78
76
23
78
54
45
65
39
12
34
56
89
13
50
74
Cluster Seed
Figure 2. Accuracy performance chart (Ecoli Dataset)
MSE Comparative Chart (Ecoli Dataset)
100
90
80
70
60
Existing K mean
50
New K mean
40
30
20
10
23
78
54
45
65
39
0
89
13
50
Set p = 1
Compute the distance between each data point and
all other data- points in the set D
3. 3Find the closest pair of data points from the set D
and form a data-point set Am (1<= p <= k) which
contains these two data- points, Delete these two
data points from the set D
4. 4Find the data point in D that is closest to the data
point set Ap, Add it to Ap and delete it from D
5. 5Repeat step 4 until the number of data points in
Am reaches (n/k+1)
6. If p<k+1, then p = p+1, find another pair of data
points from D between which the distance is the
shortest, form another data-point set Ap and delete
them from D, Go to step 4
7. For each data-point set Am (1<=p<=k+1) find the
arithmetic mean of the vectors of data points
Cp(1<=p<=k+1) in Ap.
8. Select nearest object of each Cp(1<=p<=k+1) as
initial centroid.
9. Compute the distance of each data-point di
(1<=i<=n) to all the centroids cj (1<=j<=k+1) as
d(di, cj)
10. For each data-point di, find the closest centroid cj
and assign di to cluster j
11. Set ClusterId[i]=j;
// j:Id of the closest cluster
12. Set Nearest_Dist[i]= d(di, cj)
13. For each cluster j (1<=j<=k+1), recalculate the
centroids
14. Repeat
15. For each data-point di
15.1 Compute its distance from the centroid of the
present nearest cluster
15.2 If this distance is less than or equal to the present
nearest Distance, the data-point stays in the cluster,
Else
15.2.1For every centroid cj (1<=j<=k+1) Compute the
distance (di, cj); End for
15.2.2Assign the data-point di to the cluster with the
nearest Centroid Cj
15.2.3 Set ClusterId[i] =j
15.2.4 SetNearest_Dist[i = d (di, cj); End for
16. For each cluster j (1<=j<=k+1), recalculate the
centroids; until the convergence Criteria is met.
Mean Square Error
1.
2.
Kmean
Mean Square
Error
12
34
56
which the distance is the shortest and form a data-point set
which contains these two data-points, then we delete them
from the population. Now find out nearest data point of this
set and put it into new set. The numbers of elements in the set
are decided by initial population and number of clusters
systematically
Cluster Seed
5. PERFORMANCE STUDY
Figure 3 shows the performance of accuracy study which has
been carried out on same size of datasets. The accuracy of the
model has been tested for both existing K -mean and new
approach of K -means method. The experiment shows that the
accuracy is significantly increase in new approach of k mean.
Figure 3 MSE Comparison chart (Ecoli Dataset)
the performance of accuracy study and Figure 3 shows mean
square error comparison which has been carried out on
Vehicle datasets.
9
International Journal of Computer Applications (0975 – 8887)
Volume 65– No.11, March 2013
Dataset: Vehicle
Mean Square
Error
Accuracy
(%)
123456
7085.63
74.72
347698
5869.40
85.12
763451
5816.62
81.54
884712
5816.36
82.28
995634
7029.56
72.61
Algorithm Cluster seed
K- mean
New K----5849.38
87.86
mean
Table II Accuracy & MSE performance (Vehicle dataset)
Performance Chart (Vehicle Dataset)
100
90
80
Accuracy
70
60
Existing K mean
50
New K mean
40
[5] J. MacQueen, “Some method for classification and
analysis of multi varite observation”, University of
California, Los Angeles, 281 – 297.
[6] Maria Camila N. Barioni, Humberto L. Razente, Agma J.
M. Traina, “An efficient approach to scale up k-medoid
based algorithms in large databases”, 265 – 279.
[7] Michel Steinbach, Levent Ertoz and Vipin Kumar,
“Challenges in high dimensional data set”, International
Conference of Data management, Vol. 2,No. 3, 2005.
[8] Parsons L., Haque E., and Liu H., “Subspace clustering
for high dimensional data: A review”, SIGKDD, Explor,
Newsletter 6, 90 -105, 2004.
[9] Rui Xu, Donlad Wunsch, “Survey of Clustering
Algorithm”, IEEE Transactions on Neural Networks,
Vol. 16, No. 3, may 2005.
[10] Sanjay garg, Ramesh Chandra Jain, “Variation of k-mean
Algorithm: A study for High Dimensional Large data
sets”, Information Technology Journal5 (6), 1132 –
1135, 2006.
[11] Vance Febre, “Clustering and Continues k-mean
algorithm”, Los Alamos Science, Georgain Electonics
Scientific
Journal:
Computer
Science
and
Telecommunication, vol. 4,No.3, 1994.
[12] Zhexue Huang, “A Fast Clustering Algorithm to Cluster
Very Large Categorical Data Sets in Data Mining”.
30
20
10
99
56
34
88
47
12
76
34
51
34
76
98
12
34
56
0
Cluster Seed
Figure 4 Accuracy performance chart (Vehicle Dataset)
6. CONCLUSION
A New k-mean algorithm which In new Approach of classical
partition based clustering algorithm improve the execution
time of k-means algorithm, with no miss of clustering quality
in most cases. From our result we conclude that, the second
proposed implementation of the k-means algorithm is the best
one. From experiment we observe that proposed algorithm
give more accuracy for dense dataset rather than sparse
dataset.
7. REFRENCES
[1] Dechang Pi, Xiaolin Qin and Qiang Wang, “Fuzzy
Clustering Algorithm Based on Tree for Association
Rules”, International Journal of Information Technology,
vol.12, No. 3, 2006.
[2] Fahim A.M., Salem A.M., “Efficient enhanced k-means
clustering algorithm”, Journal of Zhejiang University
Science, 1626 – 1633, 2006.
[3] Fang Yuag, Zeng Hui Meng, “A New Algorithm to get
initial centroid”, Third International Conference on
Machine Learning and cybernetics, Shanghai, 26-29
August,1191 – 1193, 2004.
[13] Nathan Rountree, “Further Data Mining: Building
Decision Trees”, first presented 28 July 1999.
[14] Yang liu, “Introduction to Rough Set Theory and Its
Application in Decision Suppot System”
[15] Wei-YIn loh, “Regression trees with unbiased variable
selection and interaction detection”, University of
Wisconsin–Madison.
[16] S. Rasoul Safavian and David Landgrebe, “A Survey of
Decision Tree Classifier Methodology”, School of
Electrical Engineering ,Purdue University, West
Lafayette, IN 47907.
[17] David S. Vogel, Ognian Asparouhov and Tobias
Scheffer, “Scalable Look-Ahead Linear Regression
Trees” .
[18] Alin Dobra, “Classification and Regression Tree
Construction”, Thesis Proposal, Department of Computer
Science, Cornell university, Ithaca NY, November 25,
2002
[19] Yinmei Huang,
“Classification and regression tree
(CART) analysis: methodological review and its
application”, Ph.D. Student, The Department of
Sociology, The University of Akron Olin Hall 247,
Akron, OH 44325-1905,
[20] Yan X. and Han J. (2003), GSpan: Graph-Based
Substructure Pattern Mining. Proc. 2nd IEEE Int.Conf.
on Data Mining (ICDM 2003, Maebashi, Japan), 721–
724. IEEE Press,Piscataway, NJ,USA.
[4] Friedrich Leisch1 and Bettina Gr un2, “Extending
Standard Cluster Algorithms to Allow for Group
Constraints”,
Compstat
2006,
Proceeding
in
Computational Statistics, Physica verlag, Heidelberg,
Germany,2006
10