Chapter 1 Introduction
Chapter 1 Introduction
Raw data may be gathered in large quantities from many different disciplines, but this data
is meaningless unless it is properly reasoned through to provide knowledge that is helpful.
In this project, we concentrate on clustering, one of the key data mining techniques.
In unsupervised learning, the goal is often to find hidden structures or clusters in the data
or to uncover relationships between variables. By extracting meaningful patterns,
unsupervised learning algorithms can help with tasks such as data exploration, data pre-
processing, anomaly detection, and feature engineering. The most successful unsupervised
learning approach is clustering, which is one of several strategies employed. The most used
clustering method is K-means. K-means is sensitive to out-of-the-ordinary data points, and
it occasionally creates empty clusters.We recommended a novel method to address this
issue. K-means genetic algorithm. In this study, the Genetic K-means clustering approach
is the main focus.
1
characteristics or proximity in the feature space. The objective is to identify natural
clusters within the data without any prior knowledge of the class labels. Common
clustering algorithms include k-means, hierarchical clustering, and DBSCAN.
2. Dimensionality reduction: Dimensionality reduction techniques aim to reduce the
number of variables or features in the data while preserving important information.
These methods are useful when dealing with high-dimensional data by simplifying
it and allowing for easier visualization and analysis. Principal Component Analysis
(PCA) and t-SNE (t-Distributed Stochastic Neighbor Embedding) are popular
dimensionality reduction algorithms.
Unsupervised learning plays a crucial role in various fields, including data mining, pattern
recognition, recommender systems, and exploratory data analysis, as it helps uncover
underlying structures and insights in unlabelled datasets.
1.2.1 Clustering:
It is an unsupervised method that uses an automated method to group items with similar
properties together [5]. Statistics professionals also refer to it as categorization. Clustering
is grouping things based on similarities. The items are divided into two clusters: one for
those with similar qualities and another for those with different properties. The Manhattan
distance or the Euclidian distance are used to measure similarity. Getting low intra-cluster
and the converse for inter-cluster similarity is the main goal of clustering.
2
1.3 Types of Clustering
• Partitioning clustering
• Hierarchical clustering
• Fuzzy clustering
• Density Based clustering
• Model based clustering
3
things. It is not necessary to pre-specify the value of k or the number of clusters [7]. The
outcome of hierarchical clustering yields a tree-based structure. Its primarily of two types:
• Agglomerative Clustering
• Divisive Clustering
4
4. Hamming Distance
A straight line distance called a euclidian distance exists between two places. L2 standard
is another name for the Euclidian separation [8]. The distance between two points on a
plane with the coordinates (x, y) is known as the Euclidian separation. The formula to
calculate Euclidian distance is:
City block distance, or L1 norm are other names for Manhattan Distance. It is measured in
right-angle axes [8]. Manhattan distance is the product of diagonal and horizontal distance,
which is calculated using Pythagoras' theorem. The Manhattan Distance formula is as
follows:
5
Finding the right number of clusters and interpreting and validating consistency inside the
cluster are both aided by the Elbow approach. This approach focuses on variance %, which
depends on the value of k [8]. The value of k should be set so that addition of more clusters
does not significantly improve the modelling of the data.
Number of Clusters K
𝑛(𝑘) − 𝑥(𝑘)
𝑇(𝑘) =
max{𝑛(𝑘), 𝑥(𝑘)}
6
Number of Clusters K
Figure 1.3 Evaluation Graph of Silhouette Method [8]
1 B
Gapn (k) = ∑ log(𝑊𝑘𝑏) − log(𝑊𝑘)
𝐵 𝑏=1
7
1.5.4 Davies Bouldin Index:
It contrasts the total inside intra-cluster variance for different values of k with the values
that would be predicted under the data's null reference distribution. The ideal clusters will
be formed by the values that maximise the gap statistic.
𝑀𝑖 - centroid of cluster
𝐶𝑖, 𝑇𝑗. - the size of cluster
𝐷𝑗 -is the measure of validity of the cluster.
8
1.6.1 Flowchart of standard K-means:
9
• It requires advance knowledge of the data and a domain expert is needed to choose
the requisite value for k.
• Final result is affected more by initial selection and less by outliers.
• Reconfiguring the data yields a different result for the same data
• Only convex shaped clusters are formed
Dataset {2, 3, 4, 10, 11, 12, 20, 25, 30} and K=2
10
Sepal Length
It is a technique that comes from biological evolution and natural selection to tackle limited
and unconstrained optimization issue [10]. The genetic algorithm repeatedly adjusts a
population of individual solutions. At every step, the genetic algorithm randomly chooses
individuals from the current population to be parents which are used to create the offspring
for the next generation. Genetic algorithm may be used to tackle different optimization
challenges. Steps are as follows:
1. Initialization of Population
2. Finding Fitness value
3. Selection of chromosomes
4. Crossover
5. Mutation
1. Initialization:
In initialization step, population is defined as set of individuals. An individual is defining by a set
of variables known as Genes. In a genetic algorithm, strings are used to describe the set of genes
of an individual. To encode the genes in a chromosome binary values are used.
11
Figure 1.7 Genetic Algorithm Chromosome and Population [10]
2. Fitness Function: The fitness function calculate the ability of individual to compete
with other individuals. It present a fitness score to each individual. The probability that an
individualwill be selected for reproduction is depend on its fitness value. The formula for
fitness value depend on the type of problem .The most commonly used formula for finding
fitness value is
Smax = max(F(SINTER)/F(SINTRA))
Where Smax is the fitness function obtained by dividing the total inter cluster distance
3. Selection: The concept of selection phase is to prefer the fittest individuals and their
genes move onward to the next generation .Pairs of an individuals are selected according to
their fitnessscores. Individuals with large fitness value have more chance to be selected for
reproduction.
12
Table 1.1 Crossover operation on S1 and S3 chromosome.
S1 1 1 1 1 0 1 0 1 0 1 F=7
S3 1 1 1 0 1 1 0 1 0 1 F=7
S1 1 1 1 0 1 1 0 1 0 1 F=7
S3 1 1 1 1 0 1 0 1 0 1 F=7
Table 1.2 and 1.3 shows the crossover operation and result of chromosome S1 and S3.
Thecrossover operator is applied to produce new offspring’s with higher probability.
S1 1 1 1 0 1 1 0 1 0 1 F=7
S3 1 1 1 1 0 1 0 1 0 1 F=7
S1 1 1 1 0 1 1 0 1 0 1 F=7
S3 1 1 1 1 0 1 0 1 0 1 F=7
Table 1.3 and 1.4 shows the mutation of chromosome S1 and S3. In mutation the bits with
lowerprobability is flipped and increase the fitness of particular chromosome.
13
1.7.1 Flow chart of genetic algorithm:
14
Chapter 2 Literature Survey
2.1 Clustering
Clustering is an unsupervised technique that creates clusters of objects having similar
featuresusing automatic technique [1]. It is also termed as classification by statisticians. In
clustering, grouping is done based on similarity of objects .The objects with similar
properties are groupedin one cluster and with dissimilar properties into another. Similarity
is evaluated by using Euclidian distance or Manhattan distance. The main aim of clustering
is to obtain low intra- cluster similarity and high inter-cluster similarity. There are different
types of clustering techniques includes partitioning clustering, hierarchical clustering,
fuzzy clustering and density based clustering .This chapter presents literature survey of
clustering techniques for analyzingbig data and tabular comparison of these techniques is
presented.
There are distinctive kinds of partitioning clustering techniques. The most mainstream
Partioning clustering strategy is the K-means clustering presented by Macqueen in1967. In
K-means clustering every data point is grouped based on similarity. The K-means strategy
is sensitive to outliers and in some cases it frames empty clusters with large datasets. In
this we proposed another method (Hybrid K-means with Genetic algorithm) to evacuate
the downsideof k means clustering [11].
15
2.3 Unsupervised learning technique: Clustering
There is huge amount of data generated from information industry and other repositories.
This data is useless until it’s processed and extract useful knowledge from it. The
researchers analyze two elementary objectives of data mining: description and prediction.
There are different techniques used for extracting useful information from large amount of
data. Clustering is the most commonly used partitioning method for mining large datasets.
In partitioning methods K-means clustering is most efficient method butit is sensitive to
outliers and sometimes produce empty clusters with large dataset. Clustering using
optimization problem techniques: genetic algorithm, decision trees, neural network remove
the problem of partitioning methods [1].
Researchers identify two elementary objectives of data mining: description and prediction.
Prediction utilizes various existing variables in the database in order to predict the future
valuesof interest and description mainly focuses on finding various patterns that describes
the data and the subsequent presentation for individual interpretation. The relative
prominence of both description and prediction differ with respect to fundamental technique
and the application. There are several data mining techniques fulfilling these objectives:
classification mining, association rule mining and clustering using the techniques such as
genetic algorithms, decisiontree, neural networks and machine learning [2].
S.Bandyopadhyay et al. [4] designed a genetic clustering algorithm called as GKA that
classifythe pixels of satellite image. This KGA clustering algorithm is applied when number
of clustersis known a priori and crisp in nature. In this paper Genetic algorithm is used to
search clusterscenters. Floating point representation of chromosomes is used because it is
more natural and appropriate form for coding the cluster centers. The major drawback of
this paper is that the algorithm is applied only to those dataset whose k is already known.
M.Jain. et al. [7] proposed a K-means with genetic algorithm for enhancing stock prediction.
Inthis paper they enhance a stock prediction or market analysis using k means with genetic
algorithm for finding the cluster centroids. Chi square similarity is used for determining
the accuracy and the result obtained has highest accuracy then k means .The drawback of
proposedalgorithm is that they are only applied to matrix represented dataset.
C. Ordonez. [9] Presented two different approaches of the K-means clustering algorithm to
16
cluster the binary data streams. The variants used by them are incremental K-means and
scalable K-means. A proposed variant gives high quality clusters and less sensitive as
compared to k-means. The proposed incremental k-means and scalable K-means is
compared with existing k-means in terms of accuracy, confusion matrix and error rate. The
incremental and scalable k-means gives higher accuracy than existing k-means.
Mor et al. [11] proposed a genetic algorithm approach for clustering and compared the
results with k means algorithm. In this proposed approach the fitness is calculated on the
basis of intracluster and inter cluster similarity measures. The proposed algorithm has low
intra cluster distance and high inter cluster distance and also remove the drawback of local
optima. The drawback of this algorithm is that the GA algorithm not find the value of K
(number of clusters)as K is randomly chosen in this algorithm.
K.Dharmendra et al. [12] presented the efficient K-means clustering algorithm. The main
objective of K-means clustering algorithm is to divide the dataset into K number of clusters
where k is predefined or randomly selected by the analyst. The main aim of K-means
clusteringused in this paper is to minimize the within sum of square.
K.Shahroudi et al. [13] proposed an algorithm for variable selection in clustering of market
segmentation using genetic algorithm. The objective of this proposed algorithm to identify
thevariables that are optimal and remove the irrelevant variables using genetic algorithm.
Finally the result obtained have efficiently improved the outcomes based on the most
relevant techniqueof segmentation.
E.O.Hartono et al. [14] proposed an algorithm for determining a cluster centroid using
genetic algorithm. The determining an initial value of cluster centroid using genetic
algorithm providebetter results than random numbers. Fitness value is calculated using
MSE (Mean Square Error).Fitness value is 1 divided by the MSE. Lower the MSE higher
the performance. The drawback of proposed algorithm is that the evaluation of clusters is
not done and also the valueof K is not known Apriori.
P.Vats et al. [17] had discussed a comparative analysis of various clustering technique
using genetic and K-means algorithm. It uses the sample Iris dataset to perform the
17
differentclustering techniques i.e. K-means algorithm, Incremental K-means and Fuzzy C-
means. The code is implemented using matlab and Weka.In this paper they have discussed
Fuzzy C-meansgives better results as compared to K-means algorithm.
K.Kim et al. [22] proposed a recommender system using Genetic algorithm and K-means
for online shopping market. In this proposed technique the initial seed is optimized by
Genetic algorithm called GA K-means for online shopping recommender system. The
proposed algorithm results is compared with existing algorithms and it shows that
proposed algorithm improves the segmentation performance compared to existing
algorithms.
2.4 Conclusion
In this chapter literature review and comparative analysis of the recent techniques for
clustering big data is done. The various clustering techniques for analyzing big data are
compared and theirmerits and demerits are presented.
18
Chapter 3 System Design and Development
The k-means algorithm executes fast but it cannot handle non arbitrary shape, sensitive to
outliers and form empty clusters for large datasets. The hybrid Genetic K-means algorithm
can handle noise as well as non-arbitrary shape but it takes more computation time and is
19
more complex than K-Means.
• The K-means algorithm is applied with cuckoo search algorithm to remove the
shortcomings of existing K-means algorithm, but it doesn’t handle large dataset.
• In another technique k-means is modified to incremental K-means which generates better
result than k-means for numeric datasets.
3.3 Objectives
The objectives of the project are as follows:
• To study the existing K-means clustering technique in combination with Genetic
algorithm.
• To introduce an efficient clustering technique to remove the drawback of existing
clusteringalgorithm.
• To implement and validate the proposed technique on Iris dataset.
20
clustering DaviesBouldin index is used for Evaluation of clusters.
Fig 4.1 shows the procedure of proposed hybrid technique
Step 1: In the first step collection of data from UCI machine learning repository [42] for
formationof clusters.
Step 2: In Second step normalization of data is done using python IDLE as normalized data
is easyto handle.
Step 3: In the Third step Genetic algorithm for finding initial cluster centroid and after
crossoverand mutation, the result is passed to K-means for formation of clusters.
Step 4: In the last steps after formation of clusters Davies Bouldin Index is used
Evaluation ofcluster.
Fmax = max(S(DINTER)/S(DINTRA))
Where Fmax is the fitness function obtained by dividing the total inter cluster distance
21
and total intra cluster distance.
Step 3: Selection Phase
The selection of chromosomes is done based on the fitness value .The chromosomes are
selected using roulette wheel selection method or rank based selection .The chromosomes
with high fitnessvalue have high probability to be selected first.
Step 4: Crossover
Crossover is applied to chromosomes to produce new off springs. The selected
chromosomes are randomly selected to produce new off springs. Crossover is simply
reproduction of new chromosomes.
Step 5: Mutation
It is used to maintain a genetic diversity of population After Crossover Mutation is applied
which is random tweak to particular chromosome. There are various methods for mutation
includes:
• Bit Manipulation
• Random Resetting
• Swap Mutation
• Scramble Mutation
• Inversion Mutation
Fig 3.2 shows the process of Genetic algorithm. First evaluation of chromosome using
fitness function and select the chromosomes based on Roulette wheel selection .After
selection crossover is applied and lastly the result of crossover passed to mutation to
22
produce new offspring’s.
S1 1 1 1 1 0 1 0 1 0 1 F=7
S2 0 1 1 1 0 0 0 1 0 1 F=5
S3 1 1 1 0 1 1 0 1 0 1 F=7
S4 0 1 0 0 0 1 0 0 1 1 F=4
S5 1 1 1 0 1 1 1 1 0 1 F=8
S6 0 1 0 0 1 1 0 1 0 0 F=3
Table 3.1 shows the initialization of chromosomes of MaxOne problem and fitness value
of allthe chromosomes. The fitness value is calculated based on number of one’s occur in
particulariteration. After calculate the fitness value of all chromosomes the chromosomes
is arranged according to fitness value and selection is done based on roulette wheel
23
selection. The total number of one occur is 34 out of 60. Our aim is to increase the number
of one’s so we use the genetic algorithm for optimization of MaxOne problem.
S1 1 1 1 1 0 1 0 1 0 1 F=7
S3 1 1 1 0 1 1 0 1 0 1 F=7
S5 1 1 1 0 1 1 1 1 0 1 F=8
S2 0 1 1 1 0 0 0 1 0 1 F=5
S4 0 1 0 0 0 1 0 0 1 1 F=4
S6 0 1 0 0 1 1 0 1 0 0 F=3
Table 3.2 represent the arrangement of chromosomes based on fitness value. We randomly
select the chromosomes using roulette wheel selection for crossover and mutation.
2.2 Randomly select chromosomes for crossover and mutation to produce new off
springs.
24
Table 3.3 Crossover of S1 and S3 chromosome.
S1 1 1 1 1 0 1 0 1 0 1 F=7
S3 1 1 1 0 1 1 0 1 0 1 F=7
Table 3.3 represent the chromosomes S1 and S3 for crossover. The two point crossover
isperformed on chromosomes S1 and S3 to increase the probability of number of one’s.
S1 1 1 1 0 1 1 0 1 0 1 F=7
S3 1 1 1 1 0 1 0 1 0 1 F=7
Table 4.4 represent the crossover result of S1 and S3.After applying crossover on S1and
S3 thenumber of one’s increases.
Table 3.5 represent the chromosomes S2 and S4 for crossover. The two point crossover
isapplied to increase the probability of number of one’s.
S4 0 1 1 1 0 0 0 1 0 1 F=5
Table 3.6 represents the crossover result of S2 and S4.After applying crossover on S2
and S4 the number of one’s increases.
2.4 Perform crossover on S5 and S2 chromosome.
25
Table 3.7 Crossover of S5 and S6
S5 1 1 1 0 1 1 1 1 0 F=8
S6 0 1 0 0 1 1 0 0 0 F=3
Table 3.7 represent the chromosomes S5 and S6 for crossover. The two point crossover
isapplied to increase the probability of number of one’s.
S6 0 1 1 0 1 1 1 1 0 F=8
Table 3.8 represent the crossover result of S5 and S6.After applying crossover on S5 and
S6 thenumber of one’s increases.
Step 3: Perform mutation on the crossover results so that we obtain a better result.
3.1 Mutation means to change the particular bit of the chromosomes to increase the
number ofone’s.
Table 3.9 Mutation result of chromosomes
S1 1 1 1 1 0 1 0 1 0 1 F=8
S3 1 1 1 0 1 1 0 1 0 1 F=8
S5 1 1 1 0 1 1 1 1 0 1 F=5
S2 0 1 1 1 0 0 0 1 0 1 F=5
S4 0 1 0 0 0 1 0 0 1 1 F=5
S6 0 1 0 0 1 1 0 1 0 0 F=8
Conclusion: The optimization algorithm give better results when number of iteration is
more. So we have to increase the number of iteration to get better results.
26
3.7 Proposed Algorithm (Genetic K-means Algorithm)
This algorithm focus on random selection of initial seed problem of k means clustering.
Geneticalgorithm is used to select the optimum initial centroid for better results. Steps of
Genetic K- means (GAKM) is as follows:
Fig 3.3 shows the flowchart of proposed Genetic K means clustering. First initialize the
population in the form of strings and select the chromosomes based on roulette wheel
selection.After selection of chromosomes the result is passed to crossover .By applying
two point crossover to produce new offspring’s the result is passed to mutation .After
applying mutation on particular chromosome the final chromosomes is passed to K-means
clustering to perform clustering. After clustering the evaluation of clusters is done based
on Davies Bouldin index value. Higher the value of Davies Bouldin index indicates the
good clustering.
27
Steps of Proposed Algorithm:
i) Initialization:
In the initialization step of Genetic K-means clustering the parameter of the dataset is
prearranged in the strings (called chromosomes).The formation of chromosomes is the
crucial step in selecting the initial centroid. Initially K centroids are selected for K random
clusters. Initial population corresponds to Z where Z is the number of centroids that are
randomly selected from normalized dataset. Z is equal to (pop_size*K) where K is the
quantity of groups to be framed.
28
The total Intra cluster distance is:
𝑞.𝑟
Where 𝐷 is the distance between the neighboring clusters and xi and xj are the points
of the clusters m and n are the data points of the neighboring clusters.
Fmax = max(S(DINTER)/S(DINTRA))
Where Fmax is the fitness function obtained by dividing the total inter cluster distance and
total intra cluster distance.
v) Crossover Operator:
29
After the selection using rank based selection the next step is to produce off springs. The
mainlyused solution is crossover. Different types of crossover are single point crossover,
two point crossover, multiple point crossover .Single point crossover gives better result for
Integer and real datasets. Crossover operators are applied to maintain the genetic diversity.
Genetic diversity is the crucial step for the process of evolution. Crossover operator is
applied to the one of the genetic operator to maintain the genetic diversity. Different types
crossover operator are used like one point crossover, two point crossover and multiple point
crossover based on the requirement.
Formula for crossover is:
vi) Mutation:
Mutation is the genetic operator used to preserve the genetic diversity from one generation
to next generation. There were different genetic operator like bitwise operator, uniform
operator, on Uniform operator and Gaussian operator. Uniform mutation operator is used
for real and integer dataset as it gives better results. Mutation is applied so as to obtained
the better accuracyand it is applied to (Pm*pop size* u) where Pm is the probability of
population and u is the chromosome length.
30
Where 𝐴𝑖the centroid of cluster is𝐶𝑖, 𝑇𝑖 is the size of cluster and 𝑆𝑖is the measure of
validity of cluster.
Step 1: The sample Iris dataset is used for performing Genetic K-means algorithm. The
irisdataset consist of three species Setsoa, versicolor and Virginica and four dimensions
sepal length, sepal width, petal length and petal width.
Table 3.10 shows the iris dataset used for Genetic K-means clustering. The iris dataset
consist of three dimensions sepal length, sepal width and petal length. Genetic Algorithm
can be applied to iris dataset to find accuracy, precision, recall and sensitivity. Davies
31
Bouldin index is used forevaluation of clusters. The first 15 rows of iris dataset is collected
to perform genetic K-means clustering.
Table 3.11 represents the normalized iris dataset obtained by applying normalization on
unnormalized dataset. The normalized dataset give accurate results and easiness of
handling.
Step 3: After normalization of dataset:
Let assume pop size=4, k=3
Rows actually selected are X= (pop size *K) i.e. 4*3 rows are actually selected from
32
dataset. Let Y be the 12 indices returned from the dataset
Y= [4, 12, 14, 7, 9, 1, 2, 3, 5]
The rows correspond to first 3 indices represent first chromosome and the first index
represent first Centroid
Selected row
Column Number Chromosomes
indices
{0.083,0.45,0.084}
1 {4,12,14} {0.138,0.583,0.101}
{0,0.416,0.169}
{0.083,0.583,0.0677}
2 {7,9,1} {0.0277,0.375,0.0677}
{0.2222,0.625,0.0677}
{0.1666,0.41666,0.0677}
3 {2,3,5} {0.1111,0.5,0.050}
{0.19444,0.666,0.0677}
{0.1388,0.41666,0.0677}
4 {13,11,15} {0.3055,0.7083,0.084}
{0.41666,0.8333,0.0338}
Table 3.12 shows the selected row indices and chromosomes of the dataset. In column No.
1where, Ist Centroid (C1): 0.083, 0.45, 0.084 represent Ist cluster. 2nd Centroid (C2): 0.138,
0.583, 0.101 represent 2nd cluster and 3rd Centroid (C3): 0, 0.0416, and 0.169 represents 3rd
cluster.
Step 5: Calculate the Euclidean distance of point 1 {0.2222, 0.625, and 0.0677} from
dataset to cluster 1{0.083, 0.45, and 0.084}
33
Distance from cluster 1 to object 1=0.0163
Distance from cluster 2 to object 1 =1.50
Distance from cluster 3 to object 1=1.66
Table 3.13 represents the calculated distance obtained from cluster 1, cluster 2 and cluster
3 andalso the assignment of clusters.
Step 6: After calculating all the distance of clusters assignment of cluster of is shown in
table
1 1 1 1 1 1 1 1 1 1 3 2 3 3 2
34
• Generation of off springs
Offspring 1= (α *parent 1) + ((1- α)*parent2)
Offspring 1:0.02055, 0.1530, 0.1770, 0.1754, 0.1694, 0.20000, 0.01374, 0.0131, 0.0230
Offspring 2:0.02877, 0.2235, 0.2655, 0.2493, 0.2306, 0.2867, 0.0137, 0.0212, 0.0212
The offspring 1 and offspring 2 are obtained by applying crossover on selected row
indices 2and 3.
Parent 3 is selected from selected from selected row indices and chromosome table to
maintain the genetic diversity. Mutation is simply the random tweak to a particular
chromosome. It is applied to 5th chromosome of parent 3.After apply mutation on 5th
chromosome the value of chromosome changed from 0.9726 to 0.7982
35
Chapter 4 Experiments and Result Analysis
This chapter of project report will focus on implementation of the proposed hybrid
algorithm Genetic K-means using python and MATLAB.
To implement the proposed approach iris dataset is used. The dataset is collected from UCI
Machine Learning Repository [42]. The proposed approach is implemented by Python
using google collab.
The Iris dataset is a multi-dimensional dataset consisting of three classes (setsoa, versicolor
and virginica) and four dimensions (sepal length, petal length, sepal width and petal
width).This dataset consists of integer data points. K-means clustering and Genetic K-
means are implemented by using this dataset to find optimal clusters. Confusion matrix,
Accuracy, Recall and Precision are calculated.Davies Bouldin index is used for evaluation
of each cluster. The snapshots below show the implementation of K-means and Genetic K-
means on this dataset and also the calculated value of Davies Bouldin Index for evaluation
of each cluster. The Snapshots for code and results of K- means clustering is shown in
figure 4.1 and figure 4.2.
36
I. Code of K-means Clustering:
Fig 4.1 Shows the code of K-means clustering on Iris dataset. It shows implementation of
K-means clustering and uses the iris data and value of K used is 3. This code predicts the
accuracy of K- means. The predicted clusters are compared with actual class clusters and
accuracy of iris dataset is evaluated .The accuracy obtained from K-means clustering is 45
%.
37
Centroid Update Results:
Update Centroid as given in fig 5.1 generates a new centroid for cluster 1
The experiments performed on these algorithms are on the basis of many parameters that is
confusion matrix, Davies Bouldin index value ,accuracy , Recall, precision, Sensitivity and
Specificity.
38
4.2.1. Confusion Matrix of K-means Clustering
Both the algorithms that is K-Means and Genetic K-means algorithm are tested on the iris
datasets to calculate accuracy. Figure 4.3 shows the comparison of the confusion matrix
obtained by each algorithm on Iris dataset.
Predicted Class
0 1 2
0 50 0 0
Actual Class 1 0 10 34
2 0 36 20
Fig 4.3 shows the Confusion matrix obtained from K-means clustering.Accuracy Recall,
Precision, Sensitivity and Specificity are calculated by comparing the actual and predicted
results.
39
4.2.2 Confusion Matrix of Genetic K-means Clustering
Both the two algorithms that is K-Means and Genetic K-means algorithm are tested on the
iris datasets to calculate accuracy. Fig. 4.4 shows the comparison of the confusion matrix
obtained by each algorithm on Iris dataset.
Predicted Class
0 1 2
0 50 0 0
Actual Class 1 0 48 2
2 0 14 36
Fig 4.4 shows the Confusion matrix obtained from Genetic K-means clustering.Accuracy
Recall, Precision, Sensitivity and Specificity are calculated by comparing the actual and
predicted results.
• Precision:.TP/Predicted 1=48/62=77%
40
4.2.3 Test for Performance of Accuracy
Both the algorithms that is K-Means algorithm and proposed approach are tested on
iris dataset to calculate the accuracy performance. Accuracy evaluation is needed to
maintain the quality of clusters.
The formula for calculating accuracy
Where accuracy is obtained by dividing the sum of TP (True Positive) and TN (True
Negative) by thetotal sum. i.e. TP (True Positive), TN(True Negative),FP(False Positive)
and FN(False Negative).
The accuracy obtained from K-means and Genetic K-means shows that the proposed
algorithm givesbetter accuracy than K-means clustering .The accuracy obtained from all
the four datasets is using K-means and Genetic K-means is shown is table 4.1
Table 4.1 Shows the table of accuracy obtained from K-means and Genetic K-means
Algorithm. The table shows that Genetic algorithm gives higher accuracy thanK-means
algorithm.
41
4.2.4 Calculation of Intra cluster distance using K means and proposed algorithm.
Intra cluster distance is the distance between data points in the same cluster. The objective
of k-meansand proposed algorithm is to minimize the intra cluster distance. The formula for
finding intra clusterdistance is
Where m is the no of elements, 𝑥𝑖 and 𝑥𝑗 are data points of clusters and 𝐷𝑞 is the distance
between data points in same cluster.
The Intra cluster distance obtained from k means and proposed algorithm is shown in table
4.2 and table 4.3.
Table 4.2: Intra cluster distance using K-means algorithm
1 5.927
2 1.0245
3 1.086
Table 4.2 shows the table of intra-cluster distance using K-means algorithm.The intra-
cluster distanceis obtained by calculating the distance between data points in the same
cluster. The distance obtained by K-means algorithm is more than Genetic Algorithm.
2 0.0284
3 0.0861
42
Table 4.3 shows the table of intra-cluster distance using proposed algorithm. The intra-
cluster distance is obtained by calculating the distance between data points in the same
cluster. The distance obtained by proposed algorithm is less than K-means algorithm.
4.2.5 Calculation of Inter cluster distance using K means and proposed algorithm.
Inter cluster distance is the distance between data points of corresponding cluster. The main
objectiveof K-means and proposed algorithm is to maximize the inter cluster distance. The
inter cluster distanceobtained from k means and proposed algorithm is shown in table.
Formula for finding Inter cluster distance:
Where m, n are the number of elements in the 𝑞𝑡ℎand 𝑟𝑡ℎ cluster, 𝑥𝑖 and 𝑥𝑗 are the
elements inclusters and 𝐷𝑞.𝑟 is the inter cluster distance between data points.
INTER
1 1.3585
2 1.3203
3 0.3323
Table 4.4 shows the table of Inter-Cluster distance using K-means Algorithm. The table
shows the value of distance between data points of one cluster to another cluster. The
calculated value is lesser than proposed algorithm.
43
Table 4.5: Inter Cluster distance using proposed algorithm
1 1.585
2 1.4505
3 0.5823
Table 4.5 shows the table of Inter-Cluster distance using proposed Algorithm. The table
shows the value of distance between data points of one cluster to another cluster. The
calculated value is higher than the proposed algorithm.
44
Chapter 5 Conclusion and Future Scope
This chapter discusses the conclusion of the work done in the project and ends with a clear
vision of future direction which can be taken further.
5.1 Conclusion
This project introduces big data and provides background of various clustering techniques
used to analyzebig data. In this work comparative analysis of these techniques is done. A
hybrid approach based on Genetic K-Means is effective grouping of enormous information
is proposed. This approach is developed in python. The experimental results have been
gathered which shows that the proposed approach is more accurate as compared to K-
Means when tested on dataset.
5.2 Limitations
The proposed approach still has some of the following limitations.
• The proposed approach still requires the value of K, but after finding the Davies
Bouldin index valuewe will find the number of initial or desired clusters as input,
though data points has been distributed.
• The proposed approach can be applied only for those data sets which have numerical
values or attributes
• The proposed approach shows that initial value of clusters is needed as input, in future
the approach can be enhanced by finding the optimal no of cluster using best technique
so that automatic number of desired clusters is formed.
• In future, this approach can be applied for a particular real time application area by
addressing issuesinvolved and can be applied for data sets with categorical attributes.
45
vvvReferences
46
Applications 1.2 (2010): 2328.
[12] K.Shahroudi and S.Biabani "Variable selection in clustering for market segmentation
using genetic algorithms." Interdisciplinary Journal of Contemporary Research in
Business 3.6 (2011): 333-341.
[13] E.O.Hartono and D. Abdullah. "Determining a Cluster Centroid of K-Means Clustering
Using Genetic Algorithm." International Journal of Computer Science and Software
Engineering (IJCSSE) 4.6 (2015).
[14] D.X.Chang, XD. Zhang, and CW. Zheng. "A genetic algorithm with gene rearrangement
for K-meansclustering." Pattern Recognition 42.7 (2009): 1210-1222.
[15] R.Lletı, M.C.Ortiz, L.A.Sarabia, & M.S.Sánchez, (2004). “Selecting variables for k-
means cluster analysis by using a genetic algorithm that optimises the silhouettes”.
Analytica Chimica Acta, 515(1), 87-100.
[16] P.Vats, M. Mandot and A. Gosain, “A Comparative Analysis of Various Cluster Detection
Techniques for Data Mining,” Electronic Systems, Signal Processing and Computing
Technologies (ICESC), 2014International Conference on. IEEE, 2014.
[17] I.B. Saida, K. Nadjet and B. Omar “A new algorithm for data clustering based on cuckoo
search optimization,” Genetic and Evolutionary Computing, pp. 55-64. Springer, 2014.
[18] A.K. Jain, MN. Murty and P.J. Flynn, “Data clustering: a review,” ACM computing
surveys (CSUR), Vol. 31, No.3, pp. 264-323, 1999.
[19] Lu, Yi, et al. "FGKA: A fast genetic k-means clustering algorithm." Proceedings of the
2004 ACMsymposium on applied computing. ACM, 2004.
[20] A.Likas, N.Vlassis and J.J. Verbeek. "The global k-means clustering algorithm."
Pattern recognition 36.2 (2003): 451-461.
[21] K. Kim and H. Ahn. "A recommender system using GA K-means clustering in an online
shopping market." Expert systems with applications 34.2 (2008): 1200-1209.
[22] G.P. Babu and M.N. Murty. "A near-optimal initial seed value selection in k-means means
algorithm using a genetic algorithm." Pattern Recognition Letters 14.10 (1993): 763-769.
[23] D.K. Roy and L. K. Sharma. "Genetic k-Means clustering algorithm for mixed numeric
and categoricaldata sets." International Journal of Artificial Intelligence & Applications
1.2 (2010): 23-28.
[24] M.E. Celebi, H.A. Kingravi, and P. A. Vela. "A comparative study of efficient
initialization methods for the k-means clustering algorithm." Expert systems with
applications40.1 (2013): 200-210.
47