Title: K-Means Clustering Algorithm Implementation: Department of Computer Science and Engineering
Title: K-Means Clustering Algorithm Implementation: Department of Computer Science and Engineering
Title: K-Means Clustering Algorithm Implementation: Department of Computer Science and Engineering
2 Problem analysis
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on. It allows us to cluster the data
into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its
own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in this
algorithm.
3 Algorithm
Step 1: Select the number K to decide the number of clusters.
Step 2: Select random K points or centroids. (It can be other from the input dataset).
Step 3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step 4: Calculate the variance and place a new centroid of each cluster.
Step 5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Step 6: If any reassignment occurs, then go to step-4 else go to FINISH.
Figure 2
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose
the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as in
figure 3a
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like as in figure 3b.
From the figure 3b, we can see, one yellow point is on the left side of the line, and two blue points are right
to the line. So in figure 3c, these three points will be assigned to new centroids.
Figure 3
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points
(figure 4a)
We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown
in the figure 4b:
As we got the new centroids so again will draw the median line and reassign the data points. So, form figure
4c, we can see there are no dissimilar data points on either side of the line, which means our model is formed.
(c)
(a) (b)
Figure 4
4 Implementation in Java
1 package k_means;
2
3 import java.util.ArrayList;
4 import java.util.Random;
5 import java.util.Scanner;
6
7 /**
8 *
9 * @author Jargis Ahmed
10 */
11 class points {
12
13 int x, y, cl;
14
15 points(int x, int y) { // points on the grid (x,y)
16 this.x = x;
17 this.y = y;
18 }
19
20 void setClass(int c) { // cluster number in which this (x,y) point reside
21 cl = c;
22 }
23 }
24
25 class St {
26
27 int pt, ks; // pt is the number of points and ks is the number of
clusters
28 int mat[][];
29 points[] p, k;
30 ArrayList<points[]> fkc;
31 Random rand = new Random();
32
33 St(int points, int clusters) {
34 pt = points;
35 ks = clusters;
36 mat = new int[pt][pt]; // mat is the 2d cartesian grid or matrix
37 fkc = new ArrayList<points[]>();
38 Start();
39 }
40
41 void Start() {
42 p = new points[pt]; // create random points coordinate
43 for (int i = 0; i < pt; i++) {
44 int x = rand.nextInt(pt);
45 int y = rand.nextInt(pt);
46 p[i] = new points(x, y);
47 }
48 k = new points[ks]; // create random cluster points coordinate
49 for (int i = 0; i < ks; i++) {
50 int x = rand.nextInt(pt);
51 int y = rand.nextInt(pt);
52 k[i] = new points(x, y);
53 }
54 int i, j;
55 double min;
56 int count = 0;
57 while (true) {
58 // starting cluster allocation of point p(x,y) based on minimum
distance from each cluster points
59 for (i = 0; i < pt; i++) {
60 for (j = 0, min = 10000; j < ks; j++) {
61 double d1 = Math.sqrt(Math.pow((double) (k[j].x − p[i].x),
2) + Math.pow((double) (k[j].y − p[i].y), 2));
62 if (d1 < min) {
63 p[i].setClass(j);
64 min = d1;
65 }
66 }
67 }
68 points[] kdup = new points[ks]; // creating duplicate set of
cluster points to store starting cluster coordinates
69 for (i = 0; i < ks; i++) {
70 kdup[i] = new points(k[i].x, k[i].y);
71 }
72 // calculating mean for each points in different clusters
73 for (j = 0; j < ks; j++) {
74 int x = 0, y = 0, ci = 0;
75 for (i = 0; i < pt; i++) {
76 if (p[i].cl == j) {
77 x += p[i].x;
78 y += p[i].y;
79 ci++;
80 }
81 }
82 if (ci != 0) { // allocating the mean as new cluster point
coordinate
83 k[j].x = x / ci;
84 k[j].y = y / ci;
85 }
86 }
87 int err = 0;
88 // claculating error between previous and present cluster points
coordinates
89 for (i = 0; i < ks; i++) {
90 err += Math.abs(k[i].x − kdup[i].x) + Math.abs(k[i].y − kdup[i].
y);
91 }
92 count++;
93 // 0 error between previous and present cluster points coordinates
indicates clustring is finalized
94 if (err == 0) {
95 break;
96 }
97 }
98
99 double IntraDis;
100 for (i = 0; i < ks; i++) {
101 for (j = 0, IntraDis = 0; j < pt; j++) {
102 if (p[j].cl == i) {
103 IntraDis += Math.sqrt(Math.pow((double) (k[i].x − p[j].x),
2) + Math.pow((double) (k[i].y − p[j].y), 2));
104 }
105 }
106 System.out.println("Cluster " + (i + 1) + " Intra−distance = " +
IntraDis);
107 }
108 for (i = 0; i < pt; i++) {
109 mat[p[i].x][p[i].y] = p[i].cl + 1;
110 }
111 for (i = 0; i < ks; i++) {
112 for (j = 0; j < p.length; j++) {
113 if (p[j].cl == i) {
114 System.out.println("Point (" + p[j].x + ", " + p[j].y + ") "
+ "Cluster − " + (+p[j].cl + 1));
115 }
116 }
117
118 }
119 for (i = 0; i < k.length; i++) {
120 System.out.println("Cluster " + (i + 1) + " − (" + k[i].x + ", " + k
[i].y + ")");
121 }
122 //printGrapgh(); // print a 2D graph alike matrix showing the points
and
123 }
124 }
125
126 public class K_Means {
127
128 public static void main(String[] args) {
129 // TODO code application logic here
130 Scanner sc = new Scanner(System.in);
131 System.out.println("Insert number of points");
132 int points = sc.nextInt();
133 System.out.println("Insert number of clusters");
134 int clusters = sc.nextInt();
135 St s = new St(points, clusters);
136 }
137 }
Output:
1 Cluster 1 Intra−distance = 42.004763409818885
2 Cluster 2 Intra−distance = 16.82842712474619
3 Cluster 3 Intra−distance = 10.537319187990756
4 Cluster 4 Intra−distance = 9.930106595800748
5 Point (8, 6) Cluster − 1
6 Point (12, 10) Cluster − 1
7 Point (13, 10) Cluster − 1
8 Point (17, 4) Cluster − 1
9 Point (6, 9) Cluster − 1
10 Point (16, 1) Cluster − 1
11 Point (10, 11) Cluster − 1
12 Point (16, 6) Cluster − 1
13 Point (12, 4) Cluster − 1
14 Point (8, 0) Cluster − 2
15 Point (0, 0) Cluster − 2
16 Point (2, 2) Cluster − 2
17 Point (4, 0) Cluster − 2
18 Point (10, 0) Cluster − 2
19 Point (15, 14) Cluster − 3
20 Point (9, 15) Cluster − 3
21 Point (18, 16) Cluster − 3
22 Point (5, 16) Cluster − 4
23 Point (1, 12) Cluster − 4
24 Point (0, 18) Cluster − 4
25 Cluster 1 − (12, 6)
26 Cluster 2 − (4, 0)
27 Cluster 3 − (14, 15)
28 Cluster 4 − (2, 15)
8 Policy
Copying from internet, classmate, seniors, or from any other source is strongly prohibited. 100% marks will be
deducted if any such copying is detected for lab exercise.