K-Means Clustering Algorithm
K-Means Clustering Algorithm
It is an iterative unsupervised learning algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar properties. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
1
o Let's take K=2, to group datasets into two different clusters.
o Now, we have to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset.
o We will assign each data point of the scatter plot to its closest K-point or centroid. For this,
we have to calculate the distance between two points. So, we will draw a median between
both the centroids.
From the above image, it is clear that points to the left side of the line is near to the
K1 or blue centroid, and points to the right of the line are close to the yellow
centroid.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
2
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:
3
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
4
Decision of K- Value
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it
forms. But choosing the optimal number of clusters is a big task. There are some different ways to
find the optimal number of clusters.
Elbow Method
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3
clusters) is given below:
WCSS= ∑Pi in Cluster1 distance (Pi C1)2 +∑Pi in Cluster2distance (Pi C2)2+∑Pi in CLuster3 distance (Pi C3)2
∑Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the distances between each data point and
its centroid within a cluster1.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-
10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow
method.
5
df = pd.read_excel('D:\ML & Python\Book3.xlsx')
print(df.head())
[5 rows x 14 columns]
There are many ways to handle for non-numerical data. First, we will want to cycle through the
columns in the Pandas dataframe. For columns that are not numbers, we want to find their unique
elements. This can be done by simply take a set of the column values. From here, the index within
that set can be the new "numerical" value or "id" of the text data.
df.drop(['body','name'], 1, inplace=True)
df.fillna(0, inplace=True)
print(df.head())
[5 rows x 12 columns]
def handle_non_numerical_data(df):
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
return df
df = handle_non_numerical_data(df)
6
print(df.head())
X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])
clf = KMeans(n_clusters=2)
clf.fit(X)
correct = 0
for i in range(len(X)):
predict_me = np.array(X[i].astype(float))
predict_me = predict_me.reshape(-1, len(predict_me))
prediction = clf.predict(predict_me)
if prediction[0] == y[i]:
correct += 1
print(correct/len(X))
0.7081741787624141
7
body Body Identification Number
home.dest Home/Destination
The closest distance between the two clusters is crucial for the hierarchical clustering. There are
various ways to calculate the distance between two clusters, and these ways decide the rule for
clustering. These measures are called Linkage methods.
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
8
3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated.
The difference between K-Means algorithm and Mean-Shift is that later one does not need to
specify the number of clusters in advance because the number of clusters will be determined by the
algorithm w.r.t data.
It is based on Kernel Density Estimation (KDE), which is a way to estimate the probability density
function of a random variable. KDE is a problem where the inferences of the population are made by
data smoothing. It works by providing weights to each data point. The weight function is called a
kernel. There are many kinds of kernels- Gaussian kernel, rectangular kernel, flat kernel, etc. Adding
all those kernels together creates a density function (probability surface).
The KDE surface where our data points are distributed in the surface plot. The hills can be
considered as the kernel.
9
We can make the points climb uphill to the nearest peak on the KDE surface. So, iteratively shifting
each point to climb uphill to the peak. The bandwidth parameter used to make the KDE surface
varies on the different sizes. For example, we have a tall skinny kernel which means a small kernel
bandwidth and in a case where the size of the kernel is short and fat, which means a large kernel
bandwidth. A small kernel bandwidth makes the KDE surface hold the peak for every data point
more formally, saying each point has its cluster; on the other hand, large kernel bandwidth results in
fewer kernels or fewer clusters.
Step 1 – Define a window (bandwidth of the Kernel to be used for estimation) and place the window
on a data point.
Step 2 – Find mean of all the points within the window.
Step 3 – Move the window to the location of the mean.
Step 4 – Repeat step 2-3 until convergence.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
where,
• P(A|B) is Posterior probability. P(A | B) is the probability of event A, given the event B is true
(has occured). Event B is also termed as evidence.
• P(B|A) is Likelihood probability. P(B | A) is the probability of B given event A, i.e. probability
of event B after evidence A is seen.
• P(A) is Prior Probability. P(A) is the priori of A (the prior independent probability, i.e.
probability of event before evidence is seen).
• P(B) is Marginal Probability
10
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to
the weather conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
11
Likelihood table weather condition:
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
12
model.fit(dataset.data, dataset.target)
print(model)
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
O U T P UT :
p r e c i s io n r e c a l l f1 - s co r e s u p po r t
a cc u r a c y 0.96 150
m a c ro av g 0.96 0.96 0.96 150
w e i g h t e d av g 0 .9 6 0.96 0.96 150
[[50 0 0]
[ 0 47 3]
[ 0 3 47]]
Text Analysis is a major application field for machine learning algorithms. However the raw data, a
sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them
expect numerical feature vectors with a fixed size rather than the raw text documents with variable
length.
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
The dataset is divided into two parts, namely, feature matrix and the response/target vector.
• The Feature matrix (X) contains all the vectors(rows) of the dataset in which each vector consists
of the value of dependent features. The number of features is d i.e. X = (x1,x2,x2, xd).
• The Response/target vector (y) contains the value of class/group variable for each row of
feature matrix.
13
The Naive Bayes Model
Given a data matrix X and a target vector y, we state our problem as:
where, y is class variable and X is a dependent feature vector with dimension d i.e. X = (x1,x2,x2,
xd), where d is the number of variables/features of the sample.
• P(y|X) is the probability of observing the class y given the sample X with X = (x1,x2,x2,
xd), where d is the number of variables/features of the sample.
Now the “naïve” conditional independence assumptions come into play: assume that all features
in X are mutually independent, conditional on the category y:
Finally, to find the probability of a given sample for all possible values of the class variable y, we just
need to find the output with maximum probability:
14
print(vectorizer.get_feature_names())
[‘and’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’]
print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
Implementation in Python
Here we consider a multi-class (20 classes) text classification problem.
First, we will load all the necessary libraries:
import numpy as np, pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score
Output:
We have 20 unique classes
We have 11314 training samples
We have 7532 test samples
So, this is a 20-class text classification problem with n_train = 11314 training samples (text
sentences) and n_test = 7532 test samples (text sentences).
15
The next step consists of building the Naive Bayes classifier and finally training the model. In our
example, we will convert the collection of text documents (train and test sets) into a matrix of token
counts.
To implement that text transformation we will use the make_pipeline function. This will internally
transform the text data and then the model will be fitted using the transformed data.
The last line of code predicts the labels of the test set.
Finally, let’s build the multi-class confusion matrix to see if the model is good or if the model predicts
correctly only specific text categories.
16
# plot the confusion matrix
mat = confusion_matrix(test_data.target, predicted_categories)
print("The accuracy is {}".format(accuracy_score(test_data.target, predicted_categories)))
The accuracy is 0.7738980350504514
From the above confusion matrix, we can verify that the model is really good.
• It is able to correctly predict all 20 classes of the text data (most values are on the diagonal and
few are off-the-diagonal).
• We also notice that the highest miss-classification (value off-the-diagonal) is 131. The value 131
means that 131 documents that belonged to the “religion miscellaneous” category were miss-
classified as belonging to the “religion christian” category.
17