Assignment 5

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Assignment -5 (CLUSTERING AND CLASSIFICATION)

1. Explain k means clustering algorithm and give its advantages and disadvantages

Ans):
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems
in machine learning or data science. In this topic, we will learn what is K-means clustering algorithm,
how the algorithm works, along with the Python implementation of k-means clustering.

It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabelled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K centre points or centroids by an iterative process.
o Assigns each data point to its closest k-centre. Those data points which are near to the
particular k-centre, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters

Advantages of k-means:

 Relatively simple to implement.


 Scales to large data sets.
 Guarantees convergence.
 Can warm-start the positions of centroids.
 Easily adapts to new examples.
 Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

Disadvantages of k-means:

 Choosing k manually:

Use the “Loss vs. Clusters” plot to find the optimal (k).

 Being dependent on initial values.

For a low k, you can mitigate this dependence by running k-means several times with different initial
values and picking the best result. As k increases, you need advanced versions of k-means to pick
better values of the initial centroids (called k-means seeding). Clustering data of varying sizes and
density.
k-means has trouble clustering data where clusters are of varying sizes and density. To cluster such
data, you need to generalize k-means as described in the Advantages section.

 Clustering outliers.

Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored.
Consider removing or clipping outliers before clustering.

 Scaling with number of dimensions.

As the number of dimensions increases, a distance-based similarity measure converges to a constant


value between any given examples. Reduce dimensionality either by using PCA on the feature data, or
by using “spectral clustering” to modify the clustering algorithm as explained below.

2. Use K-means clustering algorithm to divide the following data into clusters.
D = {2,3,4,10,11,12,20,25,30}
K=2
M1 = 4, M2 = 12

Ans)
3. Use K-means clustering algorithm to divide the following data into clusters.
D = {2,4,6,9,12,16,20,24,26}
K=2
M1 = 4, M2 = 12

Ans)

4 What is clustering? Illustrate with an example various steps and commands


involved for performing the k means clustering in R

Ans)

You might also like