INTRO TO ML ASS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

INTRODUCTION TO MACHINE LEARNING

NAME:GAYATHRI.K
REG NO:212223230061

Assignment: K-Means Clustering

1. Introduction to Clustering:

Clustering is a form of unsupervised learning where data is grouped based on similarities.


Unlike supervised learning, there are no predefined labels in clustering. Instead, the algorithm
identifies patterns or groupings within the data.

K-Means clustering is a popular clustering algorithm that divides a dataset into kkk distinct
non-overlapping clusters. Each cluster is defined by its centroid, and data points are assigned to
the cluster with the closest centroid.

2. Applications of K-Means Clustering:

1. Market Segmentation: Group customers based on purchasing behavior.


2. Image Compression: Reduce the number of colors in images by grouping similar
colors.
3. Document Clustering: Group similar documents or articles for better organization.
4. Anomaly Detection: Identify outliers in datasets for fraud detection or security.

3. K-Means Algorithm Steps:

1. Initialize kkk centroids randomly or using a specific method.


2. Assign each data point to the nearest centroid based on a distance metric (e.g.,
Euclidean distance).
3. Update the centroids by calculating the mean of the points assigned to each cluster.
4. Repeat steps 2 and 3 until:
○ Centroids no longer change, or
○ A predefined number of iterations is reached.
4. Advantages and Disadvantages:

Advantages:

● Simple and easy to implement.


● Efficient with large datasets.
● Works well when clusters are distinct and well-separated.

Disadvantages:

● Sensitive to the initial placement of centroids.


● Struggles with clusters of varying sizes and densities.
● Requires the number of clusters (kkk) to be specified beforehand.

5. Python Implementation:

The dataset consists of 2D points:

Points: (2,3),(3,3),(6,7),(8,8),(3,5),(7,6)\text{Points: } (2, 3), (3, 3), (6, 7), (8, 8), (3, 5), (7,
6)Points: (2,3),(3,3),(6,7),(8,8),(3,5),(7,6)

Code:
import numpy as np

# Example dataset
data = np.array([[2, 3], [3, 3], [6, 7], [8, 8], [3, 5], [7, 6]])
k = 2 # Number of clusters

# Initialize centroids randomly


centroids = data[np.random.choice(data.shape[0], k, replace=False)]

# K-Means algorithm
for _ in range(100): # Max iterations
clusters = [[] for _ in range(k)]
for point in data:
# Assign points to the nearest centroid
idx = np.argmin([np.linalg.norm(point - c) for c in centroids])
clusters[idx].append(point)

# Update centroids
new_centroids = [np.mean(cluster, axis=0) if cluster else centroids[i]
for i, cluster in enumerate(clusters)]
if np.allclose(new_centroids, centroids): # Check for convergence
break
centroids = new_centroids

# Display results
for i, cluster in enumerate(clusters):
print(f"Cluster {i+1}: {cluster}")
print(f"Final Centroids: {centroids}")

7. Conclusion:

The K-Means algorithm is a powerful clustering technique for grouping similar data. Its simplicity
and scalability make it a popular choice for many practical applications. However, understanding
its limitations and choosing kkk wisely are critical for optimal performance.

You might also like