Skip to content

Commit cd74a77

Browse files
Create clustering.md
1 parent 3f999a6 commit cd74a77

File tree

1 file changed

+115
-0
lines changed

1 file changed

+115
-0
lines changed
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Clustering
2+
3+
Clustering is an unsupervised machine learning technique that groups a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). This README provides an overview of clustering, including its fundamental concepts, types, algorithms, and how to implement it using Python.
4+
5+
## Table of Contents
6+
7+
1. [Introduction](#introduction)
8+
2. [Concepts](#concepts)
9+
3. [Types of Clustering](#types-of-clustering)
10+
4. [Clustering Algorithms](#clustering-algorithms)
11+
5. [Implementation](#implementation)
12+
- [Using Scikit-learn](#using-scikit-learn)
13+
- [Code Example](#code-example)
14+
6. [Evaluation Metrics](#evaluation-metrics)
15+
7. [Conclusion](#conclusion)
16+
8. [References](#references)
17+
18+
## Introduction
19+
20+
Clustering is a technique used to find inherent groupings within data without pre-labeled targets. It is widely used in exploratory data analysis, pattern recognition, image analysis, information retrieval, and bioinformatics.
21+
22+
## Concepts
23+
24+
### Centroid
25+
26+
A centroid is the center of a cluster. In the k-means clustering algorithm, for example, each cluster is represented by its centroid, which is the mean of all the data points in the cluster.
27+
28+
### Distance Measure
29+
30+
Distance measures are used to quantify the similarity or dissimilarity between data points. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity.
31+
32+
### Inertia
33+
34+
Inertia is a metric used to assess the quality of the clusters formed. It is the sum of squared distances of samples to their nearest cluster center.
35+
36+
## Types of Clustering
37+
38+
1. **Hard Clustering**: Each data point either belongs to a cluster completely or not at all.
39+
2. **Soft Clustering (Fuzzy Clustering)**: Each data point can belong to multiple clusters with varying degrees of membership.
40+
41+
## Clustering Algorithms
42+
43+
### K-Means Clustering
44+
45+
K-Means is a popular clustering algorithm that partitions the data into k clusters, where each data point belongs to the cluster with the nearest mean. The algorithm follows these steps:
46+
1. Initialize k centroids randomly.
47+
2. Assign each data point to the nearest centroid.
48+
3. Recalculate the centroids as the mean of all data points assigned to each cluster.
49+
4. Repeat steps 2 and 3 until convergence.
50+
51+
### Hierarchical Clustering
52+
53+
Hierarchical clustering builds a tree of clusters. There are two types:
54+
- **Agglomerative (bottom-up)**: Starts with each data point as a separate cluster and merges the closest pairs of clusters iteratively.
55+
- **Divisive (top-down)**: Starts with all data points in one cluster and splits the cluster iteratively into smaller clusters.
56+
57+
### DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
58+
59+
DBSCAN groups together points that are close to each other based on a distance measurement and a minimum number of points. It can find arbitrarily shaped clusters and is robust to noise.
60+
61+
## Implementation
62+
63+
### Using Scikit-learn
64+
65+
Scikit-learn is a popular machine learning library in Python that provides tools for clustering.
66+
67+
### Code Example
68+
69+
```python
70+
import numpy as np
71+
import pandas as pd
72+
from sklearn.cluster import KMeans
73+
from sklearn.preprocessing import StandardScaler
74+
from sklearn.metrics import silhouette_score
75+
76+
# Load dataset
77+
data = pd.read_csv('path/to/your/dataset.csv')
78+
79+
# Preprocess the data
80+
scaler = StandardScaler()
81+
data_scaled = scaler.fit_transform(data)
82+
83+
# Initialize and fit KMeans model
84+
kmeans = KMeans(n_clusters=3, random_state=42)
85+
kmeans.fit(data_scaled)
86+
87+
# Get cluster labels
88+
labels = kmeans.labels_
89+
90+
# Calculate silhouette score
91+
silhouette_avg = silhouette_score(data_scaled, labels)
92+
print("Silhouette Score:", silhouette_avg)
93+
94+
# Add cluster labels to the original data
95+
data['Cluster'] = labels
96+
97+
print(data.head())
98+
```
99+
100+
## Evaluation Metrics
101+
102+
- **Silhouette Score**: Measures how similar a data point is to its own cluster compared to other clusters.
103+
- **Inertia (Within-cluster Sum of Squares)**: Measures the compactness of the clusters.
104+
- **Davies-Bouldin Index**: Measures the average similarity ratio of each cluster with the cluster that is most similar to it.
105+
- **Dunn Index**: Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
106+
107+
## Conclusion
108+
109+
Clustering is a powerful technique for discovering structure in data. Understanding different clustering algorithms and their evaluation metrics is crucial for selecting the appropriate method for a given problem.
110+
111+
## References
112+
113+
- [Scikit-learn Documentation](https://scikit-learn.org/stable/modules/clustering.html)
114+
- [Wikipedia: Cluster Analysis](https://en.wikipedia.org/wiki/Cluster_analysis)
115+
- [Towards Data Science: A Comprehensive Guide to Clustering](https://towardsdatascience.com/a-comprehensive-guide-to-clustering-9789897f8b88)

0 commit comments

Comments
 (0)