Skip to content

Commit 7be03ee

Browse files
authored
Merge pull request animator#720 from Antiquely3059/main
Added Clustering
2 parents d1d5698 + 3a8ac54 commit 7be03ee

File tree

2 files changed

+97
-0
lines changed

2 files changed

+97
-0
lines changed
+96
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Clustering
2+
3+
Clustering is an unsupervised machine learning technique that groups a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). This README provides an overview of clustering, including its fundamental concepts, types, algorithms, and how to implement it using Python.
4+
5+
## Introduction
6+
7+
Clustering is a technique used to find inherent groupings within data without pre-labeled targets. It is widely used in exploratory data analysis, pattern recognition, image analysis, information retrieval, and bioinformatics.
8+
9+
## Concepts
10+
11+
### Centroid
12+
13+
A centroid is the center of a cluster. In the k-means clustering algorithm, for example, each cluster is represented by its centroid, which is the mean of all the data points in the cluster.
14+
15+
### Distance Measure
16+
17+
Distance measures are used to quantify the similarity or dissimilarity between data points. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity.
18+
19+
### Inertia
20+
21+
Inertia is a metric used to assess the quality of the clusters formed. It is the sum of squared distances of samples to their nearest cluster center.
22+
23+
## Types of Clustering
24+
25+
1. **Hard Clustering**: Each data point either belongs to a cluster completely or not at all.
26+
2. **Soft Clustering (Fuzzy Clustering)**: Each data point can belong to multiple clusters with varying degrees of membership.
27+
28+
## Clustering Algorithms
29+
30+
### K-Means Clustering
31+
32+
K-Means is a popular clustering algorithm that partitions the data into k clusters, where each data point belongs to the cluster with the nearest mean. The algorithm follows these steps:
33+
1. Initialize k centroids randomly.
34+
2. Assign each data point to the nearest centroid.
35+
3. Recalculate the centroids as the mean of all data points assigned to each cluster.
36+
4. Repeat steps 2 and 3 until convergence.
37+
38+
### Hierarchical Clustering
39+
40+
Hierarchical clustering builds a tree of clusters. There are two types:
41+
- **Agglomerative (bottom-up)**: Starts with each data point as a separate cluster and merges the closest pairs of clusters iteratively.
42+
- **Divisive (top-down)**: Starts with all data points in one cluster and splits the cluster iteratively into smaller clusters.
43+
44+
### DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
45+
46+
DBSCAN groups together points that are close to each other based on a distance measurement and a minimum number of points. It can find arbitrarily shaped clusters and is robust to noise.
47+
48+
## Implementation
49+
50+
### Using Scikit-learn
51+
52+
Scikit-learn is a popular machine learning library in Python that provides tools for clustering.
53+
54+
### Code Example
55+
56+
```python
57+
import numpy as np
58+
import pandas as pd
59+
from sklearn.cluster import KMeans
60+
from sklearn.preprocessing import StandardScaler
61+
from sklearn.metrics import silhouette_score
62+
63+
# Load dataset
64+
data = pd.read_csv('path/to/your/dataset.csv')
65+
66+
# Preprocess the data
67+
scaler = StandardScaler()
68+
data_scaled = scaler.fit_transform(data)
69+
70+
# Initialize and fit KMeans model
71+
kmeans = KMeans(n_clusters=3, random_state=42)
72+
kmeans.fit(data_scaled)
73+
74+
# Get cluster labels
75+
labels = kmeans.labels_
76+
77+
# Calculate silhouette score
78+
silhouette_avg = silhouette_score(data_scaled, labels)
79+
print("Silhouette Score:", silhouette_avg)
80+
81+
# Add cluster labels to the original data
82+
data['Cluster'] = labels
83+
84+
print(data.head())
85+
```
86+
87+
## Evaluation Metrics
88+
89+
- **Silhouette Score**: Measures how similar a data point is to its own cluster compared to other clusters.
90+
- **Inertia (Within-cluster Sum of Squares)**: Measures the compactness of the clusters.
91+
- **Davies-Bouldin Index**: Measures the average similarity ratio of each cluster with the cluster that is most similar to it.
92+
- **Dunn Index**: Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
93+
94+
## Conclusion
95+
96+
Clustering is a powerful technique for discovering structure in data. Understanding different clustering algorithms and their evaluation metrics is crucial for selecting the appropriate method for a given problem.

contrib/machine-learning/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,5 @@
1010
- [PyTorch.md](pytorch.md)
1111
- [Types of optimizers](Types_of_optimizers.md)
1212
- [Logistic Regression](logistic-regression.md)
13+
- [Clustering](clustering.md)
1314
- [Grid Search](grid-search.md)

0 commit comments

Comments
 (0)