Skip to content

Commit bb38f8f

Browse files
authored
Update K-Means_Clustering.md
1 parent aefb52e commit bb38f8f

File tree

1 file changed

+58
-48
lines changed

1 file changed

+58
-48
lines changed

contrib/machine-learning/K-Means_Clustering.md

Lines changed: 58 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,25 @@
11
# K-Means Clustering
22
Unsupervised Learning Algorithm for Grouping Similar Data.
3+
34
## Introduction
45
K-means clustering is a fundamental unsupervised machine learning algorithm that excels at grouping similar data points together. It's a popular choice due to its simplicity and efficiency in uncovering hidden patterns within unlabeled datasets.
6+
57
## Unsupervised Learning
68
Unlike supervised learning algorithms that rely on labeled data for training, unsupervised algorithms, like K-means, operate solely on input data (without predefined categories). Their objective is to discover inherent structures or groupings within the data.
9+
710
## The K-Means Objective
811
Organize similar data points into clusters to unveil underlying patterns. The main objective is to minimize total intra-cluster variance or the squared function.
912

1013
![image](assets/knm.png)
1114
## Clusters and Centroids
1215
A cluster represents a collection of data points that share similar characteristics. K-means identifies a pre-determined number (k) of clusters within the dataset. Each cluster is represented by a centroid, which acts as its central point (imaginary or real).
16+
1317
## Minimizing In-Cluster Variation
1418
The K-means algorithm strategically assigns each data point to a cluster such that the total variation within each cluster (measured by the sum of squared distances between points and their centroid) is minimized. In simpler terms, K-means strives to create clusters where data points are close to their respective centroids.
19+
1520
## The Meaning Behind "K-Means"
1621
The "means" in K-means refers to the averaging process used to compute the centroid, essentially finding the center of each cluster.
22+
1723
## K-Means Algorithm in Action
1824
![image](assets/km_.png)
1925
The K-means algorithm follows an iterative approach to optimize cluster formation:
@@ -24,62 +30,66 @@ The K-means algorithm follows an iterative approach to optimize cluster formatio
2430
4. **Iteration Until Convergence:** Steps 2 and 3 are repeated iteratively until a stopping criterion is met. This criterion can be either:
2531
- **Centroid Stability:** No significant change occurs in the centroids' positions, indicating successful clustering.
2632
- **Reaching Maximum Iterations:** A predefined number of iterations is completed.
27-
## Code
28-
Following is a simple implementation of K-Means.
33+
34+
## Code
35+
Following is a simple implementation of K-Means.
2936

30-
31-
# Generate and Visualize Sample Data
32-
# import the necessary Libraries
33-
34-
import numpy as np
35-
import matplotlib.pyplot as plt
37+
```python
38+
# Generate and Visualize Sample Data
39+
# import the necessary Libraries
3640

37-
# Create data points for cluster 1 and cluster 2
38-
X = -2 * np.random.rand(100, 2)
39-
X1 = 1 + 2 * np.random.rand(50, 2)
40-
41-
# Combine data points from both clusters
42-
X[50:100, :] = X1
43-
44-
# Plot data points and display the plot
45-
plt.scatter(X[:, 0], X[:, 1], s=50, c='b')
46-
plt.show()
47-
48-
# K-Means Model Creation and Training
49-
from sklearn.cluster import KMeans
50-
51-
# Create KMeans object with 2 clusters
52-
kmeans = KMeans(n_clusters=2)
53-
kmeans.fit(X) # Train the model on the data
54-
55-
# Visualize Data Points with Centroids
56-
centroids = kmeans.cluster_centers_ # Get centroids (cluster centers)
57-
58-
plt.scatter(X[:, 0], X[:, 1], s=50, c='b') # Plot data points again
59-
plt.scatter(centroids[0, 0], centroids[0, 1], s=200, c='g', marker='s') # Plot centroid 1
60-
plt.scatter(centroids[1, 0], centroids[1, 1], s=200, c='r', marker='s') # Plot centroid 2
61-
plt.show() # Display the plot with centroids
62-
63-
# Predict Cluster Label for New Data Point
64-
new_data = np.array([-3.0, -3.0])
65-
new_data_reshaped = new_data.reshape(1, -1)
66-
predicted_cluster = kmeans.predict(new_data_reshaped)
67-
print("Predicted cluster for new data:", predicted_cluster)
68-
69-
### Output:
70-
Before Implementing K-Means Clustering
41+
import numpy as np
42+
import matplotlib.pyplot as plt
43+
44+
# Create data points for cluster 1 and cluster 2
45+
X = -2 * np.random.rand(100, 2)
46+
X1 = 1 + 2 * np.random.rand(50, 2)
47+
48+
# Combine data points from both clusters
49+
X[50:100, :] = X1
50+
51+
# Plot data points and display the plot
52+
plt.scatter(X[:, 0], X[:, 1], s=50, c='b')
53+
plt.show()
54+
55+
# K-Means Model Creation and Training
56+
from sklearn.cluster import KMeans
57+
58+
# Create KMeans object with 2 clusters
59+
kmeans = KMeans(n_clusters=2)
60+
kmeans.fit(X) # Train the model on the data
61+
62+
# Visualize Data Points with Centroids
63+
centroids = kmeans.cluster_centers_ # Get centroids (cluster centers)
64+
65+
plt.scatter(X[:, 0], X[:, 1], s=50, c='b') # Plot data points again
66+
plt.scatter(centroids[0, 0], centroids[0, 1], s=200, c='g', marker='s') # Plot centroid 1
67+
plt.scatter(centroids[1, 0], centroids[1, 1], s=200, c='r', marker='s') # Plot centroid 2
68+
plt.show() # Display the plot with centroids
69+
70+
# Predict Cluster Label for New Data Point
71+
new_data = np.array([-3.0, -3.0])
72+
new_data_reshaped = new_data.reshape(1, -1)
73+
predicted_cluster = kmeans.predict(new_data_reshaped)
74+
print("Predicted cluster for new data:", predicted_cluster)
75+
```
76+
77+
### Output:
78+
Before Implementing K-Means Clustering
7179
![Before Implementing K-Means Clustering](assets/km_2.png)
7280

73-
After Implementing K-Means Clustering
74-
![After Implementing K-Means Clustering](assets/km_3.png)
81+
After Implementing K-Means Clustering
82+
![After Implementing K-Means Clustering](assets/km_3.png)
83+
84+
Predicted cluster for new data: `[0]`
7585

76-
Predicted cluster for new data: [0]
7786
## Conclusion
7887
**K-Means** can be applied to data that has a smaller number of dimensions, is numeric, and is continuous or can be used to find groups that have not been explicitly labeled in the data. As an example, it can be used for Document Classification, Delivery Store Optimization, or Customer Segmentation.
79-
## Reference
80-
[[Survey of Machine Learning and Data Mining Techniques used in Multimedia System](https://www.researchgate.net/publication/333457161_Survey_of_Machine_Learning_and_Data_Mining_Techniques_used_in_Multimedia_System?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6Il9kaXJlY3QiLCJwYWdlIjoiX2RpcmVjdCJ9fQ)]
8188

82-
[[A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database](https://www.researchgate.net/publication/339267868_A_Clustering_Approach_for_Outliers_Detection_in_a_Big_Point-of-Sales_Database?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6Il9kaXJlY3QiLCJwYWdlIjoiX2RpcmVjdCJ9fQ)]
89+
## References
90+
91+
- [Survey of Machine Learning and Data Mining Techniques used in Multimedia System](https://www.researchgate.net/publication/333457161_Survey_of_Machine_Learning_and_Data_Mining_Techniques_used_in_Multimedia_System?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6Il9kaXJlY3QiLCJwYWdlIjoiX2RpcmVjdCJ9fQ)
92+
- [A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database](https://www.researchgate.net/publication/339267868_A_Clustering_Approach_for_Outliers_Detection_in_a_Big_Point-of-Sales_Database?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6Il9kaXJlY3QiLCJwYWdlIjoiX2RpcmVjdCJ9fQ)
8393

8494

8595

0 commit comments

Comments
 (0)