Description
Describe the bug
Hello,
We're a team working on a clustering problem. To make the result reproducible, we've set the random_state in KMean() to 0. However, we got a different result when we run the same code with the same data a week later. We're unsure about why it happened and still can't get the previous result.
Steps/Code to Reproduce
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_km_X = scaler.fit_transform(data)
model = KMeans(n_clusters=4, init='k-means++', random_state=0).fit(scaled_km_X)
loss = model.inertia_
data_labels = model.labels_ + 1
data_pred = pd.concat([data,demo], axis = 1)
data_pred["Cluster_number"] = model.labels_ + 1
data_pred
Expected Results
Expected number of observations in each cluster:
Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4
212 | 65 | 242 | 312
Actual Results
Now:
Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4
321 | 67 | 192 | 251
Versions
System:
python: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0.1
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.24
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True