Skip to content

Different cluster result with same random_state #21749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ChristyQu opened this issue Nov 23, 2021 · 3 comments
Closed

Different cluster result with same random_state #21749

ChristyQu opened this issue Nov 23, 2021 · 3 comments

Comments

@ChristyQu
Copy link

ChristyQu commented Nov 23, 2021

Describe the bug

Hello,

We're a team working on a clustering problem. To make the result reproducible, we've set the random_state in KMean() to 0. However, we got a different result when we run the same code with the same data a week later. We're unsure about why it happened and still can't get the previous result.

Steps/Code to Reproduce

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_km_X = scaler.fit_transform(data)
model = KMeans(n_clusters=4, init='k-means++', random_state=0).fit(scaled_km_X)
loss = model.inertia_
data_labels = model.labels_ + 1

data_pred = pd.concat([data,demo], axis = 1)
data_pred["Cluster_number"] = model.labels_ + 1
data_pred

Expected Results

Expected number of observations in each cluster:

Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4
212 | 65 | 242 | 312

Actual Results

Now:

Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4
321 | 67 | 192 | 251

Versions

System:
python: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0.1
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.24
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True

@glemaitre
Copy link
Member

glemaitre commented Nov 23, 2021

Are you using the same version of scikit-learn between the two runs.
We fixed a couple of things in 1.0.1 that solve some reproducibility issues: #21195
But it means that between versions, the result will change.

@ChristyQu
Copy link
Author

I didn't pay much attention to the version of sklearn in the first run, but I've run the code again using version 1.0 and it outputs the same result as version 1.0.1.
The issue we had happened around Nov 17 which I believe is after version 1.0.1 is released (correct me if I am wrong).

@ChristyQu
Copy link
Author

Issue fixed: was able to get the previous result using an older version (0.22.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants