Different cluster result with same random_state #21749

ChristyQu · 2021-11-23T03:27:22Z

Describe the bug

Hello,

We're a team working on a clustering problem. To make the result reproducible, we've set the random_state in KMean() to 0. However, we got a different result when we run the same code with the same data a week later. We're unsure about why it happened and still can't get the previous result.

Steps/Code to Reproduce

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_km_X = scaler.fit_transform(data)
model = KMeans(n_clusters=4, init='k-means++', random_state=0).fit(scaled_km_X)
loss = model.inertia_
data_labels = model.labels_ + 1

data_pred = pd.concat([data,demo], axis = 1)
data_pred["Cluster_number"] = model.labels_ + 1
data_pred

Expected Results

Expected number of observations in each cluster:

Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4
212 | 65 | 242 | 312

Actual Results

Now:

Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4
321 | 67 | 192 | 251

Versions

System:
python: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0.1
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.24
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True

glemaitre · 2021-11-23T11:07:31Z

Are you using the same version of scikit-learn between the two runs.
We fixed a couple of things in 1.0.1 that solve some reproducibility issues: #21195
But it means that between versions, the result will change.

ChristyQu · 2021-11-23T16:50:13Z

I didn't pay much attention to the version of sklearn in the first run, but I've run the code again using version 1.0 and it outputs the same result as version 1.0.1.
The issue we had happened around Nov 17 which I believe is after version 1.0.1 is released (correct me if I am wrong).

ChristyQu · 2021-11-24T05:28:04Z

Issue fixed: was able to get the previous result using an older version (0.22.1)

ChristyQu added the Bug: triage label Nov 23, 2021

ChristyQu closed this as completed Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different cluster result with same random_state #21749

Different cluster result with same random_state #21749

ChristyQu commented Nov 23, 2021 •

edited by glemaitre

Loading

glemaitre commented Nov 23, 2021 •

edited

Loading

ChristyQu commented Nov 23, 2021

ChristyQu commented Nov 24, 2021

Different cluster result with same random_state #21749

Different cluster result with same random_state #21749

Comments

ChristyQu commented Nov 23, 2021 • edited by glemaitre Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented Nov 23, 2021 • edited Loading

ChristyQu commented Nov 23, 2021

ChristyQu commented Nov 24, 2021

ChristyQu commented Nov 23, 2021 •

edited by glemaitre

Loading

glemaitre commented Nov 23, 2021 •

edited

Loading