Skip to content

Different cluster result with same random_state #21749

Closed
@ChristyQu

Description

@ChristyQu

Describe the bug

Hello,

We're a team working on a clustering problem. To make the result reproducible, we've set the random_state in KMean() to 0. However, we got a different result when we run the same code with the same data a week later. We're unsure about why it happened and still can't get the previous result.

Steps/Code to Reproduce

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_km_X = scaler.fit_transform(data)
model = KMeans(n_clusters=4, init='k-means++', random_state=0).fit(scaled_km_X)
loss = model.inertia_
data_labels = model.labels_ + 1

data_pred = pd.concat([data,demo], axis = 1)
data_pred["Cluster_number"] = model.labels_ + 1
data_pred

Expected Results

Expected number of observations in each cluster:

Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4
212 | 65 | 242 | 312

Actual Results

Now:

Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4
321 | 67 | 192 | 251

Versions

System:
python: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0.1
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.24
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions