Skip to content

[Undocumented?] KMeans behavior change between v1.2.2 and v1.3.0 #30643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stu-blair opened this issue Jan 13, 2025 · 5 comments
Closed

[Undocumented?] KMeans behavior change between v1.2.2 and v1.3.0 #30643

stu-blair opened this issue Jan 13, 2025 · 5 comments

Comments

@stu-blair
Copy link

Describe the bug

When upgrading scikit-learn from 1.1.1 to 1.6.0, I noticed my scripts results were completely changing, despite using the same code, data, and random seeds. I narrowed this down to v1.3.0. I tried setting every parameter I could find, but still the results are different. KMeans produces clusters in different orders before and after 1.3.0 release. The results are deterministic if you set the random seed, in both 1.2.2 and 1.3.0, but the ordering changes upon upgrade.

Can someone please help me identify what change causes this, and if there is any way to get the post-1.3.0 behavior to be consistent with the prior behavior?

Steps/Code to Reproduce

I ran the following code in both v1.2.2 and v1.3.0

import sklearn
from sklearn.cluster import KMeans
def do_kmeans_test():
  data = [[ 2.614], [ 2.614], [-0.493], [-0.489], [-0.489], [ 4.965], [ 2.518], [ 0.518], [ 0.201]]
  kmeans_pca = KMeans(n_clusters=3,init='k-means++',random_state=1234,n_init=10,max_iter=300,verbose=False,tol=1e-4,copy_x=True,algorithm="lloyd")
  kmeans_pca.fit(data)
  print('sklearn version:', sklearn.__version__)
  print('cluster centers:', kmeans_pca.cluster_centers_)

do_kmeans_test()

Expected Results

In v1.2.2 (and earlier versions 1.1.1 and 1.2.0) the results are:

sklearn version: 1.2.2
cluster centers: [[-0.1504]
 [ 4.965 ]
 [ 2.582 ]]

Actual Results

But in v1.3.0 (and later versions) the results are:

sklearn version: 1.3.0
cluster centers: [[ 2.582 ]
 [-0.1504]
 [ 4.965 ]]

Versions

in the 1.2.2 test:

>>> import sklearn; sklearn.show_versions()

System:
    python: 3.8.12 (default, Jan 10 2025, 11:02:07)  [Clang 16.0.0 (clang-1600.0.26.4)]
executable: /Users/sblair/.pyenv/versions/3.8.12/bin/python
   machine: macOS-15.1-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.2
          pip: 21.1.1
   setuptools: 56.0.0
        numpy: 1.19.5
        scipy: 1.10.1
       Cython: None
       pandas: 1.4.0
   matplotlib: 3.7.5
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libomp
       filepath: /Users/sblair/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/sblair/.pyenv/versions/3.8.12/lib/python3.8/site-packages/numpy/.dylibs/libopenblas.0.dylib
        version: 0.3.13
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/sblair/.pyenv/versions/3.8.12/lib/python3.8/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell



in the 1.3.0 test:

>>> import sklearn; sklearn.show_versions()

System:
    python: 3.8.12 (default, Jan 10 2025, 11:02:07)  [Clang 16.0.0 (clang-1600.0.26.4)]
executable: /Users/sblair/.pyenv/versions/3.8.12/bin/python
   machine: macOS-15.1-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.3.0
          pip: 21.1.1
   setuptools: 56.0.0
        numpy: 1.19.5
        scipy: 1.10.1
       Cython: None
       pandas: 1.4.0
   matplotlib: 3.7.5
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libomp
       filepath: /Users/sblair/.pyenv/versions/3.8.12/lib/python3.8/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/sblair/.pyenv/versions/3.8.12/lib/python3.8/site-packages/numpy/.dylibs/libopenblas.0.dylib
        version: 0.3.13
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/sblair/.pyenv/versions/3.8.12/lib/python3.8/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell
@stu-blair stu-blair added Bug Needs Triage Issue requires triage labels Jan 13, 2025
@jeremiedbb
Copy link
Member

Hi @stu-blair, it's very likely due to this PR #25752, documented in the 1.3.0 changelog #25752. The individual results may be different because the random number generation is different but are the same in expectation.

@jeremiedbb jeremiedbb removed Bug Needs Triage Issue requires triage labels Jan 14, 2025
@stu-blair
Copy link
Author

Many thanks @jeremiedbb . Is there any way to workaround this change to maintain backwards compatibility?

I see @glemaitre had a suggestion on how to maintain backwards compatiblity here? #25752 (comment)

And an idea here also #27991 (comment)

Reproducibility/backwards compatibility is extremely important for scientific packages.

@jeremiedbb
Copy link
Member

I don't think it's worth the maintenance burden. The results are not numerically the same but are as valid as before, just a difference in rng. This came from a bug fix, and I think a change of behavior like this one is acceptable when we're fixing a bug at the same time. Keeping backward compat for a these kind of changes would add unnecessary complexity to the code-base imo.

@betatim
Copy link
Member

betatim commented Jan 17, 2025

Just to add to this, the clusters found are the same, they are in a different order. Which I think is less bad than if the centers changed (slightly). In terms of compatibility, I'm not sure if there is any meaning to the order of clusters. Maybe if you care about it, you could sort the cluster centers after the fact?

@lesteve
Copy link
Member

lesteve commented Jan 21, 2025

Reproducibility/backwards compatibility is extremely important for scientific packages.

As a scikit-learn maintainer, I would say that in scikit-learn we are conservative and try hard to avoid changing results silently from one major release to the next (e.g. 1.2 to 1.3). Having said this, it could well be the case that we are still changing results too often for users that have a strong focus on stability/reproducibility ...

In this particular case, I think the ship has sailed a long time ago since 1.3 was released in October 2023 and the latest release is 1.6.1 (released January 2025). It is unlikely that we are going to revert this change, so I am going to close this one.

Even if I am mostly a maintainer, I also can find myself in the position of a user and be like "but why this they change this?". The latest annoying thing I remember off the top of my head in a scipy sparse array change that affected scikit-learn, see scipy/scipy#18509 (comment) if you are really curious. Sure I can be mildly annoyed about this kind of things, but then I remember we are all kind of trying our best with the resource we have ...

@lesteve lesteve closed this as not planned Won't fix, can't repro, duplicate, stale Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants