-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[Undocumented?] KMeans behavior change between v1.2.2 and v1.3.0 #30643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @stu-blair, it's very likely due to this PR #25752, documented in the 1.3.0 changelog #25752. The individual results may be different because the random number generation is different but are the same in expectation. |
Many thanks @jeremiedbb . Is there any way to workaround this change to maintain backwards compatibility? I see @glemaitre had a suggestion on how to maintain backwards compatiblity here? #25752 (comment) And an idea here also #27991 (comment) Reproducibility/backwards compatibility is extremely important for scientific packages. |
I don't think it's worth the maintenance burden. The results are not numerically the same but are as valid as before, just a difference in rng. This came from a bug fix, and I think a change of behavior like this one is acceptable when we're fixing a bug at the same time. Keeping backward compat for a these kind of changes would add unnecessary complexity to the code-base imo. |
Just to add to this, the clusters found are the same, they are in a different order. Which I think is less bad than if the centers changed (slightly). In terms of compatibility, I'm not sure if there is any meaning to the order of clusters. Maybe if you care about it, you could sort the cluster centers after the fact? |
As a scikit-learn maintainer, I would say that in scikit-learn we are conservative and try hard to avoid changing results silently from one major release to the next (e.g. 1.2 to 1.3). Having said this, it could well be the case that we are still changing results too often for users that have a strong focus on stability/reproducibility ... In this particular case, I think the ship has sailed a long time ago since 1.3 was released in October 2023 and the latest release is 1.6.1 (released January 2025). It is unlikely that we are going to revert this change, so I am going to close this one. Even if I am mostly a maintainer, I also can find myself in the position of a user and be like "but why this they change this?". The latest annoying thing I remember off the top of my head in a scipy sparse array change that affected scikit-learn, see scipy/scipy#18509 (comment) if you are really curious. Sure I can be mildly annoyed about this kind of things, but then I remember we are all kind of trying our best with the resource we have ... |
Describe the bug
When upgrading scikit-learn from 1.1.1 to 1.6.0, I noticed my scripts results were completely changing, despite using the same code, data, and random seeds. I narrowed this down to v1.3.0. I tried setting every parameter I could find, but still the results are different.
KMeans
produces clusters in different orders before and after 1.3.0 release. The results are deterministic if you set the random seed, in both 1.2.2 and 1.3.0, but the ordering changes upon upgrade.Can someone please help me identify what change causes this, and if there is any way to get the post-1.3.0 behavior to be consistent with the prior behavior?
Steps/Code to Reproduce
I ran the following code in both v1.2.2 and v1.3.0
Expected Results
In v1.2.2 (and earlier versions 1.1.1 and 1.2.0) the results are:
Actual Results
But in v1.3.0 (and later versions) the results are:
Versions
The text was updated successfully, but these errors were encountered: