-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
More stable n_init runs for KMeans #20200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR does fix it on my machine indeed, thanks! Just out of curiosity a few questions:
|
1e-7 is a little above the eps of float32 so it should be fine for both. On the other hand it's small enough so that the new run is such a small improvement that we can discard it. |
I wonder how it would perform in extreme cases (e.g points than produce inertia lower than 1e-7, where omitting cluster that way may matter)? |
If I understand correctly, the failure observed in #20199 happens with float64 data and computation, meaning that the (accumulated) rounding error level in our inertia implementation is orders of magnitude the machine precision for float64. Therefore it could be the case that the 1e-7 would not be enough to protect against a similar instability on float32 data. I think we should included a reproducer for the problem that WDYT? |
I am not sure |
I changed to check the relative diff of the inertia which makes it indepent of its scale |
Right. After consideration it may also give some unconvienient behavior to users. |
@jeremiedbb That should work in most cases. The only exception I may think of is due nature of float itself, but in this case |
I changed 1e-7 to 1e-6 to be bigger than the eps of float32 which is ~1.19e-7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's not easy to come up with a non-regression test for this one. Assuming that it makes the originally common tests pass 100 out of 100 manual runs on a machine with more than 2 cores, then LGTM.
A snippet adapted from #20199 that reproduces the problem (calling import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],
random_state=0, n_features=2, cluster_std=0.1)
# X = X.astype('float32')
# y = y.astype('float32')
kmeans = KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)
ref_cluster_centers = kmeans.fit(X, y).cluster_centers_
print('reference cluster centers:\n', ref_cluster_centers)
print('-'*80)
for i in range(1000):
cluster_centers = kmeans.fit(X, y).cluster_centers_
if not np.allclose(ref_cluster_centers, cluster_centers):
raise RuntimeError(f'differing cluster centers:\n{cluster_centers}')
|
Side-comment: maybe this is worth mentioning somewhere that the efficiency gains in KMeans are at the cost of non-deterministic results. For example, This may have been discussed already and/or maybe it is just me that does not know this stuff very well and was surprised by this ... |
It's not only the inertia. Actually it's possible but very unlikely that you get different clustering between 2 "identical" runs, due to different accumulation of rouding errors because of the parallelism.
When reading this it looks like one can get very different inertiae but in fact it can only vary by a maximum of a few machine epsilon. I'm not sur it's worth mentioning that since we always consider results of floating point operations varying by eps to be the same (e.g. assert_allclose). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I have not been able to make the tests fail as well.
# allow small tolerance on the inertia to accommodate for | ||
# non-deterministic rounding errors due to parallel computation | ||
if best_inertia is None or inertia < best_inertia * (1 - 1e-6): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precising that this is just above the float32 eps as you mentioned might be informative here.
Fixes #20199. Also related to #19403.
Fixes #20216
Sometimes a new run is selected to be the best although it's exactly the same clustering as the best run so far due to rounding errors for the inertia computation. This PR allows a small tol on the inertia to select a new run only if it has a better inertia beyond rounding errors.
It's hard to add a test since it's not triggered in the CI but has been observed locally, and it's hard to test because it come from rounding errors which might differ between machines. I guess that in this case the fact that it does not appear locally anymore is enough.