Skip to content

More stable n_init runs for KMeans #20200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 22, 2021

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Jun 3, 2021

Fixes #20199. Also related to #19403.
Fixes #20216

Sometimes a new run is selected to be the best although it's exactly the same clustering as the best run so far due to rounding errors for the inertia computation. This PR allows a small tol on the inertia to select a new run only if it has a better inertia beyond rounding errors.

It's hard to add a test since it's not triggered in the CI but has been observed locally, and it's hard to test because it come from rounding errors which might differ between machines. I guess that in this case the fact that it does not appear locally anymore is enough.

@lesteve
Copy link
Member

lesteve commented Jun 3, 2021

This PR does fix it on my machine indeed, thanks!

Just out of curiosity a few questions:

  • is there a way to set the tolerance, i.e. how magic is 1e-7?
  • does it not depend on the input data type (float32 or float64)?

@jeremiedbb
Copy link
Member Author

1e-7 is a little above the eps of float32 so it should be fine for both. On the other hand it's small enough so that the new run is such a small improvement that we can discard it.

@michalkrawczyk
Copy link
Contributor

I wonder how it would perform in extreme cases (e.g points than produce inertia lower than 1e-7, where omitting cluster that way may matter)?
I would like to suggest to also test with self._tol, as it would also let user to decide how 'sensitive' it will be during assesment of new clusters

@ogrisel
Copy link
Member

ogrisel commented Jun 7, 2021

1e-7 is a little above the eps of float32 so it should be fine for both.

If I understand correctly, the failure observed in #20199 happens with float64 data and computation, meaning that the (accumulated) rounding error level in our inertia implementation is orders of magnitude the machine precision for float64.

Therefore it could be the case that the 1e-7 would not be enough to protect against a similar instability on float32 data.

I think we should included a reproducer for the problem that KMeans()-check_transformer_data_not_an_array revealed make it run both in 32 bit and 64 bit float, check that we can reproduce the problem with both and check that 1e-7 fixes the problem for both.

WDYT?

@ogrisel
Copy link
Member

ogrisel commented Jun 7, 2021

I would like to suggest to also test with self._tol, as it would also let user to decide how 'sensitive' it will be during assesment of new clusters

I am not sure tol should be used there. It will make the influence of tol more complex to study. I would restrict it to the use of the convergence criterion of the main KMeans loop instead.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Jun 7, 2021

I changed to check the relative diff of the inertia which makes it indepent of its scale

@michalkrawczyk
Copy link
Contributor

I would like to suggest to also test with self._tol, as it would also let user to decide how 'sensitive' it will be during assesment of new clusters

I am not sure tol should be used there. It will make the influence of tol more complex to study. I would restrict it to the use of the convergence criterion of the main KMeans loop instead.

Right. After consideration it may also give some unconvienient behavior to users.

@michalkrawczyk
Copy link
Contributor

I changed to check the relative diff of the inertia which makes it indepent of its scale

@jeremiedbb That should work in most cases. The only exception I may think of is due nature of float itself, but in this case best_inertia * (1 - 1e-7) should change it value in most cases to 0, which in my opinion wouldn't do problems

@jeremiedbb
Copy link
Member Author

I changed 1e-7 to 1e-6 to be bigger than the eps of float32 which is ~1.19e-7

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not easy to come up with a non-regression test for this one. Assuming that it makes the originally common tests pass 100 out of 100 manual runs on a machine with more than 2 cores, then LGTM.

@lesteve
Copy link
Member

lesteve commented Jun 8, 2021

A snippet adapted from #20199 that reproduces the problem (calling KMeans.fit many time can give cluster centers that are swapped because there are a tiny difference in inertia)

import numpy as np

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],
                  random_state=0, n_features=2, cluster_std=0.1)
# X = X.astype('float32')
# y = y.astype('float32')

kmeans = KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)

ref_cluster_centers = kmeans.fit(X, y).cluster_centers_
print('reference cluster centers:\n', ref_cluster_centers)
print('-'*80)

for i in range(1000):
    cluster_centers = kmeans.fit(X, y).cluster_centers_
    if not np.allclose(ref_cluster_centers, cluster_centers):
        raise RuntimeError(f'differing cluster centers:\n{cluster_centers}')
  • float64: it fails on main, does not fail in this PR (1000 different fits)
  • float32 (uncomment astype code in the snippet): does not fail on main

@lesteve
Copy link
Member

lesteve commented Jun 9, 2021

Side-comment: maybe this is worth mentioning somewhere that the efficiency gains in KMeans are at the cost of non-deterministic results. For example, kmeans.inertia_ can vary between runs on the same machine (even when setting random_state which is the naive approach to get deterministic results).

This may have been discussed already and/or maybe it is just me that does not know this stuff very well and was surprised by this ...

@jeremiedbb
Copy link
Member Author

It's not only the inertia. Actually it's possible but very unlikely that you get different clustering between 2 "identical" runs, due to different accumulation of rouding errors because of the parallelism.

kmeans.inertia_ can vary between runs

When reading this it looks like one can get very different inertiae but in fact it can only vary by a maximum of a few machine epsilon. I'm not sur it's worth mentioning that since we always consider results of floating point operations varying by eps to be the same (e.g. assert_allclose).

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have not been able to make the tests fail as well.

Comment on lines +1060 to +1062
# allow small tolerance on the inertia to accommodate for
# non-deterministic rounding errors due to parallel computation
if best_inertia is None or inertia < best_inertia * (1 - 1e-6):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precising that this is just above the float32 eps as you mentioned might be informative here.

@lesteve lesteve merged commit 572c2cb into scikit-learn:main Jun 22, 2021
@adrinjalali adrinjalali mentioned this pull request Sep 3, 2021
6 tasks
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants