More stable n_init runs for KMeans #20200

jeremiedbb · 2021-06-03T10:00:25Z

Fixes #20199. Also related to #19403.
Fixes #20216

Sometimes a new run is selected to be the best although it's exactly the same clustering as the best run so far due to rounding errors for the inertia computation. This PR allows a small tol on the inertia to select a new run only if it has a better inertia beyond rounding errors.

It's hard to add a test since it's not triggered in the CI but has been observed locally, and it's hard to test because it come from rounding errors which might differ between machines. I guess that in this case the fact that it does not appear locally anymore is enough.

lesteve · 2021-06-03T13:28:33Z

This PR does fix it on my machine indeed, thanks!

Just out of curiosity a few questions:

is there a way to set the tolerance, i.e. how magic is 1e-7?
does it not depend on the input data type (float32 or float64)?

jeremiedbb · 2021-06-03T13:35:55Z

1e-7 is a little above the eps of float32 so it should be fine for both. On the other hand it's small enough so that the new run is such a small improvement that we can discard it.

michalkrawczyk · 2021-06-03T17:02:00Z

I wonder how it would perform in extreme cases (e.g points than produce inertia lower than 1e-7, where omitting cluster that way may matter)?
I would like to suggest to also test with self._tol, as it would also let user to decide how 'sensitive' it will be during assesment of new clusters

ogrisel · 2021-06-07T09:52:24Z

1e-7 is a little above the eps of float32 so it should be fine for both.

If I understand correctly, the failure observed in #20199 happens with float64 data and computation, meaning that the (accumulated) rounding error level in our inertia implementation is orders of magnitude the machine precision for float64.

Therefore it could be the case that the 1e-7 would not be enough to protect against a similar instability on float32 data.

I think we should included a reproducer for the problem that KMeans()-check_transformer_data_not_an_array revealed make it run both in 32 bit and 64 bit float, check that we can reproduce the problem with both and check that 1e-7 fixes the problem for both.

WDYT?

ogrisel · 2021-06-07T09:53:56Z

I would like to suggest to also test with self._tol, as it would also let user to decide how 'sensitive' it will be during assesment of new clusters

I am not sure tol should be used there. It will make the influence of tol more complex to study. I would restrict it to the use of the convergence criterion of the main KMeans loop instead.

jeremiedbb · 2021-06-07T13:04:56Z

I changed to check the relative diff of the inertia which makes it indepent of its scale

michalkrawczyk · 2021-06-07T15:02:24Z

I would like to suggest to also test with self._tol, as it would also let user to decide how 'sensitive' it will be during assesment of new clusters

I am not sure tol should be used there. It will make the influence of tol more complex to study. I would restrict it to the use of the convergence criterion of the main KMeans loop instead.

Right. After consideration it may also give some unconvienient behavior to users.

michalkrawczyk · 2021-06-07T15:49:28Z

I changed to check the relative diff of the inertia which makes it indepent of its scale

@jeremiedbb That should work in most cases. The only exception I may think of is due nature of float itself, but in this case best_inertia * (1 - 1e-7) should change it value in most cases to 0, which in my opinion wouldn't do problems

jeremiedbb · 2021-06-08T13:57:05Z

I changed 1e-7 to 1e-6 to be bigger than the eps of float32 which is ~1.19e-7

ogrisel

I guess it's not easy to come up with a non-regression test for this one. Assuming that it makes the originally common tests pass 100 out of 100 manual runs on a machine with more than 2 cores, then LGTM.

sklearn/cluster/_kmeans.py

lesteve · 2021-06-08T15:29:51Z

A snippet adapted from #20199 that reproduces the problem (calling KMeans.fit many time can give cluster centers that are swapped because there are a tiny difference in inertia)

import numpy as np

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],
                  random_state=0, n_features=2, cluster_std=0.1)
# X = X.astype('float32')
# y = y.astype('float32')

kmeans = KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)

ref_cluster_centers = kmeans.fit(X, y).cluster_centers_
print('reference cluster centers:\n', ref_cluster_centers)
print('-'*80)

for i in range(1000):
    cluster_centers = kmeans.fit(X, y).cluster_centers_
    if not np.allclose(ref_cluster_centers, cluster_centers):
        raise RuntimeError(f'differing cluster centers:\n{cluster_centers}')

float64: it fails on main, does not fail in this PR (1000 different fits)
float32 (uncomment astype code in the snippet): does not fail on main

lesteve · 2021-06-09T13:21:37Z

Side-comment: maybe this is worth mentioning somewhere that the efficiency gains in KMeans are at the cost of non-deterministic results. For example, kmeans.inertia_ can vary between runs on the same machine (even when setting random_state which is the naive approach to get deterministic results).

This may have been discussed already and/or maybe it is just me that does not know this stuff very well and was surprised by this ...

jeremiedbb · 2021-06-10T14:45:30Z

It's not only the inertia. Actually it's possible but very unlikely that you get different clustering between 2 "identical" runs, due to different accumulation of rouding errors because of the parallelism.

kmeans.inertia_ can vary between runs

When reading this it looks like one can get very different inertiae but in fact it can only vary by a maximum of a few machine epsilon. I'm not sur it's worth mentioning that since we always consider results of floating point operations varying by eps to be the same (e.g. assert_allclose).

jjerphan

LGTM, I have not been able to make the tests fail as well.

jjerphan · 2021-06-15T13:31:40Z

sklearn/cluster/_kmeans.py

+            # allow small tolerance on the inertia to accommodate for
+            # non-deterministic rounding errors due to parallel computation
+            if best_inertia is None or inertia < best_inertia * (1 - 1e-6):


Precising that this is just above the float32 eps as you mentioned might be informative here.

allow small tol when finding best run

af054b1

github-actions bot added the module:cluster label Jun 3, 2021

jeremiedbb mentioned this pull request Jun 3, 2021

KMeans()-check_transformer_data_not_an_array test failing locally from time to time in main #20199

Closed

relative diff

cbe9a9c

jeremiedbb mentioned this pull request Jun 7, 2021

Inconsistent labels with KMeans using random_state #20216

Closed

lower tol

26905b0

ogrisel approved these changes Jun 8, 2021

View reviewed changes

ogrisel reviewed Jun 8, 2021

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

more precise comment

e76cc7f

jjerphan approved these changes Jun 15, 2021

View reviewed changes

lesteve merged commit 572c2cb into scikit-learn:main Jun 22, 2021

jeremiedbb mentioned this pull request Jul 6, 2021

Error with apparently non-deterministic tests #20371

Closed

adrinjalali mentioned this pull request Sep 3, 2021

Missing what's new (v1.0) #20925

Closed

6 tasks

jeremiedbb mentioned this pull request Sep 15, 2021

TST enable tests K-means for MacOS #21052

Merged

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

More stable n_init runs for KMeans (scikit-learn#20200)

9912dbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More stable n_init runs for KMeans #20200

More stable n_init runs for KMeans #20200

jeremiedbb commented Jun 3, 2021 •

edited

Loading

lesteve commented Jun 3, 2021

jeremiedbb commented Jun 3, 2021

michalkrawczyk commented Jun 3, 2021

ogrisel commented Jun 7, 2021

ogrisel commented Jun 7, 2021

jeremiedbb commented Jun 7, 2021 •

edited

Loading

michalkrawczyk commented Jun 7, 2021

michalkrawczyk commented Jun 7, 2021

jeremiedbb commented Jun 8, 2021

ogrisel left a comment

lesteve commented Jun 8, 2021 •

edited

Loading

lesteve commented Jun 9, 2021

jeremiedbb commented Jun 10, 2021

jjerphan left a comment

jjerphan Jun 15, 2021

More stable n_init runs for KMeans #20200

More stable n_init runs for KMeans #20200

Conversation

jeremiedbb commented Jun 3, 2021 • edited Loading

lesteve commented Jun 3, 2021

jeremiedbb commented Jun 3, 2021

michalkrawczyk commented Jun 3, 2021

ogrisel commented Jun 7, 2021

ogrisel commented Jun 7, 2021

jeremiedbb commented Jun 7, 2021 • edited Loading

michalkrawczyk commented Jun 7, 2021

michalkrawczyk commented Jun 7, 2021

jeremiedbb commented Jun 8, 2021

ogrisel left a comment

Choose a reason for hiding this comment

lesteve commented Jun 8, 2021 • edited Loading

lesteve commented Jun 9, 2021

jeremiedbb commented Jun 10, 2021

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan Jun 15, 2021

Choose a reason for hiding this comment

jeremiedbb commented Jun 3, 2021 •

edited

Loading

jeremiedbb commented Jun 7, 2021 •

edited

Loading

lesteve commented Jun 8, 2021 •

edited

Loading