FIX Improve best run detection in kmeans when n_init > 1 #21195

jeremiedbb · 2021-09-29T13:16:56Z

Instead of relying of some tolerance to take rouding errors into account, we can simply check that the clustering of the new run is the same as the clustering of the best run so far (up to a permutation of the labels).

The check is really quick: ~200ms for 100_000_000 samples, which is negligible compared to a complete run of kmeans.

The non regression test is already there: test_k_means_fit_predict (see the linked issue), even if it does not fail in the CI because it's an effect quite hard to trigger. I think checking that it now passes is enough. @lesteve could you confirm that ?

lesteve · 2021-09-29T14:27:13Z

It seems to fix it indeed on my machine:

on this PR: no error when running 100 times one of the failing tests like this

for i in $(seq 1 100); do
    pytest sklearn/cluster/tests/test_k_means.py -k 'test_k_means_fit_predict and csr_matrix and float32 and 4-300 and full' -q; done

on main: with this same test I get roughly 1 failure for 4 runs

lesteve · 2021-09-29T15:14:53Z

I guess it would be great to have a few tests for _is_same_clustering.

jjerphan · 2021-09-30T12:46:51Z

It seems to fix it indeed on my machine:

* on this PR: no error when running 100 times one of the failing tests like this

for i in $(seq 1 100); do
    pytest sklearn/cluster/tests/test_k_means.py -k 'test_k_means_fit_predict and csr_matrix and float32 and 4-300 and full' -q; done

* on `main`: with this same test I get roughly 1 failure for 4 runs

I also observe this on my machine.

jjerphan

LGTM, thank you @jeremiedbb!

sklearn/cluster/_kmeans.py

sklearn/cluster/tests/test_k_means.py

lesteve · 2021-10-01T15:00:25Z

LGTM, one thing I am wondering is whether the changelog should go in v1.0.rst or v1.1.rst ...

jeremiedbb · 2021-10-01T15:30:35Z

LGTM, one thing I am wondering is whether the changelog should go in v1.0.rst or v1.1.rst .

v1.0 because the next bug fix release will be 1.0.1

lesteve · 2021-10-05T12:39:51Z

Merging, thanks!

…n#21195)

improve heuristics for chosing new run in kmeans

c39d670

github-actions bot added cython module:cluster labels Sep 29, 2021

make flake8 happy even if he's wrong on this point

5b11849

jeremiedbb added 2 commits September 29, 2021 17:54

add test for _is_same_clustering

3ab9509

change log entry

98e55b7

jjerphan approved these changes Sep 30, 2021

View reviewed changes

sklearn/cluster/_kmeans.py Show resolved Hide resolved

sklearn/cluster/tests/test_k_means.py Outdated Show resolved Hide resolved

more sanity check

2c8d09c

Merge branch 'master' into kmeans-more-stable-n-init

ba03217

Merge branch 'main' into kmeans-more-stable-n-init

42fb8cc

lesteve changed the title ~~FIX Improve better run detection in kmeans when n_init > 1~~ FIX Improve best run detection in kmeans when n_init > 1 Oct 5, 2021

lesteve merged commit 0a88cf8 into scikit-learn:main Oct 5, 2021

jeremiedbb added this to the 1.0.1 milestone Oct 5, 2021

jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Oct 5, 2021

glemaitre mentioned this pull request Oct 23, 2021

Release 1.0.1 #21404

Merged

10 tasks

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021

FIX Improve best run detection in kmeans when n_init > 1 (scikit-lear…

3e634bc

…n#21195)

glemaitre pushed a commit that referenced this pull request Oct 25, 2021

FIX Improve best run detection in kmeans when n_init > 1 (#21195)

8b8e8b2

glemaitre mentioned this pull request Nov 23, 2021

Different cluster result with same random_state #21749

Closed

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

FIX Improve best run detection in kmeans when n_init > 1 (scikit-lear…

c9cec42

…n#21195)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX Improve best run detection in kmeans when n_init > 1 #21195

FIX Improve best run detection in kmeans when n_init > 1 #21195

jeremiedbb commented Sep 29, 2021

Uh oh!

lesteve commented Sep 29, 2021 •

edited

Loading

Uh oh!

lesteve commented Sep 29, 2021

Uh oh!

jjerphan commented Sep 30, 2021

Uh oh!

jjerphan left a comment

Uh oh!

Uh oh!

Uh oh!

lesteve commented Oct 1, 2021

Uh oh!

jeremiedbb commented Oct 1, 2021

Uh oh!

lesteve commented Oct 5, 2021

Uh oh!

Uh oh!

Uh oh!

FIX Improve best run detection in kmeans when n_init > 1 #21195

FIX Improve best run detection in kmeans when n_init > 1 #21195

Conversation

jeremiedbb commented Sep 29, 2021

Uh oh!

lesteve commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lesteve commented Sep 29, 2021

Uh oh!

jjerphan commented Sep 30, 2021

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lesteve commented Oct 1, 2021

Uh oh!

jeremiedbb commented Oct 1, 2021

Uh oh!

lesteve commented Oct 5, 2021

Uh oh!

Uh oh!

lesteve commented Sep 29, 2021 •

edited

Loading