TST Speed-up test_predict for KMeans #23274

jeremiedbb · 2022-05-04T08:27:59Z

The 2 tests test_predict and test_k_means_fit_predict actually check the same thing. I combined them into a single test_kmeans_predict. Now it takes ~1sec locally (instead of 28s).

test_kmeans_predict

jeremiedbb · 2022-05-04T08:30:15Z

sklearn/cluster/tests/test_k_means.py

@@ -621,19 +591,28 @@ def test_score_max_iter(Estimator):
 @pytest.mark.parametrize(
    "array_constr", [np.array, sp.csr_matrix], ids=["dense", "sparse"]
 )
-@pytest.mark.parametrize("dtype", [np.float32, np.float64])
-@pytest.mark.parametrize("init", ["random", "k-means++"])


The init is irrelevant to this test. I removed the parametrisation and just used "random". "k-means++" was actually the main reason these tests were slow

jeremiedbb · 2022-05-04T08:31:28Z

sklearn/cluster/tests/test_k_means.py

+    X, _ = make_blobs(
+        n_samples=200, n_features=10, centers=10, random_state=global_random_seed
+    )


slightly reduced n_samples. 200 is well enough for this test

jeremiedbb · 2022-05-04T08:33:00Z

sklearn/cluster/tests/test_k_means.py

 @pytest.mark.parametrize(
    "Estimator, algorithm",
    [(KMeans, "lloyd"), (KMeans, "elkan"), (MiniBatchKMeans, None)],
 )
-def test_predict(Estimator, algorithm, init, dtype, array_constr):
+@pytest.mark.parametrize("max_iter", [2, 100])


This is taken from the deleted test and is important to keep because it allows to test when the fit stops after reaching max_iter or after reaching convergence.

jeremiedbb · 2022-05-04T09:10:02Z

tested with all random seeds in commit affab9e.

thomasjpfan

Thanks for the PR! I left a comment about global_dtype usage.

thomasjpfan · 2022-05-04T14:08:06Z

sklearn/cluster/tests/test_k_means.py

-def test_predict(Estimator, algorithm, init, dtype, array_constr):
+@pytest.mark.parametrize("max_iter", [2, 100])
+def test_kmeans_predict(
+    Estimator, algorithm, array_constr, max_iter, global_dtype, global_random_seed


This is one of the cases where we use to test for both dtypes, but now restrict to one dtype (by default). Is this the intention of #22881 ?

To summarize my thoughts:

tests that are not testing against float32 yet
-> We want to test these ones against float32. We can use the fixture for that. Since we don't want to make the test suite too long, it's only enabled on 1 job.

tests that are testing against both dtypes
-> These tests already contribute to make the test suite too long and I see no real reason why these ones should be tested against both dtypes in all jobs if we don't do it for the tests from the first item. So I'd be in favor of using the fixture for those as well.

To prevent surprises, we enabled the fixture in all jobs for the azure nightly build which is a good compromise to me.

glemaitre

LGTM

thomasjpfan

LGTM

jeremiedbb added 3 commits May 4, 2022 10:21

refactor and combine test_predict and test_fit_predict

8c8cc91

[all random seeds]

6d914bf

test_kmeans_predict

[all random seeds]

affab9e

test_kmeans_predict

github-actions bot added the module:cluster label May 4, 2022

jeremiedbb commented May 4, 2022

View reviewed changes

jeremiedbb added module:test-suite everything related to our tests No Changelog Needed labels May 4, 2022

jeremiedbb added the Quick Review For PRs that are quick to review label May 4, 2022

thomasjpfan reviewed May 4, 2022

View reviewed changes

glemaitre approved these changes May 4, 2022

View reviewed changes

thomasjpfan approved these changes May 4, 2022

View reviewed changes

thomasjpfan merged commit 9c8b89d into scikit-learn:main May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST Speed-up test_predict for KMeans #23274

TST Speed-up test_predict for KMeans #23274

jeremiedbb commented May 4, 2022 •

edited

Loading

jeremiedbb May 4, 2022

jeremiedbb May 4, 2022

jeremiedbb May 4, 2022

jeremiedbb commented May 4, 2022

thomasjpfan left a comment

thomasjpfan May 4, 2022

jeremiedbb May 4, 2022

glemaitre left a comment

thomasjpfan left a comment

TST Speed-up test_predict for KMeans #23274

TST Speed-up test_predict for KMeans #23274

Conversation

jeremiedbb commented May 4, 2022 • edited Loading

jeremiedbb May 4, 2022

Choose a reason for hiding this comment

jeremiedbb May 4, 2022

Choose a reason for hiding this comment

jeremiedbb May 4, 2022

Choose a reason for hiding this comment

jeremiedbb commented May 4, 2022

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan May 4, 2022

Choose a reason for hiding this comment

jeremiedbb May 4, 2022

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

jeremiedbb commented May 4, 2022 •

edited

Loading