Skip to content

TST Speed-up test_predict for KMeans #23274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented May 4, 2022

ref #23211

The 2 tests test_predict and test_k_means_fit_predict actually check the same thing. I combined them into a single test_kmeans_predict. Now it takes ~1sec locally (instead of 28s).

@@ -621,19 +591,28 @@ def test_score_max_iter(Estimator):
@pytest.mark.parametrize(
"array_constr", [np.array, sp.csr_matrix], ids=["dense", "sparse"]
)
@pytest.mark.parametrize("dtype", [np.float32, np.float64])
@pytest.mark.parametrize("init", ["random", "k-means++"])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init is irrelevant to this test. I removed the parametrisation and just used "random". "k-means++" was actually the main reason these tests were slow

Comment on lines +604 to +606
X, _ = make_blobs(
n_samples=200, n_features=10, centers=10, random_state=global_random_seed
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly reduced n_samples. 200 is well enough for this test

@pytest.mark.parametrize(
"Estimator, algorithm",
[(KMeans, "lloyd"), (KMeans, "elkan"), (MiniBatchKMeans, None)],
)
def test_predict(Estimator, algorithm, init, dtype, array_constr):
@pytest.mark.parametrize("max_iter", [2, 100])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taken from the deleted test and is important to keep because it allows to test when the fit stops after reaching max_iter or after reaching convergence.

@jeremiedbb jeremiedbb added module:test-suite everything related to our tests No Changelog Needed labels May 4, 2022
@jeremiedbb
Copy link
Member Author

tested with all random seeds in commit affab9e.

@jeremiedbb jeremiedbb added the Quick Review For PRs that are quick to review label May 4, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left a comment about global_dtype usage.

def test_predict(Estimator, algorithm, init, dtype, array_constr):
@pytest.mark.parametrize("max_iter", [2, 100])
def test_kmeans_predict(
Estimator, algorithm, array_constr, max_iter, global_dtype, global_random_seed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the cases where we use to test for both dtypes, but now restrict to one dtype (by default). Is this the intention of #22881 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize my thoughts:

  • tests that are not testing against float32 yet
    -> We want to test these ones against float32. We can use the fixture for that. Since we don't want to make the test suite too long, it's only enabled on 1 job.
  • tests that are testing against both dtypes
    -> These tests already contribute to make the test suite too long and I see no real reason why these ones should be tested against both dtypes in all jobs if we don't do it for the tests from the first item. So I'd be in favor of using the fixture for those as well.

To prevent surprises, we enabled the fixture in all jobs for the azure nightly build which is a good compromise to me.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit 9c8b89d into scikit-learn:main May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:cluster module:test-suite everything related to our tests No Changelog Needed Quick Review For PRs that are quick to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants