-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
TST Speed-up test_predict for KMeans #23274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST Speed-up test_predict for KMeans #23274
Conversation
test_kmeans_predict
test_kmeans_predict
@@ -621,19 +591,28 @@ def test_score_max_iter(Estimator): | |||
@pytest.mark.parametrize( | |||
"array_constr", [np.array, sp.csr_matrix], ids=["dense", "sparse"] | |||
) | |||
@pytest.mark.parametrize("dtype", [np.float32, np.float64]) | |||
@pytest.mark.parametrize("init", ["random", "k-means++"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The init
is irrelevant to this test. I removed the parametrisation and just used "random". "k-means++" was actually the main reason these tests were slow
X, _ = make_blobs( | ||
n_samples=200, n_features=10, centers=10, random_state=global_random_seed | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly reduced n_samples. 200 is well enough for this test
@pytest.mark.parametrize( | ||
"Estimator, algorithm", | ||
[(KMeans, "lloyd"), (KMeans, "elkan"), (MiniBatchKMeans, None)], | ||
) | ||
def test_predict(Estimator, algorithm, init, dtype, array_constr): | ||
@pytest.mark.parametrize("max_iter", [2, 100]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is taken from the deleted test and is important to keep because it allows to test when the fit stops after reaching max_iter or after reaching convergence.
tested with all random seeds in commit affab9e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I left a comment about global_dtype
usage.
def test_predict(Estimator, algorithm, init, dtype, array_constr): | ||
@pytest.mark.parametrize("max_iter", [2, 100]) | ||
def test_kmeans_predict( | ||
Estimator, algorithm, array_constr, max_iter, global_dtype, global_random_seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the cases where we use to test for both dtypes, but now restrict to one dtype (by default). Is this the intention of #22881 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize my thoughts:
- tests that are not testing against float32 yet
-> We want to test these ones against float32. We can use the fixture for that. Since we don't want to make the test suite too long, it's only enabled on 1 job. - tests that are testing against both dtypes
-> These tests already contribute to make the test suite too long and I see no real reason why these ones should be tested against both dtypes in all jobs if we don't do it for the tests from the first item. So I'd be in favor of using the fixture for those as well.
To prevent surprises, we enabled the fixture in all jobs for the azure nightly build which is a good compromise to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ref #23211
The 2 tests
test_predict
andtest_k_means_fit_predict
actually check the same thing. I combined them into a singletest_kmeans_predict
. Now it takes ~1sec locally (instead of 28s).