-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Performance regression in pairwise_distances with the Euclidean metric on sparse data #26097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess it's the same reason as for KMeans.transform explained here #26100. |
I cannot reproduce the slowdown between from benchmarks.common import Benchmark
from benchmarks.metrics import PairwiseDistancesBenchmark
Benchmark.data_size = "large"
params = ("sparse", "euclidean", 4)
bench = PairwiseDistancesBenchmark()
bench.setup(*params)
def time_pairwise_distances():
bench.time_pairwise_distances()
if __name__ == "__main__":
time_pairwise_distances() And here are the two profiles for |
After some digging, I'm not so sure that it's due to the validation of |
I finally found the commit responsible for the regression: it's de67a44 from PR #25598 I ran the benchmarks for this commit and the commit just before which confirms that: https://scikit-learn.org/scikit-learn-benchmarks/#metrics.PairwiseDistancesBenchmark.time_pairwise_distances?p-representation='sparse'&p-metric='euclidean'&p-n_jobs=4&commits=de67a442 The reason for the regression is another occurence of OpenMathLib/OpenBLAS#3187. We're running @glemaitre, the reason you couldn't reproduce is probably because on macos your install is such that the BLAS and scikit-learn use the same openmp lib. Unfortunately I don't know any easy fix for that besides reverting this commit. |
Thanks for the investigation. Indeed i don't see any short term solution beyond advising our users to install packages from conda-forge. In the longer term we will need to come back the possibility to ship numpy/scipy wheels that somehow rely on an openblas linked to a specific vendored openmp for each supported OS. |
I am +1 on reverting for now. Moving forward, we'll need to be more careful when merging PRs that add OpenMP. Currently the best way to catch the issue is to run the benchmarks with a SciPy installed on PyPI. |
An alternative is to make row_norms able to use a float64 accumulator. This way we don't have to call row_norms for each chunk in _euclidean_distances_upcast. I created a branch to try that, but since row_norms is public and used in many places it requires more work. So I think we should revert for now (I opened #26275) and possibly come back later on this alternative. |
As spotted by our continuous benchmark suite, there is a more than 2x slowdown in
pairwise_distances
on sparse input data (for the Euclidean metric).This happened between b4afbee and fabe160.
Could it be ef5c087?
It's not sure because there are many other commits in
git log b4afbeee..fabe1606
and I did not review them all.The text was updated successfully, but these errors were encountered: