Performance regression in pairwise_distances with the Euclidean metric on sparse data #26097

ogrisel · 2023-04-05T14:39:16Z

As spotted by our continuous benchmark suite, there is a more than 2x slowdown in pairwise_distances on sparse input data (for the Euclidean metric).

https://scikit-learn.org/scikit-learn-benchmarks/#metrics.PairwiseDistancesBenchmark.time_pairwise_distances?p-representation='sparse'&p-metric='euclidean'&p-n_jobs=4&commits=b4afbeee-fabe1606

This happened between b4afbee and fabe160.

Could it be ef5c087?

It's not sure because there are many other commits in git log b4afbeee..fabe1606 and I did not review them all.

The text was updated successfully, but these errors were encountered:

jeremiedbb · 2023-04-06T10:35:22Z

I guess it's the same reason as for KMeans.transform explained here #26100.

glemaitre · 2023-04-06T13:42:42Z

I cannot reproduce the slowdown between 1.2.1 and main:

from benchmarks.common import Benchmark
from benchmarks.metrics import PairwiseDistancesBenchmark

Benchmark.data_size = "large"

params = ("sparse", "euclidean", 4)
bench = PairwiseDistancesBenchmark()
bench.setup(*params)


def time_pairwise_distances():
    bench.time_pairwise_distances()


if __name__ == "__main__":
    time_pairwise_distances()

And here are the two profiles for 1.2.1 and main, respectively. Time-wise, it takes the same time.

jeremiedbb · 2023-04-11T09:02:47Z

After some digging, I'm not so sure that it's due to the validation of gen_batches. The dimensions of the dataset used in the benchmark is such that gen_batches is call ~10 times. With ~50us per call, it should not be visible at all in a 2s run.

jeremiedbb · 2023-04-21T13:34:28Z

I finally found the commit responsible for the regression: it's de67a44 from PR #25598

I ran the benchmarks for this commit and the commit just before which confirms that: https://scikit-learn.org/scikit-learn-benchmarks/#metrics.PairwiseDistancesBenchmark.time_pairwise_distances?p-representation='sparse'&p-metric='euclidean'&p-n_jobs=4&commits=de67a442

The reason for the regression is another occurence of OpenMathLib/OpenBLAS#3187. We're running row_norms (which now uses openmp) and matmul (which uses BLAS) sequentially in a loop. It happens when BLAS and openmp don't share the same threadpool, i.e. when the BLAS uses pthread (e.g. OpenBLAS from PyPI) or a different openmp (e.g. MKL uses intel-openmp).

@glemaitre, the reason you couldn't reproduce is probably because on macos your install is such that the BLAS and scikit-learn use the same openmp lib.

Unfortunately I don't know any easy fix for that besides reverting this commit.

ogrisel · 2023-04-21T13:47:37Z

Thanks for the investigation. Indeed i don't see any short term solution beyond advising our users to install packages from conda-forge. In the longer term we will need to come back the possibility to ship numpy/scipy wheels that somehow rely on an openblas linked to a specific vendored openmp for each supported OS.

thomasjpfan · 2023-04-21T18:52:58Z

I am +1 on reverting for now.

Moving forward, we'll need to be more careful when merging PRs that add OpenMP. Currently the best way to catch the issue is to run the benchmarks with a SciPy installed on PyPI.

jeremiedbb · 2023-04-24T14:52:30Z

An alternative is to make row_norms able to use a float64 accumulator. This way we don't have to call row_norms for each chunk in _euclidean_distances_upcast. I created a branch to try that, but since row_norms is public and used in many places it requires more work. So I think we should revert for now (I opened #26275) and possibly come back later on this alternative.

github-actions bot added the Needs Triage Issue requires triage label Apr 5, 2023

ogrisel added Performance Regression and removed Needs Triage Issue requires triage labels Apr 5, 2023

jeremiedbb mentioned this issue Apr 24, 2023

PERF revert openmp use in csr_row_norms #26275

Merged

thomasjpfan closed this as completed in #26275 Apr 27, 2023

jeremiedbb mentioned this issue Apr 25, 2024

Configure OpenBLAS to use scikit-learn's OpenMP threadpool #28883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in pairwise_distances with the Euclidean metric on sparse data #26097

Performance regression in pairwise_distances with the Euclidean metric on sparse data #26097

ogrisel commented Apr 5, 2023

jeremiedbb commented Apr 6, 2023

glemaitre commented Apr 6, 2023

jeremiedbb commented Apr 11, 2023

jeremiedbb commented Apr 21, 2023

ogrisel commented Apr 21, 2023

thomasjpfan commented Apr 21, 2023

jeremiedbb commented Apr 24, 2023

Performance regression in pairwise_distances with the Euclidean metric on sparse data #26097

Performance regression in pairwise_distances with the Euclidean metric on sparse data #26097

Comments

ogrisel commented Apr 5, 2023

jeremiedbb commented Apr 6, 2023

glemaitre commented Apr 6, 2023

jeremiedbb commented Apr 11, 2023

jeremiedbb commented Apr 21, 2023

ogrisel commented Apr 21, 2023

thomasjpfan commented Apr 21, 2023

jeremiedbb commented Apr 24, 2023