Enforce threading in parallel pairwise #13310

pierreglaser · 2019-02-27T14:12:37Z

pairwise_distances makes a direct call to _parallel_pairwise, that computes the distance matrix chunk by chunk. However, it uses loky, the default backend of joblib, which generates a lot of overhead due to data communication. I propose to enforce a threading backend in _parallel_pairwise. I also discard the np.hstack concatenation call and write to the matrix in-place, to further optimize for speed.

Running this benchmark and limiting over-subscription by setting MKL_NUM_THREADS to 1 yielded these results:

@jeremiedbb @zanospi

rmenuet · 2019-02-27T14:21:20Z

Full benchmark on server with 40 threads & MKL using:

loky, multiprocessing and threading as backend
euclidean, cosine, correlation & manhattan
either dense or sparse matrices with increasing number of features
either unlimited num of threads per job, or only 1 thread

(had crashes with sparse manhattan for dim >= 1000 with loky & multiprocessing that I did not investigate yet)

=========================================================
=== Joblib_backend = Loky, MKL_NUM_THREAD not limited ===
=========================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     646~0ms   3.99~0s   3.26~0s   3.94~0s
     dense         euclidean    746~0ms   4.29~0s   3.23~0s   3.95~0s
     dense         manhattan    33.9~0s   17.8~0s   11.1~0s   6.18~0s
     dense        correlation   15.7~0s   17.4~0s   10.6~0s   5.02~0s
     sparse          cosine     2.34~0s   3.83~0s   3.00~0s   3.63~0s
     sparse        euclidean    2.65~0s   3.90~0s   3.11~0s   3.52~0s
     sparse        manhattan    27.0~0s   17.9~0s   11.1~0s   6.01~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     813~0ms   4.83~0s   3.52~0s   3.82~0s
     dense         euclidean    971~0ms   4.91~0s   3.51~0s   3.71~0s
     dense         manhattan    1.13~0m   37.4~0s   21.3~0s   11.6~0s
     dense        correlation   32.9~0s   35.9~0s   20.6~0s   9.96~0s
     sparse          cosine     3.42~0s   3.80~0s   2.82~0s   3.07~0s
     sparse        euclidean    4.18~0s   4.01~0s   2.95~0s   3.21~0s
     sparse        manhattan    51.3~0s    failed    failed    failed
	 
Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     765~0ms   4.86~0s   3.55~0s   3.71~0s
     dense         euclidean    974~0ms   4.91~0s   3.47~0s   3.64~0s
     dense         manhattan    1.14~0m   39.0~0s   21.2~0s   12.4~0s
     dense        correlation   39.0~0s   37.8~0s   20.5~0s   9.82~0s
     sparse          cosine     3.64~0s   3.80~0s   2.84~0s   3.62~0s
     sparse        euclidean    3.82~0s   4.38~0s   3.32~0s   3.74~0s
     sparse        manhattan    51.3~0s    failed    failed    failed

====================================================================
=== Joblib_backend = multiprocessing, MKL_NUM_THREAD not limited ===
====================================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     658~0ms   2.99~0s   2.33~0s   1.90~0s
     dense         euclidean    763~0ms   3.04~0s   2.35~0s   1.91~0s
     dense         manhattan    33.9~0s   16.3~0s   9.60~0s   3.97~0s
     dense        correlation   15.5~0s   16.2~0s   9.60~0s   3.97~0s
     sparse          cosine     2.20~0s   2.57~0s   1.99~0s   1.70~0s
     sparse        euclidean    2.65~0s   2.81~0s   2.12~0s   1.77~0s
     sparse        manhattan    26.8~0s   16.2~0s   9.66~0s   4.00~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     731~0ms   4.33~0s   3.02~0s   2.34~0s
     dense         euclidean    976~0ms   4.34~0s   3.01~0s   2.12~0s
     dense         manhattan    1.13~0m   36.5~0s   20.7~0s   9.94~0s
     dense        correlation   32.6~0s   35.8~0s   20.4~0s   10.4~0s
     sparse          cosine     3.29~0s   3.11~0s   2.27~0s   1.66~0s
     sparse        euclidean    3.75~0s   3.39~0s   2.40~0s   1.78~0s
     sparse        manhattan    51.0~0s    failed    failed    failed

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     735~0ms   4.33~0s   3.04~0s   2.28~0s
     dense         euclidean    966~0ms   4.26~0s   3.01~0s   2.11~0s
     dense         manhattan    1.13~0m   36.6~0s   21.0~0s   10.2~0s
     dense        correlation   32.8~0s   35.8~0s   20.4~0s   10.5~0s
     sparse          cosine     3.30~0s   3.12~0s   2.29~0s   1.69~0s
     sparse        euclidean    3.73~0s   3.38~0s   2.38~0s   1.72~0s
     sparse        manhattan    51.0~0s    failed    failed    failed

==============================================================
=== Joblib_backend = threading, MKL_NUM_THREAD not limited ===
==============================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     709~0ms   732~0ms   878~0ms   3.29~0s
     dense         euclidean    707~0ms   889~0ms   935~0ms   3.27~0s
     dense         manhattan    33.9~0s   13.6~0s   7.56~0s   1.98~0s
     dense        correlation   15.5~0s   14.0~0s   7.78~0s   2.10~0s
     sparse          cosine     2.19~0s   1.45~0s   1.12~0s   1.66~0s
     sparse        euclidean    2.64~0s   1.57~0s   1.17~0s   1.83~0s
     sparse        manhattan    26.8~0s   13.8~0s   7.48~0s   1.79~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     761~0ms   921~0ms   1.20~0s   3.31~0s
     dense         euclidean    958~0ms   997~0ms   1.08~0s   3.51~0s
     dense         manhattan    1.13~0m   34.7~0s   19.1~0s   8.46~0s
     dense        correlation   32.7~0s   33.6~0s   18.6~0s   8.62~0s
     sparse          cosine     3.33~0s   1.98~0s   1.29~0s   1.62~0s
     sparse        euclidean    3.71~0s   2.08~0s   1.44~0s   1.66~0s
     sparse        manhattan    51.0~0s   26.0~0s   14.0~0s   3.03~0s

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     771~0ms   972~0ms   1.03~0s   3.31~0s
     dense         euclidean    966~0ms   1.00~0s   1.03~0s   3.51~0s
     dense         manhattan    1.13~0m   34.7~0s   19.1~0s   8.48~0s
     dense        correlation   32.8~0s   33.6~0s   18.7~0s   8.59~0s
     sparse          cosine     3.28~0s   1.99~0s   1.30~0s   1.79~0s
     sparse        euclidean    3.72~0s   2.08~0s   1.39~0s   1.93~0s
     sparse        manhattan    51.0~0s   26.0~0s   14.1~0s   3.02~0s

===============================================
=== Joblib_backend = Loky, MKL_NUM_THREAD 1 ===
===============================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     2.38~0s   5.18~0s   4.17~0s   4.03~0s
     dense         euclidean    3.14~0s   5.13~0s   4.22~0s   4.06~0s
     dense         manhattan    36.2~0s   19.4~0s   12.3~0s   6.71~0s
     dense        correlation   17.6~0s   19.1~0s   12.1~0s   5.34~0s
     sparse          cosine     2.95~0s   4.69~0s   3.91~0s   3.85~0s
     sparse        euclidean    3.30~0s   4.62~0s   3.68~0s   3.93~0s
     sparse        manhattan    27.2~0s   18.5~0s   12.0~0s   6.58~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.53~0s   6.01~0s   3.97~0s   4.03~0s
     dense         euclidean    3.70~0s   6.02~0s   4.05~0s   3.82~0s
     dense         manhattan    1.19~0m   39.6~0s   22.8~0s   12.8~0s
     dense        correlation   35.8~0s   38.5~0s   21.4~0s   10.3~0s
     sparse          cosine     3.84~0s   4.52~0s   3.00~0s   3.20~0s
     sparse        euclidean    4.31~0s   4.39~0s   3.66~0s   3.41~0s
     sparse        manhattan    51.4~0s    failed    failed    failed

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.59~0s   6.03~0s   4.14~0s   3.94~0s
     dense         euclidean    3.77~0s   5.60~0s   4.28~0s   3.67~0s
     dense         manhattan    1.36~0m   39.3~0s   22.9~0s   12.7~0s
     dense        correlation   37.9~0s   38.3~0s   21.8~0s   9.12~0s
     sparse          cosine     3.84~0s   4.91~0s   3.74~0s   3.92~0s
     sparse        euclidean    4.26~0s   4.88~0s   3.63~0s   4.17~0s
     sparse        manhattan    51.4~0s    failed    failed    failed

==========================================================
=== Joblib_backend = multiprocessing, MKL_NUM_THREAD 1 ===
==========================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     2.10~0s   3.48~0s   2.66~0s   2.06~0s
     dense         euclidean    2.85~0s   3.70~0s   2.65~0s   1.96~0s
     dense         manhattan    34.0~0s   17.4~0s   10.7~0s   4.58~0s
     dense        correlation   15.4~0s   17.6~0s   10.7~0s   4.45~0s
     sparse          cosine     2.56~0s   3.17~0s   2.13~0s   1.87~0s
     sparse        euclidean    2.72~0s   3.40~0s   2.29~0s   2.14~0s
     sparse        manhattan    26.9~0s   18.0~0s   10.9~0s   4.38~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     3.39~0s   4.81~0s   3.42~0s   2.54~0s
     dense         euclidean    3.10~0s   4.90~0s   3.30~0s   2.41~0s
     dense         manhattan    1.13~0m   38.3~0s   22.6~0s   10.3~0s
     dense        correlation   32.8~0s   38.4~0s   21.9~0s   10.8~0s
     sparse          cosine     3.36~0s   3.96~0s   2.53~0s   1.86~0s
     sparse        euclidean    4.00~0s   4.10~0s   3.08~0s   1.86~0s
     sparse        manhattan    51.1~0s    failed    failed    failed

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.37~0s   4.96~0s   3.35~0s   2.57~0s
     dense         euclidean    3.10~0s   5.12~0s   3.42~0s   2.19~0s
     dense         manhattan    1.13~0m   37.7~0s   22.4~0s   10.4~0s
     dense        correlation   32.8~0s   38.4~0s   21.5~0s   11.3~0s
     sparse          cosine     3.34~0s   3.93~0s   2.46~0s   1.69~0s
     sparse        euclidean    3.87~0s   4.08~0s   2.55~0s   2.12~0s
     sparse        manhattan    51.1~0s    failed    failed    failed

====================================================
=== Joblib_backend = threading, MKL_NUM_THREAD 1 ===
====================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     2.27~0s   1.70~0s   1.14~0s   937~0ms
     dense         euclidean    3.01~0s   1.81~0s   1.17~0s   837~0ms
     dense         manhattan    35.0~0s   14.0~0s   7.97~0s   2.40~0s
     dense        correlation   16.8~0s   14.3~0s   8.18~0s   2.55~0s
     sparse          cosine     2.59~0s   1.61~0s   1.06~0s   825~0ms
     sparse        euclidean    3.06~0s   1.85~0s   1.08~0s   689~0ms
     sparse        manhattan    27.1~0s   14.1~0s   7.82~0s   2.07~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     3.44~0s   2.74~0s   1.78~0s   1.11~0s
     dense         euclidean    3.58~0s   2.79~0s   1.80~0s   904~0ms
     dense         manhattan    1.16~0m   35.2~0s   19.7~0s   8.69~0s
     dense        correlation   34.7~0s   34.2~0s   19.0~0s   9.19~0s
     sparse          cosine     3.83~0s   2.14~0s   1.37~0s   939~0ms
     sparse        euclidean    4.24~0s   2.36~0s   1.40~0s   659~0ms
     sparse        manhattan    51.4~0s   26.2~0s   14.2~0s   3.29~0s

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.49~0s   2.81~0s   1.79~0s   1.13~0s
     dense         euclidean    3.63~0s   2.92~0s   2.03~0s   914~0ms
     dense         manhattan    1.18~0m   34.9~0s   19.6~0s   8.46~0s
     dense        correlation   34.8~0s   34.2~0s   19.0~0s   9.29~0s
     sparse          cosine     3.78~0s   2.09~0s   1.30~0s   887~0ms
     sparse        euclidean    4.25~0s   2.34~0s   1.48~0s   784~0ms
     sparse        manhattan    51.4~0s   26.0~0s   14.0~0s   3.29~0s

jeremiedbb · 2019-02-27T14:33:22Z

ping @rth
related to #8216

jeremiedbb · 2019-02-27T14:39:12Z

Note that enforcing threading can lead to oversubscription. We should wait #13297 to dynamically set inner parallelism max threads.

rth

Interesting results and nice benchmarks !

A few comments below

rth · 2019-02-27T14:40:27Z

sklearn/metrics/pairwise.py

+    fd = delayed(_dist_wrapper)
+    ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype)
+    Parallel(backend="threading", n_jobs=n_jobs, verbose=0)(
+        fd(func, ret, s, X, Y[s], **kwds)


Interesting that it goes faster, I would have expected to access the shared array and writing to a shared array be limiting with respect to GIL.

rth · 2019-02-27T14:41:53Z

sklearn/metrics/pairwise.py

+    # enforce a threading backend to prevent data communication overhead
+    fd = delayed(_dist_wrapper)
+    ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype)
+    Parallel(backend="threading", n_jobs=n_jobs, verbose=0)(


verbose=0 is the default and could be dropped?

rth · 2019-02-27T14:44:08Z

sklearn/metrics/pairwise.py

@@ -1049,6 +1049,11 @@ def distance_metrics():
    return PAIRWISE_DISTANCE_FUNCTIONS


+def _dist_wrapper(dist, mat, slice, *args, **kwargs):


maybe use more explicit names?

def _dist_wrapper(dist_func, dist_matrix, col_slice, *args, **kwargs):

ok. Keeping slice and not col_slice for now because of my answer to you comment below.

rth · 2019-02-27T14:45:20Z

sklearn/metrics/pairwise.py

-        fd(X, Y[s], **kwds)
+    # enforce a threading backend to prevent data communication overhead
+    fd = delayed(_dist_wrapper)
+    ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype)


I would be curious to know if allocating this array Fortran ordered for better cache locality would matter for performance (no need to run all benchmarks, just one typical case).

nice one, but actually, is there any reason why we slice on Y and not on X? Otherwise we can swap, and keep the C order.

The thing is in pairwise_distances_chunked we are chunking along X and then X_chunk and Y are sent to pairwise_distances that can be parallelized with n_jobs. So we cannot chunk both pairwise_distances and pairwise_distances_chunked in the same way.

Maybe this is something we can think about in a follow-up PR.

rth · 2019-02-27T14:50:25Z

Note that enforcing threading can lead to oversubscription.

Agreed, but it's still an improvement I think even without setting OMP_NUM_THREAD unless one is using a 20+ core machine? And in those cases making sure OMP_NUM_THREAD is limited is probably a good practice in any case.

rth

Please add a what's new entry.

Overall looks good!

rth · 2019-02-27T15:08:33Z

sklearn/metrics/pairwise.py

@@ -1049,6 +1049,11 @@ def distance_metrics():
    return PAIRWISE_DISTANCE_FUNCTIONS


+def _dist_wrapper(dist_func, dist_matrix, slice, *args, **kwargs):
+    """Write in-place to a slice of a distance matrix"""
+    dist_matrix[:, slice] = dist_func(*args, **kwargs)


Ok, then maybe call it slice_ to avoid name collision with a reserved key word?

rth · 2019-02-27T15:43:42Z

doc/whats_new/v0.21.rst

@@ -182,6 +182,12 @@ Support for Python 3.4 and below has been officially dropped.
 :mod:`sklearn.metrics`
 ......................

+- |Efficiency| Force ``joblib`` to use the ``'threading'`` backend in
+  :func:`metrics.pairwise.pairwise_distances` when ``n_jobs`` is greater than


This doesn't really explain what changed. Maybe

- |Efficiency| Faster :func:`metrics.pairwise.pairwise_distances` with `n_jobs` > 1 by using the `threading`, instead of the `multiprocessing`, parallel backend.

?

instead of 'loky' :)

yeah, I know, I meant process based parallelism generally...

ogrisel · 2019-02-28T07:32:41Z

Could you please reorganize the benchmark report to make it easy to contrast loky vs threading for each case?

ogrisel · 2019-02-28T07:34:28Z

you can even use plt.barh (horizontal bar plot) to put all the results in the same plot using rich labels on the y axis.

pierreglaser · 2019-02-28T07:46:57Z

@zanospi if you got these results from asv and have a hard time parsing/normalizing them in a table-like format, I wrote a helper that aggregates asv benchmark results in a pandas dataframe. @jeremiedbb has it on his machine :)

jnothman · 2019-02-28T09:00:29Z

(offline so can't check: are we doing this only for locally implemented metrics, or also for scipy and custom metrics?)

jeremiedbb · 2019-02-28T09:04:26Z

I'd say for all metrics, even scipy's one. In the benchmarks above there's correlation which is a scipy metric a threading improves that too.

GaelVaroquaux · 2019-02-28T13:09:35Z

LGTM. Great! Thank you!

ogrisel · 2019-02-28T22:06:50Z

sklearn/metrics/pairwise.py

+    # enforce a threading backend to prevent data communication overhead
+    fd = delayed(_dist_wrapper)
+    ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype, order='F')
+    Parallel(backend="threading", n_jobs=n_jobs)(


we should probably not enforce backend like this be instead use the prefer="threads" hinting mechanism. That is using the **_joblib_parallel_args(prefer='threads') trick to keep backward compat with joblib 0.11.

I was wondering about that, but say when you are using Dask, do we really want it to be able to change the backend here? I can't think of a use case where we want to use process-based backend for Euclidean distances which probably is the most frequent case.

Other metrics might require more fine-grained treatment ...

GaelVaroquaux · 2019-02-28T22:10:38Z

we should probably not enforce backend like this be instead use the prefer= "threads" hinting mechanism.

Good catch.

ogrisel · 2019-02-28T22:11:57Z

Note that enforcing threading can lead to oversubscription. We should wait #13297 to dynamically set inner parallelism max threads.

I believe this should not have been merged before having the opportunity to rerun the benchmark with oversubcription protection currently being implemented in #13297 are merged and used in this PR.

ogrisel · 2019-02-28T22:22:10Z

Also as I said earlier it's really hard to analyze the benchmark results in their current form. Making comparisons requires a lot of scrolling.

GaelVaroquaux · 2019-02-28T22:35:01Z

Note that enforcing threading can lead to oversubscription. We should wait #13297 to dynamically set inner parallelism max threads.

I believe this should not have been merged before having the opportunity to rerun the benchmark with oversubcription protection currently being implemented in #13297 are merged and used in this PR.

OK. I'll take the blame. I thought that @pierreglaser 's first benchmarks were quite conclusive.

ogrisel · 2019-02-28T22:38:02Z

Yes but @pierreglaser's benchmark had MKL_NUM_THREADS=1 fixed in the environment variable which:

is not representative of what a naive user will have;
makes sequential (n_jobs=1) look worse that what it is;
hides the fact that the loky backend has built-in over-subscription protection while the threading backend does not because it's more complicated to get right for threads.

GaelVaroquaux · 2019-02-28T22:41:01Z

Yes, granted. I had not seen that. I should have, given that I am aware of oversubscription problems. But I did not.

Is prioritizing #13297 a way forward (to be able to control for oversubscription, and then redo the benchmarks)?

ogrisel · 2019-02-28T22:45:30Z

@zanospi benchmark's cover all the interesting cases but:

they are hard to read;
they should probably be re-run once this PR leverages manual oversubcription protection as implemented in [WIP] Add helpers for BLAS and OpenMP libs #13297 (or even if we fix the oversubscription protection for the threading backend of joblib directly upstream in joblib now that @tomMoral and @jeremiedbb think they have found the fix to the failing loky helpers PR tests).

I don't think it is necessary to revert the merge as I think this PR probably already improves the performance for majority of our users.

jeremiedbb · 2019-02-28T22:53:40Z

We can see oversubscription with the threading backend on 40 cores. But it's still not slower than loky backend. In any case, it's at least even with loky backend. It will be better when we have the clibs helpers but in the meanwhile it's still an improvment, and a really significant one for low n_jobs, e.g. speedup instead of slow donw :)

rth · 2019-03-01T07:13:32Z

I agree there may be oversubscription issues, but we would indeed see those more for 20-40 CPU cores while most users have 4-8 CPU cores at most where this is a definite improvement.

In the former case, I would argue it also goes a bit into the HPC domain, and while it would be certainly nice if we can handle oversubscription, in that case IMO it's also a bit of a responsibility of the user to know how to efficiently use their compute server and be aware of different levels of parallelism going on (at least for BLAS)..

rmenuet · 2019-03-01T10:25:38Z

Easier to read benchmark just comparing loky and threading backends for different n_jobs > 1 and no constraint on MKL_NUM_THREADS

Representation	Metric	n_jobs	loky	threading	Ratio
dense	cosine	2	4	0.7	5.71429
_	_	4	3.3	0.8	4.125
_	_	40	3.9	3.3	1.18182
_	euclidean	2	4.3	0.9	4.77778
_	_	4	3.2	0.9	3.55556
_	_	40	4	3.3	1.21212
_	manhattan	2	17.8	13.6	1.30882
_	_	4	11.1	7.6	1.46053
_	_	40	6.2	2	3.1
_	correlation	2	17.4	14	1.24286
_	_	4	10.6	7.8	1.35897
_	_	40	5	2.1	2.38095
sparse	cosine	2	3.8	1.5	2.53333
_	_	4	3	1.1	2.72727
_	_	40	3.6	1.7	2.11765
_	euclidean	2	3.9	1.6	2.4375
_	_	4	3.1	1.2	2.58333
_	_	40	3.5	1.8	1.94444
_	manhattan	2	17.9	13.8	1.2971
_	_	4	11.1	7.5	1.48
_	_	40	6	1.8	3.33333

GaelVaroquaux · 2019-03-01T12:39:53Z

Easier to read benchmark just comparing loky and threading backends for different n_jobs > 1 and no constraint on MKL_NUM_THREADS

If I read this correctly, its a speed up 100% of the time.

* ENH enforce threading backend in _parallel_pairwise * OPT write inplace to the matrix in the workers * CLN cosmetics * CLN renaming, cosmetics * CLN renaming * DOC whats_new * DOC rephrasing

This reverts commit d6b2da5.

* ENH enforce threading backend in _parallel_pairwise * OPT write inplace to the matrix in the workers * CLN cosmetics * CLN renaming, cosmetics * CLN renaming * DOC whats_new * DOC rephrasing

schabi88 · 2022-02-03T10:27:37Z

Hi,

We are currently evaluating an upgrade of scikit-learn==0.20.4 to scikit-learn==1.0.2 and are seeing a very high performance degredation in our production setup. We've tried to reproduce it with test data. Our current assumption is, that this change that was merged into scikit-learn==0.21.0 introduce the problem and is still persistent for classifiers like KNeighborsClassifier.

Could you please take a look into this and comment why you've changed the backend to force threading for distance calculations using "brute", which is our productive setup. In the reproduction scenario as well as with our test data, using Loky is significantly faster than multithreading.

Thanks and regards,
Felix

TLDR:

this was tested using scikit-learn==1.0.2. As you can see, whenever "threading" instead of "loky" is chosen, the performance is worse:

Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
96.892 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
5.552 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
5.068 seconds

Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
89.467 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
4.307 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
4.748 seconds

Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
173.201 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
11.277 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
11.081 seconds

Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
164.040 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
7.906 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
8.707 seconds

Reproduction Example:

from time import time
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from joblib import parallel_backend


def run():
    backends = ["threading", "loky", "multiprocessing"]
    samples = [1000, 5000, 10000, 50000, 100000]
    algorithmns = ["ball_tree", "brute"]
    metrics = ["euclidean", "manhattan"]
    # define dataset
    X, y = make_classification(n_samples=20000, n_features=50, n_informative=15, n_redundant=5, random_state=3)
    # define the mode
    # fit the model
    for sample in samples:
        for metric in metrics:
            for algorithmn in algorithmns:
                for backend in backends:
                    with parallel_backend(backend=backend, n_jobs=-1):
                        X_test, y_test = make_classification(n_samples=sample, n_features=50, n_informative=15, n_redundant=5,
                                                             random_state=3)
                        start = time()
                        model = KNeighborsClassifier(algorithm=algorithmn, metric=metric, n_jobs=-1)
                        model.fit(X, y)
                        model.predict(X_test)
                        end = time()
                        result = end - start
                        print("Configuration--> Samples: %d, Metric: %s, Algorithm: %s, Backend: %s" % (sample, metric, algorithmn, backend))
                        print("%.3f seconds" % result)


if __name__ == '__main__':
    run()

Machine setup:

Architecture:        x86_64
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           40
NUMA node(s):        2
Model name:          Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
NUMA node0 CPU(s):   0-19
NUMA node1 CPU(s):   20-39

Full performance test:

Configuration--> Samples: 1000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
1.813 seconds
Configuration--> Samples: 1000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
1.385 seconds
Configuration--> Samples: 1000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
0.710 seconds
Configuration--> Samples: 1000, Metric: euclidean, Algorithm: brute, Backend: threading
0.749 seconds
Configuration--> Samples: 1000, Metric: euclidean, Algorithm: brute, Backend: loky
0.768 seconds
Configuration--> Samples: 1000, Metric: euclidean, Algorithm: brute, Backend: multiprocessing
0.645 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
2.009 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
0.584 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
0.799 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: brute, Backend: threading
0.513 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: brute, Backend: loky
0.509 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: brute, Backend: multiprocessing
0.523 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
8.777 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
0.571 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
1.144 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: brute, Backend: threading
2.889 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: brute, Backend: loky
4.070 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: brute, Backend: multiprocessing
2.842 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
9.745 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
0.799 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
1.130 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: brute, Backend: threading
2.437 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: brute, Backend: loky
2.392 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: brute, Backend: multiprocessing
2.375 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
13.075 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
1.180 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
1.901 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: brute, Backend: threading
6.452 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: brute, Backend: loky
5.968 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: brute, Backend: multiprocessing
5.773 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
16.369 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
1.221 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
1.785 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: brute, Backend: threading
5.258 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: brute, Backend: loky
5.246 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: brute, Backend: multiprocessing
5.087 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
89.467 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
4.307 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
4.748 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: brute, Backend: threading
29.213 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: brute, Backend: loky
28.359 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: brute, Backend: multiprocessing
27.758 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
96.892 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
5.552 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
5.068 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: brute, Backend: threading
24.075 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: brute, Backend: loky
23.903 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: brute, Backend: multiprocessing
23.634 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
164.040 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
7.906 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
8.707 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: brute, Backend: threading
58.199 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: brute, Backend: loky
57.016 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: brute, Backend: multiprocessing
55.996 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
173.201 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
11.277 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
11.081 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: brute, Backend: threading
50.696 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: brute, Backend: loky
49.473 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: brute, Backend: multiprocessing
49.485 seconds

RNAer · 2022-06-22T03:46:09Z

I see similar observations. I am surprised to find setting multiple jobs slows down the computation significantly in sklearn.metrics.pairwise_distances(_chunked) no matter for dense or sparse arrays or for 'jaccard' or my custom metric function...

ENH enforce threading backend in _parallel_pairwise

f7dc7ef

pierreglaser force-pushed the enforce-threading-in_parallel_pairwise branch from 34229aa to 6d84837 Compare February 27, 2019 14:22

OPT write inplace to the matrix in the workers

7361f64

pierreglaser force-pushed the enforce-threading-in_parallel_pairwise branch from 6d84837 to 7361f64 Compare February 27, 2019 14:28

CLN cosmetics

18d7bf7

rth reviewed Feb 27, 2019

View reviewed changes

rth added this to the 0.21 milestone Feb 27, 2019

CLN renaming, cosmetics

ed86308

rth reviewed Feb 27, 2019

View reviewed changes

pierreglaser added 2 commits February 27, 2019 16:23

CLN renaming

1d9dbca

DOC whats_new

99b93f0

rth approved these changes Feb 27, 2019

View reviewed changes

DOC rephrasing

63a6346

jeremiedbb approved these changes Feb 27, 2019

View reviewed changes

GaelVaroquaux merged commit 1ada3aa into scikit-learn:master Feb 28, 2019

ogrisel reviewed Feb 28, 2019

View reviewed changes

rth mentioned this pull request Mar 1, 2019

Euclidean pairwise_distances slower for n_jobs > 1 #8216

Closed

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "Enforce threading in parallel pairwise (scikit-learn#13310)"

97e0a22

This reverts commit d6b2da5.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "Enforce threading in parallel pairwise (scikit-learn#13310)"

d5baad0

This reverts commit d6b2da5.

pierreglaser mentioned this pull request Aug 1, 2019

RFC: Add pierreglaser/sklearn_parallel_benchmark to scikit-learn_benchmarks jeremiedbb/scikit-learn_benchmarks#16

Open

3 tasks

lucyleeow mentioned this pull request Apr 3, 2025

Array API support for pairwise kernels #29822

Merged

4 tasks

		@@ -1049,6 +1049,11 @@ def distance_metrics():
		return PAIRWISE_DISTANCE_FUNCTIONS


		def _dist_wrapper(dist, mat, slice, args, *kwargs):

Uh oh!

Enforce threading in parallel pairwise #13310

Enforce threading in parallel pairwise #13310

Uh oh!

Conversation

pierreglaser commented Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmenuet commented Feb 27, 2019

Uh oh!

jeremiedbb commented Feb 27, 2019

Uh oh!

jeremiedbb commented Feb 27, 2019

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Feb 28, 2019

Uh oh!

ogrisel commented Feb 28, 2019

Uh oh!

pierreglaser commented Feb 28, 2019

Uh oh!

jnothman commented Feb 28, 2019 via email

Uh oh!

jeremiedbb commented Feb 28, 2019

Uh oh!

GaelVaroquaux commented Feb 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Feb 28, 2019 via email

Uh oh!

ogrisel commented Feb 28, 2019

Uh oh!

ogrisel commented Feb 28, 2019

Uh oh!

GaelVaroquaux commented Feb 28, 2019

Uh oh!

ogrisel commented Feb 28, 2019

Uh oh!

GaelVaroquaux commented Feb 28, 2019

Uh oh!

ogrisel commented Feb 28, 2019

Uh oh!

jeremiedbb commented Feb 28, 2019

Uh oh!

rth commented Mar 1, 2019

Uh oh!

rmenuet commented Mar 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

pierreglaser commented Feb 27, 2019 •

edited

Loading

rth commented Feb 27, 2019 •

edited

Loading

rmenuet commented Mar 1, 2019 •

edited

Loading