Skip to content

Enforce threading in parallel pairwise #13310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pierreglaser
Copy link
Contributor

@pierreglaser pierreglaser commented Feb 27, 2019

pairwise_distances makes a direct call to _parallel_pairwise, that computes the distance matrix chunk by chunk. However, it uses loky, the default backend of joblib, which generates a lot of overhead due to data communication. I propose to enforce a threading backend in _parallel_pairwise. I also discard the np.hstack concatenation call and write to the matrix in-place, to further optimize for speed.

Running this benchmark and limiting over-subscription by setting MKL_NUM_THREADS to 1 yielded these results:

parallel_pairwise_plot

@jeremiedbb @zanospi

@rmenuet
Copy link
Contributor

rmenuet commented Feb 27, 2019

Full benchmark on server with 40 threads & MKL using:

  • loky, multiprocessing and threading as backend
  • euclidean, cosine, correlation & manhattan
  • either dense or sparse matrices with increasing number of features
  • either unlimited num of threads per job, or only 1 thread

(had crashes with sparse manhattan for dim >= 1000 with loky & multiprocessing that I did not investigate yet)

=========================================================
=== Joblib_backend = Loky, MKL_NUM_THREAD not limited ===
=========================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     646~0ms   3.99~0s   3.26~0s   3.94~0s
     dense         euclidean    746~0ms   4.29~0s   3.23~0s   3.95~0s
     dense         manhattan    33.9~0s   17.8~0s   11.1~0s   6.18~0s
     dense        correlation   15.7~0s   17.4~0s   10.6~0s   5.02~0s
     sparse          cosine     2.34~0s   3.83~0s   3.00~0s   3.63~0s
     sparse        euclidean    2.65~0s   3.90~0s   3.11~0s   3.52~0s
     sparse        manhattan    27.0~0s   17.9~0s   11.1~0s   6.01~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     813~0ms   4.83~0s   3.52~0s   3.82~0s
     dense         euclidean    971~0ms   4.91~0s   3.51~0s   3.71~0s
     dense         manhattan    1.13~0m   37.4~0s   21.3~0s   11.6~0s
     dense        correlation   32.9~0s   35.9~0s   20.6~0s   9.96~0s
     sparse          cosine     3.42~0s   3.80~0s   2.82~0s   3.07~0s
     sparse        euclidean    4.18~0s   4.01~0s   2.95~0s   3.21~0s
     sparse        manhattan    51.3~0s    failed    failed    failed
	 
Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     765~0ms   4.86~0s   3.55~0s   3.71~0s
     dense         euclidean    974~0ms   4.91~0s   3.47~0s   3.64~0s
     dense         manhattan    1.14~0m   39.0~0s   21.2~0s   12.4~0s
     dense        correlation   39.0~0s   37.8~0s   20.5~0s   9.82~0s
     sparse          cosine     3.64~0s   3.80~0s   2.84~0s   3.62~0s
     sparse        euclidean    3.82~0s   4.38~0s   3.32~0s   3.74~0s
     sparse        manhattan    51.3~0s    failed    failed    failed

====================================================================
=== Joblib_backend = multiprocessing, MKL_NUM_THREAD not limited ===
====================================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     658~0ms   2.99~0s   2.33~0s   1.90~0s
     dense         euclidean    763~0ms   3.04~0s   2.35~0s   1.91~0s
     dense         manhattan    33.9~0s   16.3~0s   9.60~0s   3.97~0s
     dense        correlation   15.5~0s   16.2~0s   9.60~0s   3.97~0s
     sparse          cosine     2.20~0s   2.57~0s   1.99~0s   1.70~0s
     sparse        euclidean    2.65~0s   2.81~0s   2.12~0s   1.77~0s
     sparse        manhattan    26.8~0s   16.2~0s   9.66~0s   4.00~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     731~0ms   4.33~0s   3.02~0s   2.34~0s
     dense         euclidean    976~0ms   4.34~0s   3.01~0s   2.12~0s
     dense         manhattan    1.13~0m   36.5~0s   20.7~0s   9.94~0s
     dense        correlation   32.6~0s   35.8~0s   20.4~0s   10.4~0s
     sparse          cosine     3.29~0s   3.11~0s   2.27~0s   1.66~0s
     sparse        euclidean    3.75~0s   3.39~0s   2.40~0s   1.78~0s
     sparse        manhattan    51.0~0s    failed    failed    failed

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     735~0ms   4.33~0s   3.04~0s   2.28~0s
     dense         euclidean    966~0ms   4.26~0s   3.01~0s   2.11~0s
     dense         manhattan    1.13~0m   36.6~0s   21.0~0s   10.2~0s
     dense        correlation   32.8~0s   35.8~0s   20.4~0s   10.5~0s
     sparse          cosine     3.30~0s   3.12~0s   2.29~0s   1.69~0s
     sparse        euclidean    3.73~0s   3.38~0s   2.38~0s   1.72~0s
     sparse        manhattan    51.0~0s    failed    failed    failed

==============================================================
=== Joblib_backend = threading, MKL_NUM_THREAD not limited ===
==============================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     709~0ms   732~0ms   878~0ms   3.29~0s
     dense         euclidean    707~0ms   889~0ms   935~0ms   3.27~0s
     dense         manhattan    33.9~0s   13.6~0s   7.56~0s   1.98~0s
     dense        correlation   15.5~0s   14.0~0s   7.78~0s   2.10~0s
     sparse          cosine     2.19~0s   1.45~0s   1.12~0s   1.66~0s
     sparse        euclidean    2.64~0s   1.57~0s   1.17~0s   1.83~0s
     sparse        manhattan    26.8~0s   13.8~0s   7.48~0s   1.79~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     761~0ms   921~0ms   1.20~0s   3.31~0s
     dense         euclidean    958~0ms   997~0ms   1.08~0s   3.51~0s
     dense         manhattan    1.13~0m   34.7~0s   19.1~0s   8.46~0s
     dense        correlation   32.7~0s   33.6~0s   18.6~0s   8.62~0s
     sparse          cosine     3.33~0s   1.98~0s   1.29~0s   1.62~0s
     sparse        euclidean    3.71~0s   2.08~0s   1.44~0s   1.66~0s
     sparse        manhattan    51.0~0s   26.0~0s   14.0~0s   3.03~0s

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     771~0ms   972~0ms   1.03~0s   3.31~0s
     dense         euclidean    966~0ms   1.00~0s   1.03~0s   3.51~0s
     dense         manhattan    1.13~0m   34.7~0s   19.1~0s   8.48~0s
     dense        correlation   32.8~0s   33.6~0s   18.7~0s   8.59~0s
     sparse          cosine     3.28~0s   1.99~0s   1.30~0s   1.79~0s
     sparse        euclidean    3.72~0s   2.08~0s   1.39~0s   1.93~0s
     sparse        manhattan    51.0~0s   26.0~0s   14.1~0s   3.02~0s

===============================================
=== Joblib_backend = Loky, MKL_NUM_THREAD 1 ===
===============================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     2.38~0s   5.18~0s   4.17~0s   4.03~0s
     dense         euclidean    3.14~0s   5.13~0s   4.22~0s   4.06~0s
     dense         manhattan    36.2~0s   19.4~0s   12.3~0s   6.71~0s
     dense        correlation   17.6~0s   19.1~0s   12.1~0s   5.34~0s
     sparse          cosine     2.95~0s   4.69~0s   3.91~0s   3.85~0s
     sparse        euclidean    3.30~0s   4.62~0s   3.68~0s   3.93~0s
     sparse        manhattan    27.2~0s   18.5~0s   12.0~0s   6.58~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.53~0s   6.01~0s   3.97~0s   4.03~0s
     dense         euclidean    3.70~0s   6.02~0s   4.05~0s   3.82~0s
     dense         manhattan    1.19~0m   39.6~0s   22.8~0s   12.8~0s
     dense        correlation   35.8~0s   38.5~0s   21.4~0s   10.3~0s
     sparse          cosine     3.84~0s   4.52~0s   3.00~0s   3.20~0s
     sparse        euclidean    4.31~0s   4.39~0s   3.66~0s   3.41~0s
     sparse        manhattan    51.4~0s    failed    failed    failed

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.59~0s   6.03~0s   4.14~0s   3.94~0s
     dense         euclidean    3.77~0s   5.60~0s   4.28~0s   3.67~0s
     dense         manhattan    1.36~0m   39.3~0s   22.9~0s   12.7~0s
     dense        correlation   37.9~0s   38.3~0s   21.8~0s   9.12~0s
     sparse          cosine     3.84~0s   4.91~0s   3.74~0s   3.92~0s
     sparse        euclidean    4.26~0s   4.88~0s   3.63~0s   4.17~0s
     sparse        manhattan    51.4~0s    failed    failed    failed

==========================================================
=== Joblib_backend = multiprocessing, MKL_NUM_THREAD 1 ===
==========================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     2.10~0s   3.48~0s   2.66~0s   2.06~0s
     dense         euclidean    2.85~0s   3.70~0s   2.65~0s   1.96~0s
     dense         manhattan    34.0~0s   17.4~0s   10.7~0s   4.58~0s
     dense        correlation   15.4~0s   17.6~0s   10.7~0s   4.45~0s
     sparse          cosine     2.56~0s   3.17~0s   2.13~0s   1.87~0s
     sparse        euclidean    2.72~0s   3.40~0s   2.29~0s   2.14~0s
     sparse        manhattan    26.9~0s   18.0~0s   10.9~0s   4.38~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     3.39~0s   4.81~0s   3.42~0s   2.54~0s
     dense         euclidean    3.10~0s   4.90~0s   3.30~0s   2.41~0s
     dense         manhattan    1.13~0m   38.3~0s   22.6~0s   10.3~0s
     dense        correlation   32.8~0s   38.4~0s   21.9~0s   10.8~0s
     sparse          cosine     3.36~0s   3.96~0s   2.53~0s   1.86~0s
     sparse        euclidean    4.00~0s   4.10~0s   3.08~0s   1.86~0s
     sparse        manhattan    51.1~0s    failed    failed    failed

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.37~0s   4.96~0s   3.35~0s   2.57~0s
     dense         euclidean    3.10~0s   5.12~0s   3.42~0s   2.19~0s
     dense         manhattan    1.13~0m   37.7~0s   22.4~0s   10.4~0s
     dense        correlation   32.8~0s   38.4~0s   21.5~0s   11.3~0s
     sparse          cosine     3.34~0s   3.93~0s   2.46~0s   1.69~0s
     sparse        euclidean    3.87~0s   4.08~0s   2.55~0s   2.12~0s
     sparse        manhattan    51.1~0s    failed    failed    failed

====================================================
=== Joblib_backend = threading, MKL_NUM_THREAD 1 ===
====================================================

Samples 10k / Features 500
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     2.27~0s   1.70~0s   1.14~0s   937~0ms
     dense         euclidean    3.01~0s   1.81~0s   1.17~0s   837~0ms
     dense         manhattan    35.0~0s   14.0~0s   7.97~0s   2.40~0s
     dense        correlation   16.8~0s   14.3~0s   8.18~0s   2.55~0s
     sparse          cosine     2.59~0s   1.61~0s   1.06~0s   825~0ms
     sparse        euclidean    3.06~0s   1.85~0s   1.08~0s   689~0ms
     sparse        manhattan    27.1~0s   14.1~0s   7.82~0s   2.07~0s

Samples 10k / Features 1k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
     dense           cosine     3.44~0s   2.74~0s   1.78~0s   1.11~0s
     dense         euclidean    3.58~0s   2.79~0s   1.80~0s   904~0ms
     dense         manhattan    1.16~0m   35.2~0s   19.7~0s   8.69~0s
     dense        correlation   34.7~0s   34.2~0s   19.0~0s   9.19~0s
     sparse          cosine     3.83~0s   2.14~0s   1.37~0s   939~0ms
     sparse        euclidean    4.24~0s   2.36~0s   1.40~0s   659~0ms
     sparse        manhattan    51.4~0s   26.2~0s   14.2~0s   3.29~0s

Samples 10k / Features 2k
================ ============= ========= ========= ========= =========
--                                              n_jobs
------------------------------ ---------------------------------------
representation      metric        1         2         4         40
================ ============= ========= ========= ========= =========
     dense           cosine     3.49~0s   2.81~0s   1.79~0s   1.13~0s
     dense         euclidean    3.63~0s   2.92~0s   2.03~0s   914~0ms
     dense         manhattan    1.18~0m   34.9~0s   19.6~0s   8.46~0s
     dense        correlation   34.8~0s   34.2~0s   19.0~0s   9.29~0s
     sparse          cosine     3.78~0s   2.09~0s   1.30~0s   887~0ms
     sparse        euclidean    4.25~0s   2.34~0s   1.48~0s   784~0ms
     sparse        manhattan    51.4~0s   26.0~0s   14.0~0s   3.29~0s

@pierreglaser pierreglaser force-pushed the enforce-threading-in_parallel_pairwise branch from 34229aa to 6d84837 Compare February 27, 2019 14:22
@pierreglaser pierreglaser force-pushed the enforce-threading-in_parallel_pairwise branch from 6d84837 to 7361f64 Compare February 27, 2019 14:28
@jeremiedbb
Copy link
Member

ping @rth
related to #8216

@jeremiedbb
Copy link
Member

Note that enforcing threading can lead to oversubscription. We should wait #13297 to dynamically set inner parallelism max threads.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting results and nice benchmarks !

A few comments below

fd = delayed(_dist_wrapper)
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype)
Parallel(backend="threading", n_jobs=n_jobs, verbose=0)(
fd(func, ret, s, X, Y[s], **kwds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that it goes faster, I would have expected to access the shared array and writing to a shared array be limiting with respect to GIL.

# enforce a threading backend to prevent data communication overhead
fd = delayed(_dist_wrapper)
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype)
Parallel(backend="threading", n_jobs=n_jobs, verbose=0)(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verbose=0 is the default and could be dropped?

@@ -1049,6 +1049,11 @@ def distance_metrics():
return PAIRWISE_DISTANCE_FUNCTIONS


def _dist_wrapper(dist, mat, slice, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use more explicit names?

def _dist_wrapper(dist_func, dist_matrix, col_slice, *args, **kwargs):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Keeping slice and not col_slice for now because of my answer to you comment below.

fd(X, Y[s], **kwds)
# enforce a threading backend to prevent data communication overhead
fd = delayed(_dist_wrapper)
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be curious to know if allocating this array Fortran ordered for better cache locality would matter for performance (no need to run all benchmarks, just one typical case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one, but actually, is there any reason why we slice on Y and not on X? Otherwise we can swap, and keep the C order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is in pairwise_distances_chunked we are chunking along X and then X_chunk and Y are sent to pairwise_distances that can be parallelized with n_jobs. So we cannot chunk both pairwise_distances and pairwise_distances_chunked in the same way.

Maybe this is something we can think about in a follow-up PR.

@rth
Copy link
Member

rth commented Feb 27, 2019

Note that enforcing threading can lead to oversubscription.

Agreed, but it's still an improvement I think even without setting OMP_NUM_THREAD unless one is using a 20+ core machine? And in those cases making sure OMP_NUM_THREAD is limited is probably a good practice in any case.

@rth rth added this to the 0.21 milestone Feb 27, 2019
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a what's new entry.

Overall looks good!

@@ -1049,6 +1049,11 @@ def distance_metrics():
return PAIRWISE_DISTANCE_FUNCTIONS


def _dist_wrapper(dist_func, dist_matrix, slice, *args, **kwargs):
"""Write in-place to a slice of a distance matrix"""
dist_matrix[:, slice] = dist_func(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then maybe call it slice_ to avoid name collision with a reserved key word?

@@ -182,6 +182,12 @@ Support for Python 3.4 and below has been officially dropped.
:mod:`sklearn.metrics`
......................

- |Efficiency| Force ``joblib`` to use the ``'threading'`` backend in
:func:`metrics.pairwise.pairwise_distances` when ``n_jobs`` is greater than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really explain what changed. Maybe

- |Efficiency| Faster :func:`metrics.pairwise.pairwise_distances` with `n_jobs` > 1
  by using the `threading`, instead of the `multiprocessing`, parallel backend.

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of 'loky' :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I know, I meant process based parallelism generally...

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2019

Could you please reorganize the benchmark report to make it easy to contrast loky vs threading for each case?

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2019

you can even use plt.barh (horizontal bar plot) to put all the results in the same plot using rich labels on the y axis.

@pierreglaser
Copy link
Contributor Author

@zanospi if you got these results from asv and have a hard time parsing/normalizing them in a table-like format, I wrote a helper that aggregates asv benchmark results in a pandas dataframe. @jeremiedbb has it on his machine :)

@jnothman
Copy link
Member

jnothman commented Feb 28, 2019 via email

@jeremiedbb
Copy link
Member

I'd say for all metrics, even scipy's one. In the benchmarks above there's correlation which is a scipy metric a threading improves that too.

@GaelVaroquaux
Copy link
Member

LGTM. Great! Thank you!

@GaelVaroquaux GaelVaroquaux merged commit 1ada3aa into scikit-learn:master Feb 28, 2019
# enforce a threading backend to prevent data communication overhead
fd = delayed(_dist_wrapper)
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype, order='F')
Parallel(backend="threading", n_jobs=n_jobs)(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably not enforce backend like this be instead use the prefer="threads" hinting mechanism. That is using the **_joblib_parallel_args(prefer='threads') trick to keep backward compat with joblib 0.11.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that, but say when you are using Dask, do we really want it to be able to change the backend here? I can't think of a use case where we want to use process-based backend for Euclidean distances which probably is the most frequent case.

Other metrics might require more fine-grained treatment ...

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 28, 2019 via email

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2019

Note that enforcing threading can lead to oversubscription. We should wait #13297 to dynamically set inner parallelism max threads.

I believe this should not have been merged before having the opportunity to rerun the benchmark with oversubcription protection currently being implemented in #13297 are merged and used in this PR.

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2019

Also as I said earlier it's really hard to analyze the benchmark results in their current form. Making comparisons requires a lot of scrolling.

@GaelVaroquaux
Copy link
Member

Note that enforcing threading can lead to oversubscription. We should wait #13297 to dynamically set inner parallelism max threads.

I believe this should not have been merged before having the opportunity to rerun the benchmark with oversubcription protection currently being implemented in #13297 are merged and used in this PR.

OK. I'll take the blame. I thought that @pierreglaser 's first benchmarks were quite conclusive.

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2019

Yes but @pierreglaser's benchmark had MKL_NUM_THREADS=1 fixed in the environment variable which:

  • is not representative of what a naive user will have;
  • makes sequential (n_jobs=1) look worse that what it is;
  • hides the fact that the loky backend has built-in over-subscription protection while the threading backend does not because it's more complicated to get right for threads.

@GaelVaroquaux
Copy link
Member

Yes, granted. I had not seen that. I should have, given that I am aware of oversubscription problems. But I did not.

Is prioritizing #13297 a way forward (to be able to control for oversubscription, and then redo the benchmarks)?

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2019

@zanospi benchmark's cover all the interesting cases but:

  • they are hard to read;
  • they should probably be re-run once this PR leverages manual oversubcription protection as implemented in [WIP] Add helpers for BLAS and OpenMP libs  #13297 (or even if we fix the oversubscription protection for the threading backend of joblib directly upstream in joblib now that @tomMoral and @jeremiedbb think they have found the fix to the failing loky helpers PR tests).

I don't think it is necessary to revert the merge as I think this PR probably already improves the performance for majority of our users.

@jeremiedbb
Copy link
Member

We can see oversubscription with the threading backend on 40 cores. But it's still not slower than loky backend. In any case, it's at least even with loky backend. It will be better when we have the clibs helpers but in the meanwhile it's still an improvment, and a really significant one for low n_jobs, e.g. speedup instead of slow donw :)

@rth
Copy link
Member

rth commented Mar 1, 2019

I agree there may be oversubscription issues, but we would indeed see those more for 20-40 CPU cores while most users have 4-8 CPU cores at most where this is a definite improvement.

In the former case, I would argue it also goes a bit into the HPC domain, and while it would be certainly nice if we can handle oversubscription, in that case IMO it's also a bit of a responsibility of the user to know how to efficiently use their compute server and be aware of different levels of parallelism going on (at least for BLAS)..

@rmenuet
Copy link
Contributor

rmenuet commented Mar 1, 2019

Easier to read benchmark just comparing loky and threading backends for different n_jobs > 1 and no constraint on MKL_NUM_THREADS

Representation Metric n_jobs loky threading Ratio
dense cosine 2 4 0.7 5.71429
_ _ 4 3.3 0.8 4.125
_ _ 40 3.9 3.3 1.18182
_ euclidean 2 4.3 0.9 4.77778
_ _ 4 3.2 0.9 3.55556
_ _ 40 4 3.3 1.21212
_ manhattan 2 17.8 13.6 1.30882
_ _ 4 11.1 7.6 1.46053
_ _ 40 6.2 2 3.1
_ correlation 2 17.4 14 1.24286
_ _ 4 10.6 7.8 1.35897
_ _ 40 5 2.1 2.38095
sparse cosine 2 3.8 1.5 2.53333
_ _ 4 3 1.1 2.72727
_ _ 40 3.6 1.7 2.11765
_ euclidean 2 3.9 1.6 2.4375
_ _ 4 3.1 1.2 2.58333
_ _ 40 3.5 1.8 1.94444
_ manhattan 2 17.9 13.8 1.2971
_ _ 4 11.1 7.5 1.48
_ _ 40 6 1.8 3.33333

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 1, 2019 via email

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
* ENH enforce threading backend in _parallel_pairwise

* OPT write inplace to the matrix in the workers

* CLN cosmetics

* CLN renaming, cosmetics

* CLN renaming

* DOC whats_new

* DOC rephrasing
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
* ENH enforce threading backend in _parallel_pairwise

* OPT write inplace to the matrix in the workers

* CLN cosmetics

* CLN renaming, cosmetics

* CLN renaming

* DOC whats_new

* DOC rephrasing
@schabi88
Copy link

schabi88 commented Feb 3, 2022

Hi,

We are currently evaluating an upgrade of scikit-learn==0.20.4 to scikit-learn==1.0.2 and are seeing a very high performance degredation in our production setup. We've tried to reproduce it with test data. Our current assumption is, that this change that was merged into scikit-learn==0.21.0 introduce the problem and is still persistent for classifiers like KNeighborsClassifier.

Could you please take a look into this and comment why you've changed the backend to force threading for distance calculations using "brute", which is our productive setup. In the reproduction scenario as well as with our test data, using Loky is significantly faster than multithreading.

Thanks and regards,
Felix

TLDR:

this was tested using scikit-learn==1.0.2. As you can see, whenever "threading" instead of "loky" is chosen, the performance is worse:

Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
96.892 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
5.552 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
5.068 seconds

Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
89.467 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
4.307 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
4.748 seconds

Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
173.201 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
11.277 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
11.081 seconds

Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
164.040 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
7.906 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
8.707 seconds

Reproduction Example:

from time import time
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from joblib import parallel_backend


def run():
    backends = ["threading", "loky", "multiprocessing"]
    samples = [1000, 5000, 10000, 50000, 100000]
    algorithmns = ["ball_tree", "brute"]
    metrics = ["euclidean", "manhattan"]
    # define dataset
    X, y = make_classification(n_samples=20000, n_features=50, n_informative=15, n_redundant=5, random_state=3)
    # define the mode
    # fit the model
    for sample in samples:
        for metric in metrics:
            for algorithmn in algorithmns:
                for backend in backends:
                    with parallel_backend(backend=backend, n_jobs=-1):
                        X_test, y_test = make_classification(n_samples=sample, n_features=50, n_informative=15, n_redundant=5,
                                                             random_state=3)
                        start = time()
                        model = KNeighborsClassifier(algorithm=algorithmn, metric=metric, n_jobs=-1)
                        model.fit(X, y)
                        model.predict(X_test)
                        end = time()
                        result = end - start
                        print("Configuration--> Samples: %d, Metric: %s, Algorithm: %s, Backend: %s" % (sample, metric, algorithmn, backend))
                        print("%.3f seconds" % result)


if __name__ == '__main__':
    run()

Machine setup:

Architecture:        x86_64
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           40
NUMA node(s):        2
Model name:          Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
NUMA node0 CPU(s):   0-19
NUMA node1 CPU(s):   20-39

Full performance test:

Configuration--> Samples: 1000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
1.813 seconds
Configuration--> Samples: 1000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
1.385 seconds
Configuration--> Samples: 1000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
0.710 seconds
Configuration--> Samples: 1000, Metric: euclidean, Algorithm: brute, Backend: threading
0.749 seconds
Configuration--> Samples: 1000, Metric: euclidean, Algorithm: brute, Backend: loky
0.768 seconds
Configuration--> Samples: 1000, Metric: euclidean, Algorithm: brute, Backend: multiprocessing
0.645 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
2.009 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
0.584 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
0.799 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: brute, Backend: threading
0.513 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: brute, Backend: loky
0.509 seconds
Configuration--> Samples: 1000, Metric: manhattan, Algorithm: brute, Backend: multiprocessing
0.523 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
8.777 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
0.571 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
1.144 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: brute, Backend: threading
2.889 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: brute, Backend: loky
4.070 seconds
Configuration--> Samples: 5000, Metric: euclidean, Algorithm: brute, Backend: multiprocessing
2.842 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
9.745 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
0.799 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
1.130 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: brute, Backend: threading
2.437 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: brute, Backend: loky
2.392 seconds
Configuration--> Samples: 5000, Metric: manhattan, Algorithm: brute, Backend: multiprocessing
2.375 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
13.075 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
1.180 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
1.901 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: brute, Backend: threading
6.452 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: brute, Backend: loky
5.968 seconds
Configuration--> Samples: 10000, Metric: euclidean, Algorithm: brute, Backend: multiprocessing
5.773 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
16.369 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
1.221 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
1.785 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: brute, Backend: threading
5.258 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: brute, Backend: loky
5.246 seconds
Configuration--> Samples: 10000, Metric: manhattan, Algorithm: brute, Backend: multiprocessing
5.087 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
89.467 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
4.307 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
4.748 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: brute, Backend: threading
29.213 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: brute, Backend: loky
28.359 seconds
Configuration--> Samples: 50000, Metric: euclidean, Algorithm: brute, Backend: multiprocessing
27.758 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
96.892 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
5.552 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
5.068 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: brute, Backend: threading
24.075 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: brute, Backend: loky
23.903 seconds
Configuration--> Samples: 50000, Metric: manhattan, Algorithm: brute, Backend: multiprocessing
23.634 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: threading
164.040 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: loky
7.906 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: ball_tree, Backend: multiprocessing
8.707 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: brute, Backend: threading
58.199 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: brute, Backend: loky
57.016 seconds
Configuration--> Samples: 100000, Metric: euclidean, Algorithm: brute, Backend: multiprocessing
55.996 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: threading
173.201 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: loky
11.277 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: ball_tree, Backend: multiprocessing
11.081 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: brute, Backend: threading
50.696 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: brute, Backend: loky
49.473 seconds
Configuration--> Samples: 100000, Metric: manhattan, Algorithm: brute, Backend: multiprocessing
49.485 seconds

@RNAer
Copy link

RNAer commented Jun 22, 2022

I see similar observations. I am surprised to find setting multiple jobs slows down the computation significantly in sklearn.metrics.pairwise_distances(_chunked) no matter for dense or sparse arrays or for 'jaccard' or my custom metric function...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants