-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Enforce threading in parallel pairwise #13310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce threading in parallel pairwise #13310
Conversation
Full benchmark on server with 40 threads & MKL using:
(had crashes with sparse manhattan for dim >= 1000 with loky & multiprocessing that I did not investigate yet)
|
34229aa
to
6d84837
Compare
6d84837
to
7361f64
Compare
Note that enforcing threading can lead to oversubscription. We should wait #13297 to dynamically set inner parallelism max threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting results and nice benchmarks !
A few comments below
fd = delayed(_dist_wrapper) | ||
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype) | ||
Parallel(backend="threading", n_jobs=n_jobs, verbose=0)( | ||
fd(func, ret, s, X, Y[s], **kwds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that it goes faster, I would have expected to access the shared array and writing to a shared array be limiting with respect to GIL.
sklearn/metrics/pairwise.py
Outdated
# enforce a threading backend to prevent data communication overhead | ||
fd = delayed(_dist_wrapper) | ||
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype) | ||
Parallel(backend="threading", n_jobs=n_jobs, verbose=0)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verbose=0
is the default and could be dropped?
sklearn/metrics/pairwise.py
Outdated
@@ -1049,6 +1049,11 @@ def distance_metrics(): | |||
return PAIRWISE_DISTANCE_FUNCTIONS | |||
|
|||
|
|||
def _dist_wrapper(dist, mat, slice, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use more explicit names?
def _dist_wrapper(dist_func, dist_matrix, col_slice, *args, **kwargs):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Keeping slice
and not col_slice
for now because of my answer to you comment below.
sklearn/metrics/pairwise.py
Outdated
fd(X, Y[s], **kwds) | ||
# enforce a threading backend to prevent data communication overhead | ||
fd = delayed(_dist_wrapper) | ||
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be curious to know if allocating this array Fortran ordered for better cache locality would matter for performance (no need to run all benchmarks, just one typical case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice one, but actually, is there any reason why we slice on Y
and not on X
? Otherwise we can swap, and keep the C order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is in pairwise_distances_chunked
we are chunking along X
and then X_chunk
and Y
are sent to pairwise_distances
that can be parallelized with n_jobs
. So we cannot chunk both pairwise_distances
and pairwise_distances_chunked
in the same way.
Maybe this is something we can think about in a follow-up PR.
Agreed, but it's still an improvement I think even without setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a what's new entry.
Overall looks good!
sklearn/metrics/pairwise.py
Outdated
@@ -1049,6 +1049,11 @@ def distance_metrics(): | |||
return PAIRWISE_DISTANCE_FUNCTIONS | |||
|
|||
|
|||
def _dist_wrapper(dist_func, dist_matrix, slice, *args, **kwargs): | |||
"""Write in-place to a slice of a distance matrix""" | |||
dist_matrix[:, slice] = dist_func(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then maybe call it slice_
to avoid name collision with a reserved key word?
doc/whats_new/v0.21.rst
Outdated
@@ -182,6 +182,12 @@ Support for Python 3.4 and below has been officially dropped. | |||
:mod:`sklearn.metrics` | |||
...................... | |||
|
|||
- |Efficiency| Force ``joblib`` to use the ``'threading'`` backend in | |||
:func:`metrics.pairwise.pairwise_distances` when ``n_jobs`` is greater than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really explain what changed. Maybe
- |Efficiency| Faster :func:`metrics.pairwise.pairwise_distances` with `n_jobs` > 1
by using the `threading`, instead of the `multiprocessing`, parallel backend.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of 'loky' :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I know, I meant process based parallelism generally...
Could you please reorganize the benchmark report to make it easy to contrast loky vs threading for each case? |
you can even use |
@zanospi if you got these results from asv and have a hard time parsing/normalizing them in a table-like format, I wrote a helper that aggregates asv benchmark results in a pandas dataframe. @jeremiedbb has it on his machine :) |
(offline so can't check: are we doing this only for locally implemented
metrics, or also for scipy and custom metrics?)
|
I'd say for all metrics, even scipy's one. In the benchmarks above there's |
LGTM. Great! Thank you! |
# enforce a threading backend to prevent data communication overhead | ||
fd = delayed(_dist_wrapper) | ||
ret = np.empty((X.shape[0], Y.shape[0]), dtype=X.dtype, order='F') | ||
Parallel(backend="threading", n_jobs=n_jobs)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably not enforce backend like this be instead use the prefer="threads"
hinting mechanism. That is using the **_joblib_parallel_args(prefer='threads')
trick to keep backward compat with joblib 0.11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about that, but say when you are using Dask, do we really want it to be able to change the backend here? I can't think of a use case where we want to use process-based backend for Euclidean distances which probably is the most frequent case.
Other metrics might require more fine-grained treatment ...
we should probably not enforce backend like this be instead use the prefer= "threads" hinting mechanism.
Good catch.
|
I believe this should not have been merged before having the opportunity to rerun the benchmark with oversubcription protection currently being implemented in #13297 are merged and used in this PR. |
Also as I said earlier it's really hard to analyze the benchmark results in their current form. Making comparisons requires a lot of scrolling. |
OK. I'll take the blame. I thought that @pierreglaser 's first benchmarks were quite conclusive. |
Yes but @pierreglaser's benchmark had
|
Yes, granted. I had not seen that. I should have, given that I am aware of oversubscription problems. But I did not. Is prioritizing #13297 a way forward (to be able to control for oversubscription, and then redo the benchmarks)? |
@zanospi benchmark's cover all the interesting cases but:
I don't think it is necessary to revert the merge as I think this PR probably already improves the performance for majority of our users. |
We can see oversubscription with the threading backend on 40 cores. But it's still not slower than loky backend. In any case, it's at least even with loky backend. It will be better when we have the clibs helpers but in the meanwhile it's still an improvment, and a really significant one for low n_jobs, e.g. speedup instead of slow donw :) |
I agree there may be oversubscription issues, but we would indeed see those more for 20-40 CPU cores while most users have 4-8 CPU cores at most where this is a definite improvement. In the former case, I would argue it also goes a bit into the HPC domain, and while it would be certainly nice if we can handle oversubscription, in that case IMO it's also a bit of a responsibility of the user to know how to efficiently use their compute server and be aware of different levels of parallelism going on (at least for BLAS).. |
Easier to read benchmark just comparing
|
Easier to read benchmark just comparing loky and threading backends for different n_jobs > 1 and no constraint on MKL_NUM_THREADS
If I read this correctly, its a speed up 100% of the time.
|
* ENH enforce threading backend in _parallel_pairwise * OPT write inplace to the matrix in the workers * CLN cosmetics * CLN renaming, cosmetics * CLN renaming * DOC whats_new * DOC rephrasing
This reverts commit d6b2da5.
This reverts commit d6b2da5.
* ENH enforce threading backend in _parallel_pairwise * OPT write inplace to the matrix in the workers * CLN cosmetics * CLN renaming, cosmetics * CLN renaming * DOC whats_new * DOC rephrasing
Hi, We are currently evaluating an upgrade of scikit-learn==0.20.4 to scikit-learn==1.0.2 and are seeing a very high performance degredation in our production setup. We've tried to reproduce it with test data. Our current assumption is, that this change that was merged into scikit-learn==0.21.0 introduce the problem and is still persistent for classifiers like KNeighborsClassifier. Could you please take a look into this and comment why you've changed the backend to force threading for distance calculations using "brute", which is our productive setup. In the reproduction scenario as well as with our test data, using Loky is significantly faster than multithreading. Thanks and regards, TLDR: this was tested using scikit-learn==1.0.2. As you can see, whenever "threading" instead of "loky" is chosen, the performance is worse:
Reproduction Example:
Machine setup:
Full performance test:
|
I see similar observations. I am surprised to find setting multiple jobs slows down the computation significantly in |
pairwise_distances
makes a direct call to_parallel_pairwise
, that computes the distance matrix chunk by chunk. However, it usesloky
, the default backend ofjoblib
, which generates a lot of overhead due to data communication. I propose to enforce athreading
backend in_parallel_pairwise
. I also discard thenp.hstack
concatenation call and write to the matrix in-place, to further optimize for speed.Running this benchmark and limiting over-subscription by setting
MKL_NUM_THREADS
to 1 yielded these results:@jeremiedbb @zanospi