-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Euclidean pairwise_distances slower for n_jobs > 1 #8216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Parallelism is always fraught with this kind of issue. There are multiple
factors in whether it will succeed, and very often it won't. Again, you've
not varied n_features, which might be an important factor. n_jobs may be
more useful when the calculation is slow, i.e. non-euclidean, or over
sparse matrices or whatnot.
I can't remember if we experimented with multithreading for this; it might
help.
…On 19 January 2017 at 22:48, Roman Yurchak ***@***.***> wrote:
In a followup of issue #8213
<#8213> , it looks
like using n_jobs > 1 in Eucledian pairwise_distances makes computations
slower instead of speeding them up.
Steps to reproduce
from sklearn.metrics import pairwise_distancesimport numpy as np
np.random.seed(99999)
n_dim = 200
for n_train, n_test in [(1000, 100000),
(10000, 10000),
(100000, 1000)]:
print('\n# n_train={}, n_test={}, n_dim={}\n'.format(
n_train, n_test, n_dim))
X_train = np.random.rand(n_train, n_dim)
X_test = np.random.rand(n_test, n_dim)
for n_jobs in [1, 2]:
print('n_jobs=', n_jobs, ' => ', end='')
%timeit pairwise_distances(X_train, X_test, 'euclidean',
n_jobs=n_jobs, squared=True)
which on a 2 core CPU returns,
# n_train=1000, n_test=100000, n_dim=200
n_jobs= 1 => 1 loop, best of 3: 1.92 s per loop
n_jobs= 2 => 1 loop, best of 3: 4.95 s per loop
# n_train=10000, n_test=10000, n_dim=200
n_jobs= 1 => 1 loop, best of 3: 1.89 s per loop
n_jobs= 2 => 1 loop, best of 3: 4.74 s per loop
# n_train=100000, n_test=1000, n_dim=200
n_jobs= 1 => 1 loop, best of 3: 2 s per loop
n_jobs= 2 => 1 loop, best of 3: 5.6 s per loop
While for small datasets, it would make sens that the parallel processing
would not improve performance due to the multiprocessing etc overhead, this
is by no mean a small dataset. And the compute time does not decrease when
using e.g. n_jobs=4 on a 4 core CPU.
This also holds for other number of dimensions,
*n_dim=10*
# n_train=1000, n_test=100000, n_dim=10
n_jobs= 1 => 1 loop, best of 3: 873 ms per loop
n_jobs= 2 => 1 loop, best of 3: 4.25 s per loop
*n_dim=1000*
# n_train=1000, n_test=100000, n_dim=1000
n_jobs= 1 => 1 loop, best of 3: 6.56 s per loop
n_jobs= 2 => 1 loop, best of 3: 8.56 s per loop
Running benchmarks/bench_plot_parallel_pairwise.py also yields similar
results,
[image: untitled]
<https://cloud.githubusercontent.com/assets/630936/22105598/240d33bc-de45-11e6-8b68-3ce97b6d7e2d.png>
This might affect a number of estimators / metrics where
pairwise_distances is used.
Versions
Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1
I also get similar results with scikit-learn 0.17.1
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8216>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz66aT0-xOVW9ya0hd-2CtG8KjQsENks5rT016gaJpZM4LoA1P>
.
|
Yes, I know, I tried to do a quick parallellization of def _naive_parallel_pairwise_distances(X, Y, metric, batch_size=100, n_jobs=1, squared=True):
queue = []
n_samples = X.shape[0]
for k in range(n_samples//batch_size + int(n_samples % batch_size != 0)):
mslice = slice(k*batch_size, min((k+1)*batch_size, n_samples))
X_sl = X[mslice, :]
queue.append(delayed(pairwise_distances)(X_sl, Y, metric,
n_jobs=1, squared=squared))
res = Parallel(n_jobs=n_jobs)(queue)
return np.hstack(res) and this yields comparable performance, so it must be as you say that
I have tried The question is that if Also when |
It looks like the cost of memory copies is the issue here. For instance, if we rewrite the parallel calculations to save the results to a shared array (adapted from this joblib example) in this gist, we get the results below (on a 8 core Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz),
so while the compute time of |
I don't think you're gaining anything there by dumping X and Y: unless some
optimisation is involved, there are still refs held by the calling code.
However, I agree that creating memmapped storage for the result seems wise.
However, you're deleting the memmapped file before returning it, which will
work (since it's an open file) on *NIX, but I'm not sure if it'll work on
Windows.
…On 21 January 2017 at 03:57, Roman Yurchak ***@***.***> wrote:
It looks like the cost of memory copies is the issue here. For instance,
if we rewrite the parallel calculations to save the results to a shared
array (adapted from this joblib example
<https://pythonhosted.org/joblib/parallel.html#writing-parallel-computation-results-in-shared-memory>)
in this gist
<https://gist.github.com/rth/2f904eaa165631cd0dba53640f47f652>, we get
the results below,
# n_train=100000, n_test=1000, n_dim=1000
## Backend : pairwise_distances
n_jobs= 1 => 1 loop, best of 3: 2.63 s per loop
n_jobs= 2 => 1 loop, best of 3: 8.43 s per loop
n_jobs= 4 => 1 loop, best of 3: 9.15 s per loop
n_jobs= 8 => 1 loop, best of 3: 14.5 s per loop
## Backend : mmap_pairwise_distances
n_jobs= 1 => 1 loop, best of 3: 3.7 s per loop
n_jobs= 2 => 1 loop, best of 3: 3.43 s per loop
n_jobs= 4 => 1 loop, best of 3: 2.9 s per loop
n_jobs= 8 => 1 loop, best of 3: 2.83 s per loop
so while the compute time of pairwise_distances keeps increasing as you
use more CPU cores, using a shared array to store the results allows to at
least to see some improvement with larger number of CPU cores (though the
scaling is not great and it still does not beat single core performance)...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8216 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63F1Tp4Rr9YPqM0uTueY5QqqEtwJks5rUOeHgaJpZM4LoA1P>
.
|
You are right, looks like joblib with the multiprocesssing backend would do the automatic memapping of input array anyway if they are larger than 1MB.
True, I should have copied data to memory and closed the mmap file before deleting it... |
@rth I wonder if this case would not run faster with the threading backend of joblib. Can you please re-run your benchmarks with this backend as a new option? |
You should also run the tests on a larger machines with 16 cores to see the scalability profile when n_jobs is increased. |
@ogrisel I looked a bit more into this, and considered the influence of 3 parameters,
this benchmarks the
The benchmark script can be found in this gist. Here are the results for the runs, using an Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz),
At least for this dataset of medium size (X array: 0.08 GB, Y array 0.08 GB, result array 0.8 GB), the conclusions appear to be that,
Looks like everything is memory bound, and globally using more cores does not help. Will try testing scipy metrics (e.g. |
Parallelizing the jacckard distance works much better, because the computations take much longer and are not memory bound anymore. The choice of the parallel backend doesn't seem to matter in this case,
Here is a sample script to estimate relative compute time of different metrics, import numpy as np
from sklearn.metrics.pairwise import (pairwise_distances, PAIRWISE_DISTANCE_FUNCTIONS,
PAIRWISE_BOOLEAN_FUNCTIONS)
from IPython import get_ipython
ipython = get_ipython()
n_x, n_y, n_dim = 10000, 1000, 1000
X = np.random.rand(n_x, n_dim)
Y = np.random.rand(n_y, n_dim)
print('\n# n_x={}, n_y={}, n_dim={}'.format(n_x, n_y, n_dim))
for metric in list(PAIRWISE_DISTANCE_FUNCTIONS.keys()) + PAIRWISE_BOOLEAN_FUNCTIONS:
print(' * {:10} metric => '.format(metric), end='')
ipython.magic("timeit -n 1 -r 1 pairwise_distances(X, Y, metric, n_jobs=1)") which produces,
maybe it could be worth raising a warning when the |
Start with a comment in the documentation?
…On 9 Feb 2017 2:42 am, "Roman Yurchak" ***@***.***> wrote:
Parallelizing the jacckard distance works much better, because the
computations take much longer and are not memory bound anymore. The choice
of the parallel backend doesn't seem to matter in this case,
(sklearn-env) ***@***.*** ~]$ ipython pairwise_distances_benchmark.py False jaccard
# sparse=False, n_x=100000, n_y=1000, n_dim=1000
# X array: 0.8 GB, Y array 0.008 GB, result array 0.8 GB
# metric = jaccard
## {'mmap_result': False, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8}
n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop
n_jobs= 2 => 1 loop, best of 1: 2min 40s per loop
n_jobs= 4 => 1 loop, best of 1: 1min 25s per loop
n_jobs= 8 => 1 loop, best of 1: 44.4 s per loop
n_jobs= 16 => 1 loop, best of 1: 33.4 s per loop
## {'mmap_result': True, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8}
n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop
n_jobs= 2 => 1 loop, best of 1: 2min 34s per loop
n_jobs= 4 => 1 loop, best of 1: 1min 18s per loop
n_jobs= 8 => 1 loop, best of 1: 49.6 s per loop
n_jobs= 16 => 1 loop, best of 1: 34.3 s per loop
## {'MKL_NUM_THREADS': 8, 'backend': 'threading'}
n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop
n_jobs= 2 => 1 loop, best of 1: 2min 34s per loop
n_jobs= 4 => 1 loop, best of 1: 1min 17s per loop
n_jobs= 8 => 1 loop, best of 1: 42.4 s per loop
n_jobs= 16 => 1 loop, best of 1: 30.7 s per loop
Here is a sample script to estimate relative compute time of different
metrics,
import numpy as npfrom sklearn.metrics.pairwise import (pairwise_distances, PAIRWISE_DISTANCE_FUNCTIONS,
PAIRWISE_BOOLEAN_FUNCTIONS)
from IPython import get_ipython
ipython = get_ipython()
n_x, n_y, n_dim = 10000, 1000, 1000
X = np.random.rand(n_x, n_dim)
Y = np.random.rand(n_y, n_dim)
print('\n# n_x={}, n_y={}, n_dim={}'.format(n_x, n_y, n_dim))
for metric in list(PAIRWISE_DISTANCE_FUNCTIONS.keys()) + PAIRWISE_BOOLEAN_FUNCTIONS:
print(' * {:10} metric => '.format(metric), end='')
ipython.magic("timeit -n 1 -r 1 pairwise_distances(X, Y, metric, n_jobs=1)")
which produces,
# n_x=10000, n_y=1000, n_dim=1000
* precomputed metric => 1 loop, best of 1: 6.22 ms per loop
* cityblock metric => 1 loop, best of 1: 9.98 s per loop
* euclidean metric => 1 loop, best of 1: 685 ms per loop
* manhattan metric => 1 loop, best of 1: 10.3 s per loop
* l2 metric => 1 loop, best of 1: 721 ms per loop
* cosine metric => 1 loop, best of 1: 836 ms per loop
* l1 metric => 1 loop, best of 1: 10.4 s per loop
* dice metric => 1 loop, best of 1: 20.2 s per loop
* jaccard metric => 1 loop, best of 1: 40.7 s per loop
* kulsinski metric => 1 loop, best of 1: 19.9 s per loop
* matching metric => 1 loop, best of 1: 7.08 s per loop
* rogerstanimoto metric => 1 loop, best of 1: 17.4 s per loop
* russellrao metric => 1 loop, best of 1: 13.3 s per loop
* sokalmichener metric => 1 loop, best of 1: 17.4 s per loop
* sokalsneath metric => 1 loop, best of 1: 19.7 s per loop
* yule metric => 1 loop, best of 1: 25.4 s per loop
maybe it could be worth raising a warning when the pairwise_distances is
used with faster metrics ['euclidean', 'l2', 'cosine'] and n_jobs > 1,
saying that using the latter option may lead to worse performance?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8216 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz651trLe1a51dcGDWkO4z8BZNgDqTks5raeJUgaJpZM4LoA1P>
.
|
True, but which part of the docs?
which means that, if this issue is confirmed, using I imagine the impact of this is somewhat limited by the fact that NearestNeighbors would use by default kd_tree or ball_tree algorithm (cf. #8213) , even in cases when brute force with |
That's a fair comment. I don't mind a warning if we're able to get a good
idea of when to warn
…On 9 February 2017 at 08:08, Roman Yurchak ***@***.***> wrote:
Start with a comment in the documentation?
True, but which part of the docs? pairwise_distances that allow a user
provided n_jobs parameter are used (directly or indirectly via
NearestNeighbors) in,
- silhouette_score
- LocalOutlierFactor
- DBSCAN
- NearestNeighbors
- KNeighborsClassifier
- KNeighborsRegressor
- Isomap
- kneighbors_graph
- mean_shift
- LabelPropagation
which means that, if this issue is confirmed, using n_jobs > 1 may
potentially render those estimators / functions (significantly) slower.
I imagine the impact of this is somewhat limited by the fact that
NearestNeighbors would use by default kd_tree or ball_tree algorithm (cf.
#8213 <#8213>) , even
in cases when brute force with pairwise_distances would have been more
efficient, but that's another issue..
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8216 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60IRjurlQCHG4IpHjmzsYKO6laBQks5rai7ggaJpZM4LoA1P>
.
|
I had a look at this a few weeks ago and AFAICT the reason may be that ret = Parallel(n_jobs=n_jobs, verbose=0)(
fd(X, Y[s], **kwds)
for s in gen_even_slices(Y.shape[0], n_jobs)) For the auto memmapping to be more useful I think it should do: ret = Parallel(n_jobs=n_jobs, verbose=0)(
another_fd(X, Y, s, **kwds)
for s in gen_even_slices(Y.shape[0], n_jobs)) |
I had a look at this a few weeks ago and AFAICT the reason may be that
sklearn.metrics.pairwise._parallel_pairwise does:
ret = Parallel(n_jobs=n_jobs, verbose=0)(
fd(X, Y[s], **kwds)
for s in gen_even_slices(Y.shape[0], n_jobs))
For the auto memmapping to be more useful I think it should do:
ret = Parallel(n_jobs=n_jobs, verbose=0)(
another_fd(X, Y, s, **kwds)
for s in gen_even_slices(Y.shape[0], n_jobs))
Yes. Maybe we should document this in joblib.
Maybe it's time for joblib to have a set of examples that discuss such
efficient tradeoffs.
|
Thanks @lesteve , I'll test that option. Also would you have an opinion why auto memmapping the output array (results with |
Hmmm not off the top of my head I have to say. |
As threading is not impacted by the GIL for those workloads, there is really no point in using multiprocessing + memmaping: that is just a waste of disk access and memory (multiprocessing + memmaping still does at least one memory copy that is not required when threading). The fact that cosine / euclidean do not get a speed up with n_jobs=1 is because those workloads are memory bound. It might be possible to find clever ways to block the data differently depending on the dimensions but I think it's primarily BLAS level 1 and 2 operations that dominate and those are memory bound. So +1 for using the threading backend by default and issueing a warning when using n_jobs>1 on memory bound metrics. |
Thanks @ogrisel I'll submit a PR. Need to check that threading performs as well or better for all metrics though.. It is interesting that BLAS level 1 and level 2 operations (dot product, matrix vector multiplications, ..) are memory bound. I imagine this would also explain why parallelizing the predict method (e.g. in linear models) may not yield any speed up (related issues #7448, #7635)... |
Yes indeed I mean level 1 and 2: vec x vec and mat x vec are memory bound. Only vec x vec is has an algorithmic intensity large enough to benefit from multi cores. But it also depends on how it's "blocked". MKL and OpenBLAS do it for us though. |
Actually for cosine similarities and and euclidean distances there is a level 3 GEMM involved. So at least this step should benefit from multicore without running into the memory bandwidth limit. But the BLAS implementation (MKL or OpenBLAS) should have no problem to multithread that part and there is probably no point in trying to manually handle the parallelism at a coarser level for those GEMM based distances. |
Switching whether to use threading bases on the choice of metric may make sense |
This is resolved by using a threading backend for |
In a followup of issue #8213 , it looks like using
n_jobs > 1
in Eucledianpairwise_distances
makes computations slower instead of speeding them up.Steps to reproduce
which on a 2 core CPU returns,
While for small datasets, it would make sens that the parallel processing would not improve performance due to the multiprocessing etc overhead, this is by no mean a small dataset. And the compute time does not decrease when using e.g. n_jobs=4 on a 4 core CPU.
This also holds for other number of dimensions,
n_dim=10
n_dim=1000
Running

benchmarks/bench_plot_parallel_pairwise.py
also yields similar results,This might affect a number of estimators / metrics where
pairwise_distances
is used.Versions
I also get similar results with scikit-learn 0.17.1
The text was updated successfully, but these errors were encountered: