Euclidean pairwise_distances slower for n_jobs > 1 #8216

rth · 2017-01-19T11:48:09Z

In a followup of issue #8213 , it looks like using n_jobs > 1 in Eucledian pairwise_distances makes computations slower instead of speeding them up.

Steps to reproduce

from sklearn.metrics import pairwise_distances
import numpy as np

np.random.seed(99999)
n_dim = 200

for n_train, n_test in [(1000, 100000),
                        (10000, 10000),
                        (100000, 1000)]:
    print('\n# n_train={}, n_test={}, n_dim={}\n'.format(
                         n_train, n_test, n_dim))

    X_train = np.random.rand(n_train, n_dim)
    X_test = np.random.rand(n_test, n_dim)

    for n_jobs in [1, 2]:
        print('n_jobs=', n_jobs, ' => ', end='')

        %timeit pairwise_distances(X_train, X_test, 'euclidean',
                                   n_jobs=n_jobs, squared=True)

which on a 2 core CPU returns,

# n_train=1000, n_test=100000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 1.92 s per loop
n_jobs= 2  => 1 loop, best of 3: 4.95 s per loop

# n_train=10000, n_test=10000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 1.89 s per loop
n_jobs= 2  => 1 loop, best of 3: 4.74 s per loop

# n_train=100000, n_test=1000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 2 s per loop
n_jobs= 2  => 1 loop, best of 3: 5.6 s per loop

While for small datasets, it would make sens that the parallel processing would not improve performance due to the multiprocessing etc overhead, this is by no mean a small dataset. And the compute time does not decrease when using e.g. n_jobs=4 on a 4 core CPU.

This also holds for other number of dimensions,
n_dim=10

# n_train=1000, n_test=100000, n_dim=10

n_jobs= 1  => 1 loop, best of 3: 873 ms per loop
n_jobs= 2  => 1 loop, best of 3: 4.25 s per loop

n_dim=1000

# n_train=1000, n_test=100000, n_dim=1000

n_jobs= 1  => 1 loop, best of 3: 6.56 s per loop
n_jobs= 2  => 1 loop, best of 3: 8.56 s per loop

Running benchmarks/bench_plot_parallel_pairwise.py also yields similar results,

This might affect a number of estimators / metrics where pairwise_distances is used.

Versions

Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1

I also get similar results with scikit-learn 0.17.1

The text was updated successfully, but these errors were encountered:

jnothman · 2017-01-20T00:20:01Z

Parallelism is always fraught with this kind of issue. There are multiple factors in whether it will succeed, and very often it won't. Again, you've not varied n_features, which might be an important factor. n_jobs may be more useful when the calculation is slow, i.e. non-euclidean, or over sparse matrices or whatnot. I can't remember if we experimented with multithreading for this; it might help.

…

On 19 January 2017 at 22:48, Roman Yurchak ***@***.***> wrote: In a followup of issue #8213 <#8213> , it looks like using n_jobs > 1 in Eucledian pairwise_distances makes computations slower instead of speeding them up. Steps to reproduce from sklearn.metrics import pairwise_distancesimport numpy as np np.random.seed(99999) n_dim = 200 for n_train, n_test in [(1000, 100000), (10000, 10000), (100000, 1000)]: print('\n# n_train={}, n_test={}, n_dim={}\n'.format( n_train, n_test, n_dim)) X_train = np.random.rand(n_train, n_dim) X_test = np.random.rand(n_test, n_dim) for n_jobs in [1, 2]: print('n_jobs=', n_jobs, ' => ', end='') %timeit pairwise_distances(X_train, X_test, 'euclidean', n_jobs=n_jobs, squared=True) which on a 2 core CPU returns, # n_train=1000, n_test=100000, n_dim=200 n_jobs= 1 => 1 loop, best of 3: 1.92 s per loop n_jobs= 2 => 1 loop, best of 3: 4.95 s per loop # n_train=10000, n_test=10000, n_dim=200 n_jobs= 1 => 1 loop, best of 3: 1.89 s per loop n_jobs= 2 => 1 loop, best of 3: 4.74 s per loop # n_train=100000, n_test=1000, n_dim=200 n_jobs= 1 => 1 loop, best of 3: 2 s per loop n_jobs= 2 => 1 loop, best of 3: 5.6 s per loop While for small datasets, it would make sens that the parallel processing would not improve performance due to the multiprocessing etc overhead, this is by no mean a small dataset. And the compute time does not decrease when using e.g. n_jobs=4 on a 4 core CPU. This also holds for other number of dimensions, *n_dim=10* # n_train=1000, n_test=100000, n_dim=10 n_jobs= 1 => 1 loop, best of 3: 873 ms per loop n_jobs= 2 => 1 loop, best of 3: 4.25 s per loop *n_dim=1000* # n_train=1000, n_test=100000, n_dim=1000 n_jobs= 1 => 1 loop, best of 3: 6.56 s per loop n_jobs= 2 => 1 loop, best of 3: 8.56 s per loop Running benchmarks/bench_plot_parallel_pairwise.py also yields similar results, [image: untitled] <https://cloud.githubusercontent.com/assets/630936/22105598/240d33bc-de45-11e6-8b68-3ce97b6d7e2d.png> This might affect a number of estimators / metrics where pairwise_distances is used. Versions Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3 Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:53:06) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] NumPy 1.11.1 SciPy 0.18.1 Scikit-Learn 0.18.1 I also get similar results with scikit-learn 0.17.1 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8216>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz66aT0-xOVW9ya0hd-2CtG8KjQsENks5rT016gaJpZM4LoA1P> .

rth · 2017-01-20T11:12:49Z

Parallelism is always fraught with this kind of issue. There are multiple factors in whether it will succeed, and very often it won't. [..] n_jobs may be more useful when the calculation is slow, i.e. non-euclidean, or over sparse matrices or whatnot.

Yes, I know, I tried to do a quick parallellization of pariwise_distances(..., n_jobs=1),

def _naive_parallel_pairwise_distances(X, Y, metric, batch_size=100, n_jobs=1, squared=True):
    queue = []
    n_samples = X.shape[0]
    for k in range(n_samples//batch_size + int(n_samples % batch_size != 0)):
        mslice = slice(k*batch_size, min((k+1)*batch_size, n_samples))
        X_sl = X[mslice, :]
        queue.append(delayed(pairwise_distances)(X_sl, Y, metric,
                                        n_jobs=1, squared=squared))
    res = Parallel(n_jobs=n_jobs)(queue)
    return np.hstack(res)

and this yields comparable performance, so it must be as you say that n_jobs would better improve non-eucledian distances etc that take longer to compute.

Again, you've not varied n_features, which might be an important factor.

I have tried n_features in [10, 200, 1000] in my original post above (sorry called it n_dim), in all cases n_jobs > 1 makes things slower. Including for large arrays X [100000, 1000], Y [1000, 1000] which are reaching the limit of what I can compute with 8 GB RAM.

The question is that if n_jobs always makes pairwise euclidean distances slower, no matter n_samples, n_features (as that's what all the benchmarks suggest so far), then some warning should probably be printed to the user if n_jobs > 1 is used? Particularly since euclidean distance is the default (and probably the most frequent used) distance metric.

Also when benchmarks/bench_plot_parallel_pairwise.py was added by @mblondel in 3b8f54e , I imagine it aimed to demonstrate that using n_jobs > 1 could speed up Eucledian / RBF distance calculations (couldn't find any plots / PR from that time). When I run this benchmark now, it illustrates that using n_jobs > 1 makes things slower (cf images in the original post above), which doesn't really make sens to show on a benchmark...

rth · 2017-01-20T16:57:42Z

It looks like the cost of memory copies is the issue here. For instance, if we rewrite the parallel calculations to save the results to a shared array (adapted from this joblib example) in this gist, we get the results below (on a 8 core Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz),

# n_train=100000, n_test=1000, n_dim=1000

## Backend :  pairwise_distances
n_jobs= 1  => 1 loop, best of 3: 2.63 s per loop
n_jobs= 2  => 1 loop, best of 3: 8.43 s per loop
n_jobs= 4  => 1 loop, best of 3: 9.15 s per loop
n_jobs= 8  => 1 loop, best of 3: 14.5 s per loop
## Backend :  mmap_pairwise_distances
n_jobs= 1  => 1 loop, best of 3: 3.7 s per loop
n_jobs= 2  => 1 loop, best of 3: 3.43 s per loop
n_jobs= 4  => 1 loop, best of 3: 2.9 s per loop
n_jobs= 8  => 1 loop, best of 3: 2.83 s per loop

so while the compute time of pairwise_distances keeps increasing as you use more CPU cores, using a shared array to store the results allows to at least to see some improvement with larger number of CPU cores (though the scaling is not great and it still does not beat single core performance)...

jnothman · 2017-01-21T11:05:27Z

I don't think you're gaining anything there by dumping X and Y: unless some optimisation is involved, there are still refs held by the calling code. However, I agree that creating memmapped storage for the result seems wise. However, you're deleting the memmapped file before returning it, which will work (since it's an open file) on *NIX, but I'm not sure if it'll work on Windows.

…

On 21 January 2017 at 03:57, Roman Yurchak ***@***.***> wrote: It looks like the cost of memory copies is the issue here. For instance, if we rewrite the parallel calculations to save the results to a shared array (adapted from this joblib example <https://pythonhosted.org/joblib/parallel.html#writing-parallel-computation-results-in-shared-memory>) in this gist <https://gist.github.com/rth/2f904eaa165631cd0dba53640f47f652>, we get the results below, # n_train=100000, n_test=1000, n_dim=1000 ## Backend : pairwise_distances n_jobs= 1 => 1 loop, best of 3: 2.63 s per loop n_jobs= 2 => 1 loop, best of 3: 8.43 s per loop n_jobs= 4 => 1 loop, best of 3: 9.15 s per loop n_jobs= 8 => 1 loop, best of 3: 14.5 s per loop ## Backend : mmap_pairwise_distances n_jobs= 1 => 1 loop, best of 3: 3.7 s per loop n_jobs= 2 => 1 loop, best of 3: 3.43 s per loop n_jobs= 4 => 1 loop, best of 3: 2.9 s per loop n_jobs= 8 => 1 loop, best of 3: 2.83 s per loop so while the compute time of pairwise_distances keeps increasing as you use more CPU cores, using a shared array to store the results allows to at least to see some improvement with larger number of CPU cores (though the scaling is not great and it still does not beat single core performance)... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8216 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz63F1Tp4Rr9YPqM0uTueY5QqqEtwJks5rUOeHgaJpZM4LoA1P> .

rth · 2017-01-23T08:25:04Z

I don't think you're gaining anything there by dumping X and Y: unless some optimisation is involved, there are still refs held by the calling code.

You are right, looks like joblib with the multiprocesssing backend would do the automatic memapping of input array anyway if they are larger than 1MB.

However, I agree that creating memmapped storage for the result seems wise. However, you're deleting the memmapped file before returning it, which will work (since it's an open file) on *NIX, but I'm not sure if it'll work on Windows.

True, I should have copied data to memory and closed the mmap file before deleting it...

ogrisel · 2017-02-08T09:51:55Z

@rth I wonder if this case would not run faster with the threading backend of joblib. Can you please re-run your benchmarks with this backend as a new option?

ogrisel · 2017-02-08T09:52:55Z

You should also run the tests on a larger machines with 16 cores to see the scalability profile when n_jobs is increased.

rth · 2017-02-08T14:52:09Z

@ogrisel I looked a bit more into this, and considered the influence of 3 parameters,

parallel backend in ['multiprocessing', 'threading']
memmapping the output array or not
setting MKL_NUM_THREAD in [1, 8] to disable multithreading in numpy (when using default conda install)

this benchmarks the _parallel_pairwise functio, run for

dense and sparse CSR arrays
metrics in ['euclidean', 'cosine', 'l1'] metrics.
for input array sizes (n_x, n_y, n_dim) in
```
[(100000, 1000, 1000),
 (10000, 10000, 1000),
 (10000, 10000, 10),
 (10000, 10000, 10000)]:
```
for sparse arrays the n_dim is multiplied by 10 to account for lower density.

The benchmark script can be found in this gist. Here are the results for the runs, using an Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz),

Metruc\is_sparse_csr	Dense	Sparse CSR
euclidean	http://pastebin.com/Wvn40Arh	http://pastebin.com/4cMBr2xV
cosine	http://pastebin.com/wX0pZiMQ
l1		in progress..

At least for this dataset of medium size (X array: 0.08 GB, Y array 0.08 GB, result array 0.8 GB), the conclusions appear to be that,

for all possible combination of parameters, using n_jobs=1 yields the best performance
the multiprocessing backend decreases performance significantly with n_jobs in this use case
the threading backend at best improves performance by 10-20% and doesn't outperform n_jobs=1 significantly
* setting the number of numpy threads with MKL_NUM_THREAD doesn't seem to have an impact
* pre-allocating the result array, and providing it to joblib so it would be memmaped appears to work worse than doing the same thing but creating the mmemaped file by hand (see comment above).

Looks like everything is memory bound, and globally using more cores does not help. Will try testing scipy metrics (e.g. jaccard) next..

rth · 2017-02-08T15:42:10Z

Parallelizing the jacckard distance works much better, because the computations take much longer and are not memory bound anymore. The choice of the parallel backend doesn't seem to matter in this case,

(sklearn-env) [ec2-user@ip-172-31-11-89 ~]$ ipython pairwise_distances_benchmark.py False jaccard
# sparse=False, n_x=100000, n_y=1000, n_dim=1000
#   X array: 0.8 GB, Y array 0.008 GB, result array 0.8 GB
# metric = jaccard

##  {'mmap_result': False, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8}
 n_jobs= 1  => 1 loop, best of 1: 5min 7s per loop
n_jobs= 2  => 1 loop, best of 1: 2min 40s per loop
n_jobs= 4  => 1 loop, best of 1: 1min 25s per loop
n_jobs= 8  => 1 loop, best of 1: 44.4 s per loop
n_jobs= 16  => 1 loop, best of 1: 33.4 s per loop

##  {'mmap_result': True, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8}
n_jobs= 1  => 1 loop, best of 1: 5min 7s per loop
n_jobs= 2  => 1 loop, best of 1: 2min 34s per loop
n_jobs= 4  => 1 loop, best of 1: 1min 18s per loop
n_jobs= 8  => 1 loop, best of 1: 49.6 s per loop
n_jobs= 16  => 1 loop, best of 1: 34.3 s per loop

##  {'MKL_NUM_THREADS': 8, 'backend': 'threading'}                                                                                                          
 n_jobs= 1  => 1 loop, best of 1: 5min 7s per loop
n_jobs= 2  => 1 loop, best of 1: 2min 34s per loop
n_jobs= 4  => 1 loop, best of 1: 1min 17s per loop
n_jobs= 8  => 1 loop, best of 1: 42.4 s per loop
n_jobs= 16  => 1 loop, best of 1: 30.7 s per loop

Here is a sample script to estimate relative compute time of different metrics,

import numpy as np
from sklearn.metrics.pairwise import (pairwise_distances, PAIRWISE_DISTANCE_FUNCTIONS,
                                      PAIRWISE_BOOLEAN_FUNCTIONS)

from IPython import get_ipython
ipython = get_ipython()

n_x, n_y, n_dim = 10000, 1000, 1000

X = np.random.rand(n_x, n_dim)
Y = np.random.rand(n_y, n_dim)

print('\n# n_x={}, n_y={}, n_dim={}'.format(n_x, n_y, n_dim))

for metric in list(PAIRWISE_DISTANCE_FUNCTIONS.keys()) + PAIRWISE_BOOLEAN_FUNCTIONS:

    print(' * {:10} metric => '.format(metric), end='')
    ipython.magic("timeit -n 1 -r 1 pairwise_distances(X, Y, metric, n_jobs=1)")

which produces,

# n_x=10000, n_y=1000, n_dim=1000
 * precomputed metric => 1 loop, best of 1: 6.22 ms per loop
 * cityblock  metric => 1 loop, best of 1: 9.98 s per loop
 * euclidean  metric => 1 loop, best of 1: 685 ms per loop
 * manhattan  metric => 1 loop, best of 1: 10.3 s per loop
 * l2         metric => 1 loop, best of 1: 721 ms per loop
 * cosine     metric => 1 loop, best of 1: 836 ms per loop
 * l1         metric => 1 loop, best of 1: 10.4 s per loop
 * dice       metric => 1 loop, best of 1: 20.2 s per loop
 * jaccard    metric => 1 loop, best of 1: 40.7 s per loop
 * kulsinski  metric => 1 loop, best of 1: 19.9 s per loop
 * matching   metric => 1 loop, best of 1: 7.08 s per loop
 * rogerstanimoto metric => 1 loop, best of 1: 17.4 s per loop
 * russellrao metric => 1 loop, best of 1: 13.3 s per loop
 * sokalmichener metric => 1 loop, best of 1: 17.4 s per loop
 * sokalsneath metric => 1 loop, best of 1: 19.7 s per loop
 * yule       metric => 1 loop, best of 1: 25.4 s per loop

maybe it could be worth raising a warning when the pairwise_distances is used with faster metrics ['euclidean', 'l2', 'cosine'] and n_jobs > 1, saying that using the latter option may lead to worse performance?

jnothman · 2017-02-08T20:20:37Z

Start with a comment in the documentation?

…

On 9 Feb 2017 2:42 am, "Roman Yurchak" ***@***.***> wrote: Parallelizing the jacckard distance works much better, because the computations take much longer and are not memory bound anymore. The choice of the parallel backend doesn't seem to matter in this case, (sklearn-env) ***@***.*** ~]$ ipython pairwise_distances_benchmark.py False jaccard # sparse=False, n_x=100000, n_y=1000, n_dim=1000 # X array: 0.8 GB, Y array 0.008 GB, result array 0.8 GB # metric = jaccard ## {'mmap_result': False, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8} n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop n_jobs= 2 => 1 loop, best of 1: 2min 40s per loop n_jobs= 4 => 1 loop, best of 1: 1min 25s per loop n_jobs= 8 => 1 loop, best of 1: 44.4 s per loop n_jobs= 16 => 1 loop, best of 1: 33.4 s per loop ## {'mmap_result': True, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8} n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop n_jobs= 2 => 1 loop, best of 1: 2min 34s per loop n_jobs= 4 => 1 loop, best of 1: 1min 18s per loop n_jobs= 8 => 1 loop, best of 1: 49.6 s per loop n_jobs= 16 => 1 loop, best of 1: 34.3 s per loop ## {'MKL_NUM_THREADS': 8, 'backend': 'threading'} n_jobs= 1 => 1 loop, best of 1: 5min 7s per loop n_jobs= 2 => 1 loop, best of 1: 2min 34s per loop n_jobs= 4 => 1 loop, best of 1: 1min 17s per loop n_jobs= 8 => 1 loop, best of 1: 42.4 s per loop n_jobs= 16 => 1 loop, best of 1: 30.7 s per loop Here is a sample script to estimate relative compute time of different metrics, import numpy as npfrom sklearn.metrics.pairwise import (pairwise_distances, PAIRWISE_DISTANCE_FUNCTIONS, PAIRWISE_BOOLEAN_FUNCTIONS) from IPython import get_ipython ipython = get_ipython() n_x, n_y, n_dim = 10000, 1000, 1000 X = np.random.rand(n_x, n_dim) Y = np.random.rand(n_y, n_dim) print('\n# n_x={}, n_y={}, n_dim={}'.format(n_x, n_y, n_dim)) for metric in list(PAIRWISE_DISTANCE_FUNCTIONS.keys()) + PAIRWISE_BOOLEAN_FUNCTIONS: print(' * {:10} metric => '.format(metric), end='') ipython.magic("timeit -n 1 -r 1 pairwise_distances(X, Y, metric, n_jobs=1)") which produces, # n_x=10000, n_y=1000, n_dim=1000 * precomputed metric => 1 loop, best of 1: 6.22 ms per loop * cityblock metric => 1 loop, best of 1: 9.98 s per loop * euclidean metric => 1 loop, best of 1: 685 ms per loop * manhattan metric => 1 loop, best of 1: 10.3 s per loop * l2 metric => 1 loop, best of 1: 721 ms per loop * cosine metric => 1 loop, best of 1: 836 ms per loop * l1 metric => 1 loop, best of 1: 10.4 s per loop * dice metric => 1 loop, best of 1: 20.2 s per loop * jaccard metric => 1 loop, best of 1: 40.7 s per loop * kulsinski metric => 1 loop, best of 1: 19.9 s per loop * matching metric => 1 loop, best of 1: 7.08 s per loop * rogerstanimoto metric => 1 loop, best of 1: 17.4 s per loop * russellrao metric => 1 loop, best of 1: 13.3 s per loop * sokalmichener metric => 1 loop, best of 1: 17.4 s per loop * sokalsneath metric => 1 loop, best of 1: 19.7 s per loop * yule metric => 1 loop, best of 1: 25.4 s per loop maybe it could be worth raising a warning when the pairwise_distances is used with faster metrics ['euclidean', 'l2', 'cosine'] and n_jobs > 1, saying that using the latter option may lead to worse performance? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8216 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz651trLe1a51dcGDWkO4z8BZNgDqTks5raeJUgaJpZM4LoA1P> .

rth · 2017-02-08T21:08:47Z

Start with a comment in the documentation?

True, but which part of the docs? pairwise_distances that allow a user provided n_jobs parameter are used (directly or indirectly via NearestNeighbors) in,

silhouette_score
LocalOutlierFactor
DBSCAN
NearestNeighbors
KNeighborsClassifier
KNeighborsRegressor
Isomap
kneighbors_graph
mean_shift
LabelPropagation

which means that, if this issue is confirmed, using n_jobs > 1 may potentially render those estimators / functions (significantly) slower.

I imagine the impact of this is somewhat limited by the fact that NearestNeighbors would use by default kd_tree or ball_tree algorithm (cf. #8213) , even in cases when brute force with pairwise_distances would have been more efficient, but that's another issue..

jnothman · 2017-02-09T00:00:42Z

That's a fair comment. I don't mind a warning if we're able to get a good idea of when to warn

…

On 9 February 2017 at 08:08, Roman Yurchak ***@***.***> wrote: Start with a comment in the documentation? True, but which part of the docs? pairwise_distances that allow a user provided n_jobs parameter are used (directly or indirectly via NearestNeighbors) in, - silhouette_score - LocalOutlierFactor - DBSCAN - NearestNeighbors - KNeighborsClassifier - KNeighborsRegressor - Isomap - kneighbors_graph - mean_shift - LabelPropagation which means that, if this issue is confirmed, using n_jobs > 1 may potentially render those estimators / functions (significantly) slower. I imagine the impact of this is somewhat limited by the fact that NearestNeighbors would use by default kd_tree or ball_tree algorithm (cf. #8213 <#8213>) , even in cases when brute force with pairwise_distances would have been more efficient, but that's another issue.. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8216 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz60IRjurlQCHG4IpHjmzsYKO6laBQks5rai7ggaJpZM4LoA1P> .

lesteve · 2017-02-09T08:01:54Z

pre-allocating the result array, and providing it to joblib so it would be memmaped appears to work worse than doing the same thing but creating the mmemaped file by hand (see comment above).

I had a look at this a few weeks ago and AFAICT the reason may be that sklearn.metrics.pairwise._parallel_pairwise does:

    ret = Parallel(n_jobs=n_jobs, verbose=0)(
        fd(X, Y[s], **kwds)
        for s in gen_even_slices(Y.shape[0], n_jobs))

For the auto memmapping to be more useful I think it should do:

    ret = Parallel(n_jobs=n_jobs, verbose=0)(
        another_fd(X, Y, s, **kwds)
        for s in gen_even_slices(Y.shape[0], n_jobs))

GaelVaroquaux · 2017-02-09T08:08:51Z

I had a look at this a few weeks ago and AFAICT the reason may be that sklearn.metrics.pairwise._parallel_pairwise does: ret = Parallel(n_jobs=n_jobs, verbose=0)( fd(X, Y[s], **kwds) for s in gen_even_slices(Y.shape[0], n_jobs)) For the auto memmapping to be more useful I think it should do: ret = Parallel(n_jobs=n_jobs, verbose=0)( another_fd(X, Y, s, **kwds) for s in gen_even_slices(Y.shape[0], n_jobs))

Yes. Maybe we should document this in joblib. Maybe it's time for joblib to have a set of examples that discuss such efficient tradeoffs.

rth · 2017-02-09T08:12:59Z

Thanks @lesteve , I'll test that option.

Also would you have an opinion why auto memmapping the output array (results with 'mmap_result': True here ) has significanly worse performance in this case than creating the memmap file by hand and providing it to joblib (results here). Though in the latter case I was using squared=True option, so maybe that has an impact. Or maybe I missed something else...

lesteve · 2017-02-09T08:28:22Z

Also would you have an opinion why auto memmapping the output array (results with 'mmap_result': True here ) has significanly worse performance in this case than creating the memmap file by hand and providing it to joblib (results here)

Hmmm not off the top of my head I have to say.

ogrisel · 2017-02-09T08:30:56Z

As threading is not impacted by the GIL for those workloads, there is really no point in using multiprocessing + memmaping: that is just a waste of disk access and memory (multiprocessing + memmaping still does at least one memory copy that is not required when threading).

The fact that cosine / euclidean do not get a speed up with n_jobs=1 is because those workloads are memory bound. It might be possible to find clever ways to block the data differently depending on the dimensions but I think it's primarily BLAS level 1 and 2 operations that dominate and those are memory bound.

So +1 for using the threading backend by default and issueing a warning when using n_jobs>1 on memory bound metrics.

rth · 2017-02-09T08:45:08Z

Thanks @ogrisel I'll submit a PR. Need to check that threading performs as well or better for all metrics though..

It is interesting that BLAS level 1 and level 2 operations (dot product, matrix vector multiplications, ..) are memory bound. I imagine this would also explain why parallelizing the predict method (e.g. in linear models) may not yield any speed up (related issues #7448, #7635)...

ogrisel · 2017-02-09T09:34:01Z

Yes indeed I mean level 1 and 2: vec x vec and mat x vec are memory bound. Only vec x vec is has an algorithmic intensity large enough to benefit from multi cores. But it also depends on how it's "blocked". MKL and OpenBLAS do it for us though.

ogrisel · 2017-02-10T07:11:54Z

Actually for cosine similarities and and euclidean distances there is a level 3 GEMM involved. So at least this step should benefit from multicore without running into the memory bandwidth limit. But the BLAS implementation (MKL or OpenBLAS) should have no problem to multithread that part and there is probably no point in trying to manually handle the parallelism at a coarser level for those GEMM based distances.

jnothman · 2018-05-30T10:11:32Z

Switching whether to use threading bases on the choice of metric may make sense

rth · 2019-03-01T16:20:55Z

This is resolved by using a threading backend for pariwise_distances calculations in #13310

rth mentioned this issue Jan 19, 2017

Improve performance of Nearest Neighbors search FreeDiscovery/FreeDiscovery#59

Closed

rth mentioned this issue Mar 6, 2017

deadlock in multioutput #8543

Closed

lesteve mentioned this issue Oct 27, 2017

Feature request: Parallel/Multicore implementation of t-SNE #10023

Closed

rth mentioned this issue Nov 29, 2017

[WIP] Pre-compute norms to speed-up NearestNeighbors.kneighbors algorith='brute' metric='euclidean' #10212

Closed

jnothman mentioned this issue Dec 10, 2017

[MRG+1] ENH Add working_memory global config for chunked operations #10280

Merged

10 tasks

lesteve mentioned this issue Apr 4, 2018

imblearn SMOTE throwing error when n_jobs > 1 #10916

Closed

rth mentioned this issue May 10, 2018

Explore the use of anonymous mmap buffers for joblib.Parallel joblib/joblib#378

Open

rth mentioned this issue May 30, 2018

Test too slow: test_mean_shift.py::test_parallel #11146

Closed

jeremiedbb mentioned this issue Feb 27, 2019

Enforce threading in parallel pairwise #13310

Merged

rth closed this as completed Mar 1, 2019

DavidBreuer mentioned this issue Aug 17, 2019

Add parameter n_jobs for multiprocessing to hyperopt_estimator hyperopt/hyperopt-sklearn#138

Merged

rth mentioned this issue Apr 17, 2020

NearestNeighbor uses huge amount of RAM when using lower (<500) count of features #16946

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Euclidean pairwise_distances slower for n_jobs > 1 #8216

Euclidean pairwise_distances slower for n_jobs > 1 #8216

rth commented Jan 19, 2017

jnothman commented Jan 20, 2017 via email

rth commented Jan 20, 2017

rth commented Jan 20, 2017 •

edited

Loading

jnothman commented Jan 21, 2017 via email

rth commented Jan 23, 2017

ogrisel commented Feb 8, 2017

ogrisel commented Feb 8, 2017

rth commented Feb 8, 2017

rth commented Feb 8, 2017

jnothman commented Feb 8, 2017 via email

rth commented Feb 8, 2017

jnothman commented Feb 9, 2017 via email

lesteve commented Feb 9, 2017

GaelVaroquaux commented Feb 9, 2017 via email

rth commented Feb 9, 2017

lesteve commented Feb 9, 2017

ogrisel commented Feb 9, 2017 •

edited

Loading

rth commented Feb 9, 2017 •

edited

Loading

ogrisel commented Feb 9, 2017

ogrisel commented Feb 10, 2017

jnothman commented May 30, 2018

rth commented Mar 1, 2019

Euclidean pairwise_distances slower for n_jobs > 1 #8216

Euclidean pairwise_distances slower for n_jobs > 1 #8216

Comments

rth commented Jan 19, 2017

Steps to reproduce

Versions

jnothman commented Jan 20, 2017 via email

rth commented Jan 20, 2017

rth commented Jan 20, 2017 • edited Loading

jnothman commented Jan 21, 2017 via email

rth commented Jan 23, 2017

ogrisel commented Feb 8, 2017

ogrisel commented Feb 8, 2017

rth commented Feb 8, 2017

rth commented Feb 8, 2017

jnothman commented Feb 8, 2017 via email

rth commented Feb 8, 2017

jnothman commented Feb 9, 2017 via email

lesteve commented Feb 9, 2017

GaelVaroquaux commented Feb 9, 2017 via email

rth commented Feb 9, 2017

lesteve commented Feb 9, 2017

ogrisel commented Feb 9, 2017 • edited Loading

rth commented Feb 9, 2017 • edited Loading

ogrisel commented Feb 9, 2017

ogrisel commented Feb 10, 2017

jnothman commented May 30, 2018

rth commented Mar 1, 2019

rth commented Jan 20, 2017 •

edited

Loading

ogrisel commented Feb 9, 2017 •

edited

Loading

rth commented Feb 9, 2017 •

edited

Loading