Skip to content

Euclidean pairwise_distances slower for n_jobs > 1 #8216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rth opened this issue Jan 19, 2017 · 22 comments
Closed

Euclidean pairwise_distances slower for n_jobs > 1 #8216

rth opened this issue Jan 19, 2017 · 22 comments

Comments

@rth
Copy link
Member

rth commented Jan 19, 2017

In a followup of issue #8213 , it looks like using n_jobs > 1 in Eucledian pairwise_distances makes computations slower instead of speeding them up.

Steps to reproduce

from sklearn.metrics import pairwise_distances
import numpy as np

np.random.seed(99999)
n_dim = 200

for n_train, n_test in [(1000, 100000),
                        (10000, 10000),
                        (100000, 1000)]:
    print('\n# n_train={}, n_test={}, n_dim={}\n'.format(
                         n_train, n_test, n_dim))

    X_train = np.random.rand(n_train, n_dim)
    X_test = np.random.rand(n_test, n_dim)

    for n_jobs in [1, 2]:
        print('n_jobs=', n_jobs, ' => ', end='')

        %timeit pairwise_distances(X_train, X_test, 'euclidean',
                                   n_jobs=n_jobs, squared=True)

which on a 2 core CPU returns,

# n_train=1000, n_test=100000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 1.92 s per loop
n_jobs= 2  => 1 loop, best of 3: 4.95 s per loop

# n_train=10000, n_test=10000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 1.89 s per loop
n_jobs= 2  => 1 loop, best of 3: 4.74 s per loop

# n_train=100000, n_test=1000, n_dim=200

n_jobs= 1  => 1 loop, best of 3: 2 s per loop
n_jobs= 2  => 1 loop, best of 3: 5.6 s per loop

While for small datasets, it would make sens that the parallel processing would not improve performance due to the multiprocessing etc overhead, this is by no mean a small dataset. And the compute time does not decrease when using e.g. n_jobs=4 on a 4 core CPU.

This also holds for other number of dimensions,
n_dim=10

# n_train=1000, n_test=100000, n_dim=10

n_jobs= 1  => 1 loop, best of 3: 873 ms per loop
n_jobs= 2  => 1 loop, best of 3: 4.25 s per loop

n_dim=1000

# n_train=1000, n_test=100000, n_dim=1000

n_jobs= 1  => 1 loop, best of 3: 6.56 s per loop
n_jobs= 2  => 1 loop, best of 3: 8.56 s per loop

Running benchmarks/bench_plot_parallel_pairwise.py also yields similar results,
untitled

This might affect a number of estimators / metrics where pairwise_distances is used.

Versions

Linux-4.6.0-gentoo-x86_64-Intel-R-_Core-TM-_i5-6200U_CPU_@_2.30GHz-with-gentoo-2.3
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
Scikit-Learn 0.18.1

I also get similar results with scikit-learn 0.17.1

@jnothman
Copy link
Member

jnothman commented Jan 20, 2017 via email

@rth
Copy link
Member Author

rth commented Jan 20, 2017

Parallelism is always fraught with this kind of issue. There are multiple factors in whether it will succeed, and very often it won't. [..] n_jobs may be more useful when the calculation is slow, i.e. non-euclidean, or over sparse matrices or whatnot.

Yes, I know, I tried to do a quick parallellization of pariwise_distances(..., n_jobs=1),

def _naive_parallel_pairwise_distances(X, Y, metric, batch_size=100, n_jobs=1, squared=True):
    queue = []
    n_samples = X.shape[0]
    for k in range(n_samples//batch_size + int(n_samples % batch_size != 0)):
        mslice = slice(k*batch_size, min((k+1)*batch_size, n_samples))
        X_sl = X[mslice, :]
        queue.append(delayed(pairwise_distances)(X_sl, Y, metric,
                                        n_jobs=1, squared=squared))
    res = Parallel(n_jobs=n_jobs)(queue)
    return np.hstack(res)

and this yields comparable performance, so it must be as you say that n_jobs would better improve non-eucledian distances etc that take longer to compute.

Again, you've not varied n_features, which might be an important factor.

I have tried n_features in [10, 200, 1000] in my original post above (sorry called it n_dim), in all cases n_jobs > 1 makes things slower. Including for large arrays X [100000, 1000], Y [1000, 1000] which are reaching the limit of what I can compute with 8 GB RAM.

The question is that if n_jobs always makes pairwise euclidean distances slower, no matter n_samples, n_features (as that's what all the benchmarks suggest so far), then some warning should probably be printed to the user if n_jobs > 1 is used? Particularly since euclidean distance is the default (and probably the most frequent used) distance metric.

Also when benchmarks/bench_plot_parallel_pairwise.py was added by @mblondel in 3b8f54e , I imagine it aimed to demonstrate that using n_jobs > 1 could speed up Eucledian / RBF distance calculations (couldn't find any plots / PR from that time). When I run this benchmark now, it illustrates that using n_jobs > 1 makes things slower (cf images in the original post above), which doesn't really make sens to show on a benchmark...

@rth
Copy link
Member Author

rth commented Jan 20, 2017

It looks like the cost of memory copies is the issue here. For instance, if we rewrite the parallel calculations to save the results to a shared array (adapted from this joblib example) in this gist, we get the results below (on a 8 core Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz),

# n_train=100000, n_test=1000, n_dim=1000

## Backend :  pairwise_distances
n_jobs= 1  => 1 loop, best of 3: 2.63 s per loop
n_jobs= 2  => 1 loop, best of 3: 8.43 s per loop
n_jobs= 4  => 1 loop, best of 3: 9.15 s per loop
n_jobs= 8  => 1 loop, best of 3: 14.5 s per loop
## Backend :  mmap_pairwise_distances
n_jobs= 1  => 1 loop, best of 3: 3.7 s per loop
n_jobs= 2  => 1 loop, best of 3: 3.43 s per loop
n_jobs= 4  => 1 loop, best of 3: 2.9 s per loop
n_jobs= 8  => 1 loop, best of 3: 2.83 s per loop

so while the compute time of pairwise_distances keeps increasing as you use more CPU cores, using a shared array to store the results allows to at least to see some improvement with larger number of CPU cores (though the scaling is not great and it still does not beat single core performance)...

@jnothman
Copy link
Member

jnothman commented Jan 21, 2017 via email

@rth
Copy link
Member Author

rth commented Jan 23, 2017

I don't think you're gaining anything there by dumping X and Y: unless some optimisation is involved, there are still refs held by the calling code.

You are right, looks like joblib with the multiprocesssing backend would do the automatic memapping of input array anyway if they are larger than 1MB.

However, I agree that creating memmapped storage for the result seems wise. However, you're deleting the memmapped file before returning it, which will work (since it's an open file) on *NIX, but I'm not sure if it'll work on Windows.

True, I should have copied data to memory and closed the mmap file before deleting it...

@ogrisel
Copy link
Member

ogrisel commented Feb 8, 2017

@rth I wonder if this case would not run faster with the threading backend of joblib. Can you please re-run your benchmarks with this backend as a new option?

@ogrisel
Copy link
Member

ogrisel commented Feb 8, 2017

You should also run the tests on a larger machines with 16 cores to see the scalability profile when n_jobs is increased.

@rth
Copy link
Member Author

rth commented Feb 8, 2017

@ogrisel I looked a bit more into this, and considered the influence of 3 parameters,

  • parallel backend in ['multiprocessing', 'threading']
  • memmapping the output array or not
  • setting MKL_NUM_THREAD in [1, 8] to disable multithreading in numpy (when using default conda install)

this benchmarks the _parallel_pairwise functio, run for

  • dense and sparse CSR arrays
  • metrics in ['euclidean', 'cosine', 'l1'] metrics.
  • for input array sizes (n_x, n_y, n_dim) in
    [(100000, 1000, 1000),
     (10000, 10000, 1000),
     (10000, 10000, 10),
     (10000, 10000, 10000)]:
    
    for sparse arrays the n_dim is multiplied by 10 to account for lower density.

The benchmark script can be found in this gist. Here are the results for the runs, using an Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz),

Metruc\is_sparse_csr Dense Sparse CSR
euclidean http://pastebin.com/Wvn40Arh http://pastebin.com/4cMBr2xV
cosine http://pastebin.com/wX0pZiMQ
l1 in progress..

At least for this dataset of medium size (X array: 0.08 GB, Y array 0.08 GB, result array 0.8 GB), the conclusions appear to be that,

  • for all possible combination of parameters, using n_jobs=1 yields the best performance
  • the multiprocessing backend decreases performance significantly with n_jobs in this use case
  • the threading backend at best improves performance by 10-20% and doesn't outperform n_jobs=1 significantly
    * setting the number of numpy threads with MKL_NUM_THREAD doesn't seem to have an impact
    * pre-allocating the result array, and providing it to joblib so it would be memmaped appears to work worse than doing the same thing but creating the mmemaped file by hand (see comment above).

Looks like everything is memory bound, and globally using more cores does not help. Will try testing scipy metrics (e.g. jaccard) next..

@rth
Copy link
Member Author

rth commented Feb 8, 2017

Parallelizing the jacckard distance works much better, because the computations take much longer and are not memory bound anymore. The choice of the parallel backend doesn't seem to matter in this case,

(sklearn-env) [ec2-user@ip-172-31-11-89 ~]$ ipython pairwise_distances_benchmark.py False jaccard
# sparse=False, n_x=100000, n_y=1000, n_dim=1000
#   X array: 0.8 GB, Y array 0.008 GB, result array 0.8 GB
# metric = jaccard

##  {'mmap_result': False, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8}
 n_jobs= 1  => 1 loop, best of 1: 5min 7s per loop
n_jobs= 2  => 1 loop, best of 1: 2min 40s per loop
n_jobs= 4  => 1 loop, best of 1: 1min 25s per loop
n_jobs= 8  => 1 loop, best of 1: 44.4 s per loop
n_jobs= 16  => 1 loop, best of 1: 33.4 s per loop

##  {'mmap_result': True, 'backend': 'multiprocessing', 'MKL_NUM_THREADS': 8}
n_jobs= 1  => 1 loop, best of 1: 5min 7s per loop
n_jobs= 2  => 1 loop, best of 1: 2min 34s per loop
n_jobs= 4  => 1 loop, best of 1: 1min 18s per loop
n_jobs= 8  => 1 loop, best of 1: 49.6 s per loop
n_jobs= 16  => 1 loop, best of 1: 34.3 s per loop

##  {'MKL_NUM_THREADS': 8, 'backend': 'threading'}                                                                                                          
 n_jobs= 1  => 1 loop, best of 1: 5min 7s per loop
n_jobs= 2  => 1 loop, best of 1: 2min 34s per loop
n_jobs= 4  => 1 loop, best of 1: 1min 17s per loop
n_jobs= 8  => 1 loop, best of 1: 42.4 s per loop
n_jobs= 16  => 1 loop, best of 1: 30.7 s per loop

Here is a sample script to estimate relative compute time of different metrics,

import numpy as np
from sklearn.metrics.pairwise import (pairwise_distances, PAIRWISE_DISTANCE_FUNCTIONS,
                                      PAIRWISE_BOOLEAN_FUNCTIONS)

from IPython import get_ipython
ipython = get_ipython()

n_x, n_y, n_dim = 10000, 1000, 1000

X = np.random.rand(n_x, n_dim)
Y = np.random.rand(n_y, n_dim)

print('\n# n_x={}, n_y={}, n_dim={}'.format(n_x, n_y, n_dim))

for metric in list(PAIRWISE_DISTANCE_FUNCTIONS.keys()) + PAIRWISE_BOOLEAN_FUNCTIONS:

    print(' * {:10} metric => '.format(metric), end='')
    ipython.magic("timeit -n 1 -r 1 pairwise_distances(X, Y, metric, n_jobs=1)")

which produces,

# n_x=10000, n_y=1000, n_dim=1000
 * precomputed metric => 1 loop, best of 1: 6.22 ms per loop
 * cityblock  metric => 1 loop, best of 1: 9.98 s per loop
 * euclidean  metric => 1 loop, best of 1: 685 ms per loop
 * manhattan  metric => 1 loop, best of 1: 10.3 s per loop
 * l2         metric => 1 loop, best of 1: 721 ms per loop
 * cosine     metric => 1 loop, best of 1: 836 ms per loop
 * l1         metric => 1 loop, best of 1: 10.4 s per loop
 * dice       metric => 1 loop, best of 1: 20.2 s per loop
 * jaccard    metric => 1 loop, best of 1: 40.7 s per loop
 * kulsinski  metric => 1 loop, best of 1: 19.9 s per loop
 * matching   metric => 1 loop, best of 1: 7.08 s per loop
 * rogerstanimoto metric => 1 loop, best of 1: 17.4 s per loop
 * russellrao metric => 1 loop, best of 1: 13.3 s per loop
 * sokalmichener metric => 1 loop, best of 1: 17.4 s per loop
 * sokalsneath metric => 1 loop, best of 1: 19.7 s per loop
 * yule       metric => 1 loop, best of 1: 25.4 s per loop

maybe it could be worth raising a warning when the pairwise_distances is used with faster metrics ['euclidean', 'l2', 'cosine'] and n_jobs > 1, saying that using the latter option may lead to worse performance?

@jnothman
Copy link
Member

jnothman commented Feb 8, 2017 via email

@rth
Copy link
Member Author

rth commented Feb 8, 2017

Start with a comment in the documentation?

True, but which part of the docs? pairwise_distances that allow a user provided n_jobs parameter are used (directly or indirectly via NearestNeighbors) in,

  • silhouette_score
  • LocalOutlierFactor
  • DBSCAN
  • NearestNeighbors
  • KNeighborsClassifier
  • KNeighborsRegressor
  • Isomap
  • kneighbors_graph
  • mean_shift
  • LabelPropagation

which means that, if this issue is confirmed, using n_jobs > 1 may potentially render those estimators / functions (significantly) slower.

I imagine the impact of this is somewhat limited by the fact that NearestNeighbors would use by default kd_tree or ball_tree algorithm (cf. #8213) , even in cases when brute force with pairwise_distances would have been more efficient, but that's another issue..

@jnothman
Copy link
Member

jnothman commented Feb 9, 2017 via email

@lesteve
Copy link
Member

lesteve commented Feb 9, 2017

pre-allocating the result array, and providing it to joblib so it would be memmaped appears to work worse than doing the same thing but creating the mmemaped file by hand (see comment above).

I had a look at this a few weeks ago and AFAICT the reason may be that sklearn.metrics.pairwise._parallel_pairwise does:

    ret = Parallel(n_jobs=n_jobs, verbose=0)(
        fd(X, Y[s], **kwds)
        for s in gen_even_slices(Y.shape[0], n_jobs))

For the auto memmapping to be more useful I think it should do:

    ret = Parallel(n_jobs=n_jobs, verbose=0)(
        another_fd(X, Y, s, **kwds)
        for s in gen_even_slices(Y.shape[0], n_jobs))

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 9, 2017 via email

@rth
Copy link
Member Author

rth commented Feb 9, 2017

Thanks @lesteve , I'll test that option.

Also would you have an opinion why auto memmapping the output array (results with 'mmap_result': True here ) has significanly worse performance in this case than creating the memmap file by hand and providing it to joblib (results here). Though in the latter case I was using squared=True option, so maybe that has an impact. Or maybe I missed something else...

@lesteve
Copy link
Member

lesteve commented Feb 9, 2017

Also would you have an opinion why auto memmapping the output array (results with 'mmap_result': True here ) has significanly worse performance in this case than creating the memmap file by hand and providing it to joblib (results here)

Hmmm not off the top of my head I have to say.

@ogrisel
Copy link
Member

ogrisel commented Feb 9, 2017

As threading is not impacted by the GIL for those workloads, there is really no point in using multiprocessing + memmaping: that is just a waste of disk access and memory (multiprocessing + memmaping still does at least one memory copy that is not required when threading).

The fact that cosine / euclidean do not get a speed up with n_jobs=1 is because those workloads are memory bound. It might be possible to find clever ways to block the data differently depending on the dimensions but I think it's primarily BLAS level 1 and 2 operations that dominate and those are memory bound.

So +1 for using the threading backend by default and issueing a warning when using n_jobs>1 on memory bound metrics.

@rth
Copy link
Member Author

rth commented Feb 9, 2017

Thanks @ogrisel I'll submit a PR. Need to check that threading performs as well or better for all metrics though..

It is interesting that BLAS level 1 and level 2 operations (dot product, matrix vector multiplications, ..) are memory bound. I imagine this would also explain why parallelizing the predict method (e.g. in linear models) may not yield any speed up (related issues #7448, #7635)...

@ogrisel
Copy link
Member

ogrisel commented Feb 9, 2017

Yes indeed I mean level 1 and 2: vec x vec and mat x vec are memory bound. Only vec x vec is has an algorithmic intensity large enough to benefit from multi cores. But it also depends on how it's "blocked". MKL and OpenBLAS do it for us though.

@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2017

Actually for cosine similarities and and euclidean distances there is a level 3 GEMM involved. So at least this step should benefit from multicore without running into the memory bandwidth limit. But the BLAS implementation (MKL or OpenBLAS) should have no problem to multithread that part and there is probably no point in trying to manually handle the parallelism at a coarser level for those GEMM based distances.

@jnothman
Copy link
Member

Switching whether to use threading bases on the choice of metric may make sense

@rth
Copy link
Member Author

rth commented Mar 1, 2019

This is resolved by using a threading backend for pariwise_distances calculations in #13310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants