[MRG+1] ENH: Parallelize kneighbors method with multithreading #4009

nmayorov · 2014-12-26T18:12:48Z

I used the most simple approach with the ~~'multiprocessing'~~ 'threading' joblib backend. A simple benchmark could be seen here.

This is the issue #3536.

jakevdp · 2014-12-26T19:18:36Z

sklearn/neighbors/base.py

@@ -8,6 +8,7 @@
 # License: BSD 3 clause (C) INRIA, University of Amsterdam
 import warnings
 from abc import ABCMeta, abstractmethod
+from multiprocessing import cpu_count


unnecessary import?

It's indeed unnecessary.

jakevdp · 2014-12-26T19:23:28Z

Looks like tests are failing because you're passing an unpicklable method to Parallel. I've run into this before, but have not found a good workaround. Perhaps others would have some ideas.

nmayorov · 2014-12-26T20:26:58Z

You are right, that's the reason. There is some advanced mechanism added in Python 3 and thus it works.

Doing 'threading' parallelism would be cool from this perspective also. But there are so many things in
.pyx files, so I'm not sure I will be able to "release GIL" correctly. I can try though...

nmayorov · 2014-12-27T10:07:23Z

Hi, I was able to make work threading parallelism. Checkout updated benchmark:

http://nbviewer.ipython.org/gist/nmayorov/7531d9b59608ae2d76dc

If a user will pass a custom function (written in Python) for distance calculation he won't have any speed up though, because of GIL. But I think it's a very minor thing.

Should I continue this PR in nogil/threading direction?

nmayorov · 2014-12-27T18:28:14Z

I can't make it work with RadiusNeighborsClassifier and query_radius. I added with nogil for the main cycle, fixed all problems and successfully cython .pyx files. But the result is about 10 times slowdown, whereas it gains some decent speedup with multiprocessing (in Python 3).

jakevdp · 2014-12-27T19:41:57Z

Hi – can you add your commits to the branch so we can see these updates?

nmayorov · 2014-12-27T20:07:34Z

I wasn't sure what to add, because there are different directions to go.

I pushed all "nogil" changes. But, as I said, query_radius now works terribly with n_jobs > 1. I hope you can give an advice how to fix it.

Note, that I removed some things which you marked as "for compatibility with old numpy", because the current dependency is stated as numpy >= 1.6.2, and you mentioned there numpy 1.5 and below.

Travis build has failed again. I guess I shouldn't push these changes, they look like huge mess. I think I better try to go with multiprocessing, only need some way to pickle query methods in Python 2.

nmayorov · 2014-12-27T20:58:47Z

Well, I fixed doctest and now the build passes. If you are OK with my changes in general, let's try to fix the problem with query_radius.

jakevdp · 2014-12-28T15:13:30Z

sklearn/neighbors/ball_tree.pyx

@@ -44,7 +44,7 @@ cdef int allocate_data(BinaryTree tree, ITYPE_t n_nodes,
                       ITYPE_t n_features) except -1:
    """Allocate arrays needed for the KD Tree"""
    tree.node_bounds_arr = np.zeros((1, n_nodes, n_features), dtype=DTYPE)
-    tree.node_bounds = get_memview_DTYPE_3D(tree.node_bounds_arr)
+    tree.node_bounds = tree.node_bounds_arr


This get_memview_DTYPE_3D function is required for compatibility with old versions of numpy. Unless we've updated our dependency minimums, it should stay.

It was annotated in code as # Numpy 1.3-1.4 compatibility utilities, on the main page of scikit-learn project we see The required dependencies to build the software are NumPy >= 1.6.2, SciPy >= 0.9 and a working C/C++ compiler. So I figured it's fine to remove it.

BTW, I removed all get_memview_DTYPE functions.

Got it – yeah, it's probably time to make that change. I guess it just confused me that it was bundled into the parallel enhancements. Any way you can factor that out into a separate PR? I think it would be much easier to review that way.

Sure I will do that. I just decided to show you all changes (I understand it's pretty chaotic).

The main problem is that query_radius after my additions works slower with multiple threads than in the current version (which runs almost the same time with any number of threads).

If we figure out what to do with that, I will change commits.

nmayorov · 2014-12-29T10:37:37Z

Hey guys!

I would ask you not to review it in the current state, because there are several changes (some of them not working properly). I'll keep changes only related to kneighbors method (which were successful). Then proceed to radius_neighbors, perhaps in another PR. And also factor out cleanups of legacy code in binary_tree.pxi to another PR (as @jakevdp suggested.)

jakevdp · 2014-12-29T14:31:04Z

Great – thanks @nmayorov. I think this is hitting on some really good ideas (and some much needed maintenance updates) so thanks for being willing to iterate!

nmayorov · 2014-12-29T18:03:02Z

3 new commits are bare minimum to make kneighbors work in parallel (with actual speed improvement.) They don't affect radius_neighbors and RadiusNeighbors classes.

jakevdp · 2014-12-29T18:36:54Z

This looks great. Just so I understand: the issue with Radius Neighbors is the fact that you don't know how big an array you need to store the neighbors before the query happens, yes?

nmayorov · 2014-12-29T20:30:59Z

I can't say with absolute certainty (you should know better), but, yes — it seems to be the main issue and the reason I tried to use malloc in that cycle. It didn't help though (works the same as adding with gil) and I don't know why, perhaps frequent allocations are bad with multithreading.

nmayorov · 2015-01-01T17:55:36Z

I experimented a bit more with query_radius method. When I removed all if instructions from _query_radius_single and run in count_only mode, then the execution time started to consistently decrease when increasing n_jobs. Otherwise it doesn't work, i.e. seems like there is no one main problem to fix.

It's not that easy to get performance boost with threading and nogil. I suggest 2 variants:

Merge this PR and leave RadiusNeighbors alone at least for now (it's not used very often I guess).
Try to implement parallelization with multiprocessing. This approach is more robust, but we have the problem with unpicklable methods (presented in Python 2). Maybe someone has suggestions of how to overcome it.

djsutherland · 2015-01-03T23:22:04Z

I haven't looked at this in detail, but as a potential consumer of this change, approach 1 would be extremely useful to me, whereas approach 2 would not be useful in my case at all. (Multiprocessing copies the data way too many times and I run into memory errors.)

jnothman · 2015-01-04T03:00:32Z

(Multiprocessing copies the data way too many times and I run into memory errors.)

Can be avoided by memmapping?

nmayorov · 2015-01-04T19:56:15Z

@dougalsutherland, what is the typical shape of X on which you do training/querying of KNeighborsClassifier?

Also could you please run this code on your data and report what time improvements you'll get with n_jobs > 1?

djsutherland · 2015-01-05T18:16:27Z

@jnothman memmapping requires writing the data to disk, which kind of sucks, but it's definitely an option. In the past I've used some hacks to have the forked processes inherit global data, but that's super-hacky and doesn't work on Windows, and I've run into problems there where if I fork and close too many multiprocessing pools, my program eventually starts hanging (never really figured that one out or if it's related to this hack or not). In any case, given that a working threading-based solution exists for the main use case of this code, and in @nmayorov's benchmarks is a decent bit faster....

@nmayorov I don't actually use KNeighborsClassifier, I do a bunch of knn-distance queries in order to do some knn density estimation-type estimates; currently, I use flann (with cyflann) to do that, but that dependency makes distributing my code much more annoying. In my specific case, I also do post-processing right after each of the queries, to avoid having to construct a huge distance matrix and operate on that; having the knn search happen in nogil land makes that easier to happen fast.

I'll pull out some sample data and run benchmarks sometime soon.

nmayorov · 2015-01-09T04:52:56Z

I also run benchmark in Mac OS X – it works well.

@jakevdp what is your opinion, can you merge it? Or what else needs to be done? Do you think it's necessary to do something for radius queries? (I personally don't, they are independent changes anyway.)

jnothman · 2015-01-09T05:13:35Z

I think it's fine to fix one without the other initially, but if you want to get full reviews, change the WIP in the title to MRG!

nmayorov · 2015-02-16T20:36:16Z

@GaelVaroquaux have you seen some benchmarks that @ogrisel run on his machine? So it's not certain how well n_jobs > 1 will work on a user's machine.

If you still want to merge, maybe some word of caution should be added?

ogrisel · 2015-02-18T12:52:17Z

@nmayorov please squash all the commits or rewrite the commit messages so that they are individually informative. Commit messages are meant to be read in the coming years when reading the history of this class for debugging or refactoring purpose. Commit messages as "Updated with @ogrisel suggestions" are useless for that purpose. Instead they should summarize the "what" in one line and possibly the "why" in a subsequent paragraph when it's not obvious.

nmayorov · 2015-02-19T09:49:28Z

@ogrisel I understand your point, I will squash commits.

ogrisel · 2015-02-19T12:18:11Z

Thanks for the squash!

@GaelVaroquaux have you seen some benchmarks that @ogrisel run on his machine? So it's not certain how well n_jobs > 1 will work on a user's machine.

It should never be significantly worse than n_jobs=1 (at least under Python 3).

If you still want to merge, maybe some word of caution should be added?

I don't think it's necessary. It's just suboptimal, but not completely broken.

ogrisel · 2015-02-19T13:28:46Z

So what do other people think, shall we merge this? I think it's pretty harmless and releasing the GIL is always a good thing when it does not render the code overly complex. Any further opinion @jnothman @jakevdp @amueller @larsmans?

GaelVaroquaux · 2015-02-19T21:28:32Z

So what do other people think, shall we merge this? I think it's pretty harmless and releasing the GIL is always a good thing when it does not render the code overly complex.

I absolutely agree with this last statement.

jnothman · 2015-02-19T21:42:41Z

I think we should merge this; but this is not mere GIL unlocking. I think we need to decide which backend Neighbors's Parallel should use.

sturlamolden · 2015-02-20T14:12:11Z

I have read through @nmayorov's code. I do not see a lock which should have this effect on the scalability. The most likely candidates are some GIL contention in the Cython generated C, a bad case of false sharing, or a problem in joblib's threading backend. There are some cdef functions in @nmayorov's code that has an except -1 declaration, even though they are declared nogil. I do not know what Cython does in this case. In the worst case it grabs the GIL back to call PyErr_Occurred(), which would explain the remaining contention. But as these functions never raise exceptions the except declarations can be removed.

ogrisel · 2015-03-03T08:43:12Z

I think we should merge this; but this is not mere GIL unlocking. I think we need to decide which backend Neighbors's Parallel should use.

based on previous benchmarks, the multiprocessing backend is only efficient if the user takes proper care to manually memory map both the query data and the fitted neighbors tree. It's unlikely that many users will do that. +1 for keeping the threading based backend as default even though it's not a scalable as we would like. I am pretty sure we could hunt down the remaining GIL contention source but I don't know of any high level tool that would render that quick to do.

jnothman · 2015-08-27T03:49:58Z

apart from rebase, I take it, @ogrisel, that you would give this a +1 for merge...?

GaelVaroquaux · 2015-08-30T10:29:13Z

Looks good. 👍 to merge. The speed is useful, and releasing the gil is a good thing.

I am good to try to rebase and merge right now.

GaelVaroquaux · 2015-08-30T11:01:57Z

For the reference: seedup on my laptop (2 cores and hyperthreading):

In [12]: for n_jobs in n_jobs_all:
        knn.set_params(n_jobs=n_jobs)
        %timeit -n 3 -r 1 knn.predict(X_test)
   ....:     
3 loops, best of 1: 1min 2s per loop
3 loops, best of 1: 44.8 s per loop
3 loops, best of 1: 37 s per loop

(1, 2 and 3 jobs). So I think that this is actually predict good.

jnothman · 2015-08-30T11:16:06Z

But with the new auto-batching in joblib, the benchmarks above that showed threading outperformed multiprocessing are no longer applicable. Should you redo that benchmark with a multiprocessing backend just to be sure?

GaelVaroquaux · 2015-08-30T11:18:48Z

But with the new auto-batching in joblib, the benchmarks above that showed
threading outperformed multiprocessing are no longer applicable. Should you
redo that benchmark with a multiprocessing backend just to be sure?

I was just about to merge and move on, as this is a clear improvement
(both the current speedup and the fact that we release the gil more
often). But if you want to redo the benchmarks, I can wait a bit,
although I think we should merge anyhow.

jnothman · 2015-08-30T11:30:16Z

Go ahead

GaelVaroquaux · 2015-08-30T11:56:54Z

I have merged this. Thank you very much!

jnothman · 2015-08-30T12:19:09Z

Multiprocessing doesn't work in Python 2 due to BallTree pickling issues anyway, it seems. However, I'm not seeing gains nearly as good as yours with multithreading. At least I hope this helps some. But given the gains are small do we have any chance with radius_neighbors? Do we have any understanding of why the gains are small?

Baschdl · 2020-05-10T14:36:00Z

Was there any work on radius_neighbors in the meantime? With the dropped support of python 2, parallelization with multiprocessing shouldn't be a issue anymore.

jnothman · 2020-05-11T04:27:11Z

@Baschdl see #10887, included in scikit-learn from v0.20.

guidocioni · 2023-03-01T08:26:25Z

Is there a reason why n_jobs was not implemented also for BallTree and KDTree methods? I'm trying to follow up the discussion between issues and PRs but I cannot really understand when this was dropped.

sturlamolden · 2023-03-01T12:13:45Z

Is there a reason why n_jobs was not implemented also for BallTree and KDTree methods? I'm trying to follow up the discussion between issues and PRs but I cannot really understand when this was dropped.

scipy.spatial.KDTree does this for some of the methods (those where parallelism is iterative).

For purely recursive work we initially did not include this because it requires fork-join parallelism in C++. When the code was written C++ did not have support for this in the standard library. Now it does. We have made a parallel kd-tree build (unsure if it was merged in master) but the speedup is not very impressive (e.g. about 2x). So for implementing it for the rest of scipy.spatial.KDTree methods, the speedup from parallel build was not very motivating.

nmayorov changed the title ~~[MRG] ENH: Parallelize neighbors search through multiprocessing #3536~~ [MRG] ENH: Parallelize neighbors search through multiprocessing Dec 26, 2014

jakevdp reviewed Dec 26, 2014
View reviewed changes

nmayorov changed the title ~~[MRG] ENH: Parallelize neighbors search through multiprocessing~~ [WIP] ENH: Parallelize neighbors search through multiprocessing Dec 26, 2014

nmayorov force-pushed the neighbors_parallel branch from 4e29212 to 4488332 Compare December 27, 2014 20:29

jakevdp reviewed Dec 28, 2014
View reviewed changes

nmayorov force-pushed the neighbors_parallel branch from a476ddf to b0fde39 Compare December 29, 2014 17:54

nmayorov changed the title ~~[WIP] ENH: Parallelize neighbors search through multiprocessing~~ [WIP] ENH: Parallelize kneighbors method with multithreading Dec 29, 2014

djsutherland mentioned this pull request Jan 3, 2015

consider dropping flann djsutherland/skl-groups#26

Open

jakevdp mentioned this pull request Jan 8, 2015

ENH: Enhancements to spatial.cKDTree scipy/scipy#4374

Merged

nmayorov changed the title ~~[WIP] ENH: Parallelize kneighbors method with multithreading~~ [MRG] ENH: Parallelize kneighbors method with multithreading Jan 9, 2015

Added multithreading support for kneighbors search

63039b2

nmayorov force-pushed the neighbors_parallel branch from 2a1b012 to 63039b2 Compare February 19, 2015 10:20

jnothman mentioned this pull request Mar 10, 2015

[MRG+1] optimize DBSCAN by rewriting in Cython #4157

Closed

ogrisel mentioned this pull request Jun 19, 2015

[MRG+1] Implemented parallelised version of mean_shift and test function. #4779

Merged

amueller mentioned this pull request Aug 13, 2015

Parallelise nearest neighbors methods #3536

Closed

GaelVaroquaux mentioned this pull request Aug 30, 2015

Parallel queries in nearest neighbor search #5187

Merged

GaelVaroquaux merged commit 63039b2 into scikit-learn:master Aug 30, 2015

nmayorov deleted the neighbors_parallel branch September 1, 2015 04:46

jnothman mentioned this pull request Dec 8, 2016

DBSCAN seems not to use multiple processors (n_jobs argument ignored) #8003

Closed

[MRG+1] ENH: Parallelize kneighbors method with multithreading #4009

[MRG+1] ENH: Parallelize kneighbors method with multithreading #4009

Conversation

nmayorov commented Dec 26, 2014

jakevdp Dec 26, 2014

Choose a reason for hiding this comment

nmayorov Dec 26, 2014

Choose a reason for hiding this comment

jakevdp commented Dec 26, 2014

nmayorov commented Dec 26, 2014

nmayorov commented Dec 27, 2014

nmayorov commented Dec 27, 2014

jakevdp commented Dec 27, 2014

nmayorov commented Dec 27, 2014

nmayorov commented Dec 27, 2014

jakevdp Dec 28, 2014

Choose a reason for hiding this comment

nmayorov Dec 28, 2014

Choose a reason for hiding this comment

jakevdp Dec 28, 2014

Choose a reason for hiding this comment

nmayorov Dec 28, 2014

Choose a reason for hiding this comment

nmayorov commented Dec 29, 2014

jakevdp commented Dec 29, 2014

nmayorov commented Dec 29, 2014

jakevdp commented Dec 29, 2014

nmayorov commented Dec 29, 2014

nmayorov commented Jan 1, 2015

djsutherland commented Jan 3, 2015

jnothman commented Jan 4, 2015

nmayorov commented Jan 4, 2015

djsutherland commented Jan 5, 2015

nmayorov commented Jan 9, 2015

jnothman commented Jan 9, 2015

nmayorov commented Feb 16, 2015

ogrisel commented Feb 18, 2015

nmayorov commented Feb 19, 2015

ogrisel commented Feb 19, 2015

ogrisel commented Feb 19, 2015

GaelVaroquaux commented Feb 19, 2015 via email

jnothman commented Feb 19, 2015

sturlamolden commented Feb 20, 2015

ogrisel commented Mar 3, 2015

jnothman commented Aug 27, 2015

GaelVaroquaux commented Aug 30, 2015

GaelVaroquaux commented Aug 30, 2015

jnothman commented Aug 30, 2015

GaelVaroquaux commented Aug 30, 2015

jnothman commented Aug 30, 2015

GaelVaroquaux commented Aug 30, 2015

jnothman commented Aug 30, 2015

Baschdl commented May 10, 2020

jnothman commented May 11, 2020

guidocioni commented Mar 1, 2023

sturlamolden commented Mar 1, 2023 • edited Loading

sturlamolden commented Mar 1, 2023 •

edited

Loading