Skip to content

[MRG+1] ENH: Parallelize kneighbors method with multithreading #4009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 30, 2015

Conversation

nmayorov
Copy link
Contributor

I used the most simple approach with the 'multiprocessing' 'threading' joblib backend. A simple benchmark could be seen here.

This is the issue #3536.

@nmayorov nmayorov changed the title [MRG] ENH: Parallelize neighbors search through multiprocessing #3536 [MRG] ENH: Parallelize neighbors search through multiprocessing Dec 26, 2014
@@ -8,6 +8,7 @@
# License: BSD 3 clause (C) INRIA, University of Amsterdam
import warnings
from abc import ABCMeta, abstractmethod
from multiprocessing import cpu_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed unnecessary.

@jakevdp
Copy link
Member

jakevdp commented Dec 26, 2014

Looks like tests are failing because you're passing an unpicklable method to Parallel. I've run into this before, but have not found a good workaround. Perhaps others would have some ideas.

@nmayorov
Copy link
Contributor Author

You are right, that's the reason. There is some advanced mechanism added in Python 3 and thus it works.

Doing 'threading' parallelism would be cool from this perspective also. But there are so many things in
.pyx files, so I'm not sure I will be able to "release GIL" correctly. I can try though...

@nmayorov nmayorov changed the title [MRG] ENH: Parallelize neighbors search through multiprocessing [WIP] ENH: Parallelize neighbors search through multiprocessing Dec 26, 2014
@nmayorov
Copy link
Contributor Author

Hi, I was able to make work threading parallelism. Checkout updated benchmark:

http://nbviewer.ipython.org/gist/nmayorov/7531d9b59608ae2d76dc

If a user will pass a custom function (written in Python) for distance calculation he won't have any speed up though, because of GIL. But I think it's a very minor thing.

Should I continue this PR in nogil/threading direction?

@nmayorov
Copy link
Contributor Author

I can't make it work with RadiusNeighborsClassifier and query_radius. I added with nogil for the main cycle, fixed all problems and successfully cython .pyx files. But the result is about 10 times slowdown, whereas it gains some decent speedup with multiprocessing (in Python 3).

@jakevdp
Copy link
Member

jakevdp commented Dec 27, 2014

Hi – can you add your commits to the branch so we can see these updates?

@nmayorov
Copy link
Contributor Author

I wasn't sure what to add, because there are different directions to go.

I pushed all "nogil" changes. But, as I said, query_radius now works terribly with n_jobs > 1. I hope you can give an advice how to fix it.

Note, that I removed some things which you marked as "for compatibility with old numpy", because the current dependency is stated as numpy >= 1.6.2, and you mentioned there numpy 1.5 and below.


Travis build has failed again. I guess I shouldn't push these changes, they look like huge mess. I think I better try to go with multiprocessing, only need some way to pickle query methods in Python 2.

@nmayorov
Copy link
Contributor Author

Well, I fixed doctest and now the build passes. If you are OK with my changes in general, let's try to fix the problem with query_radius.

@@ -44,7 +44,7 @@ cdef int allocate_data(BinaryTree tree, ITYPE_t n_nodes,
ITYPE_t n_features) except -1:
"""Allocate arrays needed for the KD Tree"""
tree.node_bounds_arr = np.zeros((1, n_nodes, n_features), dtype=DTYPE)
tree.node_bounds = get_memview_DTYPE_3D(tree.node_bounds_arr)
tree.node_bounds = tree.node_bounds_arr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This get_memview_DTYPE_3D function is required for compatibility with old versions of numpy. Unless we've updated our dependency minimums, it should stay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was annotated in code as # Numpy 1.3-1.4 compatibility utilities, on the main page of scikit-learn project we see The required dependencies to build the software are NumPy >= 1.6.2, SciPy >= 0.9 and a working C/C++ compiler. So I figured it's fine to remove it.

BTW, I removed all get_memview_DTYPE functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it – yeah, it's probably time to make that change. I guess it just confused me that it was bundled into the parallel enhancements. Any way you can factor that out into a separate PR? I think it would be much easier to review that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will do that. I just decided to show you all changes (I understand it's pretty chaotic).

The main problem is that query_radius after my additions works slower with multiple threads than in the current version (which runs almost the same time with any number of threads).

If we figure out what to do with that, I will change commits.

@nmayorov
Copy link
Contributor Author

Hey guys!

I would ask you not to review it in the current state, because there are several changes (some of them not working properly). I'll keep changes only related to kneighbors method (which were successful). Then proceed to radius_neighbors, perhaps in another PR. And also factor out cleanups of legacy code in binary_tree.pxi to another PR (as @jakevdp suggested.)

@jakevdp
Copy link
Member

jakevdp commented Dec 29, 2014

Great – thanks @nmayorov. I think this is hitting on some really good ideas (and some much needed maintenance updates) so thanks for being willing to iterate!

@nmayorov
Copy link
Contributor Author

3 new commits are bare minimum to make kneighbors work in parallel (with actual speed improvement.) They don't affect radius_neighbors and RadiusNeighbors classes.

@nmayorov nmayorov changed the title [WIP] ENH: Parallelize neighbors search through multiprocessing [WIP] ENH: Parallelize kneighbors method with multithreading Dec 29, 2014
@jakevdp
Copy link
Member

jakevdp commented Dec 29, 2014

This looks great. Just so I understand: the issue with Radius Neighbors is the fact that you don't know how big an array you need to store the neighbors before the query happens, yes?

@nmayorov
Copy link
Contributor Author

I can't say with absolute certainty (you should know better), but, yes — it seems to be the main issue and the reason I tried to use malloc in that cycle. It didn't help though (works the same as adding with gil) and I don't know why, perhaps frequent allocations are bad with multithreading.

@nmayorov
Copy link
Contributor Author

nmayorov commented Jan 1, 2015

I experimented a bit more with query_radius method. When I removed all if instructions from _query_radius_single and run in count_only mode, then the execution time started to consistently decrease when increasing n_jobs. Otherwise it doesn't work, i.e. seems like there is no one main problem to fix.

It's not that easy to get performance boost with threading and nogil. I suggest 2 variants:

  1. Merge this PR and leave RadiusNeighbors alone at least for now (it's not used very often I guess).
  2. Try to implement parallelization with multiprocessing. This approach is more robust, but we have the problem with unpicklable methods (presented in Python 2). Maybe someone has suggestions of how to overcome it.

@djsutherland
Copy link
Contributor

I haven't looked at this in detail, but as a potential consumer of this change, approach 1 would be extremely useful to me, whereas approach 2 would not be useful in my case at all. (Multiprocessing copies the data way too many times and I run into memory errors.)

@jnothman
Copy link
Member

jnothman commented Jan 4, 2015

(Multiprocessing copies the data way too many times and I run into memory errors.)

Can be avoided by memmapping?

@nmayorov
Copy link
Contributor Author

nmayorov commented Jan 4, 2015

@dougalsutherland, what is the typical shape of X on which you do training/querying of KNeighborsClassifier?

Also could you please run this code on your data and report what time improvements you'll get with n_jobs > 1?

@djsutherland
Copy link
Contributor

@jnothman memmapping requires writing the data to disk, which kind of sucks, but it's definitely an option. In the past I've used some hacks to have the forked processes inherit global data, but that's super-hacky and doesn't work on Windows, and I've run into problems there where if I fork and close too many multiprocessing pools, my program eventually starts hanging (never really figured that one out or if it's related to this hack or not). In any case, given that a working threading-based solution exists for the main use case of this code, and in @nmayorov's benchmarks is a decent bit faster....

@nmayorov I don't actually use KNeighborsClassifier, I do a bunch of knn-distance queries in order to do some knn density estimation-type estimates; currently, I use flann (with cyflann) to do that, but that dependency makes distributing my code much more annoying. In my specific case, I also do post-processing right after each of the queries, to avoid having to construct a huge distance matrix and operate on that; having the knn search happen in nogil land makes that easier to happen fast.

I'll pull out some sample data and run benchmarks sometime soon.

@nmayorov
Copy link
Contributor Author

nmayorov commented Jan 9, 2015

I also run benchmark in Mac OS X – it works well.

@jakevdp what is your opinion, can you merge it? Or what else needs to be done? Do you think it's necessary to do something for radius queries? (I personally don't, they are independent changes anyway.)

@jnothman
Copy link
Member

jnothman commented Jan 9, 2015

I think it's fine to fix one without the other initially, but if you want to get full reviews, change the WIP in the title to MRG!

@nmayorov nmayorov changed the title [WIP] ENH: Parallelize kneighbors method with multithreading [MRG] ENH: Parallelize kneighbors method with multithreading Jan 9, 2015
@nmayorov
Copy link
Contributor Author

@GaelVaroquaux have you seen some benchmarks that @ogrisel run on his machine? So it's not certain how well n_jobs > 1 will work on a user's machine.

If you still want to merge, maybe some word of caution should be added?

@ogrisel
Copy link
Member

ogrisel commented Feb 18, 2015

@nmayorov please squash all the commits or rewrite the commit messages so that they are individually informative. Commit messages are meant to be read in the coming years when reading the history of this class for debugging or refactoring purpose. Commit messages as "Updated with @ogrisel suggestions" are useless for that purpose. Instead they should summarize the "what" in one line and possibly the "why" in a subsequent paragraph when it's not obvious.

@nmayorov
Copy link
Contributor Author

@ogrisel I understand your point, I will squash commits.

@ogrisel
Copy link
Member

ogrisel commented Feb 19, 2015

Thanks for the squash!

@GaelVaroquaux have you seen some benchmarks that @ogrisel run on his machine? So it's not certain how well n_jobs > 1 will work on a user's machine.

It should never be significantly worse than n_jobs=1 (at least under Python 3).

If you still want to merge, maybe some word of caution should be added?

I don't think it's necessary. It's just suboptimal, but not completely broken.

@ogrisel
Copy link
Member

ogrisel commented Feb 19, 2015

So what do other people think, shall we merge this? I think it's pretty harmless and releasing the GIL is always a good thing when it does not render the code overly complex. Any further opinion @jnothman @jakevdp @amueller @larsmans?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 19, 2015 via email

@jnothman
Copy link
Member

I think we should merge this; but this is not mere GIL unlocking. I think we need to decide which backend Neighbors's Parallel should use.

@sturlamolden
Copy link
Contributor

I have read through @nmayorov's code. I do not see a lock which should have this effect on the scalability. The most likely candidates are some GIL contention in the Cython generated C, a bad case of false sharing, or a problem in joblib's threading backend. There are some cdef functions in @nmayorov's code that has an except -1 declaration, even though they are declared nogil. I do not know what Cython does in this case. In the worst case it grabs the GIL back to call PyErr_Occurred(), which would explain the remaining contention. But as these functions never raise exceptions the except declarations can be removed.

@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2015

I think we should merge this; but this is not mere GIL unlocking. I think we need to decide which backend Neighbors's Parallel should use.

based on previous benchmarks, the multiprocessing backend is only efficient if the user takes proper care to manually memory map both the query data and the fitted neighbors tree. It's unlikely that many users will do that. +1 for keeping the threading based backend as default even though it's not a scalable as we would like. I am pretty sure we could hunt down the remaining GIL contention source but I don't know of any high level tool that would render that quick to do.

@jnothman
Copy link
Member

apart from rebase, I take it, @ogrisel, that you would give this a +1 for merge...?

@GaelVaroquaux
Copy link
Member

Looks good. 👍 to merge. The speed is useful, and releasing the gil is a good thing.

I am good to try to rebase and merge right now.

@GaelVaroquaux
Copy link
Member

For the reference: seedup on my laptop (2 cores and hyperthreading):

In [12]: for n_jobs in n_jobs_all:
        knn.set_params(n_jobs=n_jobs)
        %timeit -n 3 -r 1 knn.predict(X_test)
   ....:     
3 loops, best of 1: 1min 2s per loop
3 loops, best of 1: 44.8 s per loop
3 loops, best of 1: 37 s per loop

(1, 2 and 3 jobs). So I think that this is actually predict good.

@jnothman
Copy link
Member

But with the new auto-batching in joblib, the benchmarks above that showed threading outperformed multiprocessing are no longer applicable. Should you redo that benchmark with a multiprocessing backend just to be sure?

@GaelVaroquaux
Copy link
Member

But with the new auto-batching in joblib, the benchmarks above that showed
threading outperformed multiprocessing are no longer applicable. Should you
redo that benchmark with a multiprocessing backend just to be sure?

I was just about to merge and move on, as this is a clear improvement
(both the current speedup and the fact that we release the gil more
often). But if you want to redo the benchmarks, I can wait a bit,
although I think we should merge anyhow.

@jnothman
Copy link
Member

Go ahead

@GaelVaroquaux GaelVaroquaux merged commit 63039b2 into scikit-learn:master Aug 30, 2015
@GaelVaroquaux
Copy link
Member

I have merged this. Thank you very much!

@jnothman
Copy link
Member

Multiprocessing doesn't work in Python 2 due to BallTree pickling issues anyway, it seems. However, I'm not seeing gains nearly as good as yours with multithreading. At least I hope this helps some. But given the gains are small do we have any chance with radius_neighbors? Do we have any understanding of why the gains are small?

@Baschdl
Copy link

Baschdl commented May 10, 2020

Was there any work on radius_neighbors in the meantime? With the dropped support of python 2, parallelization with multiprocessing shouldn't be a issue anymore.

@jnothman
Copy link
Member

@Baschdl see #10887, included in scikit-learn from v0.20.

@guidocioni
Copy link

Is there a reason why n_jobs was not implemented also for BallTree and KDTree methods? I'm trying to follow up the discussion between issues and PRs but I cannot really understand when this was dropped.

@sturlamolden
Copy link
Contributor

sturlamolden commented Mar 1, 2023

Is there a reason why n_jobs was not implemented also for BallTree and KDTree methods? I'm trying to follow up the discussion between issues and PRs but I cannot really understand when this was dropped.

scipy.spatial.KDTree does this for some of the methods (those where parallelism is iterative).

For purely recursive work we initially did not include this because it requires fork-join parallelism in C++. When the code was written C++ did not have support for this in the standard library. Now it does. We have made a parallel kd-tree build (unsure if it was merged in master) but the speedup is not very impressive (e.g. about 2x). So for implementing it for the rest of scipy.spatial.KDTree methods, the speedup from parallel build was not very motivating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants