-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] ENH: Parallelize kneighbors method with multithreading #4009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] ENH: Parallelize kneighbors method with multithreading #4009
Conversation
@@ -8,6 +8,7 @@ | |||
# License: BSD 3 clause (C) INRIA, University of Amsterdam | |||
import warnings | |||
from abc import ABCMeta, abstractmethod | |||
from multiprocessing import cpu_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's indeed unnecessary.
Looks like tests are failing because you're passing an unpicklable method to |
You are right, that's the reason. There is some advanced mechanism added in Python 3 and thus it works. Doing |
Hi, I was able to make work threading parallelism. Checkout updated benchmark: http://nbviewer.ipython.org/gist/nmayorov/7531d9b59608ae2d76dc If a user will pass a custom function (written in Python) for distance calculation he won't have any speed up though, because of GIL. But I think it's a very minor thing. Should I continue this PR in nogil/threading direction? |
I can't make it work with |
Hi – can you add your commits to the branch so we can see these updates? |
I wasn't sure what to add, because there are different directions to go. I pushed all "nogil" changes. But, as I said, Note, that I removed some things which you marked as "for compatibility with old numpy", because the current dependency is stated as Travis build has failed again. I guess I shouldn't push these changes, they look like huge mess. I think I better try to go with |
4e29212
to
4488332
Compare
Well, I fixed doctest and now the build passes. If you are OK with my changes in general, let's try to fix the problem with |
@@ -44,7 +44,7 @@ cdef int allocate_data(BinaryTree tree, ITYPE_t n_nodes, | |||
ITYPE_t n_features) except -1: | |||
"""Allocate arrays needed for the KD Tree""" | |||
tree.node_bounds_arr = np.zeros((1, n_nodes, n_features), dtype=DTYPE) | |||
tree.node_bounds = get_memview_DTYPE_3D(tree.node_bounds_arr) | |||
tree.node_bounds = tree.node_bounds_arr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This get_memview_DTYPE_3D
function is required for compatibility with old versions of numpy. Unless we've updated our dependency minimums, it should stay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was annotated in code as # Numpy 1.3-1.4 compatibility utilities
, on the main page of scikit-learn project we see The required dependencies to build the software are NumPy >= 1.6.2, SciPy >= 0.9 and a working C/C++ compiler.
So I figured it's fine to remove it.
BTW, I removed all get_memview_DTYPE
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it – yeah, it's probably time to make that change. I guess it just confused me that it was bundled into the parallel enhancements. Any way you can factor that out into a separate PR? I think it would be much easier to review that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I will do that. I just decided to show you all changes (I understand it's pretty chaotic).
The main problem is that query_radius
after my additions works slower with multiple threads than in the current version (which runs almost the same time with any number of threads).
If we figure out what to do with that, I will change commits.
Hey guys! I would ask you not to review it in the current state, because there are several changes (some of them not working properly). I'll keep changes only related to |
Great – thanks @nmayorov. I think this is hitting on some really good ideas (and some much needed maintenance updates) so thanks for being willing to iterate! |
a476ddf
to
b0fde39
Compare
3 new commits are bare minimum to make |
This looks great. Just so I understand: the issue with Radius Neighbors is the fact that you don't know how big an array you need to store the neighbors before the query happens, yes? |
I can't say with absolute certainty (you should know better), but, yes — it seems to be the main issue and the reason I tried to use malloc in that cycle. It didn't help though (works the same as adding |
I experimented a bit more with It's not that easy to get performance boost with threading and nogil. I suggest 2 variants:
|
I haven't looked at this in detail, but as a potential consumer of this change, approach 1 would be extremely useful to me, whereas approach 2 would not be useful in my case at all. (Multiprocessing copies the data way too many times and I run into memory errors.) |
Can be avoided by memmapping? |
@dougalsutherland, what is the typical shape of Also could you please run this code on your data and report what time improvements you'll get with |
@jnothman memmapping requires writing the data to disk, which kind of sucks, but it's definitely an option. In the past I've used some hacks to have the forked processes inherit global data, but that's super-hacky and doesn't work on Windows, and I've run into problems there where if I fork and close too many multiprocessing pools, my program eventually starts hanging (never really figured that one out or if it's related to this hack or not). In any case, given that a working threading-based solution exists for the main use case of this code, and in @nmayorov's benchmarks is a decent bit faster.... @nmayorov I don't actually use I'll pull out some sample data and run benchmarks sometime soon. |
I also run benchmark in Mac OS X – it works well. @jakevdp what is your opinion, can you merge it? Or what else needs to be done? Do you think it's necessary to do something for radius queries? (I personally don't, they are independent changes anyway.) |
I think it's fine to fix one without the other initially, but if you want to get full reviews, change the WIP in the title to MRG! |
@GaelVaroquaux have you seen some benchmarks that @ogrisel run on his machine? So it's not certain how well If you still want to merge, maybe some word of caution should be added? |
@nmayorov please squash all the commits or rewrite the commit messages so that they are individually informative. Commit messages are meant to be read in the coming years when reading the history of this class for debugging or refactoring purpose. Commit messages as "Updated with @ogrisel suggestions" are useless for that purpose. Instead they should summarize the "what" in one line and possibly the "why" in a subsequent paragraph when it's not obvious. |
@ogrisel I understand your point, I will squash commits. |
2a1b012
to
63039b2
Compare
Thanks for the squash!
It should never be significantly worse than
I don't think it's necessary. It's just suboptimal, but not completely broken. |
So what do other people think, shall we merge this? I think it's pretty
harmless and releasing the GIL is always a good thing when it does not render
the code overly complex.
I absolutely agree with this last statement.
|
I think we should merge this; but this is not mere GIL unlocking. I think we need to decide which backend Neighbors's Parallel should use. |
I have read through @nmayorov's code. I do not see a lock which should have this effect on the scalability. The most likely candidates are some GIL contention in the Cython generated C, a bad case of false sharing, or a problem in joblib's threading backend. There are some cdef functions in @nmayorov's code that has an |
based on previous benchmarks, the multiprocessing backend is only efficient if the user takes proper care to manually memory map both the query data and the fitted neighbors tree. It's unlikely that many users will do that. +1 for keeping the threading based backend as default even though it's not a scalable as we would like. I am pretty sure we could hunt down the remaining GIL contention source but I don't know of any high level tool that would render that quick to do. |
apart from rebase, I take it, @ogrisel, that you would give this a +1 for merge...? |
Looks good. 👍 to merge. The speed is useful, and releasing the gil is a good thing. I am good to try to rebase and merge right now. |
For the reference: seedup on my laptop (2 cores and hyperthreading): In [12]: for n_jobs in n_jobs_all: knn.set_params(n_jobs=n_jobs) %timeit -n 3 -r 1 knn.predict(X_test) ....: 3 loops, best of 1: 1min 2s per loop 3 loops, best of 1: 44.8 s per loop 3 loops, best of 1: 37 s per loop (1, 2 and 3 jobs). So I think that this is actually predict good. |
But with the new auto-batching in joblib, the benchmarks above that showed threading outperformed multiprocessing are no longer applicable. Should you redo that benchmark with a multiprocessing backend just to be sure? |
I was just about to merge and move on, as this is a clear improvement |
Go ahead |
I have merged this. Thank you very much! |
Multiprocessing doesn't work in Python 2 due to |
Was there any work on |
Is there a reason why |
For purely recursive work we initially did not include this because it requires fork-join parallelism in C++. When the code was written C++ did not have support for this in the standard library. Now it does. We have made a parallel kd-tree build (unsure if it was merged in master) but the speedup is not very impressive (e.g. about 2x). So for implementing it for the rest of |
I used the most simple approach with the
'multiprocessing'
'threading'
joblib backend. A simple benchmark could be seen here.This is the issue #3536.