-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Implemented parallelised version of mean_shift and test function. #4779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In general I think the idea is good, but the interface is not how we usually do it. Please have a look at how it is done in KMeans. You should use |
Good to know! I'll work on it in the next few days and pull again. In general, the speed improvement is massive, and it has already been useful in my work. Thanks! |
Wait, on second though, I am not sure what is happening. Currently the algorithm only provides support for a single seed, right? Did you change the semantics of |
What do you mean by “a single seed”? Basically I turned the for loop “for seed in seeds” into parallel processes. In other words, the variable seeds is divided in chunks fed to different cores. |
Never mind, I was just confused. |
Changed as you requested. Is it acceptable? |
@@ -13,6 +13,10 @@ | |||
# Alexandre Gramfort <alexandre.gramfort@inria.fr> | |||
# Gael Varoquaux <gael.varoquaux@normalesup.org> | |||
|
|||
# Modified: Martino Sorbaro <martino.sorbaro@ed.ac.uk> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just add yourself to the authors above. Don't add a comment on the change here. Add it to whatsnew.rst.
Apart from my minor comments this looks good. Can you maybe give some timing results on, say |
Minor changes committed, I'll provide info on timings later. |
Hi. There is something wrong. With multiprocessing I had good results, much faster than the serial case. Using joblib I see my CPUs working at half power, with some sleeping processes, resulting in a run time that is actually longer than the serial case! |
Damn. This is probably because new workers are created over and over again. So that is not good :-/ We've seen similar issues before. @ogrisel and @GaelVaroquaux do you have a good idea on moving forward? There was a joblib issue on reusing worker pools, do we want to wait for that? |
The actual fix is here, I think: joblib/joblib#157 can you try that? |
I'm having trouble making my sklearn clone use other code, I'm just not used working with complex python directories, sorry. I'll try to find time to do it. Is there any reason it shouldn't work? |
well it might not be as fast as your original code. That would be good to know. |
So, warning that this was a bit of an artisan solution (I simply copied the relevant files in a folder and changed the imports), these are some results: (Original sklearn)
(With joblib as in sklearn)
(With multiprocessing)
(With joblib as in the "bench-batching" branch)
I used the "digits" dataset as you suggested. |
great, thanks. So that seems to me the branch is good, at least an improvement. Maybe @ogrisel can find the time to work on it. |
break | ||
completed_iterations += 1 | ||
#execute iterations on all seeds in parallel | ||
all_res = Parallel(n_jobs=n_jobs)(delayed(_mean_shift_single_seed)(seed,X,nbrs,max_iter) for seed in seeds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: you can break the line after Parallel(n_jobs=n_jobs)(
and before delayed(...)
+1 for merging joblib/joblib#157 quickly, making a release of joblib with that new optim and synchronizing the copy in scikit-learn externals to be able to benefit from this PR. @martinosorb in the mean time can you please fix the style of this patch to follow pep8 (you can use the pep8 linter to find the violations: mainy spacing after the |
Actually @martinosorb could you also please benchmark this with the threading backend of joblib (no need to use the joblib/joblib#157 branch for that benchmark). It might be the case that if the brute force method is released, parallelism could be achieved efficiently with thread thanks to numpy releasing the GIL most of the time. When using the kd-tree / ball-tree neighbors it's unfortunately very likely that the GIL is going to hurt us as it seems that we don't release it in that part of the code yet (radius queries). |
WRT using the threading here, I think this would benefit from merging #4009. But it need a rebase first. |
Again, this was just a quick check (I used joblib from my sklearn, installed with pip, not from the fork). I can do it on a machine with more than 2 cores if you need. Notice that, again, it's slower using 2 cores than using 1.
The file "mean_shift_thread" is the same as my mean_shift_, but with «backend="threading"» |
Alright thanks for checking. Let's stick on the default multiprocessing backend for now and we might re-explore the opportunity to use the threading backend if we find ways to free the GIL in the critical sections of the nearest neighbors models. |
#4905 has been submitted to upgrade the embedded joblib to the new beta that features the auto-batching feature. Once merge, we can rebase this PR on master and +1 to merge this in turn on my side. |
#4905 has been merged in master. @martin0258 could you please rebase this PR and check that the performance is as good as in your last benchmark? |
OK, I don't know if I messed up with the files, but it's still slower on 32 processors than on 1. |
Have you rebased, based on some tests that I did on my box it seems to work much better than previously: from sklearn.datasets import make_blobs
from sklearn.cluster import mean_shift
import time
X, y = make_blobs(n_samples=5000, n_features=30, centers=30)
print('sequential')
t0 = time.time()
mean_shift(X)
print('%0.3fs' % (time.time() - t0))
print('n_jobs=10')
t0 = time.time()
mean_shift(X, n_jobs=10)
print('%0.3fs' % (time.time() - t0)) On this branch (without rebasing on the new joblib):
With the same branch once rebased on the current master (that includes joblib 0.9.0b2):
It's not a linear speedup but it's much better. |
+1 for merge on my side (with a rebase + squash but I can do it) if @martinosorb of @amueller have no objection. |
I thought I had rebased but honestly I'm not that familiar with git and I also may have messed up with the virtualenvs. If you tried and succeeded, that's good news. Thanks! |
alright, seems good. Would be nice to have a comparison against the multiprocessing version. |
I take it this should be merged...? |
If this counts as a +1 for your side (ie you have reviewed the PR) yes, please go ahead. If not, I'll review it soon and merge it hopefully. |
No, I've not reviewed, I was just trying to understand whether @amueller had intended support. |
@jnothman Are you going to review it, or should I? |
Sorry, I can't now. On 30 August 2015 at 23:13, Gael Varoquaux notifications@github.com wrote:
|
OK, I'll review. |
@@ -45,6 +46,16 @@ def test_mean_shift(): | |||
n_clusters_ = len(labels_unique) | |||
assert_equal(n_clusters_, n_clusters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an additional empty line for PEP8 compliance.
I am fixing the two minor issues and merging. |
Travis is green. Merging! |
[MRG+1] Implemented parallelised version of mean_shift and test function.
thanks everyone :) |
Thanks! I learned quite a lot about github doing this. |
The main iterative loop, which is run for every seed, was put in a separate, external function (this is basically required by multiprocessing.Pool.map()). A method was added to the main class, fit_parallel, which allows for parallel execution of the algorithm on the many seeds, but everything else is the same. It is fully back compatible, and the changes are null unless the user calls explicitly the fit_parallel method.