Skip to content

[MRG+1] Implemented parallelised version of mean_shift and test function. #4779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 30, 2015
Merged

[MRG+1] Implemented parallelised version of mean_shift and test function. #4779

merged 5 commits into from
Aug 30, 2015

Conversation

martinosorb
Copy link

The main iterative loop, which is run for every seed, was put in a separate, external function (this is basically required by multiprocessing.Pool.map()). A method was added to the main class, fit_parallel, which allows for parallel execution of the algorithm on the many seeds, but everything else is the same. It is fully back compatible, and the changes are null unless the user calls explicitly the fit_parallel method.

@amueller
Copy link
Member

In general I think the idea is good, but the interface is not how we usually do it. Please have a look at how it is done in KMeans. You should use joblib.Parallel and not introduce an additional method, but instead a constructor parameter n_jobs=1.

@martinosorb
Copy link
Author

Good to know! I'll work on it in the next few days and pull again. In general, the speed improvement is massive, and it has already been useful in my work. Thanks!

@martinosorb martinosorb reopened this May 27, 2015
@amueller
Copy link
Member

Wait, on second though, I am not sure what is happening. Currently the algorithm only provides support for a single seed, right? Did you change the semantics of seeds?

@martinosorb
Copy link
Author

What do you mean by “a single seed”? Basically I turned the for loop “for seed in seeds” into parallel processes. In other words, the variable seeds is divided in chunks fed to different cores.

@amueller
Copy link
Member

Never mind, I was just confused.

@martinosorb
Copy link
Author

Changed as you requested. Is it acceptable?

@@ -13,6 +13,10 @@
# Alexandre Gramfort <alexandre.gramfort@inria.fr>
# Gael Varoquaux <gael.varoquaux@normalesup.org>

# Modified: Martino Sorbaro <martino.sorbaro@ed.ac.uk>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just add yourself to the authors above. Don't add a comment on the change here. Add it to whatsnew.rst.

@amueller
Copy link
Member

Apart from my minor comments this looks good. Can you maybe give some timing results on, say digits or some other built-in dataset?

@martinosorb
Copy link
Author

Minor changes committed, I'll provide info on timings later.

@martinosorb
Copy link
Author

Hi. There is something wrong. With multiprocessing I had good results, much faster than the serial case. Using joblib I see my CPUs working at half power, with some sleeping processes, resulting in a run time that is actually longer than the serial case!
I'm sorry but I have no idea why this happens. If you spot a reason please tell me, but I understand this is not your job.

@amueller
Copy link
Member

amueller commented Jun 1, 2015

Damn. This is probably because new workers are created over and over again. So that is not good :-/ We've seen similar issues before. @ogrisel and @GaelVaroquaux do you have a good idea on moving forward? There was a joblib issue on reusing worker pools, do we want to wait for that?

@amueller
Copy link
Member

amueller commented Jun 5, 2015

The actual fix is here, I think: joblib/joblib#157 can you try that?

@martinosorb
Copy link
Author

I'm having trouble making my sklearn clone use other code, I'm just not used working with complex python directories, sorry. I'll try to find time to do it. Is there any reason it shouldn't work?

@amueller
Copy link
Member

amueller commented Jun 8, 2015

well it might not be as fast as your original code. That would be good to know.

@martinosorb
Copy link
Author

So, warning that this was a bit of an artisan solution (I simply copied the relevant files in a folder and changed the imports), these are some results:

(Original sklearn)
ms_orig = origMS(bandwidth=30)
%timeit ms_orig.fit_predict(dig_img)

1 loops, best of 3: 7.33 s per loop

(With joblib as in sklearn)
ms_jl = jlMS(bandwidth=30,n_jobs=4)
%timeit ms_jl.fit_predict(dig_img)

1 loops, best of 3: 9.41 s per loop

(With multiprocessing)
ms_mp = mpMS(bandwidth=30,n_jobs=4)
%timeit ms_mp.fit_predict(dig_img)

1 loops, best of 3: 2.81 s per loop

(With joblib as in the "bench-batching" branch)
ms_jln = jlnMS(bandwidth=30,n_jobs=4)
%timeit ms_jln.fit_predict(dig_img)

1 loops, best of 3: 3.19 s per loop

I used the "digits" dataset as you suggested.

@amueller
Copy link
Member

amueller commented Jun 8, 2015

great, thanks. So that seems to me the branch is good, at least an improvement. Maybe @ogrisel can find the time to work on it.
Unfortunately that leaves us here in a bit of a stalemate. Merging what you have now will make it slower. We need to wait until the branch is merged, released, and backported so scikit-learn.
Alternatively we could start using multiprocessing, but that doesn't seem a great option to me. @GaelVaroquaux and @ogrisel might have opinions.

break
completed_iterations += 1
#execute iterations on all seeds in parallel
all_res = Parallel(n_jobs=n_jobs)(delayed(_mean_shift_single_seed)(seed,X,nbrs,max_iter) for seed in seeds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: you can break the line after Parallel(n_jobs=n_jobs)( and before delayed(...)

@ogrisel
Copy link
Member

ogrisel commented Jun 19, 2015

+1 for merging joblib/joblib#157 quickly, making a release of joblib with that new optim and synchronizing the copy in scikit-learn externals to be able to benefit from this PR.

@martinosorb in the mean time can you please fix the style of this patch to follow pep8 (you can use the pep8 linter to find the violations: mainy spacing after the , in the function calls and the 80 column limit.

@ogrisel
Copy link
Member

ogrisel commented Jun 19, 2015

Actually @martinosorb could you also please benchmark this with the threading backend of joblib (no need to use the joblib/joblib#157 branch for that benchmark). It might be the case that if the brute force method is released, parallelism could be achieved efficiently with thread thanks to numpy releasing the GIL most of the time.

When using the kd-tree / ball-tree neighbors it's unfortunately very likely that the GIL is going to hurt us as it seems that we don't release it in that part of the code yet (radius queries).

@ogrisel
Copy link
Member

ogrisel commented Jun 19, 2015

WRT using the threading here, I think this would benefit from merging #4009. But it need a rebase first.

@martinosorb
Copy link
Author

Again, this was just a quick check (I used joblib from my sklearn, installed with pip, not from the fork). I can do it on a machine with more than 2 cores if you need. Notice that, again, it's slower using 2 cores than using 1.

from mean_shift_ import MeanShift
ms1 = MeanShift(bandwidth=30,n_jobs=1)
ms2 = MeanShift(bandwidth=30,n_jobs=2)

%timeit ms1.fit(digits)
1 loops, best of 3: 19 s per loop
%timeit ms2.fit(digits)
1 loops, best of 3: 41.5 s per loop

from mean_shift_thread import MeanShift as MeanShiftT
ms1T = MeanShiftT(bandwidth=30,n_jobs=1)
ms2T = MeanShiftT(bandwidth=30,n_jobs=2)

%timeit ms1T.fit(digits)
1 loops, best of 3: 18.3 s per loop
%timeit ms2T.fit(digits)
1 loops, best of 3: 20 s per loop

The file "mean_shift_thread" is the same as my mean_shift_, but with «backend="threading"»

@ogrisel
Copy link
Member

ogrisel commented Jun 22, 2015

Alright thanks for checking. Let's stick on the default multiprocessing backend for now and we might re-explore the opportunity to use the threading backend if we find ways to free the GIL in the critical sections of the nearest neighbors models.

@ogrisel
Copy link
Member

ogrisel commented Jun 29, 2015

#4905 has been submitted to upgrade the embedded joblib to the new beta that features the auto-batching feature. Once merge, we can rebase this PR on master and +1 to merge this in turn on my side.

@ogrisel ogrisel changed the title Implemented parallelised version of mean_shift and test function. [MRG+1] Implemented parallelised version of mean_shift and test function. Jun 29, 2015
@ogrisel
Copy link
Member

ogrisel commented Jun 30, 2015

#4905 has been merged in master. @martin0258 could you please rebase this PR and check that the performance is as good as in your last benchmark?

@martinosorb
Copy link
Author

OK, I don't know if I messed up with the files, but it's still slower on 32 processors than on 1.

@ogrisel
Copy link
Member

ogrisel commented Jul 3, 2015

Have you rebased, based on some tests that I did on my box it seems to work much better than previously:

from sklearn.datasets import make_blobs
from sklearn.cluster import mean_shift
import time


X, y = make_blobs(n_samples=5000, n_features=30, centers=30)


print('sequential')
t0 = time.time()
mean_shift(X)
print('%0.3fs' % (time.time() - t0))

print('n_jobs=10')
t0 = time.time()
mean_shift(X, n_jobs=10)
print('%0.3fs' % (time.time() - t0))

On this branch (without rebasing on the new joblib):

sequential
22.239s
n_jobs=10
27.784s

With the same branch once rebased on the current master (that includes joblib 0.9.0b2):

sequential
23.377s
n_jobs=10
7.049s

It's not a linear speedup but it's much better.

@ogrisel
Copy link
Member

ogrisel commented Jul 3, 2015

+1 for merge on my side (with a rebase + squash but I can do it) if @martinosorb of @amueller have no objection.

@martinosorb
Copy link
Author

I thought I had rebased but honestly I'm not that familiar with git and I also may have messed up with the virtualenvs. If you tried and succeeded, that's good news. Thanks!

@amueller
Copy link
Member

alright, seems good. Would be nice to have a comparison against the multiprocessing version.

@jnothman
Copy link
Member

I take it this should be merged...?

@GaelVaroquaux
Copy link
Member

If this counts as a +1 for your side (ie you have reviewed the PR) yes, please go ahead.

If not, I'll review it soon and merge it hopefully.

@jnothman
Copy link
Member

No, I've not reviewed, I was just trying to understand whether @amueller had intended support.

@GaelVaroquaux
Copy link
Member

@jnothman Are you going to review it, or should I?

@jnothman
Copy link
Member

Sorry, I can't now.

On 30 August 2015 at 23:13, Gael Varoquaux notifications@github.com wrote:

@jnothman https://github.com/jnothman Are you going to review it, or
should I?


Reply to this email directly or view it on GitHub
#4779 (comment)
.

@GaelVaroquaux
Copy link
Member

OK, I'll review.

@@ -45,6 +46,16 @@ def test_mean_shift():
n_clusters_ = len(labels_unique)
assert_equal(n_clusters_, n_clusters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an additional empty line for PEP8 compliance.

@GaelVaroquaux
Copy link
Member

I am fixing the two minor issues and merging.

@GaelVaroquaux
Copy link
Member

Travis is green. Merging!

GaelVaroquaux added a commit that referenced this pull request Aug 30, 2015
[MRG+1] Implemented parallelised version of mean_shift and test function.
@GaelVaroquaux GaelVaroquaux merged commit 55c32ef into scikit-learn:master Aug 30, 2015
@amueller
Copy link
Member

thanks everyone :)

@martinosorb
Copy link
Author

Thanks! I learned quite a lot about github doing this.
(and from Nelle's talk in Munich this morning...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants