[MRG+1] Implemented parallelised version of mean_shift and test function. #4779

martinosorb · 2015-05-27T16:52:15Z

The main iterative loop, which is run for every seed, was put in a separate, external function (this is basically required by multiprocessing.Pool.map()). A method was added to the main class, fit_parallel, which allows for parallel execution of the algorithm on the many seeds, but everything else is the same. It is fully back compatible, and the changes are null unless the user calls explicitly the fit_parallel method.

amueller · 2015-05-27T16:54:43Z

In general I think the idea is good, but the interface is not how we usually do it. Please have a look at how it is done in KMeans. You should use joblib.Parallel and not introduce an additional method, but instead a constructor parameter n_jobs=1.

martinosorb · 2015-05-27T17:03:27Z

Good to know! I'll work on it in the next few days and pull again. In general, the speed improvement is massive, and it has already been useful in my work. Thanks!

amueller · 2015-05-27T17:55:36Z

Wait, on second though, I am not sure what is happening. Currently the algorithm only provides support for a single seed, right? Did you change the semantics of seeds?

martinosorb · 2015-05-27T20:23:50Z

What do you mean by “a single seed”? Basically I turned the for loop “for seed in seeds” into parallel processes. In other words, the variable seeds is divided in chunks fed to different cores.

amueller · 2015-05-27T20:25:22Z

Never mind, I was just confused.

martinosorb · 2015-05-28T11:30:09Z

Changed as you requested. Is it acceptable?

amueller · 2015-05-28T16:37:46Z

sklearn/cluster/mean_shift_.py

@@ -13,6 +13,10 @@
 #          Alexandre Gramfort <alexandre.gramfort@inria.fr>
 #          Gael Varoquaux <gael.varoquaux@normalesup.org>

+# Modified: Martino Sorbaro <martino.sorbaro@ed.ac.uk>


you can just add yourself to the authors above. Don't add a comment on the change here. Add it to whatsnew.rst.

amueller · 2015-05-28T16:40:36Z

Apart from my minor comments this looks good. Can you maybe give some timing results on, say digits or some other built-in dataset?

martinosorb · 2015-05-29T09:55:54Z

Minor changes committed, I'll provide info on timings later.

martinosorb · 2015-05-29T14:27:59Z

Hi. There is something wrong. With multiprocessing I had good results, much faster than the serial case. Using joblib I see my CPUs working at half power, with some sleeping processes, resulting in a run time that is actually longer than the serial case!
I'm sorry but I have no idea why this happens. If you spot a reason please tell me, but I understand this is not your job.

amueller · 2015-06-01T18:17:43Z

Damn. This is probably because new workers are created over and over again. So that is not good :-/ We've seen similar issues before. @ogrisel and @GaelVaroquaux do you have a good idea on moving forward? There was a joblib issue on reusing worker pools, do we want to wait for that?

amueller · 2015-06-05T17:54:17Z

The actual fix is here, I think: joblib/joblib#157 can you try that?

martinosorb · 2015-06-08T15:46:36Z

I'm having trouble making my sklearn clone use other code, I'm just not used working with complex python directories, sorry. I'll try to find time to do it. Is there any reason it shouldn't work?

amueller · 2015-06-08T16:52:38Z

well it might not be as fast as your original code. That would be good to know.

martinosorb · 2015-06-08T17:17:03Z

So, warning that this was a bit of an artisan solution (I simply copied the relevant files in a folder and changed the imports), these are some results:

(Original sklearn)
ms_orig = origMS(bandwidth=30)
%timeit ms_orig.fit_predict(dig_img)

1 loops, best of 3: 7.33 s per loop

(With joblib as in sklearn)
ms_jl = jlMS(bandwidth=30,n_jobs=4)
%timeit ms_jl.fit_predict(dig_img)

1 loops, best of 3: 9.41 s per loop

(With multiprocessing)
ms_mp = mpMS(bandwidth=30,n_jobs=4)
%timeit ms_mp.fit_predict(dig_img)

1 loops, best of 3: 2.81 s per loop

(With joblib as in the "bench-batching" branch)
ms_jln = jlnMS(bandwidth=30,n_jobs=4)
%timeit ms_jln.fit_predict(dig_img)

1 loops, best of 3: 3.19 s per loop

I used the "digits" dataset as you suggested.

amueller · 2015-06-08T17:32:47Z

great, thanks. So that seems to me the branch is good, at least an improvement. Maybe @ogrisel can find the time to work on it.
Unfortunately that leaves us here in a bit of a stalemate. Merging what you have now will make it slower. We need to wait until the branch is merged, released, and backported so scikit-learn.
Alternatively we could start using multiprocessing, but that doesn't seem a great option to me. @GaelVaroquaux and @ogrisel might have opinions.

ogrisel · 2015-06-19T12:45:58Z

sklearn/cluster/mean_shift_.py

-                break
-            completed_iterations += 1
+    #execute iterations on all seeds in parallel
+    all_res = Parallel(n_jobs=n_jobs)(delayed(_mean_shift_single_seed)(seed,X,nbrs,max_iter) for seed in seeds)


style: you can break the line after Parallel(n_jobs=n_jobs)( and before delayed(...)

ogrisel · 2015-06-19T12:57:33Z

+1 for merging joblib/joblib#157 quickly, making a release of joblib with that new optim and synchronizing the copy in scikit-learn externals to be able to benefit from this PR.

@martinosorb in the mean time can you please fix the style of this patch to follow pep8 (you can use the pep8 linter to find the violations: mainy spacing after the , in the function calls and the 80 column limit.

ogrisel · 2015-06-19T13:10:07Z

Actually @martinosorb could you also please benchmark this with the threading backend of joblib (no need to use the joblib/joblib#157 branch for that benchmark). It might be the case that if the brute force method is released, parallelism could be achieved efficiently with thread thanks to numpy releasing the GIL most of the time.

When using the kd-tree / ball-tree neighbors it's unfortunately very likely that the GIL is going to hurt us as it seems that we don't release it in that part of the code yet (radius queries).

ogrisel · 2015-06-19T13:24:10Z

WRT using the threading here, I think this would benefit from merging #4009. But it need a rebase first.

martinosorb · 2015-06-21T10:16:28Z

Again, this was just a quick check (I used joblib from my sklearn, installed with pip, not from the fork). I can do it on a machine with more than 2 cores if you need. Notice that, again, it's slower using 2 cores than using 1.

from mean_shift_ import MeanShift
ms1 = MeanShift(bandwidth=30,n_jobs=1)
ms2 = MeanShift(bandwidth=30,n_jobs=2)

%timeit ms1.fit(digits)
1 loops, best of 3: 19 s per loop
%timeit ms2.fit(digits)
1 loops, best of 3: 41.5 s per loop

from mean_shift_thread import MeanShift as MeanShiftT
ms1T = MeanShiftT(bandwidth=30,n_jobs=1)
ms2T = MeanShiftT(bandwidth=30,n_jobs=2)

%timeit ms1T.fit(digits)
1 loops, best of 3: 18.3 s per loop
%timeit ms2T.fit(digits)
1 loops, best of 3: 20 s per loop

The file "mean_shift_thread" is the same as my mean_shift_, but with «backend="threading"»

ogrisel · 2015-06-22T09:20:23Z

Alright thanks for checking. Let's stick on the default multiprocessing backend for now and we might re-explore the opportunity to use the threading backend if we find ways to free the GIL in the critical sections of the nearest neighbors models.

ogrisel · 2015-06-29T12:38:13Z

#4905 has been submitted to upgrade the embedded joblib to the new beta that features the auto-batching feature. Once merge, we can rebase this PR on master and +1 to merge this in turn on my side.

ogrisel · 2015-06-30T07:23:53Z

#4905 has been merged in master. @martin0258 could you please rebase this PR and check that the performance is as good as in your last benchmark?

martinosorb · 2015-06-30T16:07:54Z

OK, I don't know if I messed up with the files, but it's still slower on 32 processors than on 1.

ogrisel · 2015-07-03T15:36:18Z

Have you rebased, based on some tests that I did on my box it seems to work much better than previously:

from sklearn.datasets import make_blobs
from sklearn.cluster import mean_shift
import time


X, y = make_blobs(n_samples=5000, n_features=30, centers=30)


print('sequential')
t0 = time.time()
mean_shift(X)
print('%0.3fs' % (time.time() - t0))

print('n_jobs=10')
t0 = time.time()
mean_shift(X, n_jobs=10)
print('%0.3fs' % (time.time() - t0))

On this branch (without rebasing on the new joblib):

sequential
22.239s
n_jobs=10
27.784s

With the same branch once rebased on the current master (that includes joblib 0.9.0b2):

sequential
23.377s
n_jobs=10
7.049s

It's not a linear speedup but it's much better.

ogrisel · 2015-07-03T15:39:17Z

+1 for merge on my side (with a rebase + squash but I can do it) if @martinosorb of @amueller have no objection.

martinosorb · 2015-07-03T17:56:26Z

I thought I had rebased but honestly I'm not that familiar with git and I also may have messed up with the virtualenvs. If you tried and succeeded, that's good news. Thanks!

amueller · 2015-07-11T21:52:15Z

alright, seems good. Would be nice to have a comparison against the multiprocessing version.

jnothman · 2015-08-30T12:21:18Z

I take it this should be merged...?

GaelVaroquaux · 2015-08-30T12:23:10Z

If this counts as a +1 for your side (ie you have reviewed the PR) yes, please go ahead.

If not, I'll review it soon and merge it hopefully.

jnothman · 2015-08-30T12:41:28Z

No, I've not reviewed, I was just trying to understand whether @amueller had intended support.

GaelVaroquaux · 2015-08-30T13:13:42Z

@jnothman Are you going to review it, or should I?

jnothman · 2015-08-30T13:16:49Z

Sorry, I can't now.

On 30 August 2015 at 23:13, Gael Varoquaux notifications@github.com wrote:

@jnothman https://github.com/jnothman Are you going to review it, or
should I?

—
Reply to this email directly or view it on GitHub
#4779 (comment)
.

GaelVaroquaux · 2015-08-30T13:21:31Z

OK, I'll review.

GaelVaroquaux · 2015-08-30T14:27:24Z

sklearn/cluster/tests/test_mean_shift.py

@@ -45,6 +46,16 @@ def test_mean_shift():
    n_clusters_ = len(labels_unique)
    assert_equal(n_clusters_, n_clusters)


There should be an additional empty line for PEP8 compliance.

GaelVaroquaux · 2015-08-30T14:34:52Z

I am fixing the two minor issues and merging.

GaelVaroquaux · 2015-08-30T15:28:25Z

Travis is green. Merging!

[MRG+1] Implemented parallelised version of mean_shift and test function.

amueller · 2015-08-31T18:09:45Z

thanks everyone :)

martinosorb · 2015-08-31T20:57:14Z

Thanks! I learned quite a lot about github doing this.
(and from Nelle's talk in Munich this morning...)

Implemented parallelised version of mean_shift and test function.

7041ed2

martinosorb closed this May 27, 2015

martinosorb reopened this May 27, 2015

Change par system to joblib and n_jobs convention

4c4b1f7

amueller reviewed May 28, 2015
View reviewed changes

Minor appearance changes

a8f33d3

Trivial function name bug fixed

de4577b

martinosorb mentioned this pull request Jun 5, 2015

Dictionary learning is slower with n_jobs > 1 #4769

Open

ogrisel reviewed Jun 19, 2015
View reviewed changes

ogrisel mentioned this pull request Jun 19, 2015

[MRG] Automatically group short tasks in batches joblib/joblib#157

Merged

3 tasks

pep8 style

3d097ae

ogrisel changed the title ~~Implemented parallelised version of mean_shift and test function.~~ [MRG+1] Implemented parallelised version of mean_shift and test function. Jun 29, 2015

GaelVaroquaux reviewed Aug 30, 2015
View reviewed changes

GaelVaroquaux mentioned this pull request Aug 30, 2015

Pr 4779: Implemented parallelised version of mean_shift and test function #5189

Merged

GaelVaroquaux added a commit that referenced this pull request Aug 30, 2015

Merge pull request #4779 from martinosorb/parallel-ms

55c32ef

[MRG+1] Implemented parallelised version of mean_shift and test function.

GaelVaroquaux merged commit 55c32ef into scikit-learn:master Aug 30, 2015

		@@ -45,6 +46,16 @@ def test_mean_shift():
		n_clusters_ = len(labels_unique)
		assert_equal(n_clusters_, n_clusters)

[MRG+1] Implemented parallelised version of mean_shift and test function. #4779

[MRG+1] Implemented parallelised version of mean_shift and test function. #4779

Conversation

martinosorb commented May 27, 2015

amueller commented May 27, 2015

martinosorb commented May 27, 2015

amueller commented May 27, 2015

martinosorb commented May 27, 2015

amueller commented May 27, 2015

martinosorb commented May 28, 2015

amueller May 28, 2015

Choose a reason for hiding this comment

amueller commented May 28, 2015

martinosorb commented May 29, 2015

martinosorb commented May 29, 2015

amueller commented Jun 1, 2015

amueller commented Jun 5, 2015

martinosorb commented Jun 8, 2015

amueller commented Jun 8, 2015

martinosorb commented Jun 8, 2015

amueller commented Jun 8, 2015

ogrisel Jun 19, 2015

Choose a reason for hiding this comment

ogrisel commented Jun 19, 2015

ogrisel commented Jun 19, 2015

ogrisel commented Jun 19, 2015

martinosorb commented Jun 21, 2015

ogrisel commented Jun 22, 2015

ogrisel commented Jun 29, 2015

ogrisel commented Jun 30, 2015

martinosorb commented Jun 30, 2015

ogrisel commented Jul 3, 2015

ogrisel commented Jul 3, 2015

martinosorb commented Jul 3, 2015

amueller commented Jul 11, 2015

jnothman commented Aug 30, 2015

GaelVaroquaux commented Aug 30, 2015

jnothman commented Aug 30, 2015

GaelVaroquaux commented Aug 30, 2015

jnothman commented Aug 30, 2015

GaelVaroquaux commented Aug 30, 2015

GaelVaroquaux Aug 30, 2015

Choose a reason for hiding this comment

GaelVaroquaux commented Aug 30, 2015

GaelVaroquaux commented Aug 30, 2015

amueller commented Aug 31, 2015

martinosorb commented Aug 31, 2015