Idea: speed up the parallelization of CountVectorizer by adding a batch mechanism to joblib.Parallel #1401

cjauvin · 2012-11-23T20:31:39Z

A couple of weeks ago, I submitted this experimental idea to the joblib mailing list, but it didn't receive much attention:

As I was studying the implementation of the sklearn CountVectorizer, my attention was drawn to a comment in the code saying that its main loop could not be efficiently parallelized with joblib:

scikit-learn/sklearn/feature_extraction/text.py

Line 469 in 33a5911

# TODO: parallelize the following loop with joblib?

I don't know much about the internals of multiprocessing, but I imagined that there might be a tradeoff between the size of individual jobs and the number of times that a process in the pool is dispatched a new job. For instance, if the vectorizer is passed a very long list of very short documents, then it would seem possible that the dispatching overhead makes it very suboptimal. Perhaps a smaller number of longer jobs would work better in that case?

To explore this hypothesis, I had the simple idea of chaining together jobs as batches (to be executed by a single process) and dispatch those, instead of individual jobs. The user can then experiment with different batch sizes, trying to find the sweet spot.

Today I have decided to try my idea within a more realistic setting, by implementing a minimal version of it for CountVectorizer (it's not a full solution, see caveats in the code comments). Using a 1M-line text corpus and a 24-core Linux machine, parallelizing without batches almost doubles the required time (which I gotta admit is worryingly far from the +20% figure mentioned by @larsmans; my implementation is probably not optimal in that regard) , whereas I have obtained a 3.5X speedup with my batch mechanism:

https://gist.github.com/4137131

I don't know if the idea makes much sense, but I thought I would submit it here anyway, just to see what people thinks.

cjauvin · 2012-11-28T00:24:56Z

Although I'd have preferred to be told why, I guess this total absence of feedback speaks for itself: my idea probably doesn't make sense.. so I hereby close the PR.

GaelVaroquaux · 2012-11-28T00:31:20Z

Although I'd have preferred to be told why, I guess this total absence of
feedback speaks for itself: my idea probably doesn't make sense.. so I hereby
close the PH.

Please don't. It's just lack of time. I want to look at this. It is on my
todo list. I am just collapsing under work and starting to feel a
beginning of a breakdown due to todo list overload.

larsmans · 2012-11-28T00:34:01Z

Same story here. I'd love to review, but deadlines are approaching.

cjauvin · 2012-11-28T00:39:52Z

In that case, I reopen it.. :-) Sorry about that.. it's a simple exploratory idea, and it can certainly wait.

amueller · 2012-11-28T08:59:26Z

@larsmans deadlines? Which one?

larsmans · 2012-11-28T11:36:27Z

Deliverables.

glouppe · 2013-01-30T08:58:54Z

sklearn/feature_extraction/text.py

@@ -432,7 +468,7 @@ def fit(self, raw_documents, y=None):
        self.fit_transform(raw_documents)
        return self

-    def fit_transform(self, raw_documents, y=None):
+    def fit_transform(self, raw_documents, y=None, n_jobs=1, batch_size=1):


n_jobs and batch_size should be set into the constructor instead. It would also be nice it you could include a docstring for both of them. It might indeed not be clear for everyone what is the meaning of batch_size.

jnothman · 2015-08-16T09:08:36Z

Batch mechanism is now in joblib.parallel. Should we be incorporating it in CountVectorizer??

cjauvin · 2015-08-16T15:37:14Z

It's funny, because I had proposed a batching mechanism idea for Joblib more than 2 years ago:

cjauvin/joblib@04846a0

It was very sketchy, and I'm really not sure that what finally got implemented (which I wasn't aware of) is even related at all..

jnothman · 2015-08-17T23:15:40Z

The new batching involves dynamic sizing. Still, given a more recent PR, I suspect we're not going to get much benefit from batching CountVectorizer, but more benchmarks are needed.

rth · 2018-09-26T21:32:40Z

The code of CountVecorizer has evolved a lot since this PR was made.

I am going to close this PR, given that any new attempts to parallelize this estimator will mostly have to start from scratch (while using this PR as inspiration). There is also some related discussion in dask/dask-ml#5.

Thanks everyone for contributing!

cjauvin added 3 commits November 23, 2012 13:17

joblib batch mechanism to speedup CountVectorizer

ca71043

some additional comments

8389d58

comments correction

df59c86

cjauvin closed this Nov 28, 2012

cjauvin reopened this Nov 28, 2012

glouppe reviewed Jan 30, 2013
View reviewed changes

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

rth mentioned this pull request Jun 15, 2018

Parallelize transformers? #7635

Closed

rth closed this Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Idea: speed up the parallelization of CountVectorizer by adding a batch mechanism to joblib.Parallel #1401

Idea: speed up the parallelization of CountVectorizer by adding a batch mechanism to joblib.Parallel #1401

Uh oh!

cjauvin commented Nov 23, 2012

Uh oh!

cjauvin commented Nov 28, 2012

Uh oh!

GaelVaroquaux commented Nov 28, 2012

Uh oh!

larsmans commented Nov 28, 2012

Uh oh!

cjauvin commented Nov 28, 2012

Uh oh!

amueller commented Nov 28, 2012

Uh oh!

larsmans commented Nov 28, 2012

Uh oh!

glouppe Jan 30, 2013

Uh oh!

jnothman commented Aug 16, 2015

Uh oh!

cjauvin commented Aug 16, 2015

Uh oh!

jnothman commented Aug 17, 2015

Uh oh!

rth commented Sep 26, 2018

Uh oh!

Uh oh!

Uh oh!

Idea: speed up the parallelization of CountVectorizer by adding a batch mechanism to joblib.Parallel #1401

Idea: speed up the parallelization of CountVectorizer by adding a batch mechanism to joblib.Parallel #1401

Uh oh!

Conversation

cjauvin commented Nov 23, 2012

Uh oh!

cjauvin commented Nov 28, 2012

Uh oh!

GaelVaroquaux commented Nov 28, 2012

Uh oh!

larsmans commented Nov 28, 2012

Uh oh!

cjauvin commented Nov 28, 2012

Uh oh!

amueller commented Nov 28, 2012

Uh oh!

larsmans commented Nov 28, 2012

Uh oh!

glouppe Jan 30, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 16, 2015

Uh oh!

cjauvin commented Aug 16, 2015

Uh oh!

jnothman commented Aug 17, 2015

Uh oh!

rth commented Sep 26, 2018

Uh oh!

Uh oh!