Skip to content

Idea: speed up the parallelization of CountVectorizer by adding a batch mechanism to joblib.Parallel #1401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

cjauvin
Copy link
Contributor

@cjauvin cjauvin commented Nov 23, 2012

A couple of weeks ago, I submitted this experimental idea to the joblib mailing list, but it didn't receive much attention:


As I was studying the implementation of the sklearn CountVectorizer, my attention was drawn to a comment in the code saying that its main loop could not be efficiently parallelized with joblib:

# TODO: parallelize the following loop with joblib?

I don't know much about the internals of multiprocessing, but I imagined that there might be a tradeoff between the size of individual jobs and the number of times that a process in the pool is dispatched a new job. For instance, if the vectorizer is passed a very long list of very short documents, then it would seem possible that the dispatching overhead makes it very suboptimal. Perhaps a smaller number of longer jobs would work better in that case?

To explore this hypothesis, I had the simple idea of chaining together jobs as batches (to be executed by a single process) and dispatch those, instead of individual jobs. The user can then experiment with different batch sizes, trying to find the sweet spot.


Today I have decided to try my idea within a more realistic setting, by implementing a minimal version of it for CountVectorizer (it's not a full solution, see caveats in the code comments). Using a 1M-line text corpus and a 24-core Linux machine, parallelizing without batches almost doubles the required time (which I gotta admit is worryingly far from the +20% figure mentioned by @larsmans; my implementation is probably not optimal in that regard) , whereas I have obtained a 3.5X speedup with my batch mechanism:

https://gist.github.com/4137131

I don't know if the idea makes much sense, but I thought I would submit it here anyway, just to see what people thinks.

@cjauvin
Copy link
Contributor Author

cjauvin commented Nov 28, 2012

Although I'd have preferred to be told why, I guess this total absence of feedback speaks for itself: my idea probably doesn't make sense.. so I hereby close the PR.

@cjauvin cjauvin closed this Nov 28, 2012
@GaelVaroquaux
Copy link
Member

Although I'd have preferred to be told why, I guess this total absence of
feedback speaks for itself: my idea probably doesn't make sense.. so I hereby
close the PH.

Please don't. It's just lack of time. I want to look at this. It is on my
todo list. I am just collapsing under work and starting to feel a
beginning of a breakdown due to todo list overload.

@larsmans
Copy link
Member

Same story here. I'd love to review, but deadlines are approaching.

@cjauvin
Copy link
Contributor Author

cjauvin commented Nov 28, 2012

In that case, I reopen it.. :-) Sorry about that.. it's a simple exploratory idea, and it can certainly wait.

@cjauvin cjauvin reopened this Nov 28, 2012
@amueller
Copy link
Member

@larsmans deadlines? Which one?

@larsmans
Copy link
Member

Deliverables.

@@ -432,7 +468,7 @@ def fit(self, raw_documents, y=None):
self.fit_transform(raw_documents)
return self

def fit_transform(self, raw_documents, y=None):
def fit_transform(self, raw_documents, y=None, n_jobs=1, batch_size=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_jobs and batch_size should be set into the constructor instead. It would also be nice it you could include a docstring for both of them. It might indeed not be clear for everyone what is the meaning of batch_size.

@jnothman
Copy link
Member

Batch mechanism is now in joblib.parallel. Should we be incorporating it in CountVectorizer??

@cjauvin
Copy link
Contributor Author

cjauvin commented Aug 16, 2015

It's funny, because I had proposed a batching mechanism idea for Joblib more than 2 years ago:

cjauvin/joblib@04846a0

It was very sketchy, and I'm really not sure that what finally got implemented (which I wasn't aware of) is even related at all..

@jnothman
Copy link
Member

The new batching involves dynamic sizing. Still, given a more recent PR, I suspect we're not going to get much benefit from batching CountVectorizer, but more benchmarks are needed.

@rth rth mentioned this pull request Jun 15, 2018
@rth
Copy link
Member

rth commented Sep 26, 2018

The code of CountVecorizer has evolved a lot since this PR was made.

I am going to close this PR, given that any new attempts to parallelize this estimator will mostly have to start from scratch (while using this PR as inspiration). There is also some related discussion in dask/dask-ml#5.

Thanks everyone for contributing!

@rth rth closed this Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants