-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Idea: speed up the parallelization of CountVectorizer by adding a batch mechanism to joblib.Parallel #1401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Although I'd have preferred to be told why, I guess this total absence of feedback speaks for itself: my idea probably doesn't make sense.. so I hereby close the PR. |
Please don't. It's just lack of time. I want to look at this. It is on my |
Same story here. I'd love to review, but deadlines are approaching. |
In that case, I reopen it.. :-) Sorry about that.. it's a simple exploratory idea, and it can certainly wait. |
@larsmans deadlines? Which one? |
Deliverables. |
@@ -432,7 +468,7 @@ def fit(self, raw_documents, y=None): | |||
self.fit_transform(raw_documents) | |||
return self | |||
|
|||
def fit_transform(self, raw_documents, y=None): | |||
def fit_transform(self, raw_documents, y=None, n_jobs=1, batch_size=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_jobs
and batch_size
should be set into the constructor instead. It would also be nice it you could include a docstring for both of them. It might indeed not be clear for everyone what is the meaning of batch_size
.
Batch mechanism is now in joblib.parallel. Should we be incorporating it in |
It's funny, because I had proposed a batching mechanism idea for Joblib more than 2 years ago: It was very sketchy, and I'm really not sure that what finally got implemented (which I wasn't aware of) is even related at all.. |
The new batching involves dynamic sizing. Still, given a more recent PR, I suspect we're not going to get much benefit from batching |
The code of CountVecorizer has evolved a lot since this PR was made. I am going to close this PR, given that any new attempts to parallelize this estimator will mostly have to start from scratch (while using this PR as inspiration). There is also some related discussion in dask/dask-ml#5. Thanks everyone for contributing! |
A couple of weeks ago, I submitted this experimental idea to the joblib mailing list, but it didn't receive much attention:
As I was studying the implementation of the sklearn CountVectorizer, my attention was drawn to a comment in the code saying that its main loop could not be efficiently parallelized with joblib:
scikit-learn/sklearn/feature_extraction/text.py
Line 469 in 33a5911
I don't know much about the internals of multiprocessing, but I imagined that there might be a tradeoff between the size of individual jobs and the number of times that a process in the pool is dispatched a new job. For instance, if the vectorizer is passed a very long list of very short documents, then it would seem possible that the dispatching overhead makes it very suboptimal. Perhaps a smaller number of longer jobs would work better in that case?
To explore this hypothesis, I had the simple idea of chaining together jobs as batches (to be executed by a single process) and dispatch those, instead of individual jobs. The user can then experiment with different batch sizes, trying to find the sweet spot.
Today I have decided to try my idea within a more realistic setting, by implementing a minimal version of it for CountVectorizer (it's not a full solution, see caveats in the code comments). Using a 1M-line text corpus and a 24-core Linux machine, parallelizing without batches almost doubles the required time (which I gotta admit is worryingly far from the +20% figure mentioned by @larsmans; my implementation is probably not optimal in that regard) , whereas I have obtained a 3.5X speedup with my batch mechanism:
https://gist.github.com/4137131
I don't know if the idea makes much sense, but I thought I would submit it here anyway, just to see what people thinks.