-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Parallelize transformers? #7635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It would need to override |
@jnothman hmm get params looks pretty gnarly ... sure we can't just leave |
Ideally not... |
This is very closely related to #7448 which proposed the same for I'm not sure how this would work as a mixin. Would all transformers get it? Then it should be in This wouldn't work in pipelines, unless we add this to pipelines... Why not do it as a meta-estimator? That way we don't have to mess with |
Only transformers where |
@mblondel i think it should work with stateful transformers too from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals.joblib import Parallel, delayed
import numpy as np
import scipy.sparse as sp
def transform_parallel(self, X, n_jobs):
transform_splits = Parallel(n_jobs=n_jobs, backend="threading")(
delayed(self.transform)(X_split)
for X_split in np.array_split(X, n_jobs))
return sp.vstack(transform_splits)
CountVectorizer.transform_parallel = transform_parallel
c = CountVectorizer()
c.fit(["thie and that", "the other"])
c.transform_parallel(['this','other','thing','yeah'], n_jobs = 4).toarray()
# array([[0, 0, 0, 0, 0],
# [0, 1, 0, 0, 0],
# [0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0]]) |
I'm curious about this myself. Would love to see it happen! Any efforts towards doing so? Or is this issue stale? |
I would think this is best implemented as a mixin, but I don't think you
should expect it to appear in the core library
|
Also IMO this discussion should be not be about whether it's possible to parallelize transformers, but rather about whether it's possible to make transformers faster by using parallelization. For instance, it we take the
so in this case we can parallelize but we definitely don't want to. At least not in this way, there was some prior discussion about parallelizing this estimator in #1401. In this example there is no speedup with the threading backend probably because CountVectorizer doesn't release the GIL and we get some chunking / concatenation overhead. Generally dask-ml is a better suited to deal with such parallelization tasks. For instance, I think, |
As raised in http://stackoverflow.com/questions/39948138/sklearn-featurehasher-parallelized/39951415 many (all?) transformers could be made parallel, would this make sense as a
ParallelTransformerMixin
, something likeThe text was updated successfully, but these errors were encountered: