Skip to content

Parallelize transformers? #7635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
themrmax opened this issue Oct 10, 2016 · 9 comments
Closed

Parallelize transformers? #7635

themrmax opened this issue Oct 10, 2016 · 9 comments

Comments

@themrmax
Copy link
Contributor

themrmax commented Oct 10, 2016

As raised in http://stackoverflow.com/questions/39948138/sklearn-featurehasher-parallelized/39951415 many (all?) transformers could be made parallel, would this make sense as a ParallelTransformerMixin, something like

from sklearn.externals.joblib import Parallel, delayed
import numpy as np
import scipy.sparse as sp

class ParallelTransformerMixin:
    def transform_parallel(self, X, n_jobs):
        transform_splits = Parallel(n_jobs=n_jobs, backend="threading")(
            delayed(self.transform)(X_split)
            for X_split in np.array_split(X, n_jobs))

    return sp.vstack(transform_splits)
@jnothman
Copy link
Member

It would need to override __init__ and get_params, but maybe...

@themrmax
Copy link
Contributor Author

themrmax commented Oct 11, 2016

@jnothman hmm get params looks pretty gnarly ... sure we can't just leave n_jobs as an argument to transform?

@jnothman
Copy link
Member

Ideally not...

@amueller
Copy link
Member

This is very closely related to #7448 which proposed the same for predict.

I'm not sure how this would work as a mixin. Would all transformers get it? Then it should be in TransformerMixin, right?

This wouldn't work in pipelines, unless we add this to pipelines...

Why not do it as a meta-estimator? That way we don't have to mess with Pipeline.

@mblondel
Copy link
Member

many (all?) transformers could be made parallel

Only transformers where fit doesn't do anything besides input checking (i.e., stateless transformers).

@themrmax
Copy link
Contributor Author

themrmax commented Oct 18, 2016

@mblondel i think it should work with stateful transformers too

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals.joblib import Parallel, delayed
import numpy as np
import scipy.sparse as sp

def transform_parallel(self, X, n_jobs):
    transform_splits = Parallel(n_jobs=n_jobs, backend="threading")(
        delayed(self.transform)(X_split)
        for X_split in np.array_split(X, n_jobs))

    return sp.vstack(transform_splits)

CountVectorizer.transform_parallel = transform_parallel
c = CountVectorizer()
c.fit(["thie and that", "the other"])
c.transform_parallel(['this','other','thing','yeah'], n_jobs = 4).toarray()

# array([[0, 0, 0, 0, 0],
#        [0, 1, 0, 0, 0],
#        [0, 0, 0, 0, 0],
#        [0, 0, 0, 0, 0]])

@sukiakiumo
Copy link

I'm curious about this myself. Would love to see it happen! Any efforts towards doing so? Or is this issue stale?

@jnothman
Copy link
Member

jnothman commented Jun 15, 2018 via email

@rth
Copy link
Member

rth commented Jun 15, 2018

Also IMO this discussion should be not be about whether it's possible to parallelize transformers, but rather about whether it's possible to make transformers faster by using parallelization.

For instance, it we take the CountVectorizer example from #7635 (comment)
on the 20 newsgoup dataset using a 4 core CPU, we get,

  • fit time -> 3.39 s
  • transform time -> 3.1 s
  • transform_parallel(.. , n_jobs=4) time -> 11.7 s

so in this case we can parallelize but we definitely don't want to. At least not in this way, there was some prior discussion about parallelizing this estimator in #1401.

In this example there is no speedup with the threading backend probably because CountVectorizer doesn't release the GIL and we get some chunking / concatenation overhead.

Generally dask-ml is a better suited to deal with such parallelization tasks. For instance, I think, dask_ml.wrappers.ParallelPostFit solves this issue. Closing, please comment if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants