Parallelize transformers? #7635

themrmax · 2016-10-10T21:58:19Z

As raised in http://stackoverflow.com/questions/39948138/sklearn-featurehasher-parallelized/39951415 many (all?) transformers could be made parallel, would this make sense as a ParallelTransformerMixin, something like

from sklearn.externals.joblib import Parallel, delayed
import numpy as np
import scipy.sparse as sp

class ParallelTransformerMixin:
    def transform_parallel(self, X, n_jobs):
        transform_splits = Parallel(n_jobs=n_jobs, backend="threading")(
            delayed(self.transform)(X_split)
            for X_split in np.array_split(X, n_jobs))

    return sp.vstack(transform_splits)

The text was updated successfully, but these errors were encountered:

jnothman · 2016-10-10T23:13:18Z

It would need to override __init__ and get_params, but maybe...

themrmax · 2016-10-11T01:46:00Z

@jnothman hmm get params looks pretty gnarly ... sure we can't just leave n_jobs as an argument to transform?

jnothman · 2016-10-11T02:38:38Z

Ideally not...

amueller · 2016-10-13T20:20:15Z

This is very closely related to #7448 which proposed the same for predict.

I'm not sure how this would work as a mixin. Would all transformers get it? Then it should be in TransformerMixin, right?

This wouldn't work in pipelines, unless we add this to pipelines...

Why not do it as a meta-estimator? That way we don't have to mess with Pipeline.

mblondel · 2016-10-18T01:41:15Z

many (all?) transformers could be made parallel

Only transformers where fit doesn't do anything besides input checking (i.e., stateless transformers).

themrmax · 2016-10-18T05:05:11Z

@mblondel i think it should work with stateful transformers too

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals.joblib import Parallel, delayed
import numpy as np
import scipy.sparse as sp

def transform_parallel(self, X, n_jobs):
    transform_splits = Parallel(n_jobs=n_jobs, backend="threading")(
        delayed(self.transform)(X_split)
        for X_split in np.array_split(X, n_jobs))

    return sp.vstack(transform_splits)

CountVectorizer.transform_parallel = transform_parallel
c = CountVectorizer()
c.fit(["thie and that", "the other"])
c.transform_parallel(['this','other','thing','yeah'], n_jobs = 4).toarray()

# array([[0, 0, 0, 0, 0],
#        [0, 1, 0, 0, 0],
#        [0, 0, 0, 0, 0],
#        [0, 0, 0, 0, 0]])

sukiakiumo · 2018-06-15T04:42:20Z

I'm curious about this myself. Would love to see it happen! Any efforts towards doing so? Or is this issue stale?

jnothman · 2018-06-15T05:47:08Z

I would think this is best implemented as a mixin, but I don't think you should expect it to appear in the core library

rth · 2018-06-15T08:21:23Z

Also IMO this discussion should be not be about whether it's possible to parallelize transformers, but rather about whether it's possible to make transformers faster by using parallelization.

For instance, it we take the CountVectorizer example from #7635 (comment)
on the 20 newsgoup dataset using a 4 core CPU, we get,

fit time -> 3.39 s
transform time -> 3.1 s
transform_parallel(.. , n_jobs=4) time -> 11.7 s

so in this case we can parallelize but we definitely don't want to. At least not in this way, there was some prior discussion about parallelizing this estimator in #1401.

In this example there is no speedup with the threading backend probably because CountVectorizer doesn't release the GIL and we get some chunking / concatenation overhead.

Generally dask-ml is a better suited to deal with such parallelization tasks. For instance, I think, dask_ml.wrappers.ParallelPostFit solves this issue. Closing, please comment if you disagree.

rth mentioned this issue Feb 9, 2017

Euclidean pairwise_distances slower for n_jobs > 1 #8216

Closed

rth closed this as completed Jun 15, 2018

rth mentioned this issue Jul 8, 2018

Treat Predict as a Numpy Generalized Ufunc #11456

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize transformers? #7635

Parallelize transformers? #7635

themrmax commented Oct 10, 2016 •

edited

Loading

jnothman commented Oct 10, 2016

themrmax commented Oct 11, 2016 •

edited

Loading

jnothman commented Oct 11, 2016

amueller commented Oct 13, 2016

mblondel commented Oct 18, 2016

themrmax commented Oct 18, 2016 •

edited

Loading

sukiakiumo commented Jun 15, 2018

jnothman commented Jun 15, 2018 via email

rth commented Jun 15, 2018 •

edited

Loading

Parallelize transformers? #7635

Parallelize transformers? #7635

Comments

themrmax commented Oct 10, 2016 • edited Loading

jnothman commented Oct 10, 2016

themrmax commented Oct 11, 2016 • edited Loading

jnothman commented Oct 11, 2016

amueller commented Oct 13, 2016

mblondel commented Oct 18, 2016

themrmax commented Oct 18, 2016 • edited Loading

sukiakiumo commented Jun 15, 2018

jnothman commented Jun 15, 2018 via email

rth commented Jun 15, 2018 • edited Loading

themrmax commented Oct 10, 2016 •

edited

Loading

themrmax commented Oct 11, 2016 •

edited

Loading

themrmax commented Oct 18, 2016 •

edited

Loading

rth commented Jun 15, 2018 •

edited

Loading