Parallelize predict in classifiers #7448

lesshaste · 2016-09-17T08:02:57Z

Description

Sometimes one trains a classifier on a sample and then has to run it on a massive dataset which can be very slow. It would be great if predict and predict_proba had a n_jobs parameter so this could be done in parallel on multi core machines or alternatively if there was a standard copy and pasteable solution.

My guess is that one challenge to do this efficiently is to avoid copying large sections of the training matrix but other than that it should be embarrassingly parallel.

http://stackoverflow.com/questions/31449291/how-to-parallelise-predict-method-of-a-scikit-learn-svm-svc-classifier suggests a workaround, although untested.

For random forests I am aware of https://github.com/ajtulloch/sklearn-compiledtrees which speeds up prediction as well although it does not work yet for classification.

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jnothman · 2016-09-17T13:08:51Z

You're talking about standalone prediction, rather than within CV or a pipeline or similar? Do you just want a util that splits data into chunks, runs predict through joblib.parallel then stacks the results?

lesshaste · 2016-09-17T13:27:39Z

Yes standalone prediction is what I meant although your more sophisticated variants will be of interest too I am sure. Actually there a few things I am not sure about.

It seems that the ensemble classifiers such as random forest lend themselves naturally to internal parallelisation when predicting without have to split the data explicitly . Is it correct that this isn't in the current implementation ?

For classifiers that don't lend themselves to internal parallelisation then the steps you describe sound just right. My hope is that you don't actually have to do any copying and the splitting can be done by pointers (or views).

Potentially this would also make sense out of core for very large data sets. So the steps would be load in a large chunk, split, predict in parallel, stack the results and repeat.

jnothman · 2016-09-17T22:58:31Z

It'd be simple to put something together as a gist. I'm not sure where it
fits in scikit-learn, or whether it's the recommended approach (as opposed
to, say, sharding data and computation across machines; or having an input
thread, multiple prediction threads, and an output thread, which doesn't
work nicely with joblib.parallel.)

On 17 September 2016 at 23:27, lesshaste notifications@github.com wrote:

Yes standalone prediction is what I meant. Actually there a few things I
am not sure about.

It seems that the ensemble classifiers such as random forest lend
themselves naturally to internal parallelisation when predicting without
have to split the data explicitly . Is it correct that this isn't in the
current implementation ?

For classifiers that don't lend themselves to internal parallelisation
then the steps you describe sound just right.

Potentially this would also make sense out of core for very large data
sets. So the steps would be load in a large chunk, split, predict in
parallel, stack the results and repeat.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7448 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-AXBARaJfaHN_JYtehxo7i47PdYks5qq-rMgaJpZM4J_iho
.

lesshaste · 2016-09-18T06:39:58Z

That's great. I am completely neutral about what the underlying technical solution would be. The overall story is simply that scikit learn already supports n_jobs in training (at least for some classifiers/regressions) and it would be great if it did the same for prediction, especially as it might not be too technically hard to do so.

Computation across machines seems a distinct issue however. We should use the cores on a single machine up before we go down that route I would have thought.

raghavrv · 2016-09-19T10:18:48Z

@jnothman for classifiers like knn won't that mean serializing and reallocating the huge model n_jobs times? I am not sure how we can memmap sklearn models...

jnothman · 2016-09-19T11:33:18Z

As long as the data is stored in numpy arrays of sufficient size, joblib should automatically do the memmapping. I think. :)

GaelVaroquaux · 2016-09-19T12:01:02Z

As long as the data is stored in numpy arrays of sufficient size, joblib should automatically do the memmapping. I think. :)

No, it doesn't do deep inspection of the arguments. That would work once we plug in the persistence with the parallel dispatch, which might happen one day, but isn't there.

jnothman · 2016-09-19T12:13:23Z

Oh, I didn't realise that! I assumed that the persistence magic was part of parallel :)

GaelVaroquaux · 2016-09-19T12:17:01Z

Oh, I didn't realise that! I assumed that the persistence magic was part of parallel :)

Only a shallow one.

rth · 2016-09-19T12:28:35Z

Also, a large number of predict/transform methods just compute some form of matrix-vector or matrix-matrix multiplication (e.g. most of things in linear_model, I think, in PCA, LSA, etc), and for dense arrays, and if users have scipy with any reasonable BLAS (goto, mkl) this would result in multithreaded computations by default. So although the predict/transform method doesn't have n_jobs arguments, a lots of those (if not most) effectively run parallel computations already.Though not with sparse arrays i think.

As to the remaining ones, as was said above, I would tend to agree that it's really problem dependent, and it is hard to say in advance whether chunking is the best approach (i.e. would the performance gains outweigh the memory copy cost etc).

Maybe it would be better to just add a section about chunking in scaling strategies and possibly make a helper function that would wrap joblib.Parallel and efficiently split/concatenate the results (both in the dense and sparse array cases) ?
This is not restricted to the predict method, for instance, HashingVectorizer can be parallelized through chunking in the same way.

lesshaste · 2016-09-19T15:43:41Z

I feel slightly bad commenting as I am not writing any of this code but here are some thoughts in no particular order:

Taking RandomForestClassifier as an example of an ensemble classifier, I think it should be reasonably straightforward to parallelize predict without using joblib at all. That is the classifier itself is inherently parallel. It is just a collection of distinct decision trees after all.
In relation to classifiers that use scipy which itself might be parallelized, that is very interesting although ideally the user would always have some control over how many cores are used on the machine. This maybe be an issue for upstream of course. Also, large sparse matrices are a very important use case in my experience.
If we have to make a copy in memory of the relevant part of the training matrix to use joblib that does seem less than ideal. It would be interesting to test empirically (that is with timing) how much time is saved by running a classifier in parallel in this situation. Also there is the risk of simply running out of RAM!
It would be great as a first step to add a section about chunking with a helper function that is designed to be efficient as @rth suggests.

raghavrv · 2016-09-19T16:25:36Z

Taking RandomForestClassifier as an example of an ensemble classifier, I think it should be reasonably straightforward to parallelize predict without using joblib at all. That is the classifier itself is inherently parallel. It is just a collection of distinct decision trees after all.

Yes it should be straightforward for RandomForestClassifier I think. The apply calls can be done in parallel... Also we need not use multiprocessing as gil is released...

True that it depends on the classifier at hand, but more importantly the speed benefits should be benchmarked to see if it is worth the overhead we incur because of parallelism...

lesshaste · 2016-09-19T18:13:29Z

I completely agree.

I realise this is a little tangential but it would be awesome if https://github.com/ajtulloch/sklearn-compiledtrees were finished and then included either in scikit learn proper or as something for http://scikit-learn.org/stable/related_projects.html . I notice there are quite frequent volunteers looking for projects... :)

jnothman · 2016-09-19T23:38:38Z

If we have to make a copy in memory of the relevant part of the training matrix to use joblib that does seem less than ideal. It would be interesting to test empirically (that is with timing) how much time is saved by running a classifier in parallel in this situation. Also there is the risk of simply running out of RAM!

Parallel processing benchmarks tend to quite system-dependent.

Taking RandomForestClassifier as an example of an ensemble classifier, I think it should be reasonably straightforward to parallelize predict without using joblib at all.

RandomForest*.predict is currently parallelised using joblib.

A basic recipe, disregarding memory consumption issues, is something like:

import numpy as np
import scipy.sparse as sp
from sklearn.externals.joblib.parallel import cpu_count, Parallel, delayed

def _predict(estimator, X, method, start, stop):
    return getattr(estimator, method)(X[start:stop])

def parallel_predict(estimator, X, n_jobs=1, method='predict', batches_per_job=3):
    n_jobs = max(cpu_count() + 1 + n_jobs, 1)  # XXX: this should really be done by joblib
    n_batches = batches_per_job * n_jobs
    n_samples = len(X)
    batch_size = int(np.ceil(n_samples / n_batches))
    parallel = Parallel(n_jobs=n_jobs)
    results = parallel(delayed(_predict)(estimator, X, method, i, i + batch_size)
                       for i in range(0, n_samples, batch_size))
    if sp.issparse(results[0]):
        return sp.vstack(results)
    return np.concatenate(results)

jmschrei · 2016-09-20T19:31:59Z

I've written parallelization wrappers for predict/predict_proba/others in pomegranate using joblib here https://github.com/jmschrei/pomegranate/blob/master/pomegranate/parallel.pyx . Their functionality could probably be expanded in the ways that @jnothman suggested.

jnothman · 2016-09-20T21:56:21Z

Do others think this belongs in scikit-learn? In what form? If not, shall
we close this issue?

On 21 September 2016 at 05:32, Jacob Schreiber notifications@github.com
wrote:

I've written parallelization wrappers for predict/predict_proba/others in
pomegranate using joblib here https://github.com/jmschrei/
pomegranate/blob/master/pomegranate/parallel.pyx . Their functionality
could probably be expanded in the ways that @jnothman
https://github.com/jnothman suggested.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7448 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6445HNW8yx3IBMSO61bHDIK3IOW_ks5qsDSwgaJpZM4J_iho
.

sergei3000 · 2017-07-28T14:13:38Z

+1 for parallelizing predictions.
In one of my settings prediction ( it's predict_proba of a random forest classifier to be specific) is the bottleneck, making my script run hours to complete its calculations. It would be great to have everything sped up by using all the available cores.

rth · 2017-07-28T15:37:38Z

+1 for parallelizing predictions.
In one of my settings prediction ( it's predict_proba of a random forest classifier to be specific) is the bottleneck,

@sergei3000 RandomForestClassifier.predict_proba is already parallelized (controllable via the n_jobs init parameter)...

Do others think this belongs in scikit-learn? In what form? If not, shall
we close this issue?

Well, most of the work can be done with joblib and sklearn.utils.get_batches (or np.array_split),

from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import gen_batches

n_jobs = 4
n_samples, n_features = X.shape
batch_size = n_samples//n_jobs

def _predict(method, X, sl):
    return method(X[sl])

Parallel(n_jobs)(delayed(_predict)(estimator.predict, X, sl)
                 for sl in gen_batches(n_samples, batch_size))

I am not sure if adding parallelization wrappers to sklearn.utils (or some other location) would be that useful when compared to directly using joblib. That way people could tune the chunk size, parallel backend, etc without going though another layer of abstraction. In any case, maybe it could be worth adding a section in the documentation here about parallelizing as a way of scaling predictions?

Also in the case when X is very large (and possibly on disk), I imagine out of core processing with dask might be more suitable,

import dask.array as da

X_da = da.from_array(X, chunks=(batch_size, n_features))

X_da.map_blocks(estimator.predict, dtype=int, drop_axis=1).compute()

jmschrei · 2017-07-28T16:24:26Z

+1 for Roman's solution. While it would be elegant to include a n_jobs parameter for prediction and perhaps transformation methods, as a temporary measure it shouldn't be too difficult to write your own version.

…

On Fri, Jul 28, 2017 at 8:38 AM, Roman Yurchak ***@***.***> wrote: +1 for parallelizing predictions. In one of my settings prediction ( it's predict_proba of a random forest classifier to be specific) is the bottleneck, @sergei3000 <https://github.com/sergei3000> RandomForestClassifier.predic_ proba is already parallelized <https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/ensemble/forest.py#L579> (controllable via the n_jobs init parameter)... Do others think this belongs in scikit-learn? In what form? If not, shall we close this issue? Well, most of the work can be done with joblib and sklearn.utils.get_batches, from sklearn.externals.joblib import Parallel, delayedfrom sklearn.utils import gen_batches n_jobs = 4 n_samples, n_features = X.shape batch_size = n_samples//n_jobs def _predict(method, X, sl): return method(X[sl]) Parallel(n_jobs)(delayed(_predict)(estimator.predict, X, sl) for sl in gen_batches(n_samples, batch_size)) I am not sure if adding parallelization wrappers to sklearn.utils (or some other location) would be that useful when compared to directly using joblib. That way people could tune the chunk size, parallel backend, etc without going though another layer of abstraction. In any case, maybe it could be worth adding a section in the documentation here <http://scikit-learn.org/stable/modules/scaling_strategies.html> about parallelizing as a way of scaling predictions? Also in the case when X is very large (and possibly on disk), I imagine out of core processing with dask might be more suitable, import dask.array as da X_da = da.from_array(X, chunks=(batch_size, n_features)) X_da.map_blocks(estimator.predict, dtype=int, drop_axis=1).compute() — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7448 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADvEEGqOQaF6Vpj3HlJnz-kbQd6ix3ESks5sSgBcgaJpZM4J_iho> .

sergei3000 · 2017-07-28T17:08:47Z

RandomForestClassifier.predict_proba is already parallelized (controllable via the n_jobs init parameter)...

@rth do you mean I can do clf.predict_proba(X, n_jobs=4)? Or is it something else? There is no this parameter on the method's page:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba

rth · 2017-07-28T17:28:07Z

@sergei3000 No, you can do estimator = RandomForestClassifier(n_jobs=4) and it should parallelize estimator.predict_proba from what I understood.

sergei3000 · 2017-07-28T21:28:09Z

@rth I'll check it once again but I think I observed that the CPU only worked at a third of its full capacity while working on predict_proba. Opposit to being fully loaded while building the model with fit.

rth · 2017-07-29T10:45:49Z

I'll check it once again but I think I observed that the CPU only worked at a third of its full capacity while working on predict_proba. Opposit to being fully loaded while building the model with fit.

@sergei3000 That's how things sometimes look with the threading parallel backend (RandomForestClassifier uses it). With the multiprocessing parallel backend, the CPU will be fully loaded, but it doesn't necessarily mean that it will be faster; it will also spend CPU cycles just copying data around... In your use case, if performance is an issue, it can certainly be worth trying chunking the X array and running predict in parallel (cf examples above) to see if it would improve predict time...

jnothman · 2017-07-29T12:15:32Z

@sergei3000 some of the slowness may derive from a bug in version 0.18. Make sure you're using the 0.19 pre-release.

…

On 29 July 2017 at 20:45, Roman Yurchak ***@***.***> wrote: I'll check it once again but I think I observed that the CPU only worked at a third of its full capacity while working on predict_proba. Opposit to being fully loaded while building the model with fit. @sergei3000 <https://github.com/sergei3000> That's how things sometimes look with the threading parallel backend (RandomForestClassifier uses it). With the multiprocessing parallel backend, the CPU will be fully loaded, but it doesn't necessarily mean that it will be faster; it will also spend CPU cycles just copying data around... In your use case, if performance is an issue, it can certainly be worth trying chunking the X array and running predict in parallel (cf examples above) to see if it would improve predict time... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7448 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61RJ4puRpIiQRQg52srkplxgOgqhks5sSw1egaJpZM4J_iho> .

sergei3000 · 2017-07-29T21:44:56Z

Many thanks to everyone for your inputs, they've been extremely helpful for my understanding.

TomAugspurger · 2018-02-05T21:46:33Z

For those interested and able to accept dask as a dependency, I've implemented @rth's suggestion here in dask-ml.

https://dask-ml.readthedocs.io/en/latest/auto_examples/plot_parallel_postfit.html#sphx-glr-auto-examples-plot-parallel-postfit-py has an example of a meta-estimator that gives parallel (out of core, potentially distributed across a cluster) predict / transform. The usual caveats of parallel overhead apply, so if the scikit-learn estimator already predicts in parallel and your data fits in memory, you may not want to use the meta-estimator.

owlas · 2018-03-14T17:28:30Z

My use case is slightly different: I'd like to disable parallelisation on prediction but use parallel in training (the prediction already happens in a parallel environment).

However, the solution would be the same, implement n_jobs on .predict and .train rather than having it as a parameter of the classifier. Is there an argument against doing this?

Relevant code for parallel predict: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/forest.py#L587

TomAugspurger · 2018-03-14T17:33:44Z

In that case, can you use `set_params(n_jobs=1)` after training?

…

On Wed, Mar 14, 2018 at 12:28 PM, Oliver Laslett ***@***.***> wrote: My use case is slightly different: I'd like to disable parallelisation on prediction but use parallel in training (the prediction already happens in a parallel environment). However, the solution would be the same, implement n_jobs on .predict and .train rather than having it as a parameter of the classifier. Is there an argument against doing this? Relevant code for parallel predict: https://github.com/scikit- learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/forest.py#L587 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7448 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIkjWrG_Kyq-501_3SSTHzBTZITWgks5teVNJgaJpZM4J_iho> .

owlas · 2018-03-15T13:58:46Z

Great idea and is exactly what I'm doing and works fine. I suppose I'm just interested why n_jobs is a parameter of the classifier. If it was given to fit and predict function, I feel that it would be much clearer.

lesteve · 2018-03-16T13:33:40Z

http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects

rth · 2018-07-08T13:24:53Z

Closing as dask-ml is probably better suited to handle such tasks and the corresponding functionality was implemented in dask_ml.wrappers.ParallelPostFit as mentioned above.

While other estimators already support parallel predict (e.g RandomForestClassifier).

amueller mentioned this issue Oct 13, 2016

Parallelize transformers? #7635

Closed

rth mentioned this issue Feb 9, 2017

Euclidean pairwise_distances slower for n_jobs > 1 #8216

Closed

rth closed this as completed Jul 8, 2018

rth mentioned this issue Jul 8, 2018

Treat Predict as a Numpy Generalized Ufunc #11456

Open

lorentzenchr mentioned this issue Oct 17, 2020

predict_proba in sklearn.multioutput.MultiOutputClassifier not parallelized #18635

Closed

AlexOlza mentioned this issue Dec 29, 2021

Results of random forest are variable AlexOlza/estratificacion#22

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize predict in classifiers #7448

Parallelize predict in classifiers #7448

lesshaste commented Sep 17, 2016 •

edited

Loading

jnothman commented Sep 17, 2016

lesshaste commented Sep 17, 2016 •

edited

Loading

jnothman commented Sep 17, 2016

lesshaste commented Sep 18, 2016 •

edited

Loading

raghavrv commented Sep 19, 2016

jnothman commented Sep 19, 2016

GaelVaroquaux commented Sep 19, 2016 via email

jnothman commented Sep 19, 2016

GaelVaroquaux commented Sep 19, 2016 via email

rth commented Sep 19, 2016 •

edited

Loading

lesshaste commented Sep 19, 2016

raghavrv commented Sep 19, 2016 •

edited

Loading

lesshaste commented Sep 19, 2016 •

edited

Loading

jnothman commented Sep 19, 2016

jmschrei commented Sep 20, 2016

jnothman commented Sep 20, 2016

sergei3000 commented Jul 28, 2017

rth commented Jul 28, 2017 •

edited

Loading

jmschrei commented Jul 28, 2017 via email

sergei3000 commented Jul 28, 2017 •

edited

Loading

rth commented Jul 28, 2017

sergei3000 commented Jul 28, 2017

rth commented Jul 29, 2017

jnothman commented Jul 29, 2017 via email

sergei3000 commented Jul 29, 2017

TomAugspurger commented Feb 5, 2018

owlas commented Mar 14, 2018

TomAugspurger commented Mar 14, 2018 via email

owlas commented Mar 15, 2018 •

edited

Loading

lesteve commented Mar 16, 2018

rth commented Jul 8, 2018 •

edited

Loading

Parallelize predict in classifiers #7448

Parallelize predict in classifiers #7448

Comments

lesshaste commented Sep 17, 2016 • edited Loading

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jnothman commented Sep 17, 2016

lesshaste commented Sep 17, 2016 • edited Loading

jnothman commented Sep 17, 2016

lesshaste commented Sep 18, 2016 • edited Loading

raghavrv commented Sep 19, 2016

jnothman commented Sep 19, 2016

GaelVaroquaux commented Sep 19, 2016 via email

jnothman commented Sep 19, 2016

GaelVaroquaux commented Sep 19, 2016 via email

rth commented Sep 19, 2016 • edited Loading

lesshaste commented Sep 19, 2016

raghavrv commented Sep 19, 2016 • edited Loading

lesshaste commented Sep 19, 2016 • edited Loading

jnothman commented Sep 19, 2016

jmschrei commented Sep 20, 2016

jnothman commented Sep 20, 2016

sergei3000 commented Jul 28, 2017

rth commented Jul 28, 2017 • edited Loading

jmschrei commented Jul 28, 2017 via email

sergei3000 commented Jul 28, 2017 • edited Loading

rth commented Jul 28, 2017

sergei3000 commented Jul 28, 2017

rth commented Jul 29, 2017

jnothman commented Jul 29, 2017 via email

sergei3000 commented Jul 29, 2017

TomAugspurger commented Feb 5, 2018

owlas commented Mar 14, 2018

TomAugspurger commented Mar 14, 2018 via email

owlas commented Mar 15, 2018 • edited Loading

lesteve commented Mar 16, 2018

rth commented Jul 8, 2018 • edited Loading

lesshaste commented Sep 17, 2016 •

edited

Loading

lesshaste commented Sep 17, 2016 •

edited

Loading

lesshaste commented Sep 18, 2016 •

edited

Loading

rth commented Sep 19, 2016 •

edited

Loading

raghavrv commented Sep 19, 2016 •

edited

Loading

lesshaste commented Sep 19, 2016 •

edited

Loading

rth commented Jul 28, 2017 •

edited

Loading

sergei3000 commented Jul 28, 2017 •

edited

Loading

owlas commented Mar 15, 2018 •

edited

Loading

rth commented Jul 8, 2018 •

edited

Loading