Skip to content

Parallelize predict in classifiers #7448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lesshaste opened this issue Sep 17, 2016 · 31 comments
Closed

Parallelize predict in classifiers #7448

lesshaste opened this issue Sep 17, 2016 · 31 comments

Comments

@lesshaste
Copy link

lesshaste commented Sep 17, 2016

Description

Sometimes one trains a classifier on a sample and then has to run it on a massive dataset which can be very slow. It would be great if predict and predict_proba had a n_jobs parameter so this could be done in parallel on multi core machines or alternatively if there was a standard copy and pasteable solution.

My guess is that one challenge to do this efficiently is to avoid copying large sections of the training matrix but other than that it should be embarrassingly parallel.

http://stackoverflow.com/questions/31449291/how-to-parallelise-predict-method-of-a-scikit-learn-svm-svc-classifier suggests a workaround, although untested.

For random forests I am aware of https://github.com/ajtulloch/sklearn-compiledtrees which speeds up prediction as well although it does not work yet for classification.

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

@jnothman
Copy link
Member

You're talking about standalone prediction, rather than within CV or a pipeline or similar? Do you just want a util that splits data into chunks, runs predict through joblib.parallel then stacks the results?

@lesshaste
Copy link
Author

lesshaste commented Sep 17, 2016

Yes standalone prediction is what I meant although your more sophisticated variants will be of interest too I am sure. Actually there a few things I am not sure about.

It seems that the ensemble classifiers such as random forest lend themselves naturally to internal parallelisation when predicting without have to split the data explicitly . Is it correct that this isn't in the current implementation ?

For classifiers that don't lend themselves to internal parallelisation then the steps you describe sound just right. My hope is that you don't actually have to do any copying and the splitting can be done by pointers (or views).

Potentially this would also make sense out of core for very large data sets. So the steps would be load in a large chunk, split, predict in parallel, stack the results and repeat.

@jnothman
Copy link
Member

It'd be simple to put something together as a gist. I'm not sure where it
fits in scikit-learn, or whether it's the recommended approach (as opposed
to, say, sharding data and computation across machines; or having an input
thread, multiple prediction threads, and an output thread, which doesn't
work nicely with joblib.parallel.)

On 17 September 2016 at 23:27, lesshaste notifications@github.com wrote:

Yes standalone prediction is what I meant. Actually there a few things I
am not sure about.

It seems that the ensemble classifiers such as random forest lend
themselves naturally to internal parallelisation when predicting without
have to split the data explicitly . Is it correct that this isn't in the
current implementation ?

For classifiers that don't lend themselves to internal parallelisation
then the steps you describe sound just right.

Potentially this would also make sense out of core for very large data
sets. So the steps would be load in a large chunk, split, predict in
parallel, stack the results and repeat.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7448 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-AXBARaJfaHN_JYtehxo7i47PdYks5qq-rMgaJpZM4J_iho
.

@lesshaste
Copy link
Author

lesshaste commented Sep 18, 2016

That's great. I am completely neutral about what the underlying technical solution would be. The overall story is simply that scikit learn already supports n_jobs in training (at least for some classifiers/regressions) and it would be great if it did the same for prediction, especially as it might not be too technically hard to do so.

Computation across machines seems a distinct issue however. We should use the cores on a single machine up before we go down that route I would have thought.

@raghavrv
Copy link
Member

@jnothman for classifiers like knn won't that mean serializing and reallocating the huge model n_jobs times? I am not sure how we can memmap sklearn models...

@jnothman
Copy link
Member

As long as the data is stored in numpy arrays of sufficient size, joblib should automatically do the memmapping. I think. :)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 19, 2016 via email

@jnothman
Copy link
Member

Oh, I didn't realise that! I assumed that the persistence magic was part of parallel :)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 19, 2016 via email

@rth
Copy link
Member

rth commented Sep 19, 2016

Also, a large number of predict/transform methods just compute some form of matrix-vector or matrix-matrix multiplication (e.g. most of things in linear_model, I think, in PCA, LSA, etc), and for dense arrays, and if users have scipy with any reasonable BLAS (goto, mkl) this would result in multithreaded computations by default. So although the predict/transform method doesn't have n_jobs arguments, a lots of those (if not most) effectively run parallel computations already.Though not with sparse arrays i think.

As to the remaining ones, as was said above, I would tend to agree that it's really problem dependent, and it is hard to say in advance whether chunking is the best approach (i.e. would the performance gains outweigh the memory copy cost etc).

Maybe it would be better to just add a section about chunking in scaling strategies and possibly make a helper function that would wrap joblib.Parallel and efficiently split/concatenate the results (both in the dense and sparse array cases) ?
This is not restricted to the predict method, for instance, HashingVectorizer can be parallelized through chunking in the same way.

@lesshaste
Copy link
Author

I feel slightly bad commenting as I am not writing any of this code but here are some thoughts in no particular order:

  1. Taking RandomForestClassifier as an example of an ensemble classifier, I think it should be reasonably straightforward to parallelize predict without using joblib at all. That is the classifier itself is inherently parallel. It is just a collection of distinct decision trees after all.

  2. In relation to classifiers that use scipy which itself might be parallelized, that is very interesting although ideally the user would always have some control over how many cores are used on the machine. This maybe be an issue for upstream of course. Also, large sparse matrices are a very important use case in my experience.

  3. If we have to make a copy in memory of the relevant part of the training matrix to use joblib that does seem less than ideal. It would be interesting to test empirically (that is with timing) how much time is saved by running a classifier in parallel in this situation. Also there is the risk of simply running out of RAM!

  4. It would be great as a first step to add a section about chunking with a helper function that is designed to be efficient as @rth suggests.

@raghavrv
Copy link
Member

raghavrv commented Sep 19, 2016

Taking RandomForestClassifier as an example of an ensemble classifier, I think it should be reasonably straightforward to parallelize predict without using joblib at all. That is the classifier itself is inherently parallel. It is just a collection of distinct decision trees after all.

Yes it should be straightforward for RandomForestClassifier I think. The apply calls can be done in parallel... Also we need not use multiprocessing as gil is released...

True that it depends on the classifier at hand, but more importantly the speed benefits should be benchmarked to see if it is worth the overhead we incur because of parallelism...

@lesshaste
Copy link
Author

lesshaste commented Sep 19, 2016

I completely agree.

I realise this is a little tangential but it would be awesome if https://github.com/ajtulloch/sklearn-compiledtrees were finished and then included either in scikit learn proper or as something for http://scikit-learn.org/stable/related_projects.html . I notice there are quite frequent volunteers looking for projects... :)

@jnothman
Copy link
Member

  1. If we have to make a copy in memory of the relevant part of the training matrix to use joblib that does seem less than ideal. It would be interesting to test empirically (that is with timing) how much time is saved by running a classifier in parallel in this situation. Also there is the risk of simply running out of RAM!

Parallel processing benchmarks tend to quite system-dependent.

Taking RandomForestClassifier as an example of an ensemble classifier, I think it should be reasonably straightforward to parallelize predict without using joblib at all.

RandomForest*.predict is currently parallelised using joblib.

A basic recipe, disregarding memory consumption issues, is something like:

import numpy as np
import scipy.sparse as sp
from sklearn.externals.joblib.parallel import cpu_count, Parallel, delayed

def _predict(estimator, X, method, start, stop):
    return getattr(estimator, method)(X[start:stop])

def parallel_predict(estimator, X, n_jobs=1, method='predict', batches_per_job=3):
    n_jobs = max(cpu_count() + 1 + n_jobs, 1)  # XXX: this should really be done by joblib
    n_batches = batches_per_job * n_jobs
    n_samples = len(X)
    batch_size = int(np.ceil(n_samples / n_batches))
    parallel = Parallel(n_jobs=n_jobs)
    results = parallel(delayed(_predict)(estimator, X, method, i, i + batch_size)
                       for i in range(0, n_samples, batch_size))
    if sp.issparse(results[0]):
        return sp.vstack(results)
    return np.concatenate(results)

@jmschrei
Copy link
Member

I've written parallelization wrappers for predict/predict_proba/others in pomegranate using joblib here https://github.com/jmschrei/pomegranate/blob/master/pomegranate/parallel.pyx . Their functionality could probably be expanded in the ways that @jnothman suggested.

@jnothman
Copy link
Member

Do others think this belongs in scikit-learn? In what form? If not, shall
we close this issue?

On 21 September 2016 at 05:32, Jacob Schreiber notifications@github.com
wrote:

I've written parallelization wrappers for predict/predict_proba/others in
pomegranate using joblib here https://github.com/jmschrei/
pomegranate/blob/master/pomegranate/parallel.pyx . Their functionality
could probably be expanded in the ways that @jnothman
https://github.com/jnothman suggested.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7448 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6445HNW8yx3IBMSO61bHDIK3IOW_ks5qsDSwgaJpZM4J_iho
.

@sergei3000
Copy link

+1 for parallelizing predictions.
In one of my settings prediction ( it's predict_proba of a random forest classifier to be specific) is the bottleneck, making my script run hours to complete its calculations. It would be great to have everything sped up by using all the available cores.

@rth
Copy link
Member

rth commented Jul 28, 2017

+1 for parallelizing predictions.
In one of my settings prediction ( it's predict_proba of a random forest classifier to be specific) is the bottleneck,

@sergei3000 RandomForestClassifier.predict_proba is already parallelized (controllable via the n_jobs init parameter)...

Do others think this belongs in scikit-learn? In what form? If not, shall
we close this issue?

Well, most of the work can be done with joblib and sklearn.utils.get_batches (or np.array_split),

from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import gen_batches

n_jobs = 4
n_samples, n_features = X.shape
batch_size = n_samples//n_jobs

def _predict(method, X, sl):
    return method(X[sl])

Parallel(n_jobs)(delayed(_predict)(estimator.predict, X, sl)
                 for sl in gen_batches(n_samples, batch_size))

I am not sure if adding parallelization wrappers to sklearn.utils (or some other location) would be that useful when compared to directly using joblib. That way people could tune the chunk size, parallel backend, etc without going though another layer of abstraction. In any case, maybe it could be worth adding a section in the documentation here about parallelizing as a way of scaling predictions?

Also in the case when X is very large (and possibly on disk), I imagine out of core processing with dask might be more suitable,

import dask.array as da

X_da = da.from_array(X, chunks=(batch_size, n_features))

X_da.map_blocks(estimator.predict, dtype=int, drop_axis=1).compute() 

@jmschrei
Copy link
Member

jmschrei commented Jul 28, 2017 via email

@sergei3000
Copy link

sergei3000 commented Jul 28, 2017

RandomForestClassifier.predict_proba is already parallelized (controllable via the n_jobs init parameter)...

@rth do you mean I can do clf.predict_proba(X, n_jobs=4)? Or is it something else? There is no this parameter on the method's page:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba

@rth
Copy link
Member

rth commented Jul 28, 2017

@sergei3000 No, you can do estimator = RandomForestClassifier(n_jobs=4) and it should parallelize estimator.predict_proba from what I understood.

@sergei3000
Copy link

@rth I'll check it once again but I think I observed that the CPU only worked at a third of its full capacity while working on predict_proba. Opposit to being fully loaded while building the model with fit.

@rth
Copy link
Member

rth commented Jul 29, 2017

I'll check it once again but I think I observed that the CPU only worked at a third of its full capacity while working on predict_proba. Opposit to being fully loaded while building the model with fit.

@sergei3000 That's how things sometimes look with the threading parallel backend (RandomForestClassifier uses it). With the multiprocessing parallel backend, the CPU will be fully loaded, but it doesn't necessarily mean that it will be faster; it will also spend CPU cycles just copying data around... In your use case, if performance is an issue, it can certainly be worth trying chunking the X array and running predict in parallel (cf examples above) to see if it would improve predict time...

@jnothman
Copy link
Member

jnothman commented Jul 29, 2017 via email

@sergei3000
Copy link

Many thanks to everyone for your inputs, they've been extremely helpful for my understanding.

@TomAugspurger
Copy link
Contributor

For those interested and able to accept dask as a dependency, I've implemented @rth's suggestion here in dask-ml.

https://dask-ml.readthedocs.io/en/latest/auto_examples/plot_parallel_postfit.html#sphx-glr-auto-examples-plot-parallel-postfit-py has an example of a meta-estimator that gives parallel (out of core, potentially distributed across a cluster) predict / transform. The usual caveats of parallel overhead apply, so if the scikit-learn estimator already predicts in parallel and your data fits in memory, you may not want to use the meta-estimator.

@owlas
Copy link

owlas commented Mar 14, 2018

My use case is slightly different: I'd like to disable parallelisation on prediction but use parallel in training (the prediction already happens in a parallel environment).

However, the solution would be the same, implement n_jobs on .predict and .train rather than having it as a parameter of the classifier. Is there an argument against doing this?

Relevant code for parallel predict: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/forest.py#L587

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 14, 2018 via email

@owlas
Copy link

owlas commented Mar 15, 2018

Great idea and is exactly what I'm doing and works fine. I suppose I'm just interested why n_jobs is a parameter of the classifier. If it was given to fit and predict function, I feel that it would be much clearer.

@lesteve
Copy link
Member

lesteve commented Mar 16, 2018

http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects

@rth
Copy link
Member

rth commented Jul 8, 2018

Closing as dask-ml is probably better suited to handle such tasks and the corresponding functionality was implemented in dask_ml.wrappers.ParallelPostFit as mentioned above.

While other estimators already support parallel predict (e.g RandomForestClassifier).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants