-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Parallelize predict in classifiers #7448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You're talking about standalone prediction, rather than within CV or a pipeline or similar? Do you just want a util that splits data into chunks, runs predict through joblib.parallel then stacks the results? |
Yes standalone prediction is what I meant although your more sophisticated variants will be of interest too I am sure. Actually there a few things I am not sure about. It seems that the ensemble classifiers such as random forest lend themselves naturally to internal parallelisation when predicting without have to split the data explicitly . Is it correct that this isn't in the current implementation ? For classifiers that don't lend themselves to internal parallelisation then the steps you describe sound just right. My hope is that you don't actually have to do any copying and the splitting can be done by pointers (or views). Potentially this would also make sense out of core for very large data sets. So the steps would be load in a large chunk, split, predict in parallel, stack the results and repeat. |
It'd be simple to put something together as a gist. I'm not sure where it On 17 September 2016 at 23:27, lesshaste notifications@github.com wrote:
|
That's great. I am completely neutral about what the underlying technical solution would be. The overall story is simply that scikit learn already supports n_jobs in training (at least for some classifiers/regressions) and it would be great if it did the same for prediction, especially as it might not be too technically hard to do so. Computation across machines seems a distinct issue however. We should use the cores on a single machine up before we go down that route I would have thought. |
@jnothman for classifiers like knn won't that mean serializing and reallocating the huge model |
As long as the data is stored in numpy arrays of sufficient size, joblib should automatically do the memmapping. I think. :) |
As long as the data is stored in numpy arrays of sufficient size,
joblib should automatically do the memmapping. I think. :)
No, it doesn't do deep inspection of the arguments.
That would work once we plug in the persistence with the parallel
dispatch, which might happen one day, but isn't there.
|
Oh, I didn't realise that! I assumed that the persistence magic was part of parallel :) |
Oh, I didn't realise that! I assumed that the persistence magic was part of parallel :)
Only a shallow one.
|
Also, a large number of predict/transform methods just compute some form of matrix-vector or matrix-matrix multiplication (e.g. most of things in As to the remaining ones, as was said above, I would tend to agree that it's really problem dependent, and it is hard to say in advance whether chunking is the best approach (i.e. would the performance gains outweigh the memory copy cost etc). Maybe it would be better to just add a section about chunking in scaling strategies and possibly make a helper function that would wrap |
I feel slightly bad commenting as I am not writing any of this code but here are some thoughts in no particular order:
|
Yes it should be straightforward for True that it depends on the classifier at hand, but more importantly the speed benefits should be benchmarked to see if it is worth the overhead we incur because of parallelism... |
I completely agree. I realise this is a little tangential but it would be awesome if https://github.com/ajtulloch/sklearn-compiledtrees were finished and then included either in scikit learn proper or as something for http://scikit-learn.org/stable/related_projects.html . I notice there are quite frequent volunteers looking for projects... :) |
Parallel processing benchmarks tend to quite system-dependent.
A basic recipe, disregarding memory consumption issues, is something like: import numpy as np
import scipy.sparse as sp
from sklearn.externals.joblib.parallel import cpu_count, Parallel, delayed
def _predict(estimator, X, method, start, stop):
return getattr(estimator, method)(X[start:stop])
def parallel_predict(estimator, X, n_jobs=1, method='predict', batches_per_job=3):
n_jobs = max(cpu_count() + 1 + n_jobs, 1) # XXX: this should really be done by joblib
n_batches = batches_per_job * n_jobs
n_samples = len(X)
batch_size = int(np.ceil(n_samples / n_batches))
parallel = Parallel(n_jobs=n_jobs)
results = parallel(delayed(_predict)(estimator, X, method, i, i + batch_size)
for i in range(0, n_samples, batch_size))
if sp.issparse(results[0]):
return sp.vstack(results)
return np.concatenate(results) |
I've written parallelization wrappers for predict/predict_proba/others in pomegranate using joblib here https://github.com/jmschrei/pomegranate/blob/master/pomegranate/parallel.pyx . Their functionality could probably be expanded in the ways that @jnothman suggested. |
Do others think this belongs in scikit-learn? In what form? If not, shall On 21 September 2016 at 05:32, Jacob Schreiber notifications@github.com
|
+1 for parallelizing predictions. |
@sergei3000
Well, most of the work can be done with joblib and from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import gen_batches
n_jobs = 4
n_samples, n_features = X.shape
batch_size = n_samples//n_jobs
def _predict(method, X, sl):
return method(X[sl])
Parallel(n_jobs)(delayed(_predict)(estimator.predict, X, sl)
for sl in gen_batches(n_samples, batch_size)) I am not sure if adding parallelization wrappers to Also in the case when X is very large (and possibly on disk), I imagine out of core processing with dask might be more suitable, import dask.array as da
X_da = da.from_array(X, chunks=(batch_size, n_features))
X_da.map_blocks(estimator.predict, dtype=int, drop_axis=1).compute() |
+1 for Roman's solution. While it would be elegant to include a n_jobs
parameter for prediction and perhaps transformation methods, as a temporary
measure it shouldn't be too difficult to write your own version.
…On Fri, Jul 28, 2017 at 8:38 AM, Roman Yurchak ***@***.***> wrote:
+1 for parallelizing predictions.
In one of my settings prediction ( it's predict_proba of a random forest
classifier to be specific) is the bottleneck,
@sergei3000 <https://github.com/sergei3000> RandomForestClassifier.predic_
proba is already parallelized
<https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/ensemble/forest.py#L579>
(controllable via the n_jobs init parameter)...
Do others think this belongs in scikit-learn? In what form? If not, shall
we close this issue?
Well, most of the work can be done with joblib and
sklearn.utils.get_batches,
from sklearn.externals.joblib import Parallel, delayedfrom sklearn.utils import gen_batches
n_jobs = 4
n_samples, n_features = X.shape
batch_size = n_samples//n_jobs
def _predict(method, X, sl):
return method(X[sl])
Parallel(n_jobs)(delayed(_predict)(estimator.predict, X, sl)
for sl in gen_batches(n_samples, batch_size))
I am not sure if adding parallelization wrappers to sklearn.utils (or
some other location) would be that useful when compared to directly using
joblib. That way people could tune the chunk size, parallel backend, etc
without going though another layer of abstraction. In any case, maybe it
could be worth adding a section in the documentation here
<http://scikit-learn.org/stable/modules/scaling_strategies.html> about
parallelizing as a way of scaling predictions?
Also in the case when X is very large (and possibly on disk), I imagine
out of core processing with dask might be more suitable,
import dask.array as da
X_da = da.from_array(X, chunks=(batch_size, n_features))
X_da.map_blocks(estimator.predict, dtype=int, drop_axis=1).compute()
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7448 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADvEEGqOQaF6Vpj3HlJnz-kbQd6ix3ESks5sSgBcgaJpZM4J_iho>
.
|
@rth do you mean I can do |
@sergei3000 No, you can do |
@rth I'll check it once again but I think I observed that the CPU only worked at a third of its full capacity while working on |
@sergei3000 That's how things sometimes look with the |
@sergei3000 some of the slowness may derive from a bug in version 0.18.
Make sure you're using the 0.19 pre-release.
…On 29 July 2017 at 20:45, Roman Yurchak ***@***.***> wrote:
I'll check it once again but I think I observed that the CPU only worked
at a third of its full capacity while working on predict_proba. Opposit to
being fully loaded while building the model with fit.
@sergei3000 <https://github.com/sergei3000> That's how things sometimes
look with the threading parallel backend (RandomForestClassifier uses
it). With the multiprocessing parallel backend, the CPU will be fully
loaded, but it doesn't necessarily mean that it will be faster; it will
also spend CPU cycles just copying data around... In your use case, if
performance is an issue, it can certainly be worth trying chunking the X
array and running predict in parallel (cf examples above) to see if it
would improve predict time...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7448 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61RJ4puRpIiQRQg52srkplxgOgqhks5sSw1egaJpZM4J_iho>
.
|
Many thanks to everyone for your inputs, they've been extremely helpful for my understanding. |
For those interested and able to accept dask as a dependency, I've implemented @rth's suggestion here in https://dask-ml.readthedocs.io/en/latest/auto_examples/plot_parallel_postfit.html#sphx-glr-auto-examples-plot-parallel-postfit-py has an example of a meta-estimator that gives parallel (out of core, potentially distributed across a cluster) predict / transform. The usual caveats of parallel overhead apply, so if the scikit-learn estimator already predicts in parallel and your data fits in memory, you may not want to use the meta-estimator. |
My use case is slightly different: I'd like to disable parallelisation on prediction but use parallel in training (the prediction already happens in a parallel environment). However, the solution would be the same, implement Relevant code for parallel predict: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/forest.py#L587 |
In that case, can you use `set_params(n_jobs=1)` after training?
…On Wed, Mar 14, 2018 at 12:28 PM, Oliver Laslett ***@***.***> wrote:
My use case is slightly different: I'd like to disable parallelisation on
prediction but use parallel in training (the prediction already happens in
a parallel environment).
However, the solution would be the same, implement n_jobs on .predict and
.train rather than having it as a parameter of the classifier. Is there
an argument against doing this?
Relevant code for parallel predict: https://github.com/scikit-
learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/forest.py#L587
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7448 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIkjWrG_Kyq-501_3SSTHzBTZITWgks5teVNJgaJpZM4J_iho>
.
|
Great idea and is exactly what I'm doing and works fine. I suppose I'm just interested why |
Closing as dask-ml is probably better suited to handle such tasks and the corresponding functionality was implemented in While other estimators already support parallel predict (e.g |
Description
Sometimes one trains a classifier on a sample and then has to run it on a massive dataset which can be very slow. It would be great if predict and predict_proba had a n_jobs parameter so this could be done in parallel on multi core machines or alternatively if there was a standard copy and pasteable solution.
My guess is that one challenge to do this efficiently is to avoid copying large sections of the training matrix but other than that it should be embarrassingly parallel.
http://stackoverflow.com/questions/31449291/how-to-parallelise-predict-method-of-a-scikit-learn-svm-svc-classifier suggests a workaround, although untested.
For random forests I am aware of https://github.com/ajtulloch/sklearn-compiledtrees which speeds up prediction as well although it does not work yet for classification.
Steps/Code to Reproduce
Expected Results
Actual Results
Versions
The text was updated successfully, but these errors were encountered: