-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Run itrees in parallel during prediction. #14001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run itrees in parallel during prediction. #14001
Conversation
When is this actually faster? We did this for the random forests at some point but I think we found that it's slower. Can you provide some benchmarks? |
Hey @amueller! Thanks a lot for the feedback. This is slower for small amounts of test data (1000 samples), but it still runs in around 120 ms in our tests. However, we can see that it gets faster than running single threaded as we increase the amount of data and number of trees. Please, find the benchmark in the following notebook, with some comments: Let me know if I should run some more tests. |
ping @amueller :) |
@sergiopasra see how it is done here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/bagging.py#L356 |
Ok @agramfort . I'll have a look and come back after. Thanks for the feedback! |
Trees parallelized in chunks during prediction.
… _compute_score_samples of iforest.py
@agramfort thanks for the feedback! I updated the PR with your suggestion of dividing the trees into chunks. I also re-run the benchmark with this approach. You can check the notebook in https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/bench_isolation_forest_parallel_predict_v2.ipynb You can also compare with the first version in https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/bench_isolation_forest_parallel_predict.ipynb In this PR I needed to check the version of joblib to make it pass the tests. |
no more objection on my side. You'll need to add a what's new entry. maybe @albertcthomas you want to have a look? |
Thanks for the nice benchmark @sergiormpereira, I will have a look at the code before the end of the week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to have a test checking that the output of score_samples
is the same when n_jobs=1 and n_jobs=2, for instance in in test_iforest_parallel_regression()
. Otherwise LGTM.
sklearn/ensemble/iforest.py
Outdated
for i in range(n_jobs)) | ||
|
||
n_samples = X.shape[0] | ||
depths = np.zeros(n_samples, order="f") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use depths = np.sum(par_results, axis=0)
instead of this initialization and the for loop below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree! I'll include this change.
Hey @albertcthomas, thanks a lot for the feedback! Now, two of the tests are failing, but it is on test_ridge.py. It looks related to the following issue: Do you think it was caused by my changes? |
Thanks @sergiormpereira! I don't think the failure is related to this PR. Could you push a new commit (or amend the last commit and force push it) to rerun CI? |
8e54aea
to
fbdbcce
Compare
Thanks @albertcthomas! I amended the last commit and force pushed it, as you suggested. Now it passes everything. |
@agramfort I added an entry to What's new. I put it as Fix because in the documentation it was said that n_jobs controls the number of parallel jobs both in training and test. |
ping @agramfort @amueller :) |
This provides really nice improvements for large samples! I am concerned with users that set a high |
Yeah can you include |
@thomasjpfan @amueller Yeah, I agree that a threshold can be set. I'll include it in the next days. |
Hey @thomasjpfan @amueller ! I was looking into the issue of predicting with a small number of samples. Please, have a look at the following points.
The study regarding points 1, 2, and 3 can be found in: https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/bench_isolation_forest_parallel_predict_samples_study.ipynb The study of point 4: https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/Parallel_IForest_VS_RF.ipynb What are your thoughts on this? |
@sergiormpereira Thank you for the benchmarks! On 2: I am okay with a threshold. I wonder if it is better to have an threshold, or document that it may be good to On 4: The parallelism in The Random Forest does not break up the samples into batches. It passes all of |
@thomasjpfan thanks for the feedback, and info regarding RF! On 2: I can easily implement any of them. But, somehow, I feel an inclination towards documenting it, for the sake of being more general and avoiding to put a constant in the code. We could also document that it may increase the memory footprint (check the next point). Perhaps, @amueller can comment here, too? On 4: running the prediction in parallel with this PR increases the memory footprint, more or less, a little bit more than 0.5 times the number of parallel jobs, in relation to the current single-threaded method. Maybe this can be improved in the future? Please, have a look at the memory benchmark that I conducted: https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/bench_isolation_forest_parallel_memory_consumption.ipynb |
ping @amueller @thomasjpfan :) |
Any updates? Would love to see this being merged. Happy to help on any other investigations |
With the increase in memory usage, which opens #12040 back up again by using more memory than `sklearn.get_config()['working_memory'], I rather not have this feature activate automatically. There is essentially a trade off between computation time and memory usage. The way to work around this may be to parallelize one step higher at the following level: def _compute_chunked_score_samples(self, X):
...
for sl in slices:
scores[sl] = self._compute_score_samples(X[sl], subsample_features) (I am unsure if this will work) |
Hey @thomasjpfan ! I was analyzing that suggestion about parallelizing higher and, from my understanding, it will not alleviate the memory issue. At the moment, when we parallelize at the trees, we are having n_jobs parallel trees running. As you suggest, we would be parallelizing at the batch level, meaning with large data we would have n_jobs batches being predicted in parallel, with trees in series in each batch. So, we would indirectly having n_jobs parallel trees running. Am I correct here? What we could consider as an option would be to have a parameter to IsolationForest that explicitly states to run predict in parallel as False by default, for instance, a parallel_predict=False. In the documentation, we could warn about the current issues of having it True. What do you think? Perhaps @amueller can comment, too :) |
ping for comments :) |
ping @amueller @thomasjpfan . I really would like to move this PR forward :) |
Generally prediction can be batched over samples. This could even be achieved with a mixin, and would maintain reasonably minimal memory requirements... Rather than providing a parameter to enable parallelisation across trees, I wonder if this kind of mixin would be similarly performant. See also dask_ml.wrappers.ParallelPostFit. |
+1 for parallelized predictions. In the meantime, readers may consider a parallel wrapper as workaround: from os import sched_getaffinity
import numpy as np
from sklearn.ensemble import IsolationForest
# Fit IsolationForest
iso = IsolationForest().fit(X_train)
# Split test array in `n_cores` chunks
n_chunks = len(sched_getaffinity(0))
chunks = np.array_split(X_test, n_chunks) Multiprocessing: from multiprocessing import Pool
# Predict in parallel
with Pool(n_chunks) as pool:
y_score = np.concatenate(pool.map(iso.score_samples, chunks)) Joblib: from joblib import Parallel, delayed
# Predict in parallel
par_exec = Parallel(n_jobs=n_chunks, max_nbytes='8G')
y_score = np.concatenate(par_exec(delayed(iso.score_samples)(_X) for _X in chunks)) Multiprocessing is slightly faster for me, but your mileage may vary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I left a suggestion with a possible way forward.
return batch_depths | ||
|
||
n_jobs, n_estimators, starts = _partition_estimators( | ||
self.n_estimators, self.n_jobs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this issue again, I think we can do this:
self.n_estimators, self.n_jobs) | |
self.n_estimators, None) |
which allows joblib.parallel_backend
to control n_jobs
. At a higher level, we can use parallel_backend
to control n_jobs
:
with parallel_backend("loky", n_jobs=6):
iso.score_samples(X)
Details: Seeing that _partition_estimators
uses effective_n_jobs
:
scikit-learn/sklearn/ensemble/_base.py
Line 209 in 8d295fb
n_jobs = min(effective_n_jobs(n_jobs), n_estimators) |
effective_n_jobs
queries configuration as follows:
with parallel_backend("loky", n_jobs=4):
print(effective_n_jobs(None))
# 4
# default is 1
print(effective_n_jobs(None))
# 1
@adrinjalali I see you marked this as stalled and help-wanted. I would be interested in possibly helping finishing this off as I think this is an important performance gap within the tree/ensemble module. From my understanding, the remaining work is just around configuring the right internal API using Thomas's suggestion, prolly updating unit-tests and then running a few benchmarks to verify this works as intended. |
@adam2392 would be very nice if you could then open a PR to supersede this work. |
This has been done in #25186 |
Reference Issues/PRs
#14000
What does this implement/fix? Explain your changes.
Isolation Forest is executed in parallel during fitting. But, during prediction it is running single-threaded.
In this PR, I parallelised the execution during prediction, more precisely in the _compute_score_samples method. Each ITree was being called in sequence, using a for loop. I created an auxiliary internal function that executes each tree, and this function can be run in parallel. I used joblib for parallelisation.
Any other comments?
I run the tests for Isolation Forest and it passed.