Skip to content

Run itrees in parallel during prediction. #14001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

sergiormpereira
Copy link

Reference Issues/PRs

#14000

What does this implement/fix? Explain your changes.

Isolation Forest is executed in parallel during fitting. But, during prediction it is running single-threaded.

In this PR, I parallelised the execution during prediction, more precisely in the _compute_score_samples method. Each ITree was being called in sequence, using a for loop. I created an auxiliary internal function that executes each tree, and this function can be run in parallel. I used joblib for parallelisation.

Any other comments?

I run the tests for Isolation Forest and it passed.

@amueller
Copy link
Member

When is this actually faster? We did this for the random forests at some point but I think we found that it's slower. Can you provide some benchmarks?

@sergiormpereira
Copy link
Author

sergiormpereira commented Jun 4, 2019

Hey @amueller! Thanks a lot for the feedback.

This is slower for small amounts of test data (1000 samples), but it still runs in around 120 ms in our tests. However, we can see that it gets faster than running single threaded as we increase the amount of data and number of trees.

Please, find the benchmark in the following notebook, with some comments:
https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/bench_isolation_forest_parallel_predict.ipynb

Let me know if I should run some more tests.

@sergiormpereira
Copy link
Author

ping @amueller :)

@agramfort
Copy link
Member

@sergiopasra see how it is done here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/bagging.py#L356
_partition_estimators allows to parallelize batches of estimators which leads to less threads and is faster

@sergiormpereira
Copy link
Author

Ok @agramfort . I'll have a look and come back after. Thanks for the feedback!

@sergiormpereira
Copy link
Author

sergiormpereira commented Jul 2, 2019

@agramfort thanks for the feedback!

I updated the PR with your suggestion of dividing the trees into chunks.

I also re-run the benchmark with this approach. You can check the notebook in https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/bench_isolation_forest_parallel_predict_v2.ipynb

You can also compare with the first version in https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/bench_isolation_forest_parallel_predict.ipynb

In this PR I needed to check the version of joblib to make it pass the tests.

@agramfort
Copy link
Member

no more objection on my side. You'll need to add a what's new entry.

maybe @albertcthomas you want to have a look?

@albertcthomas
Copy link
Contributor

Thanks for the nice benchmark @sergiormpereira, I will have a look at the code before the end of the week.

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to have a test checking that the output of score_samples is the same when n_jobs=1 and n_jobs=2, for instance in in test_iforest_parallel_regression(). Otherwise LGTM.

for i in range(n_jobs))

n_samples = X.shape[0]
depths = np.zeros(n_samples, order="f")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use depths = np.sum(par_results, axis=0) instead of this initialization and the for loop below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree! I'll include this change.

@sergiormpereira
Copy link
Author

Hey @albertcthomas, thanks a lot for the feedback!
I added both suggestions: tests, and changing the loop for np.sum.

Now, two of the tests are failing, but it is on test_ridge.py. It looks related to the following issue:
#14219

Do you think it was caused by my changes?

@albertcthomas
Copy link
Contributor

Thanks @sergiormpereira! I don't think the failure is related to this PR. Could you push a new commit (or amend the last commit and force push it) to rerun CI?

@sergiormpereira sergiormpereira force-pushed the iforest-parallel-predict branch from 8e54aea to fbdbcce Compare July 17, 2019 16:13
@sergiormpereira
Copy link
Author

Thanks @albertcthomas! I amended the last commit and force pushed it, as you suggested. Now it passes everything.

@sergiormpereira
Copy link
Author

@agramfort I added an entry to What's new. I put it as Fix because in the documentation it was said that n_jobs controls the number of parallel jobs both in training and test.

@sergiormpereira
Copy link
Author

ping @agramfort @amueller :)

@thomasjpfan
Copy link
Member

This provides really nice improvements for large samples!

I am concerned with users that set a high n_jobs during training and then move this into production which runs score_samples on <1k samples. They will experience a major performance hit.

@amueller
Copy link
Member

Yeah can you include n_samples=10? There could also be a threshold that only runs in parallel with enough samples, right?

@sergiormpereira
Copy link
Author

@thomasjpfan @amueller Yeah, I agree that a threshold can be set. I'll include it in the next days.

@sergiormpereira
Copy link
Author

Hey @thomasjpfan @amueller ! I was looking into the issue of predicting with a small number of samples. Please, have a look at the following points.

  1. It appears that there is an impact both in terms of the number of samples and the number of features. But, in general, the best we can go with parallel prediction is around 100 ms per prediction call. This applies to, e.g., 1 sample and 10 features, or 100 samples and 500 features. In contrast, the current single thread prediction can achieve a best time of 30 ms per prediction call. This is true for, e.g., 1 sample with 10 features. So, my conclusion is that in this regime, the thread handling is dominating the running time.

  2. Maybe we can impose a threshold. The relationship between the number of features, the number of samples, and the obtained speed-up look non-linear. I could try to come up with a function to define the threshold, but I am afraid that it could be hardware-dependent. So, I would rather go for a simplified threshold, like >= 5000 samples to run in parallel. This is more conservative than 1k samples, but in this regime, I think that the speed-up will not be so different from the single-threaded mode. What's your opinion on this?

  3. Also, the minimum runtime we can go with the single thread IForest is around 30 ms, while with parallel threads is around 100 ms. This is more than twice slower, but as we increase the amount of data, we start saving seconds, minutes, or hours, which I think is a better saving that 70 ms. Of course, I understand that in production, we may be interested in calling predict multiple times, for a few amounts of samples, as @thomasjpfan said.

  4. As a matter of curiosity, I also did a small test with boston house-prices dataset, using the Random Forest Regression, and the Isolation Forest. I observed, that the RF Reg. for 1 thread achieves predict times of around 3 ms. But, when we start increasing the number of threads, the minimum predict call time is 100 ms. With the single-threaded Isolation Forest, in this scenario, the time for 1 sample is around 30 ms. When we use more parallel jobs during predict with Isolation Forest, we also obtain around 100 ms predict time. So, I can conclude that this issue is really not so much Isolation Forest-specific, but is more related to handling the parallel jobs.

The study regarding points 1, 2, and 3 can be found in: https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/bench_isolation_forest_parallel_predict_samples_study.ipynb

The study of point 4: https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/Parallel_IForest_VS_RF.ipynb

What are your thoughts on this?

@thomasjpfan
Copy link
Member

@sergiormpereira Thank you for the benchmarks!

On 2:

I am okay with a threshold. I wonder if it is better to have an threshold, or document that it may be good to set_params(n_jobs=1) when sample size < 5000.

On 4:

The parallelism in IsolationForest first uses _compute_chunked_score_samples to break up the samples into batches and puts each batch through each estimator_ in _compute_score_samples. This PR parallelizes the calls to the tree methods to obtain the depth. I wonder if this PR will use too much memory as reported #12040

The Random Forest does not break up the samples into batches. It passes all of X into predict_proba of each tree. The predict_proba call is parallelized in this case.

@sergiormpereira
Copy link
Author

sergiormpereira commented Aug 6, 2019

@thomasjpfan thanks for the feedback, and info regarding RF!

On 2: I can easily implement any of them. But, somehow, I feel an inclination towards documenting it, for the sake of being more general and avoiding to put a constant in the code. We could also document that it may increase the memory footprint (check the next point). Perhaps, @amueller can comment here, too?

On 4: running the prediction in parallel with this PR increases the memory footprint, more or less, a little bit more than 0.5 times the number of parallel jobs, in relation to the current single-threaded method. Maybe this can be improved in the future? Please, have a look at the memory benchmark that I conducted: https://github.com/TechhubLisbon/scikit-learn/blob/iforest-parallel-predict-benchmark/benchmarks/bench_isolation_forest_parallel_memory_consumption.ipynb

@sergiormpereira
Copy link
Author

ping @amueller @thomasjpfan :)

@necosta
Copy link

necosta commented Sep 10, 2019

Any updates? Would love to see this being merged. Happy to help on any other investigations

@thomasjpfan
Copy link
Member

With the increase in memory usage, which opens #12040 back up again by using more memory than `sklearn.get_config()['working_memory'], I rather not have this feature activate automatically. There is essentially a trade off between computation time and memory usage.

The way to work around this may be to parallelize one step higher at the following level:

def _compute_chunked_score_samples(self, X):
    ...
	for sl in slices:
    	scores[sl] = self._compute_score_samples(X[sl], subsample_features)

(I am unsure if this will work)

@sergiormpereira
Copy link
Author

Hey @thomasjpfan ! I was analyzing that suggestion about parallelizing higher and, from my understanding, it will not alleviate the memory issue. At the moment, when we parallelize at the trees, we are having n_jobs parallel trees running. As you suggest, we would be parallelizing at the batch level, meaning with large data we would have n_jobs batches being predicted in parallel, with trees in series in each batch. So, we would indirectly having n_jobs parallel trees running. Am I correct here?

What we could consider as an option would be to have a parameter to IsolationForest that explicitly states to run predict in parallel as False by default, for instance, a parallel_predict=False. In the documentation, we could warn about the current issues of having it True.

What do you think? Perhaps @amueller can comment, too :)

@necosta
Copy link

necosta commented Dec 4, 2019

ping for comments :)

@sergiormpereira
Copy link
Author

ping @amueller @thomasjpfan . I really would like to move this PR forward :)

@jnothman
Copy link
Member

jnothman commented Jan 7, 2020

Generally prediction can be batched over samples. This could even be achieved with a mixin, and would maintain reasonably minimal memory requirements... Rather than providing a parameter to enable parallelisation across trees, I wonder if this kind of mixin would be similarly performant. See also dask_ml.wrappers.ParallelPostFit.

@cmarmo cmarmo added the Needs Decision Requires decision label Aug 12, 2020
@svenvanhal
Copy link

+1 for parallelized predictions.

In the meantime, readers may consider a parallel wrapper as workaround:

from os import sched_getaffinity

import numpy as np
from sklearn.ensemble import IsolationForest

# Fit IsolationForest
iso = IsolationForest().fit(X_train)

# Split test array in `n_cores` chunks
n_chunks = len(sched_getaffinity(0))
chunks = np.array_split(X_test, n_chunks)

Multiprocessing:

from multiprocessing import Pool

# Predict in parallel
with Pool(n_chunks) as pool:
    y_score = np.concatenate(pool.map(iso.score_samples, chunks))

Joblib:

from joblib import Parallel, delayed

# Predict in parallel
par_exec = Parallel(n_jobs=n_chunks, max_nbytes='8G')
y_score = np.concatenate(par_exec(delayed(iso.score_samples)(_X) for _X in chunks))

Multiprocessing is slightly faster for me, but your mileage may vary.

Base automatically changed from master to main January 22, 2021 10:51
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I left a suggestion with a possible way forward.

return batch_depths

n_jobs, n_estimators, starts = _partition_estimators(
self.n_estimators, self.n_jobs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this issue again, I think we can do this:

Suggested change
self.n_estimators, self.n_jobs)
self.n_estimators, None)

which allows joblib.parallel_backend to control n_jobs. At a higher level, we can use parallel_backend to control n_jobs:

with parallel_backend("loky", n_jobs=6):
    iso.score_samples(X)

Details: Seeing that _partition_estimators uses effective_n_jobs:

n_jobs = min(effective_n_jobs(n_jobs), n_estimators)

effective_n_jobs queries configuration as follows:

with parallel_backend("loky", n_jobs=4):
    print(effective_n_jobs(None))
# 4

# default is 1
print(effective_n_jobs(None))
# 1

@adam2392
Copy link
Member

adam2392 commented Mar 8, 2024

@adrinjalali I see you marked this as stalled and help-wanted. I would be interested in possibly helping finishing this off as I think this is an important performance gap within the tree/ensemble module.

From my understanding, the remaining work is just around configuring the right internal API using Thomas's suggestion, prolly updating unit-tests and then running a few benchmarks to verify this works as intended.

@adrinjalali
Copy link
Member

@adam2392 would be very nice if you could then open a PR to supersede this work.

@lesteve
Copy link
Member

lesteve commented Dec 18, 2024

This has been done in #25186

@lesteve lesteve closed this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.