Skip to content

ENH Enable prediction of isolation forest in parallel #28622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jul 2, 2024

Conversation

adam2392
Copy link
Member

@adam2392 adam2392 commented Mar 12, 2024

Reference Issues/PRs

Fixes: #14000
Follow-up to #14001

What does this implement/fix? Explain your changes.

  • Implements a parallel function that is called via joblib.Parallel wrapper for scoring samples by computing the depths that the tree reaches per sample
  • It allows the user to set the parallelization via joblib.backend and adds notes to the docstring

A new benchmarking script I ran that shows very encouraging results.

results_image

Discussion of benchmarking

The four lines are isolation forests predicting on a test set with different n_jobs enabled over different sample sizes in the test set (x-axis). Each dot has different n_features, and contamination to show the n_jobs speedup actually works and is not just a random function of the different set of hyper parameters.

  • Sample size == 1000: The benchmarking results show a slight overhead when # test samples is < 2000 (i.e. 1000ish).
  • n_jobs==1: The performance is the same on main as expected if you compare left/right plots.
  • n_jobs > 1 Shows a significantly better scaling with n_jobs as expected. as n_jobs increases, the left hand plot shows the prediction time decreases. The right hand plot shows no difference as main does not predict in parallel.

Any other comments?

Possibly related since isolation forest can now also use nans #27966

cc: @sergiormpereira @thomasjpfan @adrinjalali

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Copy link

github-actions bot commented Mar 12, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5302073. Link to the linter CI: here

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 marked this pull request as ready for review June 26, 2024 15:43
The predict method can be parallelized by setting a joblib context. This
inherently does NOT use the ``n_jobs`` parameter initialized in the class,
which is used during ``fit``. This is because, predict may actually be faster
without parallelization for a small number of samples. The user can set the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
without parallelization for a small number of samples. The user can set the
without parallelization for a small number of test-samples of 1000 or less. The user can set the

We could mention this explicitly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with being explicit here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it to mention that it is better to not parallelize for 1000 samples or less

@betatim
Copy link
Member

betatim commented Jun 27, 2024

Thanks for reviving the PR. I will wait to see if any of the people you mentioned want to say something regarding "should this be done" and "how to do it". It seems reasonable to me after spending 5min thinking about it. I like using a context manager to set the parallelism at predict time, that seems like a smart idea.

One thing I struggled with is reading the plots. In particular what the difference is between the left one and the right one, and why the right hand one has so many lines but only two legend entries. Maybe you can tweak the visuals a bit or write a few sentences to help read the plots (what should I be looking for on each/what do they demonstrate/question do they answer).

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Member Author

adam2392 commented Jun 27, 2024

Hi @betatim thanks for taking a look.

I re-plotted the benchmarking result (it is from running the new benchmarking script on isolation forest predict times), and added a description of what I observed in the PR description.

tldr: predicting in parallel with n_jobs offers substantial speedups when sample sizes of the test set is > 1500ish.

Thanks for reviving the PR. I will wait to see if any of the people you mentioned want to say something regarding "should this be done" and "how to do it". It seems reasonable to me after spending 5min thinking about it. I like using a context manager to set the parallelism at predict time, that seems like a smart idea.

Re should this be done: The GH issue and related outdated PR has a discussion that points to the fact that this should be done. The motivation being speed up isolation forest predictions (similar to how random forest predicts in parallel). However, it is agreed upon that there is overhead when doing joblib, and this manifests when we have low sample sizes as confirmed by the benchmark. Thus "how this should be done" is addressed next.

Re how this should be done: The design is partly inspired by @thomasjpfan suggestions in #14001 (comment) and is related to "should this be done" because now we can control when to predict in parallel with joblib Context manager.

Lmk if these addressed your concerns!

adam2392 added 2 commits June 27, 2024 07:57
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>

from joblib import parallel_backend

with parallel_backend('loky', n_jobs=4):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the loky backend work with the require="sharedmem" ? I recall loky is a "better multiprocessing" and _parallel_compute_tree_depths is using a threading.Lock, which is a lock for multi-threading.

Copy link
Member Author

@adam2392 adam2392 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically loky backend does not, but the docs on Parallel sounds like it handles that:

require: ‘sharedmem’ or None, default=None
Hard constraint to select the backend. If set to ‘sharedmem’, the selected backend will be single-host and thread-based even if the user asked for a non-thread based backend with [parallel_config()](https://joblib.readthedocs.io/en/latest/generated/joblib.parallel_config.html#joblib.parallel_config).

https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html

I tested it out and the results were the same. In addition, if you try

with parallel_config("loky", n_jobs=n_jobs):
with parallel_config("threading", n_jobs=n_jobs):

Inside the parallel function because we do require='sharedmem', you can print the active backend and it always changes to ThreadingBackend.

But I think the docstring should say `'threading' just to be explicit and safe, so I changed that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, see: cd185ca

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 requested a review from thomasjpfan June 28, 2024 13:06
Comment on lines 170 to 171
# df = pd.DataFrame(results)
# df.to_csv("~/bench_results_forest/pr-threading.csv", index=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is commended out. Can this benchmark function have two commands? One for running the benchmark and saving the results to disk and another one for plotting the results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this and made two commands for the benchmark.

#
main()

bench_results = Path("/Users/adam2392/bench_results_forest")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the path be a user provided?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also allowed the user to pass in the paths.

The predict method can be parallelized by setting a joblib context. This
inherently does NOT use the ``n_jobs`` parameter initialized in the class,
which is used during ``fit``. This is because, predict may actually be faster
without parallelization for a small number of samples. The user can set the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with being explicit here.

Signed-off-by: Adam Li <adam2392@gmail.com>
@thomasjpfan
Copy link
Member

There is a merge conflict to fix.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @adam2392. I added a few comments.

adam2392 and others added 3 commits July 1, 2024 10:58
Signed-off-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 requested a review from OmarManzoor July 1, 2024 15:01
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 requested a review from OmarManzoor July 1, 2024 16:46
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @adam2392

@OmarManzoor OmarManzoor enabled auto-merge (squash) July 2, 2024 07:08
@OmarManzoor OmarManzoor merged commit 191f969 into scikit-learn:main Jul 2, 2024
30 checks passed
@adam2392 adam2392 deleted the itree branch July 2, 2024 12:20
snath-xoc pushed a commit to snath-xoc/scikit-learn that referenced this pull request Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Isolation Forest executes single-threaded during prediction
4 participants