ENH Enable prediction of isolation forest in parallel #28622

adam2392 · 2024-03-12T16:51:12Z

Reference Issues/PRs

Fixes: #14000
Follow-up to #14001

What does this implement/fix? Explain your changes.

Implements a parallel function that is called via joblib.Parallel wrapper for scoring samples by computing the depths that the tree reaches per sample
It allows the user to set the parallelization via joblib.backend and adds notes to the docstring

A new benchmarking script I ran that shows very encouraging results.

Discussion of benchmarking

The four lines are isolation forests predicting on a test set with different n_jobs enabled over different sample sizes in the test set (x-axis). Each dot has different n_features, and contamination to show the n_jobs speedup actually works and is not just a random function of the different set of hyper parameters.

Sample size == 1000: The benchmarking results show a slight overhead when # test samples is < 2000 (i.e. 1000ish).
n_jobs==1: The performance is the same on main as expected if you compare left/right plots.
n_jobs > 1 Shows a significantly better scaling with n_jobs as expected. as n_jobs increases, the left hand plot shows the prediction time decreases. The right hand plot shows no difference as main does not predict in parallel.

Any other comments?

Possibly related since isolation forest can now also use nans #27966

cc: @sergiormpereira @thomasjpfan @adrinjalali

Signed-off-by: Adam Li <adam2392@gmail.com>

github-actions · 2024-03-12T16:53:28Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 5302073. Link to the linter CI: here}

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2024-06-26T15:47:03Z

sklearn/ensemble/_iforest.py

+        The predict method can be parallelized by setting a joblib context. This
+        inherently does NOT use the ``n_jobs`` parameter initialized in the class,
+        which is used during ``fit``. This is because, predict may actually be faster
+        without parallelization for a small number of samples. The user can set the


Suggested change

without parallelization for a small number of samples. The user can set the

without parallelization for a small number of test-samples of 1000 or less. The user can set the

We could mention this explicitly?

I am okay with being explicit here.

I've updated it to mention that it is better to not parallelize for 1000 samples or less

… itree

betatim · 2024-06-27T08:26:05Z

Thanks for reviving the PR. I will wait to see if any of the people you mentioned want to say something regarding "should this be done" and "how to do it". It seems reasonable to me after spending 5min thinking about it. I like using a context manager to set the parallelism at predict time, that seems like a smart idea.

One thing I struggled with is reading the plots. In particular what the difference is between the left one and the right one, and why the right hand one has so many lines but only two legend entries. Maybe you can tweak the visuals a bit or write a few sentences to help read the plots (what should I be looking for on each/what do they demonstrate/question do they answer).

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2024-06-27T11:52:00Z

Hi @betatim thanks for taking a look.

I re-plotted the benchmarking result (it is from running the new benchmarking script on isolation forest predict times), and added a description of what I observed in the PR description.

tldr: predicting in parallel with n_jobs offers substantial speedups when sample sizes of the test set is > 1500ish.

Thanks for reviving the PR. I will wait to see if any of the people you mentioned want to say something regarding "should this be done" and "how to do it". It seems reasonable to me after spending 5min thinking about it. I like using a context manager to set the parallelism at predict time, that seems like a smart idea.

Re should this be done: The GH issue and related outdated PR has a discussion that points to the fact that this should be done. The motivation being speed up isolation forest predictions (similar to how random forest predicts in parallel). However, it is agreed upon that there is overhead when doing joblib, and this manifests when we have low sample sizes as confirmed by the benchmark. Thus "how this should be done" is addressed next.

Re how this should be done: The design is partly inspired by @thomasjpfan suggestions in #14001 (comment) and is related to "should this be done" because now we can control when to predict in parallel with joblib Context manager.

Lmk if these addressed your concerns!

Signed-off-by: Adam Li <adam2392@gmail.com>

thomasjpfan · 2024-06-27T18:23:29Z

sklearn/ensemble/_iforest.py

+
+            from joblib import parallel_backend
+
+            with parallel_backend('loky', n_jobs=4):


Does the loky backend work with the require="sharedmem" ? I recall loky is a "better multiprocessing" and _parallel_compute_tree_depths is using a threading.Lock, which is a lock for multi-threading.

Technically loky backend does not, but the docs on Parallel sounds like it handles that:

require: ‘sharedmem’ or None, default=None Hard constraint to select the backend. If set to ‘sharedmem’, the selected backend will be single-host and thread-based even if the user asked for a non-thread based backend with [parallel_config()](https://joblib.readthedocs.io/en/latest/generated/joblib.parallel_config.html#joblib.parallel_config).

https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html

I tested it out and the results were the same. In addition, if you try

with parallel_config("loky", n_jobs=n_jobs): with parallel_config("threading", n_jobs=n_jobs):

Inside the parallel function because we do require='sharedmem', you can print the active backend and it always changes to ThreadingBackend.

But I think the docstring should say `'threading' just to be explicit and safe, so I changed that.

For reference, see: cd185ca

Signed-off-by: Adam Li <adam2392@gmail.com>

thomasjpfan · 2024-06-28T17:11:22Z

benchmarks/bench_isolation_forest_predict.py

+    # df = pd.DataFrame(results)
+    # df.to_csv("~/bench_results_forest/pr-threading.csv", index=False)


This is commended out. Can this benchmark function have two commands? One for running the benchmark and saving the results to disk and another one for plotting the results?

I fixed this and made two commands for the benchmark.

thomasjpfan · 2024-06-28T17:12:43Z

benchmarks/bench_isolation_forest_predict.py

+#
+main()
+
+bench_results = Path("/Users/adam2392/bench_results_forest")


Can the path be a user provided?

I also allowed the user to pass in the paths.

thomasjpfan · 2024-06-28T17:13:04Z

sklearn/ensemble/_iforest.py

+        The predict method can be parallelized by setting a joblib context. This
+        inherently does NOT use the ``n_jobs`` parameter initialized in the class,
+        which is used during ``fit``. This is because, predict may actually be faster
+        without parallelization for a small number of samples. The user can set the


I am okay with being explicit here.

Signed-off-by: Adam Li <adam2392@gmail.com>

thomasjpfan · 2024-06-29T19:07:41Z

There is a merge conflict to fix.

OmarManzoor

Thanks for the PR @adam2392. I added a few comments.

sklearn/ensemble/_iforest.py

sklearn/ensemble/tests/test_iforest.py

Signed-off-by: Adam Li <adam2392@gmail.com>

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

Signed-off-by: Adam Li <adam2392@gmail.com>

… itree

benchmarks/bench_isolation_forest_predict.py

Signed-off-by: Adam Li <adam2392@gmail.com>

OmarManzoor

LGTM. Thanks @adam2392

benchmarks/bench_isolation_forest_predict.py

…8622)

Initial commit

a50a8ce

Signed-off-by: Adam Li <adam2392@gmail.com>

github-actions bot added the module:ensemble label Mar 12, 2024

Fix the pr number

ab5f95e

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 added 7 commits March 22, 2024 13:56

Add context manager for parallel predict

18cefc0

Signed-off-by: Adam Li <adam2392@gmail.com>

Merging main

5f834a9

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'main' into itree

062b6e6

Fix lint

38f33aa

Signed-off-by: Adam Li <adam2392@gmail.com>

Fix lint

6667c97

Signed-off-by: Adam Li <adam2392@gmail.com>

Working benchmark script

4a1c93f

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'main' into itree

6561c5b

adam2392 marked this pull request as ready for review June 26, 2024 15:43

adam2392 commented Jun 26, 2024

View reviewed changes

Merge branch 'itree' of https://github.com/adam2392/scikit-learn into…

dfdf78d

… itree

Working draft

105ffc9

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 added 2 commits June 27, 2024 07:57

Run black

3be8102

Signed-off-by: Adam Li <adam2392@gmail.com>

Add update to chnagelog

08d8eb6

Signed-off-by: Adam Li <adam2392@gmail.com>

thomasjpfan reviewed Jun 27, 2024

View reviewed changes

Address thomas comments

cd185ca

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 requested a review from thomasjpfan June 28, 2024 13:06

thomasjpfan reviewed Jun 28, 2024

View reviewed changes

Address comments from thomas

10f4ed9

Signed-off-by: Adam Li <adam2392@gmail.com>

thomasjpfan approved these changes Jun 29, 2024

View reviewed changes

OmarManzoor reviewed Jul 1, 2024

View reviewed changes

sklearn/ensemble/_iforest.py Outdated Show resolved Hide resolved

sklearn/ensemble/_iforest.py Outdated Show resolved Hide resolved

sklearn/ensemble/_iforest.py Outdated Show resolved Hide resolved

sklearn/ensemble/tests/test_iforest.py Show resolved Hide resolved

adam2392 and others added 3 commits July 1, 2024 10:58

Merging

292070f

Signed-off-by: Adam Li <adam2392@gmail.com>

Apply suggestions from code review

4247958

Co-authored-by: Omar Salman <omar.salman2007@gmail.com>

Fix unit test

b504533

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'itree' of https://github.com/adam2392/scikit-learn into…

ece71e1

… itree

adam2392 requested a review from OmarManzoor July 1, 2024 15:01

OmarManzoor reviewed Jul 1, 2024

View reviewed changes

benchmarks/bench_isolation_forest_predict.py Outdated Show resolved Hide resolved

Address omars comments

4adb0af

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 requested a review from OmarManzoor July 1, 2024 16:46

OmarManzoor approved these changes Jul 2, 2024

View reviewed changes

benchmarks/bench_isolation_forest_predict.py Outdated Show resolved Hide resolved

OmarManzoor and others added 2 commits July 2, 2024 11:13

Update benchmarks/bench_isolation_forest_predict.py

02fd06e

Merge branch 'main' into itree

5302073

OmarManzoor enabled auto-merge (squash) July 2, 2024 07:08

OmarManzoor merged commit 191f969 into scikit-learn:main Jul 2, 2024
30 checks passed

adam2392 deleted the itree branch July 2, 2024 12:20

snath-xoc pushed a commit to snath-xoc/scikit-learn that referenced this pull request Jul 5, 2024

ENH Enable prediction of isolation forest in parallel (scikit-learn#2…

345358a

…8622)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Enable prediction of isolation forest in parallel #28622

ENH Enable prediction of isolation forest in parallel #28622

adam2392 commented Mar 12, 2024 •

edited

Loading

github-actions bot commented Mar 12, 2024 •

edited

Loading

adam2392 Jun 26, 2024

thomasjpfan Jun 28, 2024

adam2392 Jun 29, 2024

betatim commented Jun 27, 2024

adam2392 commented Jun 27, 2024 •

edited

Loading

thomasjpfan Jun 27, 2024

adam2392 Jun 28, 2024 •

edited

Loading

adam2392 Jun 28, 2024

thomasjpfan Jun 28, 2024

adam2392 Jun 29, 2024

thomasjpfan Jun 28, 2024

adam2392 Jun 29, 2024

thomasjpfan Jun 28, 2024

thomasjpfan commented Jun 29, 2024

OmarManzoor left a comment

OmarManzoor left a comment

	without parallelization for a small number of samples. The user can set the
	without parallelization for a small number of test-samples of 1000 or less. The user can set the


		from joblib import parallel_backend

		with parallel_backend('loky', n_jobs=4):

		# df = pd.DataFrame(results)
		# df.to_csv("~/bench_results_forest/pr-threading.csv", index=False)

ENH Enable prediction of isolation forest in parallel #28622

ENH Enable prediction of isolation forest in parallel #28622

Conversation

adam2392 commented Mar 12, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Discussion of benchmarking

Any other comments?

github-actions bot commented Mar 12, 2024 • edited Loading

✔️ Linting Passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

betatim commented Jun 27, 2024

adam2392 commented Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

adam2392 Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented Jun 29, 2024

OmarManzoor left a comment

Choose a reason for hiding this comment

OmarManzoor left a comment

Choose a reason for hiding this comment

adam2392 commented Mar 12, 2024 •

edited

Loading

github-actions bot commented Mar 12, 2024 •

edited

Loading

adam2392 commented Jun 27, 2024 •

edited

Loading

adam2392 Jun 28, 2024 •

edited

Loading