-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Enable prediction of isolation forest in parallel #28622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
sklearn/ensemble/_iforest.py
Outdated
The predict method can be parallelized by setting a joblib context. This | ||
inherently does NOT use the ``n_jobs`` parameter initialized in the class, | ||
which is used during ``fit``. This is because, predict may actually be faster | ||
without parallelization for a small number of samples. The user can set the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without parallelization for a small number of samples. The user can set the | |
without parallelization for a small number of test-samples of 1000 or less. The user can set the |
We could mention this explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with being explicit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated it to mention that it is better to not parallelize for 1000 samples or less
Thanks for reviving the PR. I will wait to see if any of the people you mentioned want to say something regarding "should this be done" and "how to do it". It seems reasonable to me after spending 5min thinking about it. I like using a context manager to set the parallelism at predict time, that seems like a smart idea. One thing I struggled with is reading the plots. In particular what the difference is between the left one and the right one, and why the right hand one has so many lines but only two legend entries. Maybe you can tweak the visuals a bit or write a few sentences to help read the plots (what should I be looking for on each/what do they demonstrate/question do they answer). |
Signed-off-by: Adam Li <adam2392@gmail.com>
Hi @betatim thanks for taking a look. I re-plotted the benchmarking result (it is from running the new benchmarking script on isolation forest predict times), and added a description of what I observed in the PR description. tldr: predicting in parallel with
Re should this be done: The GH issue and related outdated PR has a discussion that points to the fact that this should be done. The motivation being speed up isolation forest predictions (similar to how random forest predicts in parallel). However, it is agreed upon that there is overhead when doing joblib, and this manifests when we have low sample sizes as confirmed by the benchmark. Thus "how this should be done" is addressed next. Re how this should be done: The design is partly inspired by @thomasjpfan suggestions in #14001 (comment) and is related to "should this be done" because now we can control when to predict in parallel with joblib Context manager. Lmk if these addressed your concerns! |
Signed-off-by: Adam Li <adam2392@gmail.com>
sklearn/ensemble/_iforest.py
Outdated
|
||
from joblib import parallel_backend | ||
|
||
with parallel_backend('loky', n_jobs=4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the loky backend work with the require="sharedmem"
? I recall loky
is a "better multiprocessing" and _parallel_compute_tree_depths
is using a threading.Lock
, which is a lock for multi-threading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically loky backend does not, but the docs on Parallel sounds like it handles that:
require: ‘sharedmem’ or None, default=None
Hard constraint to select the backend. If set to ‘sharedmem’, the selected backend will be single-host and thread-based even if the user asked for a non-thread based backend with [parallel_config()](https://joblib.readthedocs.io/en/latest/generated/joblib.parallel_config.html#joblib.parallel_config).
https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html
I tested it out and the results were the same. In addition, if you try
with parallel_config("loky", n_jobs=n_jobs):
with parallel_config("threading", n_jobs=n_jobs):
Inside the parallel function because we do require='sharedmem'
, you can print the active backend and it always changes to ThreadingBackend.
But I think the docstring should say `'threading' just to be explicit and safe, so I changed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, see: cd185ca
Signed-off-by: Adam Li <adam2392@gmail.com>
# df = pd.DataFrame(results) | ||
# df.to_csv("~/bench_results_forest/pr-threading.csv", index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is commended out. Can this benchmark function have two commands? One for running the benchmark and saving the results to disk and another one for plotting the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this and made two commands for the benchmark.
# | ||
main() | ||
|
||
bench_results = Path("/Users/adam2392/bench_results_forest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the path be a user provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also allowed the user to pass in the paths.
sklearn/ensemble/_iforest.py
Outdated
The predict method can be parallelized by setting a joblib context. This | ||
inherently does NOT use the ``n_jobs`` parameter initialized in the class, | ||
which is used during ``fit``. This is because, predict may actually be faster | ||
without parallelization for a small number of samples. The user can set the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with being explicit here.
Signed-off-by: Adam Li <adam2392@gmail.com>
There is a merge conflict to fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @adam2392. I added a few comments.
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @adam2392
Reference Issues/PRs
Fixes: #14000
Follow-up to #14001
What does this implement/fix? Explain your changes.
joblib.Parallel
wrapper for scoring samples by computing the depths that the tree reaches per sampleA new benchmarking script I ran that shows very encouraging results.
Discussion of benchmarking
The four lines are isolation forests predicting on a test set with different
n_jobs
enabled over different sample sizes in the test set (x-axis). Each dot has differentn_features
, andcontamination
to show then_jobs
speedup actually works and is not just a random function of the different set of hyper parameters.main
as expected if you compare left/right plots.n_jobs
as expected. asn_jobs
increases, the left hand plot shows the prediction time decreases. The right hand plot shows no difference asmain
does not predict in parallel.Any other comments?
Possibly related since isolation forest can now also use nans #27966
cc: @sergiormpereira @thomasjpfan @adrinjalali