-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Optimize runtime for IsolationForest #23149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
This looks reasonable to do. What does the benchmark look like for dense data?
After more profiling, it seems like the indexing operation of # only index arrays when needed
X_ = X if max_features == X.shape[1] else X[:, features]
estimator_fit(X_, y, sample_weight=curr_sample_weight) and see improvements for both sparse and dense inputs. While the problem will still exist when |
sklearn/ensemble/_bagging.py
Outdated
@@ -120,10 +135,12 @@ def _parallel_build_estimators( | |||
not_indices_mask = ~indices_to_mask(indices, n_samples) | |||
curr_sample_weight[not_indices_mask] = 0 | |||
|
|||
estimator.fit(X[:, features], y, sample_weight=curr_sample_weight) | |||
# only index arrays when needed | |||
X_ = X if max_features == X.shape[1] else X[:, features] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do this in a separate follow up PR. The check_input
change is already an improvement that can be merged by itself.
As for this change, we can only do this optimization if bootstrap_features
is False. bootstrap_features
is always False for IsolationForest
, but not for Bagging*
. (If bootstrap_features
is True, then the features are sampled with replacement which can result in repeated features.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I've reverted the changes in this PR, and will draft another PR once this got merged :)
This reverts commit ff4e429. revert last commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an entry to the change log at doc/whats_new/v1.1.rst
with tag |Efficiency|. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merging this one. The real fix for the sparse case should happen in a follow-up PR probably targeting 1.2. |
@jeremiedbb I added the "to backport" label to not forget to include this one in the 1.1.X branch before the final 1.1.0 release. |
Reference Issues/PRs
Towards #19275
What does this implement/fix? Explain your changes.
As shown here in the original issue discussion, the
check_input
argument is set to False forForest
classes, while for bagging estimators it's left as default valueTrue
, therefore we're validating the data repeatedly.The proposed fix adds a
check_input
argument to theensemble._bagging._parallel_build_estimators
, which can be set to False when the base estimator actually supports that argument during the fit process.Performance Impact
Code used for profiling:
Before (total time: 6.78s)

After (total time: 5.017s)
