ENH Optimize runtime for IsolationForest #23149

MaxwellLZH · 2022-04-17T14:26:27Z

Reference Issues/PRs

Towards #19275

What does this implement/fix? Explain your changes.

As shown here in the original issue discussion, the check_input argument is set to False for Forest classes, while for bagging estimators it's left as default value True, therefore we're validating the data repeatedly.

The proposed fix adds a check_input argument to the ensemble._bagging._parallel_build_estimators, which can be set to False when the base estimator actually supports that argument during the fit process.

Performance Impact

Code used for profiling:

from sklearn.datasets import make_classification
from scipy.sparse import csc_matrix, csr_matrix
from sklearn.ensemble import IsolationForest

X, y = make_classification(n_samples=50000, n_features=1000)
X = csc_matrix(X)
X.sort_indices()
IsolationForest(n_estimators=10, max_samples=256, n_jobs=1).fit(X)

Before (total time: 6.78s)

After (total time: 5.017s)

thomasjpfan

Thank you for the PR!

This looks reasonable to do. What does the benchmark look like for dense data?

sklearn/ensemble/_bagging.py

MaxwellLZH · 2022-04-18T01:00:13Z

The benchmark result for dense data shows smaller improvement compared to sparse inputs

Before (8.29s):

After (7.92s):

MaxwellLZH · 2022-04-23T14:30:57Z

After more profiling, it seems like the indexing operation of estimator.fit(X[:, features, y) in _bagging.py::_parallel_build_estimators is quite expensive, I tried removing the indexing operation when max_features is equal to feature numbers using the following logic:

# only index arrays when needed
X_ = X if max_features == X.shape[1] else X[:, features]
estimator_fit(X_, y, sample_weight=curr_sample_weight)

and see improvements for both sparse and dense inputs.

While the problem will still exist when max_features is not set to 1.0 for IsolationForest. I'm wondering since we're randomly pick a single feature to split anyway, can we always set the max_features to be 1.0 for IsolationForest so we can get a faster run time.

Sparse Input:
Before (4.97s)

After (0.241s)

Dense Input:
Before (7.99s)

After (5.19s)

thomasjpfan · 2022-04-23T17:27:58Z

sklearn/ensemble/_bagging.py

@@ -120,10 +135,12 @@ def _parallel_build_estimators(
                not_indices_mask = ~indices_to_mask(indices, n_samples)
                curr_sample_weight[not_indices_mask] = 0

-            estimator.fit(X[:, features], y, sample_weight=curr_sample_weight)
+            # only index arrays when needed
+            X_ = X if max_features == X.shape[1] else X[:, features]


I would do this in a separate follow up PR. The check_input change is already an improvement that can be merged by itself.

As for this change, we can only do this optimization if bootstrap_features is False. bootstrap_features is always False for IsolationForest, but not for Bagging*. (If bootstrap_features is True, then the features are sampled with replacement which can result in repeated features.)

Got it! I've reverted the changes in this PR, and will draft another PR once this got merged :)

This reverts commit ff4e429. revert last commit

thomasjpfan

Please add an entry to the change log at doc/whats_new/v1.1.rst with tag |Efficiency|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

thomasjpfan

LGTM

ogrisel · 2022-04-29T09:12:12Z

Merging this one. The real fix for the sparse case should happen in a follow-up PR probably targeting 1.2.

ogrisel · 2022-04-29T09:13:33Z

@jeremiedbb I added the "to backport" label to not forget to include this one in the 1.1.X branch before the final 1.1.0 release.

set check_input to False for _bagging

bb8b112

github-actions bot added the module:ensemble label Apr 17, 2022

set check_input to False

7abf9c7

thomasjpfan reviewed Apr 17, 2022

View reviewed changes

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

MaxwellLZH added 2 commits April 18, 2022 09:09

use funtools.partial

3ce4ee4

skip array indexing when possible

ff4e429

thomasjpfan reviewed Apr 23, 2022

View reviewed changes

Revert "skip array indexing when possible"

26f7e84

This reverts commit ff4e429. revert last commit

thomasjpfan reviewed Apr 24, 2022

View reviewed changes

update whatsnew

66a9cd4

MaxwellLZH requested a review from thomasjpfan April 28, 2022 13:10

thomasjpfan approved these changes Apr 28, 2022

View reviewed changes

thomasjpfan added the Quick Review For PRs that are quick to review label Apr 28, 2022

ogrisel approved these changes Apr 29, 2022

View reviewed changes

ogrisel merged commit 767e9ae into scikit-learn:main Apr 29, 2022

ogrisel added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Apr 29, 2022

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022

ENH Optimize runtime for IsolationForest (scikit-learn#23149)

e0bf16a

MaxwellLZH mentioned this pull request Apr 30, 2022

ENH Optimize runtime for IsolationForest #23252

Merged

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request May 10, 2022

ENH Optimize runtime for IsolationForest (scikit-learn#23149)

ac24e40

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022

ENH Optimize runtime for IsolationForest (scikit-learn#23149)

77173f1

glemaitre pushed a commit that referenced this pull request Aug 5, 2022

ENH Optimize runtime for IsolationForest (#23149)

202d50f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Optimize runtime for IsolationForest #23149

ENH Optimize runtime for IsolationForest #23149

MaxwellLZH commented Apr 17, 2022 •

edited by ogrisel

Loading

thomasjpfan left a comment

MaxwellLZH commented Apr 18, 2022

MaxwellLZH commented Apr 23, 2022

thomasjpfan Apr 23, 2022

MaxwellLZH Apr 24, 2022 •

edited

Loading

thomasjpfan left a comment

thomasjpfan left a comment

ogrisel commented Apr 29, 2022

ogrisel commented Apr 29, 2022

ENH Optimize runtime for IsolationForest #23149

ENH Optimize runtime for IsolationForest #23149

Conversation

MaxwellLZH commented Apr 17, 2022 • edited by ogrisel Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Performance Impact

thomasjpfan left a comment

Choose a reason for hiding this comment

MaxwellLZH commented Apr 18, 2022

MaxwellLZH commented Apr 23, 2022

thomasjpfan Apr 23, 2022

Choose a reason for hiding this comment

MaxwellLZH Apr 24, 2022 • edited Loading

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

ogrisel commented Apr 29, 2022

ogrisel commented Apr 29, 2022

MaxwellLZH commented Apr 17, 2022 •

edited by ogrisel

Loading

MaxwellLZH Apr 24, 2022 •

edited

Loading