Skip to content

EFF Improve IsolationForest predict time #25186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jan 24, 2023

Conversation

fsiola
Copy link
Contributor

@fsiola fsiola commented Dec 13, 2022

Reference Issues/PRs

Closes #25150

What does this implement/fix? Explain your changes.

The original implementation of _compute_score_samples had extra processing that can be solved with some caching:

  1. A call to apply and decision_path was iterating over each estimator twice
  2. decision_path returned a csr_matrix just to sum the number of indexes on each entry
  3. _average_path_length is deterministic and can be calculated and stored for caching when fitting the forest

All 3 points are solved by adding look-up variables calculated during fit.

Profile

Dense
from sklearn.datasets import make_classification
from sklearn.ensemble import IsolationForest
import numpy as np

X, y = make_classification(n_samples=50000, n_features=100, random_state=0)
X = X.astype(np.float32)
iso_forest = IsolationForest(n_estimators=100, max_samples=256, n_jobs=1, random_state=0).fit(X)
%prun -l 10 iso_forest.predict(X)
Original
         48690 function calls (48386 primitive calls) in 1.760 seconds

   Ordered by: internal time
   List reduced from 152 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      701    0.454    0.001    0.454    0.001 {method 'reduce' of 'numpy.ufunc' objects}
      100    0.417    0.004    0.570    0.006 {method 'decision_path' of 'sklearn.tree._tree.Tree' objects}
      100    0.236    0.002    0.236    0.002 {method 'apply' of 'sklearn.tree._tree.Tree' objects}
      101    0.189    0.002    0.199    0.002 _iforest.py:522(_average_path_length)
        1    0.144    0.144    1.755    1.755 _iforest.py:472(_compute_score_samples)
      100    0.097    0.001    0.097    0.001 {method 'reduceat' of 'numpy.ufunc' objects}
1307/1103    0.063    0.000    0.499    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
      300    0.040    0.000    0.040    0.000 {built-in method numpy.array}
      100    0.015    0.000    0.015    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
      100    0.015    0.000    0.133    0.001 _compressed.py:628(_minor_reduce)

New
         13733 function calls (13729 primitive calls) in 0.468 seconds

   Ordered by: internal time
   List reduced from 79 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100    0.227    0.002    0.227    0.002 {method 'apply' of 'sklearn.tree._tree.Tree' objects}
      101    0.202    0.002    0.202    0.002 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.023    0.023    0.465    0.465 _iforest.py:474(_compute_score_samples)
      101    0.002    0.000    0.213    0.002 validation.py:629(check_array)
      101    0.002    0.000    0.207    0.002 validation.py:96(_assert_all_finite)
      101    0.001    0.000    0.215    0.002 base.py:453(_validate_data)
      102    0.001    0.000    0.001    0.000 validation.py:320(_num_samples)
      102    0.001    0.000    0.001    0.000 validation.py:1377(<listcomp>)
     1925    0.001    0.000    0.001    0.000 {built-in method builtins.hasattr}
      202    0.001    0.000    0.001    0.000 _ufunc_config.py:32(seterr)
Sparse
X, y = make_classification(n_samples=50000, n_features=100, random_state=0)
X = X.astype(np.float32)
X = csc_matrix(X)
iso_forest = IsolationForest(n_estimators=100, max_samples=256, n_jobs=1, random_state=0).fit(X)
%prun -l 10 iso_forest.predict(X)
Original
         51579 function calls (51275 primitive calls) in 2.801 seconds

   Ordered by: internal time
   List reduced from 182 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100    0.843    0.008    1.007    0.010 {method 'decision_path' of 'sklearn.tree._tree.Tree' objects}
      100    0.699    0.007    0.699    0.007 {method 'apply' of 'sklearn.tree._tree.Tree' objects}
      701    0.471    0.001    0.471    0.001 {method 'reduce' of 'numpy.ufunc' objects}
      101    0.204    0.002    0.215    0.002 _iforest.py:522(_average_path_length)
        1    0.154    0.154    2.704    2.704 _iforest.py:472(_compute_score_samples)
      100    0.101    0.001    0.101    0.001 {method 'reduceat' of 'numpy.ufunc' objects}
        1    0.079    0.079    0.079    0.079 {built-in method scipy.sparse._sparsetools.csc_tocsr}
1319/1115    0.063    0.000    0.508    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
      306    0.040    0.000    0.040    0.000 {built-in method numpy.array}
      100    0.017    0.000    0.140    0.001 _compressed.py:628(_minor_reduce)
New
         15322 function calls (15318 primitive calls) in 1.013 seconds

   Ordered by: internal time
   List reduced from 128 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100    0.670    0.007    0.671    0.007 {method 'apply' of 'sklearn.tree._tree.Tree' objects}
      101    0.205    0.002    0.205    0.002 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.078    0.078    0.078    0.078 {built-in method scipy.sparse._sparsetools.csc_tocsr}
        1    0.025    0.025    0.918    0.918 _iforest.py:474(_compute_score_samples)
        3    0.005    0.002    0.005    0.002 {method 'copy' of 'numpy.ndarray' objects}
        1    0.003    0.003    1.012    1.012 _iforest.py:374(decision_function)
        3    0.003    0.001    0.003    0.001 {method 'astype' of 'numpy.ndarray' objects}
        1    0.002    0.002    0.926    0.926 _iforest.py:441(_compute_chunked_score_samples)
      101    0.002    0.000    0.212    0.002 validation.py:96(_assert_all_finite)
     2029    0.002    0.000    0.003    0.000 {built-in method builtins.hasattr}

Benchmark

Using benchmarks/bench_isolation_forest.py

Original
 Dataset: http - Fit time: 0.25837111473083496 - Predict time: 0.6131787300109863
 Dataset: smtp - Fit time: 0.16320514678955078 - Predict time: 0.10492515563964844
 Dataset: SA - Fit time: 0.23685908317565918 - Predict time: 2.7135839462280273
 Dataset: SF - Fit time: 0.25304603576660156 - Predict time: 1.0364980697631836
 Dataset: shuttle - Fit time: 0.23853278160095215 - Predict time: 0.6159451007843018
 Dataset: forestcover - Fit time: 0.3426859378814697 - Predict time: 5.550815105438232
New
Dataset: http - Fit time: 0.26087093353271484 - Predict time: 0.1054680347442627
Dataset: smtp - Fit time: 0.176008939743042 - Predict time: 0.026960134506225586
Dataset: SA - Fit time: 0.25487685203552246 - Predict time: 0.7704362869262695
Dataset: SF - Fit time: 0.30916595458984375 - Predict time: 0.28153276443481445
Dataset: shuttle - Fit time: 0.28773975372314453 - Predict time: 0.14754199981689453
Dataset: forestcover - Fit time: 0.3486030101776123 - Predict time: 1.9315860271453857

@fsiola fsiola changed the title ENH Improve IsolationForest predict time EFF Improve IsolationForest predict time Dec 13, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. That's a great performance improvement. Thanks for the profile reports, analysis and extensive benchmarks.

Just a few minor comments below:

)

self._average_path_length_per_tree = defaultdict()
self._decision_path_lenghts = defaultdict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of dicts, I would rather use Python lists using the position of a tree in the self.estimators_ list as positional index to access the precomputed path lengths.

This would avoid holding extra references to the tree objects in the keys of those dicts. I suspect that those extra references add complexity to the model serialization logic, especially for third party libraries for skops.

Copy link
Contributor

@li1127217ye li1127217ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is necessary to test memory usage.

@fsiola
Copy link
Contributor Author

fsiola commented Dec 21, 2022

I think it is necessary to test memory usage.

@li1127217ye memory usage is increased by

  • 1 extra numpy float32 for _average_path_length_max_samples
  • 2 lists of lists of numpy float32: total number of added numpy float32 is 2 * sum of number of nodes in all trees.

@fsiola
Copy link
Contributor Author

fsiola commented Dec 21, 2022

@ogrisel this change introduces a breaking change for Isolation forests fitted before it. It wont be able to do predict. How should this be handled? Keep the old predict code for models fitted before this version with warning and mark for deprecation in future version?

@fsiola fsiola requested a review from ogrisel December 23, 2022 11:50
@betatim
Copy link
Member

betatim commented Jan 16, 2023

How should this be handled? Keep the old predict code for models fitted before this version with warning and mark for deprecation in future version?

I don't think scikit-learn makes any promises related to this. For example check out #25297

@glemaitre glemaitre self-requested a review January 16, 2023 14:37
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I propose an optimization of the code computing the node depth in cython

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a pass on the other computation. A couple of comments but I think I am fine with it.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR still looks good to me. I pushed a commit to remove the duplicated input validation.

I measured a typical 5x speed-up at prediction time on synthetic data. fit time is not significantly impacted.

I think we can remove the new Python level API since the Cython level API is already public.

fsiola and others added 3 commits January 23, 2023 22:19
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@fsiola
Copy link
Contributor Author

fsiola commented Jan 23, 2023

After all suggested changes this is the new benchmark I got

Dataset: http - Fit time: 0.25515103340148926 - Predict time: 0.09728884696960449
Dataset: smtp - Fit time: 0.17158722877502441 - Predict time: 0.01819896697998047
Dataset: SA - Fit time: 0.23712801933288574 - Predict time: 0.19017887115478516
Dataset: SF - Fit time: 0.25872802734375 - Predict time: 0.1412971019744873
Dataset: shuttle - Fit time: 0.1987318992614746 - Predict time: 0.07536911964416504
Dataset: forestcover - Fit time: 0.3157329559326172 - Predict time: 0.5752010345458984

Looks a lot better than what the PR started with.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Tim Head <betatim@gmail.com>
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am wondering if we should add a small test. I don't recall if we are testing directly the Cython Tree class or we do it indirectly through the Python API of estimators using it.

@glemaitre
Copy link
Member

Apparently, we don't. Let's rely on the iforest tests then.
Enabling auto-merge. Thanks @fsiola

@glemaitre glemaitre enabled auto-merge (squash) January 24, 2023 09:37
@glemaitre glemaitre merged commit 9956210 into scikit-learn:main Jan 24, 2023
AdarshPrusty7 added a commit to AdarshPrusty7/GSGP that referenced this pull request Mar 6, 2023
* ENH Raise NotFittedError in get_feature_names_out for MissingIndicator, KBinsDiscretizer, SplineTransformer, DictVectorizer (scikit-learn#25402)

Co-authored-by: Alex <alex.buzenet.fr@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* DOC Update date and contributors list for v1.2.1 (scikit-learn#25459)

* DOC Make MeanShift documentation clearer (scikit-learn#25305)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* Finishes boolean and arithmetic creation

* Skeleton for traditional GP

* DOC Reorder whats_new/v1.2.rst (scikit-learn#25461)

Follow-up of scikit-learn#25459

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

* FIX fix faulty test in `cross_validate` that used the wrong estimator (scikit-learn#25456)

* ENH Raise NotFittedError in get_feature_names_out for estimators that use ClassNamePrefixFeatureOutMixin and SelectorMixin (scikit-learn#25308)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* EFF Improve IsolationForest predict time (scikit-learn#25186)

Co-authored-by: Felipe Breve Siola <felipe.breve-siola@klarna.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Tim Head <betatim@gmail.com>

* MAINT refactor spectral_clustering to call SpectralClustering (scikit-learn#25392)

* TST reduce warnings in test_logistic.py (scikit-learn#25469)

* CI Build doc on CircleCI (scikit-learn#25466)

* DOC Update news footer for 1.2.1 (scikit-learn#25472)

* MAINT Validate parameter for `sklearn.cluster.cluster_optics_xi` (scikit-learn#25385)

Co-authored-by: adossantosalfam <anthony.dos_santos_alfama@insa-rouen.fr>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* MAINT Parameters validation for additive_chi2_kernel (scikit-learn#25424)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* Initial Program Creation

* CI Include linting in CircleCI (scikit-learn#25475)

* MAINT Update version number to 1.2.1 in SECURITY.md (scikit-learn#25471)

* TST Sets random_state for test_logistic.py (scikit-learn#25446)

* MAINT Remove -Wcpp warnings when compiling sklearn.decomposition._online_lda_fast (scikit-learn#25020)

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

* FIX Support readonly sparse datasets for `manhattan_distances`  (scikit-learn#25432)

* TST Add non-regression test for scikit-learn#7981

This reproducer is adapted from the one of this message:
scikit-learn#7981 (comment)

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

* FIX Support readonly sparse datasets for manhattan

* DOC Add entry in whats_new/v1.2.rst for 1.2.1

* FIX Fix comment

* Update sklearn/metrics/tests/test_pairwise.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

* DOC Move entry to whats_new/v1.3.rst

* Update sklearn/metrics/tests/test_pairwise.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* MAINT dynamically expose kulsinski and remove support in BallTree (scikit-learn#25417)

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
closes scikit-learn#25212

* DOC Adds CirrusCI badge to readme (scikit-learn#25483)

* CI add linter display name (scikit-learn#25485)

* DOC update description of X in `FunctionTransformer.transform()`  (scikit-learn#24844)

* MAINT remove -Wcpp warnings when compiling sklearn.preprocessing._csr_polynomial_expansion (scikit-learn#25041)

* DOC more didactic example of bisecting kmeans (scikit-learn#25494)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* ENH csr_row_norms optimization (scikit-learn#24426)

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

* TST Allow callables as valid parameter regarding cloning estimator (scikit-learn#25498)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: From: Tim Head <betatim@gmail.com>

* DOC Fixes sphinx search on website (scikit-learn#25504)

* FIX make IsotonicRegression always predict NumPy arrays (scikit-learn#25500)



Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* FEA Add Gamma deviance as loss function to HGBT (scikit-learn#22409)

* FEA add gamma loss to HGBT

* DOC add whatsnew

* CLN address review comments

* TST make test_gamma pass by not testing out-of-sample

* TST compare gamma and poisson to LightGBM

* TST fix test_gamma by comparing to MSE HGBT instead of Poisson HGBT

* TST fix for test_same_predictions_regression for poisson

* CLN address review comments

* CLN nits

* CLN better comments

* TST use pytest.param with skip mark

* TST Correct conditional test parametrization mark

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

* CI Trigger CI

Builds currently fail because requests to Azure Ubuntu repository
timeout.

* DOC add comment for lax comparison with LightGBM

* CLN tuple needs trailing comma

---------

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

* MAINT Remove -Wsign-compare warnings when compiling sklearn.tree._tree (scikit-learn#25507)

* MAINT add more intuition on OAS computation based on literature (scikit-learn#23867)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* CI Allow cirrus arm tests to run with cd build commit tag (scikit-learn#25514)

* CI Upload ARM wheels from CirrusCI to nightly and staging index (scikit-learn#25513)



Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* MAINT Remove -Wcpp warnings from sklearn.utils._seq_dataset (scikit-learn#25406)

* FIX Fixes linux ARM CI on CirrusCI (scikit-learn#25536)

* DOC Fix grammatical mistake in `mixture` module (scikit-learn#25541)

* DOC add missing trailing colon (scikit-learn#25542)

* MAINT Parameters validation for sklearn.datasets.make_classification (scikit-learn#25474)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* MNT Expose allow_nan tag in bagging (scikit-learn#25506)

* MAINT Clean-up comments and rename variables in `_middle_term_sparse_sparse_{32, 64}` (scikit-learn#25449)

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

* DOC: remove incorrect statement (scikit-learn#25544)

* MAINT Parameters validation for reconstruct_from_patches_2d (scikit-learn#25384)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* MAINT Parameter validation for sklearn.metrics.d2_pinball_score (scikit-learn#25414)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* MAINT Parameters validation for spectral_clustering (scikit-learn#25378)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* MAINT Parameters validation for sklearn.datasets.fetch_kddcup99 (scikit-learn#25463)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* DOC Update MLPRegressor docs (scikit-learn#25556)

Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>

* DOC Update docs for KMeans (scikit-learn#25546)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* FIX BisectingKMeans crashes randomly (scikit-learn#25563)

Fixes scikit-learn#25505

* ENH BaseLabelPropagation to accept sparse matrices (scikit-learn#19664)

Co-authored-by: Kaushik Amar Das <kaushik.amar.das@accenture.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* MAINT Remove travis ci config and related doc (scikit-learn#25562)

* DOC Add pynndescent to Approximate nearest neighbors in TSNE example (scikit-learn#25480)


Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* DOC Add docstring example to make_regression (scikit-learn#25551)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* MAINT ensure that pos_label support all possible types (scikit-learn#25317)

* MAINT Parameters validation for sklearn.metrics.f1_score (scikit-learn#25557)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* ENH Adds `class_names` to `tree.export_text` (scikit-learn#25387)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* MAINT Replace cnp.ndarray with memory views in sklearn.tree._tree (where possible) (scikit-learn#25540)

* DOC Change print format in TSNE example (scikit-learn#25569)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* FIX ColumnTransformer supports empty selection for pandas output (scikit-learn#25570)

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

* DOC fix docstring of _plain_sgd (scikit-learn#25573)

* FIX Enable setting of sub-parameters for deprecated base_estimator param (scikit-learn#25477)

* DOC Improve minor and bug-fix release processes documentation (scikit-learn#25457)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@yahoo.fr>

* MAINT Remove ReadonlyArrayWrapper from _loss module (scikit-learn#25555)

* MAINT Remove ReadonlyArrayWrapper from _loss module

* CLN Remove comments about Cython 3.0

* MAINT Remove ReadonlyArrayWrapper from _kmeans (scikit-learn#25554)

* MAINT Remove ReadonlyArrayWrapper from _kmeans

* more const and remove blas compile warnings

* CLN Adds comment about casting to non const pointers

* Update sklearn/utils/_cython_blas.pyx

* MAINT Remove ReadonlyArrayWrapper from DistanceMetric (scikit-learn#25553)

* DOC improve stop_words description w.r.t. max_df range in CountVectorizer (scikit-learn#25489)

* MAINT Removes ReadOnlyWrapper (scikit-learn#25586)

* MAINT Parameters validation for sklearn.metrics.log_loss (scikit-learn#25577)

* MAINT Adds comments and better naming into tree code (scikit-learn#25576)

* MAINT Adds comments and better naming into tree code

* CLN Use feature_values instead of Xf

* Apply suggestions from code review

Co-authored-by: Adam Li <adam2392@gmail.com>

* DOC Improve comment from review

* Apply suggestions from code review

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

---------

Co-authored-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* FIX error when deserialzing a Tree instance from a read only buffer (scikit-learn#25585)

* DOC: fix typo in California Housing dataset description (scikit-learn#25613)

* ENH: Update KDTree, and example documentation (scikit-learn#25482)

* ENH: Update KDTree, and example documentation

* ENH: Add valid metric function and reference doc

* CHG: Documentation update

Co-authored-by: Adam Li <adam2392@gmail.com>

* CHG: make valid metric property and fix doc string

* FIX: documentation, and add code example

* ENH: Change valid metric to class method, and doc

* ENH: Change valid metric class variable, and doc

* FIX: documentation error

* FIX: documentation error

* CHG: Use class method for valid metrics

* FIX: CI problems

---------

Co-authored-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

* TST Common test for checking estimator deserialization from a read only buffer (scikit-learn#25624)

* DOC fix comment in plot_logistic_l1_l2_sparsity.py (scikit-learn#25633)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* DOC Places governance in navigation bar (scikit-learn#25618)

* MAINT Check pyproject toml is consistent with min_dependencies (scikit-learn#25610)

* MAINT Check pyproject toml is consistent with min_dependencies

* CLN Make it clear that only SciPy and Cython are checked

* CLN Revert auto formatter

* MAINT Use newest NumPy C API in tree._criterion (scikit-learn#25615)

* MAINT Use newest NumPy C API in tree._criterion

* FIX Use pointer for children

* FIX Fixes check_array nonfinite checks with ArrayAPI specification (scikit-learn#25619)

* FIX Fixes check_array nonfinite checks with ArrayAPI specification

* DOC Adds PR number

* FIX Test on both cupy and numpy

* DOC Correctly docstring in StackingRegressor.fit_transform (scikit-learn#25599)

* MAINT Remove Cython compilation warnings ahead of Cython3.0 release (scikit-learn#25621)

* ENH Preserve DataFrame dtypes in transform for feature selectors (scikit-learn#25102)

* FIX report properly n_iter_ when warm_start=True (scikit-learn#25443)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* DOC fix typo in KMeans's param. (scikit-learn#25649)

* FIX use const memory views in hist_gradient_boosting predictor (scikit-learn#25650)

* DOC modified the graph for better readability (scikit-learn#25644)

* MAINT Removes upper limit on setuptools (scikit-learn#25651)

* DOC improve the `warm_start` glossary entry (scikit-learn#25523)

* DOC Update governance document for SLEP020 (scikit-learn#25663)



Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

* FIX renormalization of y_pred inside log_loss (scikit-learn#25299)

* Remove renormalization of y_pred inside log_loss

* Deprecate eps parameter in log_loss

* ENH Allows target to be pandas nullable dtypes (scikit-learn#25638)

* DOC unify usage of 'w.r.t.' (scikit-learn#25683)

* MAINT Parameters validation for metrics.max_error (scikit-learn#25679)

* MAINT Parameters validation for datasets.make_friedman1 (scikit-learn#25674)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

* MAINT Parameters validation for mean_pinball_loss (scikit-learn#25685)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

* DOC Specify behavior of None for CountVectorizer (scikit-learn#25678)

* DOC Specify behaviour of None for TfIdfVectorizer max_features parameter (scikit-learn#25676)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* MAINT Set random state for plot_anomaly_comparison (scikit-learn#25675)

* MAINT Parameters validation for cluster.mean_shift (scikit-learn#25684)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

* MAINT Parameters validation for sklearn.metrics.jaccard_score (scikit-learn#25680)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

* DOC Add the custom compiler section back (scikit-learn#25667)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* MAINT Parameters validation for precision_recall_fscore_support (scikit-learn#25681)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

* FIX Allow negative tol in SequentialFeatureSelector (scikit-learn#25664)

* MAINT Replace deprecated cython conditional compilation (scikit-learn#25654)



Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

* DOC fix formatting typo in related_projects (scikit-learn#25706)

* MAINT Parameters validation for metrics.mean_absolute_percentage_error (scikit-learn#25695)

* MAINT Parameters validation for metrics.precision_recall_curve (scikit-learn#25698)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

* MAINT Parameter Validation for metrics.precision_score (scikit-learn#25708)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

* CI Stablize build with random_state (scikit-learn#25701)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* MAINT Remove -Wcpp warnings when compiling arrayfuncs (scikit-learn#25415)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* DOC Add scikit-learn-intelex to related projects (scikit-learn#23766)

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

* ENH Support float32 in SGDClassifier and SGDRegressor (scikit-learn#25587)

* FIX Raise appropriate attribute error in ensemble (scikit-learn#25668)

* FIX Allow OrdinalEncoder's encoded_missing_value set to the cardinality (scikit-learn#25704)

* ENH Let csr_row_norms support multi-thread (scikit-learn#25598)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>

* MAINT Parameter Validation for feature_selection.chi2 (scikit-learn#25719)

Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>

* MAINT Parameter Validation for feature_selection.f_classif (scikit-learn#25720)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* MAINT Parameters validation for sklearn.metrics.matthews_corrcoef (scikit-learn#25712)

Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>

* MAINT parameter validation for sklearn.datasets.dump_svmlight_file (scikit-learn#25726)

Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>

* MAINT Clean dead code in build helpers (scikit-learn#25661)

* MAINT Use newest NumPy C API in metrics._dist_metrics (scikit-learn#25702)

* CI Adds permissions to workflows that use GITHUB_TOKEN (scikit-learn#25600)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* FIX Improves error message in partial_fit when early_stopping=True (scikit-learn#25694)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* DOC Makes navbar static (scikit-learn#25688)

* MAINT Remove redundant sparse square euclidian distances function (scikit-learn#25731)

* MAINT Use float64 for accumulators in WeightVector* (scikit-learn#25721)

* API make PatchExtractor being a real scikit-learn transformer (scikit-learn#24230)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* MAINT Update pyparsing.py to use bool instead of double negation (scikit-learn#25724)

* API Deprecates values in partial_dependence in favor of pdp_values (scikit-learn#21809)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* API Use grid_values instead of pdp_values in partial_dependence (scikit-learn#25732)

* MAINT remove np.product and inf/nan aliases in favor of canonical names (scikit-learn#25741)

* MAINT Parameters validation for metrics.label_ranking_loss (scikit-learn#25742)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* MAINT Parameters validation for metrics.coverage_error (scikit-learn#25748)

* MAINT Parameters validation for metrics.dcg_score (scikit-learn#25749)

* MAINT replace cnp.ndarray with memory views in _fast_dict (scikit-learn#25754)

* MAINT Parameter Validation for feature_selection.f_regression (scikit-learn#25736)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* MAINT Parameters validation for feature_selection.r_regression (scikit-learn#25734)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* MAINT Parameter Validation for metrics.get_scorer (scikit-learn#25738)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* DOC Move allowing pandas nullable dtypes to 1.2.2 (scikit-learn#25692)

* MAINT replace cnp.ndarray with memory views in sparsefuncs_fast (scikit-learn#25764)

* MAINT parameter validation for sklearn.datasets.fetch_covtype (scikit-learn#25759)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

* MAINT Define centralized generic, but with explicit precision, types (scikit-learn#25739)

* CI Disable network when SciPy requires it (scikit-learn#25743)

* CI Open issue when arm wheel fails on CirrusCI (scikit-learn#25620)

* ENH Speed-up expected mutual information (scikit-learn#25713)

Co-authored-by: Kshitij Mathur <k.mathur68@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

* FIX add retry mechanism to handle quotechar in read_csv (scikit-learn#25511)

* Merge Population Creation (#1)

---------

Co-authored-by: Alex Buzenet <94121450+albuzenet@users.noreply.github.com>
Co-authored-by: Alex <alex.buzenet.fr@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Adam Kania <48769688+remilvus@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
Co-authored-by: Shady el Gewily <90049412+shadyelgewily-slimstock@users.noreply.github.com>
Co-authored-by: John Pangas <swiftyxswaggy@outlook.com>
Co-authored-by: Felipe Siola <fsiola@gmail.com>
Co-authored-by: Felipe Breve Siola <felipe.breve-siola@klarna.com>
Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Anthony22-dev <122220081+Anthony22-dev@users.noreply.github.com>
Co-authored-by: adossantosalfam <anthony.dos_santos_alfama@insa-rouen.fr>
Co-authored-by: Xiao Yuan <yuanx749@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Co-authored-by: Rahil Parikh <75483881+rprkh@users.noreply.github.com>
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
Co-authored-by: davidblnc <40642621+davidblnc@users.noreply.github.com>
Co-authored-by: Changyao Chen <changyao.chen@gmail.com>
Co-authored-by: Nicola Fanelli <48762613+nicolafan@users.noreply.github.com>
Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
Co-authored-by: partev <petrosyan@gmail.com>
Co-authored-by: ouss1508 <121971998+ouss1508@users.noreply.github.com>
Co-authored-by: ashah002 <97778401+ashah002@users.noreply.github.com>
Co-authored-by: Ahmedbgh <83551938+Ahmedbgh@users.noreply.github.com>
Co-authored-by: Pooja M <90301980+pm155@users.noreply.github.com>
Co-authored-by: Ian Thompson <ianiat11@gmail.com>
Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>
Co-authored-by: SANJAI_3 <86285670+sanjail3@users.noreply.github.com>
Co-authored-by: Kaushik Amar Das <cozek@users.noreply.github.com>
Co-authored-by: Kaushik Amar Das <kaushik.amar.das@accenture.com>
Co-authored-by: Nawazish Alam <nawazishmail@gmail.com>
Co-authored-by: William M <64324808+Akbeeh@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@yahoo.fr>
Co-authored-by: JanFidor <66260538+JanFidor@users.noreply.github.com>
Co-authored-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Logan Thomas <logan.thomas005@gmail.com>
Co-authored-by: Vyom Pathak <angerstick3@gmail.com>
Co-authored-by: as-90 <88336957+as-90@users.noreply.github.com>
Co-authored-by: Marvin Krawutschke <101656586+Marvvxi@users.noreply.github.com>
Co-authored-by: Haesun Park <haesunrpark@gmail.com>
Co-authored-by: Christine P. Chai <star1327p@gmail.com>
Co-authored-by: Christian Veenhuis <124370897+ChVeen@users.noreply.github.com>
Co-authored-by: Sortofamudkip <wishyutp0328@gmail.com>
Co-authored-by: sonnivs <48860780+sonnivs@users.noreply.github.com>
Co-authored-by: Ali H. El-Kassas <aliabdelmonem234@gmail.com>
Co-authored-by: Yusuf Raji <raji.yusuf234@gmail.com>
Co-authored-by: Tabea Kossen <tabeakossen@gmail.com>
Co-authored-by: Pooja Subramaniam <poojas2086@gmail.com>
Co-authored-by: JuliaSchoepp <63353759+JuliaSchoepp@users.noreply.github.com>
Co-authored-by: Jack McIvor <jacktmcivor@gmail.com>
Co-authored-by: zeeshan lone <56621467+still-learning-ev@users.noreply.github.com>
Co-authored-by: Max Halford <maxhalford25@gmail.com>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: genvalen <genvalen@protonmail.com>
Co-authored-by: Shiva chauhan <103742975+Shivachauhan17@users.noreply.github.com>
Co-authored-by: Dayne <daynesorvisto@yahoo.ca>
Co-authored-by: Ralf Gommers <ralf.gommers@gmail.com>
Co-authored-by: Kshitij Mathur <k.mathur68@gmail.com>
@HarryCollins2
Copy link

Great work on PR #25186 for improving IsolationForest predict time! Considering older versions of scikit-learn are still widely used, could we think about backporting this improvement to some of the earlier versions where this performance issue might also exist?

@glemaitre
Copy link
Member

Considering older versions of scikit-learn are still widely used, could we think about backporting this improvement to some of the earlier versions where this performance issue might also exist?

The only backport that we sometimes intend are related to security fixes but not improvement. So I would say that the answer is no.

@HarryCollins2
Copy link

Considering older versions of scikit-learn are still widely used, could we think about backporting this improvement to some of the earlier versions where this performance issue might also exist?

The only backport that we sometimes intend are related to security fixes but not improvement. So I would say that the answer is no.

Thanks for your response! I understand security fixes are priority, but performance issues also significantly affect user experience. Many applications rely on matplotlib, and updating can be difficult due to dependency issues. Noting how pandas handles key performance fixes via backports (e.g., pandas/pull/52548, pandas/pull/54528), could matplotlib adopt a similar approach for critical issues? Using tools like @meeseeksmachine could minimize manual efforts if this is a major concern.

@glemaitre
Copy link
Member

Many applications rely on matplotlib, and updating can be difficult due to dependency issues.

This is one of the reason that scikit-learn depends only from SciPy and NumPy. The second thing that we try absolutely is to support old version of these dependencies (even for the optional one).

Noting how pandas handles key performance fixes via backports

The backports (at least in the example) are only backporting in the same minor version. If the efficiency is linked to a regression or is really beneficial, we do the same. I would be interested to see a backbport from one minor version to other minor version (e.g. from 2.3 to 2.2) because I am not sure that any open-source project with limited resources will intend it.

Using tools like @meeseeksmachine could minimize manual efforts if this is a major concern.

We had a look at it in the past. It implied that we needed our development flow and was putting the burden of backporting from the release manager onto all core developers. Another concern was in the way that we currently handle the backlog that would create most of the time conflict and we would need to manage the backporting by hand anyway. So we did not think it worth switching until we come up with a better way to handle our changelog (we should do something similar to the Python project). But it requires a good amount of time exploring.

@fsiola
Copy link
Contributor Author

fsiola commented Jan 11, 2024

It is possible to write a wrapper to the IsolationForest in versions < 1.3.0 to implement this performance improvement.

You would basically need to take the original IsolationForest, calculate the path lengths on __init__ and patch some methods to use the pre-calculated paths lengths. It is not a proper backport, but something you can add to your current code as needed.

@glemaitre
Copy link
Member

I see that we modify some Cython code, this could be tricky to monkey patch.

@fsiola
Copy link
Contributor Author

fsiola commented Jan 11, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improving IsolationForest predict time
7 participants