FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

adam2392 · 2022-03-10T20:33:35Z

Reference Issues/PRs

Closes: #20819

This is ready for review.

What does this implement/fix? Explain your changes.

Implements oblique trees/splitter as a subclass of the existing Tree/Splitter. This further allows extensibility of the sklearn Tree and Splitter to allow downstream packages that have to define more complex and exotic sampling mechanisms. The total Cython code that constitutes the logic addition is around ~950 LOC, while the rest are from unit tests, and adding the Python API.

This is an extension of decision trees to generalize to random linear combination of feature oblique trees proposed by Breimann in 2001. This will enable 3rd party packages to subclass this code to instantiate other trees that work on "combinations" of features in some way (e.g. taking the weighted sum, or a kernel). The general intuition is that OF can sample more diverse set of features enabling better generalization and robustness to high-dimensional noise. RF should be used if the user suspects all data is aligned on the feature dimensions. The tradeoffs of computation time, model size and scoring time are complex because OF can fit shorter trees, but requires storage of additional data. OF needs to perform additional operations increasing computation time, but it can be negligible. We always suggest using OF first if you have the computational and storage flexibility. We suggest using RF for now if you have very large sample sizes and very very strict runtime/storage constraints.

Inclusion criterion was discussed. Breiman 2001, proposed oblique forests and demonstrated its superiority. Also see Adding Oblique Trees (Forest-RC) to the Cythonized Tree Module #20819 (comment)
Maintainability was addressed by figuring out a way that ObliqueSplitter and ObliqueTree can subclass the existing code.
Runtime performance is benchmarked (https://gist.github.com/adam2392/a652aa1c88ba94f0d6aab96ccd3ade24)
Interpretability is demonstrated using an extension of the RandomForest example with iris dataset (https://gist.github.com/adam2392/78519091104cecfeb8ff796eac7e8115)

Experiments supporting the change

Requested by Thomas Fan, Guillaume, and Olivier, we ran the following experiments.

runtime: see above
visual interpretation of the decision boundaries: see iris example
grid search outcomes on simulated data of OF vs RF: https://nbviewer.org/github/neurodata/scikit-learn/blob/5e8f1a6a506bf77fc99cdb59d81fe02f8a301e20/notebook/simulation_benchmark_OF_vs_RF.ipynb
grid search outcomes on low-dimension MNIST of OF vs RF: https://nbviewer.org/github/neurodata/scikit-learn/blob/25a1e9819c5def42f4f919eafdf4df4b065b2c08/notebook/mnist_benchmark_OF_vs_RF.ipynb
performance (accuracy) vs score time: see above 2 notebooks
performance vs fit time: see above 2 notebooks
performance vs model size: see above 2 notebooks
grid search outcomes on an openml dataset: TBD

Docs changes for education and user-guidance

extension of the examples/tree/plot_iris_dtc.py to include oblique trees
modules/tree.rst with a new section on oblique trees
modules/ensemble.rst with a new section on oblique forests
A new example in examples/ensemble/plot_oblique_axis_aligned_forests.py demonstrating oblique forest superiority on a real dataset from openml and a simulated dataset

The changes to setup.py files were necessary to compile package in some of the CI pipelines. It worked for me w/o it locally on my Mac M1 machine though.

Tracking Progress on sub-items - used for keeping track of challenges that arose (optional read)

The remaining items to complete are:

Add unit tests
Refactor to use memory views as in normal splitter (https://github.com/scikit-learn/scikit-learn/pull/23273/files)
Fix pickling issue in FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754 (comment)
Add feature_importances, or specify an error message if it's not allowed.
Add extension to the RF-Iris dataset example done by Jong Shin within the docs. Also adding a synthetic dataset perhaps that looks cleaner. And another dataset, for example digits (8 x 8 images) that can get shallower/smaller-forests vs RandomForest; e.g. train OF/RF with 100 trees, and subsample the forest and plot accuracy and plot the max_depth. Basically get a few more examples that are candidates to include into the docs: 1 example showing an intuitive visualization and then another example to demonstrate possible superiority in terms of real example. E.g. digits horizontally stack the real signal but then create permuted noisy copies and stack again. Put it into a separate notebook for now for downstream integration into scikit-learn. End goal: education for users on the inductive bias for oblique trees and when to use.
Determine if minor refactoring is needed with TreeBuilder/Tree for adding nodes (see: [MAINT] Improve extensibility of the Tree/Splitter code #22756 (comment))

Any other comments?

This is the PR to demonstrate the end product. To simplify the merging process, I propose the merging of this PR in a series of other smaller PRs.

~~[MAINT] Modularize Tree code and Splitter utility functions #22753~~
Fix unit test organization for pickling trees: [MAINT] Separate unit tests in test_tree.py for pickling and min_impurity_decrease #22915 (review)
Small PR to update Tree abstractions, such as in sklearn/tree/_classes.py and sklearn/tree/_tree.pyx
Simplifications of the utility code, such as sorting to use c++ (MAINT Create a private extension for sorting utilities #22760)

Another misc. note is that oblique and decision trees can be further easily improved in terms of training speed via quantization of continuous feature values in the data (i.e. same idea used in histogram gradient boosting).

…liquepr

…s, etc.

adam2392 · 2022-03-15T03:01:10Z

Linking convo as reference: neurodata#16 and specifically neurodata#16 (comment).

I am going to proceed by adding some lightweight unit tests to determine integrability, and getting a working version that we can come back to.

Then I will try out the idea of replacing SplitRecord and such with a cdef class (maybe even dataclass?), which will be used to pass around the data structure from node_split and add_node.

Co-authored-by: Chester Huynh <chester.huynh924@gmail.com> Co-authored-by: Parth Vora <pvora4@jhu.edu>

Co-Authored-By: Thomas Fan <thomasjpfan@gmail.com> Co-Authored-By: Chester Huynh <chester.huynh924@gmail.com> Co-Authored-By: Parth Vora <pvora4@jhu.edu>

adam2392 · 2022-03-15T18:57:57Z

Progress as of 3/16/22

Added Oblique trees/forests to the existing unit tests parametrizations and they mostly work, with one minor issue on pickling. The following is a summary:

sparse dataset support, which I think should not be addressed here? There is the issue of how this would be implemented, and needs more thinking in terms of how to handle "categorical" data as well. Overall, I added error messages where needed and updated tests where needed to handle testing Oblique trees/forests.
pickling does not work on roundtrip for some reason. See below.
added a few new tests of performance relative to Forest_RI (i.e. DecisionTreeClf and RandomForestClf)

The commit 458220e is a stable working version that has all the tree/ensemble unit tests passing and also works as intended. The next step is to explore whether or not we can substitute a cdef class for the SplitRecord/ObliqueSplitRecord, so that way we can support subclassing better.

Pickling Issue Summary

The issue arises in the test: sklearn/tree/tests/test_tree.py::test_min_impurity_decrease, which pickles the object and then compares metadata and performance before/after. The scoring fails, on the lines:

score2 = est2.score(X, y)
        assert (
            score == score2
        ), "Failed to generate same score  after pickling with {0}".format(name)

with score 1.0 before and then 0.97 after pickling.

I've essentially fixed it "almost" in a74d970, where all the internal data structures of the ObliqueTree all match before/after pickling, but for some reason the predict produces different answers... wth?

Update: I can fix it by setting max_depth=5, so I suspect there is some machine-precision rounding issue on this edge case of a dataset. Done in a582546

Notes 3/21/22

Remove the max_depth=5 and try again:

look at predictions rather then the score to determine which leaf is problematic. Most likely coming from a mishandling of tie in the feature value decision threshold(?) E.g. float32 compared to float64
run 1NN within the test and see if there are duplicates of the feature vector within X with different y-labels
iris 101 and 142 lines are duplicated... but might not be relevant. we'll see.

Update May 12th, 2022

Turns out Cython silently allows for j in range(...), even if j is not initialized with a type(?). Weird. Fixed in 8a720ed

tests. The issues still are: - sparse dataset support, which I could possibly split into a follow-up PR? - pickling does not work on roundtrip for some reason - certain issues with max_leaf

adam2392 · 2022-09-06T06:11:34Z

@ogrisel I fixed the merge commit as discussed yesterday.

Here is a "brief" summary. For full details, see the PR description. Let me know if this is too long / detailed:

What are oblique trees: oblique trees generate splits based on combinations of features. They can sample more than n_features candidate splits per split node (up to 2^n in fact). Sampling random linear combinations of features implicitly captures the dominant singular vectors of the dataset with high probability, as described by the Johnson Lindenstrauss lemma. Oblique trees do not necessarily need to sample linear combinations either. They can be generalized in countless ways, which opens the door for more usage of the sklearn.tree module.

When are oblique trees better than axis-aligned? Based on the geometric motivation above, one would suspect that oblique trees fare better than axis-aligned trees when there is a high degree of noise dimensions, and/or there are optimal split hyperplanes in the data that are not aligned with the features themselves. Thus oblique trees are suspected to perform significantly better when there are a high degree of noise dimensions.

When are axis-aligned trees better than oblique?
Thus axis-aligned trees perform better when there are informative splits that occur along the feature axes. In general, this is a more "rare" scenario (e.g. analogously think what are the chances of sampling a 0 eigenvalue in a random matrix? measure 0). Thus, we see that oblique trees on average perform better. However, there is a small increased computational and storage cost for oblique trees: the sampling and storage of the projection vectors for each split node. Thus it is recommended to use axis-aligned trees whenever computation and storage time is a critical requirement, but always test oblique trees via cross-validation if model performance in terms of scoring metrics is preferred.

…into obliquepr

jjerphan · 2022-12-05T08:40:24Z

Hi @adam2392, could you split this PR in several small atomic and orthogonal PRs?

There are changes like the ones of #25101 (which I think we are interested in for modularity) those are orthogonal to the introduction of ObliqueSplitter (which we still need to discuss, IMO those better be part of another package making use of the new structure of scikit-learn tree Cython internals after have them refactored). What do you think?

adam2392 · 2022-12-05T23:45:52Z

Hi @jjerphan yes I can and I have a separate proposal for BaseTree, which would also improve modularity of the Tree class.

The plan is to:

merge PR MAINT Introduce BaseCriterion as a base abstraction for Criterions #24678 on modularizing criterion
finish PR MAINT Refactor Splitter into a BaseSplitter #25101 on modularizing splitter and criterion
PR a change to modularizing trees. I can put up a sep GH issue to illustrate thoughts and a draft PR to illustrate the code proposal.
refactor this PR for your review (I'm still of the opinion that basic causal trees, basic oblique trees and quantile trees should be inside sklearn considering their proven fundamental importance, but happy to table discussion to the future)
in parallel start a scikit-contrib package for the exotic tree functionalities (manifold splits, survival trees, unsupervised trees, etc.)

If there's anything I/we can do to make the reviewing process easier and faster, please let me know. We are def willing to get to the final step as fast as possible :).

jjerphan · 2022-12-06T13:19:11Z

What you have proposed to #22754 (comment) looks like a good plan to me!

To me, what will potentially block it from advancing will be reviewers' time.

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 added 10 commits March 7, 2022 17:28

Adding makefile

49ef262

Adding changes to make refactor easy

d81a057

Adding changes to make refactor easy

7d76e14

Whoot

de26e27

Removing fluff

64b0652

Removing fluff

5a6a252

Almost working

aa0d81a

Almost mwe with only one nan

9882a0c

Working version with subclass

2bc6836

Remove jupyter notebook

87bb74e

github-actions bot added module:tree cython labels Mar 10, 2022

adam2392 marked this pull request as draft March 10, 2022 20:33

adam2392 added 4 commits March 10, 2022 15:34

Merge branch 'main' into obliquepr

32340d7

Removing scruff

fc6ec38

Merge branch 'obliquepr' of github.com:neurodata/scikit-learn into ob…

6051b40

…liquepr

Fix with thomas suggestions

4a4b4bb

adam2392 force-pushed the obliquepr branch 2 times, most recently from 5b74193 to 4a4b4bb Compare March 15, 2022 02:38

adam2392 added 3 commits March 14, 2022 22:42

Merge branch 'main' into obliquepr

4a16acd

Merged main. Working version

22bf39a

Co-author commit. Adding co-authors who helped out with code, example…

c6cf441

…s, etc.

adam2392 and others added 4 commits March 14, 2022 23:13

Co-authored-by: Thomas Fan <thomasjpfan@gmail.com>

822f935

Co-authored-by: Chester Huynh <chester.huynh924@gmail.com> Co-authored-by: Parth Vora <pvora4@jhu.edu>

Try to fix setup.

2aef2e9

Co-Authored-By: Thomas Fan <thomasjpfan@gmail.com> Co-Authored-By: Chester Huynh <chester.huynh924@gmail.com> Co-Authored-By: Parth Vora <pvora4@jhu.edu>

Merge branch 'main' into obliquepr

6006809

Adding pickling

11bedfc

adam2392 added 2 commits March 15, 2022 15:37

Adding working version of oblique tree/forests, along with unit

d27a005

tests. The issues still are: - sparse dataset support, which I could possibly split into a follow-up PR? - pickling does not work on roundtrip for some reason - certain issues with max_leaf

Adding almost working unit tests... Weird error with pickling

a74d970

adam2392 added 3 commits August 4, 2022 12:14

Merge branch 'main' into obliquepr

dd30df7

Merge branch 'main' into obliquepr

6409404

Fix of

d7e87dc

adam2392 added 8 commits September 7, 2022 10:53

Merge branch 'main' into obliquepr

3b3743c

Merge branch 'main' into obliquepr

4d011f9

Fixing ensemble and forest code

60b3f8d

Fixing black

f5eb6c2

Merge branch 'main' into obliquepr

770661b

Fix lint

1fbb918

Merge branch 'obliquepr' of https://github.com/neurodata/scikit-learn …

f1146fc

…into obliquepr

Fix lint

3fd2177

ogrisel mentioned this pull request Sep 28, 2022

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497

Closed

adam2392 added 2 commits October 4, 2022 09:39

Merge branch 'main' into obliquepr

d4c6201

Add note to doc string

51f3060

adam2392 mentioned this pull request Oct 24, 2022

[Refactor, Tree] Python tree class for modularity and consistency of BaseEstimator #24746

Draft

adam2392 added 2 commits December 4, 2022 15:15

Merging

60dba67

Fix build

6d44cff

jjerphan changed the title ~~[ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees~~ FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees Dec 5, 2022

adam2392 mentioned this pull request Dec 6, 2022

MAINT Introduce BaseTree as a base abstraction for Tree #25118

Draft

thomasjpfan mentioned this pull request Jan 5, 2023

MNT Refactor tree to share splitters between dense and sparse data #25306

Merged

Merge branch 'main' into obliquepr

caf8198

adam2392 mentioned this pull request Jan 20, 2023

MAINT Refactor Tree Cython class to support modularity #25448

Draft

adam2392 added 2 commits January 30, 2023 14:43

Fixing merge

328248d

Signed-off-by: Adam Li <adam2392@gmail.com>

Simplfy PYTHON api

fdd6c3a

adam2392 mentioned this pull request Mar 7, 2023

DOC Add a 'Cython Best Practices, Conventions and Knowledge' section #25608

Merged

adam2392 marked this pull request as draft June 27, 2024 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

Uh oh!

adam2392 commented Mar 10, 2022 •

edited

Loading

Uh oh!

adam2392 commented Mar 15, 2022 •

edited

Loading

Uh oh!

adam2392 commented Mar 15, 2022 •

edited

Loading

Uh oh!

adam2392 commented Sep 6, 2022

Uh oh!

jjerphan commented Dec 5, 2022 •

edited

Loading

Uh oh!

adam2392 commented Dec 5, 2022

Uh oh!

jjerphan commented Dec 6, 2022

Uh oh!

Uh oh!

Uh oh!

FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

Are you sure you want to change the base?

FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

Uh oh!

Conversation

adam2392 commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Experiments supporting the change

Docs changes for education and user-guidance

Tracking Progress on sub-items - used for keeping track of challenges that arose (optional read)

Any other comments?

Uh oh!

adam2392 commented Mar 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam2392 commented Mar 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress as of 3/16/22

Pickling Issue Summary

Notes 3/21/22

Update May 12th, 2022

Uh oh!

adam2392 commented Sep 6, 2022

Uh oh!

jjerphan commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam2392 commented Dec 5, 2022

Uh oh!

jjerphan commented Dec 6, 2022

Uh oh!

Uh oh!

adam2392 commented Mar 10, 2022 •

edited

Loading

adam2392 commented Mar 15, 2022 •

edited

Loading

adam2392 commented Mar 15, 2022 •

edited

Loading

jjerphan commented Dec 5, 2022 •

edited

Loading