Skip to content

FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 103 commits into
base: main
Choose a base branch
from

Conversation

adam2392
Copy link
Member

@adam2392 adam2392 commented Mar 10, 2022

Reference Issues/PRs

Closes: #20819

This is ready for review.

What does this implement/fix? Explain your changes.

Implements oblique trees/splitter as a subclass of the existing Tree/Splitter. This further allows extensibility of the sklearn Tree and Splitter to allow downstream packages that have to define more complex and exotic sampling mechanisms. The total Cython code that constitutes the logic addition is around ~950 LOC, while the rest are from unit tests, and adding the Python API.

This is an extension of decision trees to generalize to random linear combination of feature oblique trees proposed by Breimann in 2001. This will enable 3rd party packages to subclass this code to instantiate other trees that work on "combinations" of features in some way (e.g. taking the weighted sum, or a kernel). The general intuition is that OF can sample more diverse set of features enabling better generalization and robustness to high-dimensional noise. RF should be used if the user suspects all data is aligned on the feature dimensions. The tradeoffs of computation time, model size and scoring time are complex because OF can fit shorter trees, but requires storage of additional data. OF needs to perform additional operations increasing computation time, but it can be negligible. We always suggest using OF first if you have the computational and storage flexibility. We suggest using RF for now if you have very large sample sizes and very very strict runtime/storage constraints.

Experiments supporting the change

Requested by Thomas Fan, Guillaume, and Olivier, we ran the following experiments.

Docs changes for education and user-guidance
  • extension of the examples/tree/plot_iris_dtc.py to include oblique trees
  • modules/tree.rst with a new section on oblique trees
  • modules/ensemble.rst with a new section on oblique forests
  • A new example in examples/ensemble/plot_oblique_axis_aligned_forests.py demonstrating oblique forest superiority on a real dataset from openml and a simulated dataset

The changes to setup.py files were necessary to compile package in some of the CI pipelines. It worked for me w/o it locally on my Mac M1 machine though.

Tracking Progress on sub-items - used for keeping track of challenges that arose (optional read)

The remaining items to complete are:

  • Add unit tests
  • Refactor to use memory views as in normal splitter (https://github.com/scikit-learn/scikit-learn/pull/23273/files)
  • Fix pickling issue in FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754 (comment)
  • Add feature_importances, or specify an error message if it's not allowed.
  • Add extension to the RF-Iris dataset example done by Jong Shin within the docs. Also adding a synthetic dataset perhaps that looks cleaner. And another dataset, for example digits (8 x 8 images) that can get shallower/smaller-forests vs RandomForest; e.g. train OF/RF with 100 trees, and subsample the forest and plot accuracy and plot the max_depth. Basically get a few more examples that are candidates to include into the docs: 1 example showing an intuitive visualization and then another example to demonstrate possible superiority in terms of real example. E.g. digits horizontally stack the real signal but then create permuted noisy copies and stack again. Put it into a separate notebook for now for downstream integration into scikit-learn. End goal: education for users on the inductive bias for oblique trees and when to use.
  • Determine if minor refactoring is needed with TreeBuilder/Tree for adding nodes (see: [MAINT] Improve extensibility of the Tree/Splitter code #22756 (comment))

Any other comments?

This is the PR to demonstrate the end product. To simplify the merging process, I propose the merging of this PR in a series of other smaller PRs.

Another misc. note is that oblique and decision trees can be further easily improved in terms of training speed via quantization of continuous feature values in the data (i.e. same idea used in histogram gradient boosting).

@adam2392 adam2392 marked this pull request as draft March 10, 2022 20:33
@adam2392 adam2392 force-pushed the obliquepr branch 2 times, most recently from 5b74193 to 4a4b4bb Compare March 15, 2022 02:38
@adam2392
Copy link
Member Author

adam2392 commented Mar 15, 2022

Linking convo as reference: neurodata#16 and specifically neurodata#16 (comment).

I am going to proceed by adding some lightweight unit tests to determine integrability, and getting a working version that we can come back to.

Then I will try out the idea of replacing SplitRecord and such with a cdef class (maybe even dataclass?), which will be used to pass around the data structure from node_split and add_node.

adam2392 and others added 4 commits March 14, 2022 23:13
Co-authored-by: Chester Huynh <chester.huynh924@gmail.com>
Co-authored-by: Parth Vora <pvora4@jhu.edu>
Co-Authored-By: Thomas Fan <thomasjpfan@gmail.com>
Co-Authored-By: Chester Huynh <chester.huynh924@gmail.com>
Co-Authored-By: Parth Vora <pvora4@jhu.edu>
@adam2392
Copy link
Member Author

adam2392 commented Mar 15, 2022

Progress as of 3/16/22

Added Oblique trees/forests to the existing unit tests parametrizations and they mostly work, with one minor issue on pickling. The following is a summary:

  • sparse dataset support, which I think should not be addressed here? There is the issue of how this would be implemented, and needs more thinking in terms of how to handle "categorical" data as well. Overall, I added error messages where needed and updated tests where needed to handle testing Oblique trees/forests.
  • pickling does not work on roundtrip for some reason. See below.
  • added a few new tests of performance relative to Forest_RI (i.e. DecisionTreeClf and RandomForestClf)

The commit 458220e is a stable working version that has all the tree/ensemble unit tests passing and also works as intended. The next step is to explore whether or not we can substitute a cdef class for the SplitRecord/ObliqueSplitRecord, so that way we can support subclassing better.

Pickling Issue Summary

The issue arises in the test: sklearn/tree/tests/test_tree.py::test_min_impurity_decrease, which pickles the object and then compares metadata and performance before/after. The scoring fails, on the lines:

score2 = est2.score(X, y)
        assert (
            score == score2
        ), "Failed to generate same score  after pickling with {0}".format(name)

with score 1.0 before and then 0.97 after pickling.

I've essentially fixed it "almost" in a74d970, where all the internal data structures of the ObliqueTree all match before/after pickling, but for some reason the predict produces different answers... wth?

Update: I can fix it by setting max_depth=5, so I suspect there is some machine-precision rounding issue on this edge case of a dataset. Done in a582546

Notes 3/21/22

Remove the max_depth=5 and try again:

  • look at predictions rather then the score to determine which leaf is problematic. Most likely coming from a mishandling of tie in the feature value decision threshold(?) E.g. float32 compared to float64
  • run 1NN within the test and see if there are duplicates of the feature vector within X with different y-labels
  • iris 101 and 142 lines are duplicated... but might not be relevant. we'll see.

Update May 12th, 2022

Turns out Cython silently allows for j in range(...), even if j is not initialized with a type(?). Weird. Fixed in 8a720ed

tests.

The issues still are:

- sparse dataset support, which I could possibly split into a follow-up PR?
- pickling does not work on roundtrip for some reason
- certain issues with max_leaf
@adam2392
Copy link
Member Author

adam2392 commented Sep 6, 2022

@ogrisel I fixed the merge commit as discussed yesterday.

Here is a "brief" summary. For full details, see the PR description. Let me know if this is too long / detailed:

What are oblique trees: oblique trees generate splits based on combinations of features. They can sample more than n_features candidate splits per split node (up to 2^n in fact). Sampling random linear combinations of features implicitly captures the dominant singular vectors of the dataset with high probability, as described by the Johnson Lindenstrauss lemma. Oblique trees do not necessarily need to sample linear combinations either. They can be generalized in countless ways, which opens the door for more usage of the sklearn.tree module.

When are oblique trees better than axis-aligned? Based on the geometric motivation above, one would suspect that oblique trees fare better than axis-aligned trees when there is a high degree of noise dimensions, and/or there are optimal split hyperplanes in the data that are not aligned with the features themselves. Thus oblique trees are suspected to perform significantly better when there are a high degree of noise dimensions.

When are axis-aligned trees better than oblique?
Thus axis-aligned trees perform better when there are informative splits that occur along the feature axes. In general, this is a more "rare" scenario (e.g. analogously think what are the chances of sampling a 0 eigenvalue in a random matrix? measure 0). Thus, we see that oblique trees on average perform better. However, there is a small increased computational and storage cost for oblique trees: the sampling and storage of the projection vectors for each split node. Thus it is recommended to use axis-aligned trees whenever computation and storage time is a critical requirement, but always test oblique trees via cross-validation if model performance in terms of scoring metrics is preferred.

@jjerphan
Copy link
Member

jjerphan commented Dec 5, 2022

Hi @adam2392, could you split this PR in several small atomic and orthogonal PRs?

There are changes like the ones of #25101 (which I think we are interested in for modularity) those are orthogonal to the introduction of ObliqueSplitter (which we still need to discuss, IMO those better be part of another package making use of the new structure of scikit-learn tree Cython internals after have them refactored). What do you think?

@jjerphan jjerphan changed the title [ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees Dec 5, 2022
@adam2392
Copy link
Member Author

adam2392 commented Dec 5, 2022

Hi @jjerphan yes I can and I have a separate proposal for BaseTree, which would also improve modularity of the Tree class.

The plan is to:

  • merge PR MAINT Introduce BaseCriterion as a base abstraction for Criterions #24678 on modularizing criterion
  • finish PR MAINT Refactor Splitter into a BaseSplitter #25101 on modularizing splitter and criterion
  • PR a change to modularizing trees. I can put up a sep GH issue to illustrate thoughts and a draft PR to illustrate the code proposal.
  • refactor this PR for your review (I'm still of the opinion that basic causal trees, basic oblique trees and quantile trees should be inside sklearn considering their proven fundamental importance, but happy to table discussion to the future)
  • in parallel start a scikit-contrib package for the exotic tree functionalities (manifold splits, survival trees, unsupervised trees, etc.)

If there's anything I/we can do to make the reviewing process easier and faster, please let me know. We are def willing to get to the final step as fast as possible :).

@jjerphan
Copy link
Member

jjerphan commented Dec 6, 2022

What you have proposed to #22754 (comment) looks like a good plan to me!

To me, what will potentially block it from advancing will be reviewers' time.

Signed-off-by: Adam Li <adam2392@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Oblique Trees (Forest-RC) to the Cythonized Tree Module
4 participants