-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5b74193
to
4a4b4bb
Compare
Linking convo as reference: neurodata#16 and specifically neurodata#16 (comment). I am going to proceed by adding some lightweight unit tests to determine integrability, and getting a working version that we can come back to. Then I will try out the idea of replacing |
Co-authored-by: Chester Huynh <chester.huynh924@gmail.com> Co-authored-by: Parth Vora <pvora4@jhu.edu>
Co-Authored-By: Thomas Fan <thomasjpfan@gmail.com> Co-Authored-By: Chester Huynh <chester.huynh924@gmail.com> Co-Authored-By: Parth Vora <pvora4@jhu.edu>
Progress as of 3/16/22Added Oblique trees/forests to the existing unit tests parametrizations and they mostly work, with one minor issue on pickling. The following is a summary:
The commit 458220e is a stable working version that has all the tree/ensemble unit tests passing and also works as intended. The next step is to explore whether or not we can substitute a Pickling Issue SummaryThe issue arises in the test:
Update: I can fix it by setting Notes 3/21/22Remove the
Update May 12th, 2022Turns out Cython silently allows |
tests. The issues still are: - sparse dataset support, which I could possibly split into a follow-up PR? - pickling does not work on roundtrip for some reason - certain issues with max_leaf
@ogrisel I fixed the merge commit as discussed yesterday. Here is a "brief" summary. For full details, see the PR description. Let me know if this is too long / detailed: What are oblique trees: oblique trees generate splits based on combinations of features. They can sample more than When are oblique trees better than axis-aligned? Based on the geometric motivation above, one would suspect that oblique trees fare better than axis-aligned trees when there is a high degree of noise dimensions, and/or there are optimal split hyperplanes in the data that are not aligned with the features themselves. Thus oblique trees are suspected to perform significantly better when there are a high degree of noise dimensions. When are axis-aligned trees better than oblique? |
…into obliquepr
Hi @adam2392, could you split this PR in several small atomic and orthogonal PRs? There are changes like the ones of #25101 (which I think we are interested in for modularity) those are orthogonal to the introduction of |
Hi @jjerphan yes I can and I have a separate proposal for BaseTree, which would also improve modularity of the Tree class. The plan is to:
If there's anything I/we can do to make the reviewing process easier and faster, please let me know. We are def willing to get to the final step as fast as possible :). |
What you have proposed to #22754 (comment) looks like a good plan to me! To me, what will potentially block it from advancing will be reviewers' time. |
Signed-off-by: Adam Li <adam2392@gmail.com>
Reference Issues/PRs
Closes: #20819
This is ready for review.
What does this implement/fix? Explain your changes.
Implements oblique trees/splitter as a subclass of the existing
Tree
/Splitter
. This further allows extensibility of the sklearnTree
andSplitter
to allow downstream packages that have to define more complex and exotic sampling mechanisms. The total Cython code that constitutes the logic addition is around ~950 LOC, while the rest are from unit tests, and adding the Python API.This is an extension of decision trees to generalize to random linear combination of feature oblique trees proposed by Breimann in 2001. This will enable 3rd party packages to subclass this code to instantiate other trees that work on "combinations" of features in some way (e.g. taking the weighted sum, or a kernel). The general intuition is that OF can sample more diverse set of features enabling better generalization and robustness to high-dimensional noise. RF should be used if the user suspects all data is aligned on the feature dimensions. The tradeoffs of computation time, model size and scoring time are complex because OF can fit shorter trees, but requires storage of additional data. OF needs to perform additional operations increasing computation time, but it can be negligible. We always suggest using OF first if you have the computational and storage flexibility. We suggest using RF for now if you have very large sample sizes and very very strict runtime/storage constraints.
ObliqueSplitter
andObliqueTree
can subclass the existing code.iris
dataset (https://gist.github.com/adam2392/78519091104cecfeb8ff796eac7e8115)Experiments supporting the change
Requested by Thomas Fan, Guillaume, and Olivier, we ran the following experiments.
Docs changes for education and user-guidance
examples/tree/plot_iris_dtc.py
to include oblique treesmodules/tree.rst
with a new section on oblique treesmodules/ensemble.rst
with a new section on oblique forestsexamples/ensemble/plot_oblique_axis_aligned_forests.py
demonstrating oblique forest superiority on a real dataset from openml and a simulated datasetThe changes to
setup.py
files were necessary to compile package in some of the CI pipelines. It worked for me w/o it locally on my Mac M1 machine though.Tracking Progress on sub-items - used for keeping track of challenges that arose (optional read)
The remaining items to complete are:
feature_importances
, or specify an error message if it's not allowed.digits
(8 x 8 images) that can get shallower/smaller-forests vs RandomForest; e.g. train OF/RF with 100 trees, and subsample the forest and plot accuracy and plot the max_depth. Basically get a few more examples that are candidates to include into the docs: 1 example showing an intuitive visualization and then another example to demonstrate possible superiority in terms of real example. E.g. digits horizontally stack the real signal but then create permuted noisy copies and stack again. Put it into a separate notebook for now for downstream integration into scikit-learn. End goal: education for users on the inductive bias for oblique trees and when to use.TreeBuilder/Tree
for adding nodes (see: [MAINT] Improve extensibility of the Tree/Splitter code #22756 (comment))Any other comments?
This is the PR to demonstrate the end product. To simplify the merging process, I propose the merging of this PR in a series of other smaller PRs.
[MAINT] Modularize Tree code and Splitter utility functions #22753test_tree.py
for pickling and min_impurity_decrease #22915 (review)Tree
abstractions, such as insklearn/tree/_classes.py
andsklearn/tree/_tree.pyx
Another misc. note is that oblique and decision trees can be further easily improved in terms of training speed via quantization of continuous feature values in the data (i.e. same idea used in histogram gradient boosting).