-
Notifications
You must be signed in to change notification settings - Fork 7
[TEST PR] Adding oblique trees (i.e. Forest-RC) to cythonized tree module #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4addedb
to
634a428
Compare
634a428
to
44f688c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prototype, I think I see what is required to make it easier for you to extend scikit-learn's trees.
In the redesign, I think methods such as tree._apply_dense
and DenseObliqueSplitter.node_split
would still need to exist to perform your custom logic + data structures.
sklearn/tree/_oblique_tree.pyx
Outdated
node_id = tree._add_oblique_node(parent, is_left, is_leaf, split.feature, | ||
split.threshold, impurity, n_node_samples, | ||
weighted_n_node_samples, | ||
split.proj_vec_weights, | ||
split.proj_vec_indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a different API than the scikit-learn. As you mentioned in office hours, to better support this use case, tree._add_node
should accept a SpiltRecord
and SplitRecord
is a class that can be subclassed. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if SplitRecord can become a class that can be subclassed, then this becomes easier. However, I wasn't sure if the struct was there for performance, or maintenance reasons.
sklearn/tree/_oblique_tree.pyx
Outdated
(impurity <= min_impurity_decrease)) | ||
|
||
if not is_leaf: | ||
splitter.oblique_node_split(impurity, &split, &n_constant_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same API as scikit-learn. As long as SplitRecord
can be subclassed, I think this should just work with your custom splitter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what I specced out at least initially is that almost all of the Tree
class currently in scikit-learn can be used for any "new tree" that only changes the splitter function as long as SplitRecord and the type of splitter can be subclassed.
Since the tree doesn't do anything fancy except call Splitter functions.
…version compiles, but doesn't work due to the lack of modularity in the existing tree code
Create RF vs OF benchmarking notebook
Some notes as of 3/10/22Have been working on a refactor that would make adding oblique trees to the existing codebase a lot simpler. Some notes for further improvement:
|
Reference Issues/PRs
Fixes:
What does this implement/fix? Explain your changes.
Adds cythonized oblique trees to the tree module. This is known as
Forest-RC
in the Breiman 2001 paper._oblique_tree.pxd/pyx
: This file implements i) theObliqueTree(Tree)
, which defines a few additional class members for storing the projection weight and indices and a new function for adding and oblique node and then ii) theObliqueTreeBuilder(TreeBuilder)
, which defines how to build the oblique tree._oblique_splitter.pxd/pyx
: This is the main change, which i) defines anObliqueSplitRecord
for keeping track of oblique splits, and ii) defines anObliqueSplitter(Splitter)
which gets oblique node splits and samples projection matrices, while also storing additional hyperparameters._classes.py
: Defines new Python interfaces for the Oblique trees and forestsAny other comments?
I'm not an expert in cython and c++ interplay, but I suspect that if we can "generalize" the
Node
struct to carry projection vector and weight information (not used inForest-RI
, or axis-aligned Random Forest), then much of the tree, tree building code is not even necessary. The only thing that is different at a fundamental level is the idea of asample_proj_mat
at each node of the tree, which samples sparse combinations.Another missing component currently is the implementation on sparse data, but this should be easily added in I presume.
Code That Can Be Shortened
New data structures and classes,
ObliqueSplitRecord
,ObliqueSplitter
,ObliqueTree
are defined. However, if the existingSplitRecord
,Splitter
,Tree
can be generalized, then the existing functions can just be used to build Oblique Trees too.However, I'm not sure if the scikit-learn devs would want that, rather then just replicating some code across these two cases?