Description
Summary
#24678 introduces a modularization of Criterion
to allow different criterion to be used with the same classes.
#25101 introduces a modularization of Splitter
to allow different types of of splits to be computed.
Now comes the time to also modularize the Tree
class. A good Tree
class should enable oblique splits, causal leaf nodes (i.e. leaf nodes set differently from split nodes), quantile trees (leaf nodes set differently from split nodes) and unsupervised trees. Note another feature of causal trees is 'honesty', which should be easier to add after this issue is resolved.
Proposed improvement
We will have the following improvements:
- Refactor
tree._add_node()
to set the split node and leaf node differently. - Refactor to have a 'splitptr' for
SplitRecord
, which allows for generalizations of the SplitRecord. - Separate
Tree
into generic and abstract base functions forBaseTree
and specific supervised axis-aligned functions forTree
Once the changes are made, one should verify:
- If
tree
submodule's Cython code still builds (i.e.make clean
and thenpip install --verbose --no-build-isolation --editable .
should not error out) - verify unit tests inside
sklearn/tree
all pass - verify that the asv benchmarks do not show a performance regression.
asv continuous --verbose --split --bench RandomForest upstream/main <new_branch_name>
and then for side-by-side comparison asv compare main <new_branch_name>
Reference
As discussed in #24577 , I wrote up a doc on proposed improvements to the tree submodule that would:
- make it easier for 3rd party packages to subclass existing sklearn tree code and
- make it easier for sklearn itself to make improvements to the tree code with many of the modern improvements to trees
cc: @jjerphan