Skip to content

[MAINT] Improve extensibility of the Tree/Splitter code #22756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

adam2392
Copy link
Member

Reference Issues/PRs

Fixes: #22753

What does this implement/fix? Explain your changes.

  • Moves splitter utility functions to _splitter.pxd definition file
  • Modularize Tree class to enable easier extensions

Any other comments?

N/a

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 10, 2022

Thanks for the PR!

According to the Oblique PR, this works only if we extend the SplitRecord struct to include pointers to vectors:

vector[DTYPE_t]* proj_vec_weights # weights of the vector
vector[SIZE_t]* proj_vec_indices # indices of the features

This means the extensions only work for trees internal to scikit-learn. If external libraries want to extend trees with a new split record, they would need to adjust SplitRecord, which means vendoring scikit-learn's tree code and compile against their new SplitRecord.

From a software design point of view, I think it is strange to have fields in SplitRecord that is only used in one of the tree implementations. The regular trees would still use SplitRecord but ignore the two vector pointers, while the oblique trees uses all of them.

@adam2392
Copy link
Member Author

adam2392 commented Mar 10, 2022

According to the Oblique PR, this works only if we extend the SplitRecord struct to include pointers to vectors:

Apologies I should've elaborated.

This is able to be circumvented by defining a fusedtype. I didn't add this yet because I wasn't sure if you would prefer the soln I have implemented rn.

ctypedef fusedtype GeneralSplitRecord:
    ObliqueSplitRecord
    SplitRecord

# for Cython class functions that now pass around SplitRecord, just do the following
# and it will work because fusedtype is acceptable in function headers
cdef func(GeneralSplitRecord split, ...)

The extension to a split record that also stores the vector of indices and weights are sufficient to encompass all possible oblique trees (here we only implement and consider the one proposed by Breiman).

@adam2392
Copy link
Member Author

If you would like, I can modify the ObliquePR to reflect this is true first.

I have verified it locally.

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 11, 2022

I think the solution is acceptable if fused types work in the context of making ObliqueTree code smaller. With my understanding of fused types, I think it will make the code more complicated for this specific use case. At a high level, its similar to using C++ templates but also with a class hierarchy. I have not explored fused type deeply, so if there is a way to use fused types that is also maintainable, I am open to it.

Here is a code snippet with an alternative approach that may work for your use case:

cdef struct SplitRecord:
    int i
    
cdef struct ObliqueRecord:
    SplitRecord a
    int k
    
# Lets say this is regular tree method
cdef set_node_regular(SplitRecord * a):
    print(a.i)
    
# Lets say this is oblique tree method
cdef set_node_oblique(SplitRecord * a):
    cdef ObliqueRecord b = (<ObliqueRecord*>(a))[0]
    print(b.a.i)
    print(b.k)
    
def main():
    cdef SplitRecord a
    a.i = 10
    
    set_node_regular(&a)
    
    cdef ObliqueRecord b
    b.a.i = 3
    b.k = 4
    
    set_node_oblique(<SplitRecord*>(&b))

set_node_regular and set_node_oblique both accept SplitRecord *, but set_node_oblique is specialized. This way ObliqueTree can subclass Tree and override set_node_values keeping the signature but casting the pointer to get the ObliqueRecord back. With this idea, BestObliqueSplitter would create ObliqueRecord and pass it up to the TreeBuilder as a SplitRecord *. When the builder hands it over to ObliqueTree.set_node_values, it will cast the record back to a ObliqueRecord. (I have not tried this out yet on your PR and you may run into other issues by doing this)

@adam2392
Copy link
Member Author

adam2392 commented Mar 11, 2022

Hi @thomasjpfan thanks for the input.

This is an interesting idea. The issue is that Tree.build defines a SplitRecord within it, so when it calls node_split, there is no ObliqueSplitRecord defined yet. I'm not sure if you can then define an ObliqueSplitRecord within ObliqueSplitter.node_split, and then set the split record to the passed in reference?

E.g. I think we would need something like this no?

# option 1 (idk if this works tho...)
def main():
    cdef SplitRecord a
    a.i = 10
    
    set_node_regular(&a)
    
    # in your suggestion, we leave build() as is, which then
    # requires us to define the obliquesplitrecord within this function
    set_node_oblique(&a)

# or option 2
def main():
    cdef SplitRecord a
    a.i = 10

    cdef ObliqueRecord b
    b.a = a
    b.k = 4

    set_node_regular(&a)
    set_node_oblique(<SplitRecord>(&b))

If option 2, then we just have to be okay defining an ObliqueSplitRecord within TreeBuilder that carries with it a SplitRecord. I don't see why that would be an issue if it is well documented. It is hidden from the user.

Alternatively, we could use tempita to just generalize the TreeBuilder to have one where it uses a SplitRecord and one with ObliqueSplitRecord within the build function?

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 11, 2022

TreeBuilder.build can define a pointer to a SplitRecord and pass that around. Most likely one of c++'s unique_ptr or a shared_ptr.

If option 2, then we just have to be okay defining an ObliqueSplitRecord within TreeBuilder that carries with it a SplitRecord.

For me to properly evaluate, I would need to see an implementation.

Alternatively, we could use tempita to just generalize the TreeBuilder

I'm trying really hard to avoid Tempita for this use case. I find Tempita very hard to maintain.

@adam2392
Copy link
Member Author

adam2392 commented Mar 11, 2022

TreeBuilder.build can define a pointer to a SplitRecord and pass that around. Most likely one of c++'s unique_ptr or a shared_ptr.

I'm not sure I follow 100% here because it's getting into some deep c++, but this seems like the most desirable path. If what I spec out is correct, then I think this will be very clean. Thanks for answering any bad questions I have :p.

  • Would the change be to redefine Splitter and Tree functions to take in a shared_ptr rather then a *SplitRecord?
  • Then within the ObliqueSplitter and ObliqueTree, I override the functions and internally get access to the parent ObliqueSplitRecord struct?

Say it looks something like this.

# this would be the struct
cdef struct ObliqueSplitRecord:
    # Pointer for a normal split record see _splitter.pxd
    shared_ptr[SplitRecord] *split_record

    # the following are only used for oblique trees
    vector[DTYPE_t]* proj_vec_weights   # weights of the vector
    vector[SIZE_t]* proj_vec_indices    # indices of the features

# then say node_split would need to be redefined as:
cdef int node_split(self,
                        double impurity,   # Impurity of the node
                        shared_ptr[SplitRecord] split,
                        SIZE_t* n_constant_features) nogil except -1:
        cdef ObliqueSplitRecord oblique_split
        oblique_split.split_record = split
        ...

Then once it's passed from the TreeBuilder to the ObliqueTree._add_node as a shared_ptr[SplitRecord], how would you get access to the the ObliqueSplitRecord?

cdef SIZE_t _add_node(self, SIZE_t parent, bint is_left, bint is_leaf,
                          shared_ptr[SplitRecord] split_node, double impurity,
                          SIZE_t n_node_samples,
                          double weighted_n_node_samples) nogil except -1:
        # is this then correct and able to get the original oblique_record?
        cdef ObliqueRecord oblique_record = (<ObliqueRecord*>(split_node))[0]
        
        # the oblique_record here needs access to the proj_indices and proj_weights
        do_something_with(oblique_record.proj_vec_indices)

If this is what you had in mind, then I think it'll look very nice!

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 11, 2022

Yup, that is the general idea. We can keep things extra simple by having node_split and _add_node take a raw pointer. TreeBuilder would still own the pointer using shared_ptr (maybe unique_ptr if possible) and convert the shared_ptr with get

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 11, 2022

Although, it is unfortunate that the pruner needs to define a split record just to add a node. Maybe a new method _add_node_from_split_record that is used by the tree builder.

@adam2392
Copy link
Member Author

adam2392 commented Mar 14, 2022

Yup, that is the general idea. We can keep things extra simple by having node_split and _add_node take a raw pointer. TreeBuilder would still own the pointer using shared_ptr (maybe unique_ptr if possible) and convert the shared_ptr with get

I think node_split needs a more generic pointer. Perhaps a void*, so that regular splitter can pass in a SplitRecord*, while oblique splitter can pass in a ObliqueSplitRecord*?

Otherwise I'm having difficulty going from SplitRecord to the ObliqueSplitRecord that contains it. The main issue being, ObliqueSplitter._node_split is currently passed in a SplitRecord*, but needs to set those values and also the additional projection vector indices and weights. Then the builder does not have access to the projection vector and indices anymore when calling _add_node.

@thomasjpfan
Copy link
Member

Can ObliqueSplitter._node_split cast SplitRecord* into a ObliqueSplitRecord*? For the casting to work, the struct needs to be defined like this:

cdef struct ObliqueSplitRecord:
    SplitRecord split_record

    # the following are only used for oblique trees
    vector[DTYPE_t]* proj_vec_weights   # weights of the vector
    vector[SIZE_t]* proj_vec_indices    # indices of the features

@adam2392
Copy link
Member Author

adam2392 commented Mar 14, 2022

Can ObliqueSplitter._node_split cast SplitRecord* into a ObliqueSplitRecord*? For the casting to work, the struct needs to be defined like this:

I believe it can be, but when you use the pointer in a downstream function _add_node, you need to convert it again into a ObliqueSplitRecord*, which also contains the projection vector indices and weights.

E.g.

# this is called within TreeBuilder
cdef int node_split(self, double impurity,
                        SplitRecord* split,
                        SIZE_t* n_constant_features) nogil except -1:
        # create an ObliqueSplitRecord
        cdef ObliqueSplitRecord oblique_split = (<ObliqueSplitRecord*>(split))[0]
        # do some stuff to set the projection vector
        oblique_split.proj_vect_indices = func(...)

        ...
        # need to set the `split` pointer to what we have here
        split[0] = oblique_split.split_record

# Then within ObliqueTree, you need to be able to set the Node Values
 cdef int _set_node_values(self, SplitRecord *split_node, Node *node) nogil except -1:
        # Re-create the oblique split record that holds the original
        cdef ObliqueSplitRecord oblique_split_node = (<ObliqueSplitRecord*>(split_node))[0]
        
        # this now results in a segfault because there is nothing there
        with gil:
             print(deref(oblique_split_node.proj_vec_weights))

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 14, 2022

you need to convert it again into a ObliqueSplitRecord*, which also contains the projection vector indices and weights.

Yes. I rather have a few conversions like this than a void *, since SplitRecord * is a little more explicit. Every time a Oblique___ object gets a SplitRecord it will do one line of casting before it can use it.

@adam2392
Copy link
Member Author

Just added some notes to my original comment. Lmk if you think I'm missing something there. The issue I'm having is getting access to the proj_vec_indices again after _split_node. Casting works fine (I think?). I just somehow lose the data that was set during _split_node.

@thomasjpfan
Copy link
Member

I opened neurodata#16 to showcase how to work around the SplitRecord issue. (Although I may have broke the underlying algorithm.)

As for this PR, I do not think it can be merged as is. Making functions cimportable can be done in a case by case basis. Refactoring trees, needs a proof of concept to make sure those are the right abstractions. In other words, oblique trees needs to demonstrate how it will use these new abstractions. After that, we can open a small PR to refactor the trees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MAINT] Modularize Tree code and Splitter utility functions
2 participants