Skip to content

MAINT, RFC Simplify the Cython code in sklearn/tree/ by splitting the "Splitter" and "Partitioner" code #29459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adam2392 opened this issue Jul 10, 2024 · 6 comments · Fixed by #29458

Comments

@adam2392
Copy link
Member

adam2392 commented Jul 10, 2024

Summary of the problem

There are quite a number of GH issues with the label tree (https://github.com/scikit-learn/scikit-learn/issues?page=2&q=is%3Aopen+is%3Aissue+label%3Amodule%3Atree).

However, the code is a bit hard to approach as there's so many moving parts. For example, see: #18448 (comment), where users are intimidated by the Cython codebase. A large part of the complexity comes in the splitter, which contains the most algorithmic logic. I think with the work done by @thomasjpfan recently, we were able to pull apart the idea of a "partitioner" and "splitter" in the trees.

This problem also arises because https://github.com/scikit-learn/scikit-learn/pull/29437/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fc makes the code-diff very large inside _splitter.pyx file making it also harder to review and understand what changes affect the "splitter" and what changes affect the "partitioner".

Proposed change

I further propose to split these into separate files, so it is easier to maintain, read and determine what is affecting what. In addition, we can decrease the code complexity by creating an abstract base class for the DensePartitioner and SparsePartitioner.

See the following PR for an example of what would change. The _splitter.pyx file would decrease by over 800 LOC, while it gets moved to _partitioner.pyx.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jul 10, 2024
@adam2392
Copy link
Member Author

Would ppl be open to this simplification?

I'm a bit biased as even from my personal experience, it was a headache reading through the entire _splitter.* files to understand what was going on. Separating the files semantically would make the code much more approachable.

Assuming there is no performance degradation.

@glemaitre
Copy link
Member

I think that I'm fine with the split. One thing that we need to do when splitting the file is to exactly explain the job of the splitter and partitioner: looking at the name one could think that they do the same thing (they cut data).

I therefore think that we need to have a kind of developer documentation on the top of the file to explain why those are split: as far I remember, we have the partitioner to avoid speed regression due to some overhead in class inheritance.

we can decrease the code complexity by creating an abstract base class for the DensePartitioner and SparsePartitioner

I think this something that would lead to regression and the current workaround is actually a workaround abstract class.

@thomasjpfan would know the details. Having his feedback here would be wonderful because we could use it to document the files.

@adam2392
Copy link
Member Author

adam2392 commented Jul 11, 2024

Here is my current understanding of Partitioner and Splitter (I can edit this as my understanding improves):

Splitter: handles the algorithm for determining a split node, either via a greedy, or random search over a max_features set of chosen features. This is sort of the "director" for how partitioner and criterion work together to determine, evaluate and apply the split. This is invariant to whether or not the data is sparse, or dense, as the partitioner handles how to actually separate the samples.

Partitioner: this performs the actual partitioning of the samples index array based on split points, missing values, etc. In the future is where we would partition samples to left/right as well for categorical splits.

@adam2392
Copy link
Member Author

adam2392 commented Jul 11, 2024

we can decrease the code complexity by creating an abstract base class for the DensePartitioner and SparsePartitioner

I think this something that would lead to regression and the current workaround is actually a workaround abstract class.

Yes IIUC that was what was described here: #25306. However, I'm wondering if the vtable lookup is still incurred if we just have an abstract base class, @final decorators on the actual SparsePartitioner, DensePartitoner and fusedtype of the partitioners used in Splitter.

Just to summarize the state of affairs. If the following is the _partitioner.pyx file:

# _partitioner.pyx
cdef class AbstractPartitioner
     cdef void abstract_methods(self):

@final
cdef class SparsePartitioner(AbstractPartitioner):
     # actual implementation
     cdef inline void abstract_methods(self):

@final
cdef class DensePartitioner(AbstractPartitioner):
     # actual implementation
     cdef inline void abstract_methods(self):

then this would get hit w/ vtable lookup(?)

# _splitterv1.pyx
cdef node_split(AbstractPartitioner partitioner, …)
      # do stuff
      partitioner.abstract_methods()

However, I'm wondering if the following _splitterv2 is hit since the fused type should allow the compiler to know the type during compilation?

# _splitterv2.pyx
ctypedef fused Partitioner:
     SparsePartitioner
     DensePartitioner

cdef node_split(Partitioner partitioner, …)
      # do stuff
      partitioner.abstract_methods()

Forgive my understanding of vtable lookup if this is naive/incorrect.

Tho its in C++, my guess is this SO post is somewhat related in terms of hitting a vtable lookup when compiling(?): https://stackoverflow.com/questions/64808615/is-a-vtable-lookup-performed-when-indirectly-calling-a-virtual-method-in-a-non-o.

@adam2392
Copy link
Member Author

adam2392 commented Jul 12, 2024

I ran some benchmarks using asv on the PR branch and the one on main, and I actually don't see any real performance diffs:

asv run --bench RandomForestClassifierBenchmark.time_fit —verbose

Partition PR (fused type)
[100.00%] ··· ================ =========
              --                 n_jobs 
              ---------------- ---------
               representation      1    
              ================ =========
                   dense        5.34±0s 
                   sparse       6.62±0s 
              ================ =========

Main 
[100.00%] ··· ================ ============
              --                  n_jobs   
              ---------------- ------------
               representation       1      
              ================ ============
                   dense        5.32±0.02s 
                   sparse        6.63±0s   
              ================ ============

Perhaps @thomasjpfan has some insight into when the vtable lookup hit was occurring? Is it only if the abstract base class has methods actually defined rather than "pass", or something else perhaps.

@thomasjpfan
Copy link
Member

Having his feedback here would be wonderful because we could use it to document the files.

The approach in #29458 should be okay when it comes to v-table lookups. I left a review in #29458 (review) with my thoughts. Overall, I do not really want to depend on Cython to have the correct v-table behavior with a base class. The alternative is to to define the partitioner interface with a comment rather than an actual base class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants