MAINT, RFC Simplify the Cython code in `sklearn/tree/` by splitting the "Splitter" and "Partitioner" code #29459

adam2392 · 2024-07-10T14:25:32Z

Summary of the problem

There are quite a number of GH issues with the label tree (https://github.com/scikit-learn/scikit-learn/issues?page=2&q=is%3Aopen+is%3Aissue+label%3Amodule%3Atree).

However, the code is a bit hard to approach as there's so many moving parts. For example, see: #18448 (comment), where users are intimidated by the Cython codebase. A large part of the complexity comes in the splitter, which contains the most algorithmic logic. I think with the work done by @thomasjpfan recently, we were able to pull apart the idea of a "partitioner" and "splitter" in the trees.

This problem also arises because https://github.com/scikit-learn/scikit-learn/pull/29437/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fc makes the code-diff very large inside _splitter.pyx file making it also harder to review and understand what changes affect the "splitter" and what changes affect the "partitioner".

Proposed change

I further propose to split these into separate files, so it is easier to maintain, read and determine what is affecting what. In addition, we can decrease the code complexity by creating an abstract base class for the DensePartitioner and SparsePartitioner.

See the following PR for an example of what would change. The _splitter.pyx file would decrease by over 800 LOC, while it gets moved to _partitioner.pyx.

The text was updated successfully, but these errors were encountered:

adam2392 · 2024-07-10T14:27:09Z

Would ppl be open to this simplification?

I'm a bit biased as even from my personal experience, it was a headache reading through the entire _splitter.* files to understand what was going on. Separating the files semantically would make the code much more approachable.

Assuming there is no performance degradation.

glemaitre · 2024-07-11T08:27:31Z

I think that I'm fine with the split. One thing that we need to do when splitting the file is to exactly explain the job of the splitter and partitioner: looking at the name one could think that they do the same thing (they cut data).

I therefore think that we need to have a kind of developer documentation on the top of the file to explain why those are split: as far I remember, we have the partitioner to avoid speed regression due to some overhead in class inheritance.

we can decrease the code complexity by creating an abstract base class for the DensePartitioner and SparsePartitioner

I think this something that would lead to regression and the current workaround is actually a workaround abstract class.

@thomasjpfan would know the details. Having his feedback here would be wonderful because we could use it to document the files.

adam2392 · 2024-07-11T12:16:26Z

Here is my current understanding of Partitioner and Splitter (I can edit this as my understanding improves):

Splitter: handles the algorithm for determining a split node, either via a greedy, or random search over a max_features set of chosen features. This is sort of the "director" for how partitioner and criterion work together to determine, evaluate and apply the split. This is invariant to whether or not the data is sparse, or dense, as the partitioner handles how to actually separate the samples.

Partitioner: this performs the actual partitioning of the samples index array based on split points, missing values, etc. In the future is where we would partition samples to left/right as well for categorical splits.

adam2392 · 2024-07-11T12:16:30Z

we can decrease the code complexity by creating an abstract base class for the DensePartitioner and SparsePartitioner

I think this something that would lead to regression and the current workaround is actually a workaround abstract class.

Yes IIUC that was what was described here: #25306. However, I'm wondering if the vtable lookup is still incurred if we just have an abstract base class, @final decorators on the actual SparsePartitioner, DensePartitoner and fusedtype of the partitioners used in Splitter.

Just to summarize the state of affairs. If the following is the _partitioner.pyx file:

# _partitioner.pyx
cdef class AbstractPartitioner
     cdef void abstract_methods(self):

@final
cdef class SparsePartitioner(AbstractPartitioner):
     # actual implementation
     cdef inline void abstract_methods(self):

@final
cdef class DensePartitioner(AbstractPartitioner):
     # actual implementation
     cdef inline void abstract_methods(self):

then this would get hit w/ vtable lookup(?)

# _splitterv1.pyx
cdef node_split(AbstractPartitioner partitioner, …)
      # do stuff
      partitioner.abstract_methods()

However, I'm wondering if the following _splitterv2 is hit since the fused type should allow the compiler to know the type during compilation?

# _splitterv2.pyx
ctypedef fused Partitioner:
     SparsePartitioner
     DensePartitioner

cdef node_split(Partitioner partitioner, …)
      # do stuff
      partitioner.abstract_methods()

Forgive my understanding of vtable lookup if this is naive/incorrect.

Tho its in C++, my guess is this SO post is somewhat related in terms of hitting a vtable lookup when compiling(?): https://stackoverflow.com/questions/64808615/is-a-vtable-lookup-performed-when-indirectly-calling-a-virtual-method-in-a-non-o.

adam2392 · 2024-07-12T12:59:42Z

I ran some benchmarks using asv on the PR branch and the one on main, and I actually don't see any real performance diffs:

asv run --bench RandomForestClassifierBenchmark.time_fit —verbose

Partition PR (fused type)
[100.00%] ··· ================ =========
              --                 n_jobs 
              ---------------- ---------
               representation      1    
              ================ =========
                   dense        5.34±0s 
                   sparse       6.62±0s 
              ================ =========

Main 
[100.00%] ··· ================ ============
              --                  n_jobs   
              ---------------- ------------
               representation       1      
              ================ ============
                   dense        5.32±0.02s 
                   sparse        6.63±0s   
              ================ ============

Perhaps @thomasjpfan has some insight into when the vtable lookup hit was occurring? Is it only if the abstract base class has methods actually defined rather than "pass", or something else perhaps.

thomasjpfan · 2024-07-15T21:28:10Z

Having his feedback here would be wonderful because we could use it to document the files.

The approach in #29458 should be okay when it comes to v-table lookups. I left a review in #29458 (review) with my thoughts. Overall, I do not really want to depend on Cython to have the correct v-table behavior with a base class. The alternative is to to define the partitioner interface with a comment rather than an actual base class.

github-actions bot added the Needs Triage Issue requires triage label Jul 10, 2024

adam2392 added module:tree cython labels Jul 10, 2024

adam2392 mentioned this issue Jul 10, 2024

MAINT Pull apart Splitter and Partitioner in the sklearn/tree code #29458

Merged

glemaitre removed the Needs Triage Issue requires triage label Jul 11, 2024

thomasjpfan closed this as completed in #29458 Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT, RFC Simplify the Cython code in `sklearn/tree/` by splitting the "Splitter" and "Partitioner" code #29459

MAINT, RFC Simplify the Cython code in `sklearn/tree/` by splitting the "Splitter" and "Partitioner" code #29459

adam2392 commented Jul 10, 2024 •

edited

Loading

adam2392 commented Jul 10, 2024

glemaitre commented Jul 11, 2024

adam2392 commented Jul 11, 2024 •

edited

Loading

adam2392 commented Jul 11, 2024 •

edited

Loading

adam2392 commented Jul 12, 2024 •

edited

Loading

thomasjpfan commented Jul 15, 2024

MAINT, RFC Simplify the Cython code in sklearn/tree/ by splitting the "Splitter" and "Partitioner" code #29459

MAINT, RFC Simplify the Cython code in sklearn/tree/ by splitting the "Splitter" and "Partitioner" code #29459

Comments

adam2392 commented Jul 10, 2024 • edited Loading

Summary of the problem

Proposed change

adam2392 commented Jul 10, 2024

glemaitre commented Jul 11, 2024

adam2392 commented Jul 11, 2024 • edited Loading

adam2392 commented Jul 11, 2024 • edited Loading

adam2392 commented Jul 12, 2024 • edited Loading

thomasjpfan commented Jul 15, 2024

MAINT, RFC Simplify the Cython code in `sklearn/tree/` by splitting the "Splitter" and "Partitioner" code #29459

MAINT, RFC Simplify the Cython code in `sklearn/tree/` by splitting the "Splitter" and "Partitioner" code #29459

adam2392 commented Jul 10, 2024 •

edited

Loading

adam2392 commented Jul 11, 2024 •

edited

Loading

adam2392 commented Jul 11, 2024 •

edited

Loading

adam2392 commented Jul 12, 2024 •

edited

Loading