Skip to content

MAINT Pull apart Splitter and Partitioner in the sklearn/tree code #29458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jul 20, 2024

Conversation

adam2392
Copy link
Member

@adam2392 adam2392 commented Jul 10, 2024

Reference Issues/PRs

Fixes: #29459

Note: this is entirely just a moving of code, nothing changes functionally, except the abstract implementation (see below). This should hopefully be pretty easy to review since no unit-tests will change, or fail.

What does this implement/fix? Explain your changes.

  • Separates the concept of Splitter and Partitioner into two separate Cython files to make the code easier to read and maintain.
  • Adds an abstract base class called BasePartitioner to limit the number of repeated function definitions. I can remove this or add as an in-line comment as suggested by @thomasjpfan. I have it currently added, so lmk what you think.

These are the summary of the code moves:

  1. sorting functions: _splitter.pyx -> _partitioner.pyx
  2. partitioner classes: _splitter.pyx -> _partitioner.pyx
  3. any moving of samples functions (e.g. shift_missing_values_to_left_if_required) moved to _partitioner.pyx
  4. (Optional) Implementation of an abstract BasePartitioner class, so the partitioner definition is handled in this abstract class. I can remove this if people want. See: MAINT Pull apart Splitter and Partitioner in the sklearn/tree code #29458 (comment)

Any other comments?

This shouldn't introduce any performance regressions as the computational tricks used to make the code the same speed are still there:

  1. DensePartitioner and SparsePartitioner are still decorated with @final, so they are not able to be subclassed and Cython will optimize the code
  2. The fused type trick within _splitter.pyx is still used to define a join of the DensePartitioner and SparsePartitioner. I'm actually not 100% sure this is needed… But I think if we use the Partitioner class in _partiitioner.pxd, this may incur performance issues via vtable lookup?

I ran some benchmarks using asv on this PR branch and the one on main, and I actually don't see that many diffs:

asv run --bench RandomForestClassifierBenchmark.time_fit —verbose

Partition PR (fused type)
[100.00%] ··· ================ =========
              —                 n_jobs 
              ---------------- ————
               representation      1    
              ================ =========
                   dense        5.34±0s 
                   sparse       6.62±0s 
              ================ =========

Main 
[100.00%] ··· ================ ============
              --                  n_jobs   
              ---------------- ——————
               representation       1      
              ================ ============
                   dense        5.32±0.02s 
                   sparse        6.63±0s   
              ================ ============

Signed-off-by: Adam Li <adam2392@gmail.com>
Copy link

github-actions bot commented Jul 10, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e34ef57. Link to the linter CI: here

adam2392 added 3 commits July 10, 2024 10:56
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 marked this pull request as ready for review July 10, 2024 21:39
@adam2392
Copy link
Member Author

Lmk if this diff is too large… I couldn't see another way to do it (e.g. incrementally), since all the functions sort of rely on each other.

I guess one thing I could do is do the "sorting" functions first, and then one partitioner at a time, but that seems kind of tedious(?)

Signed-off-by: Adam Li <adam2392@gmail.com>
adam2392 added 3 commits July 12, 2024 07:32
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Member Author

What do you think of this refactoring proposal (no code functionality change).

cc: @thomasjpfan, @glemaitre

cdef float32_t INFINITY_32t = np.inf


cdef class Partitioner:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to your opening comment, is this suppose to be BasePartitioner?

Also, is defining this base class just for defining the interface? I didn't define the interface so we can be sure there is no v-table look up. I do not think Cython would generate one, but there could be a bug in Cython that will generate it in the future, and those are hard to detect.

If we want to be safe, we can comment out the interface and let it be a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to your opening comment, is this suppose to be BasePartitioner?

Yes, I should change it back.

Also, is defining this base class just for defining the interface? I didn't define the interface so we can be sure there is no v-table look up. I do not think Cython would generate one, but there could be a bug in Cython that will generate it in the future, and those are hard to detect.

Yes it's just to define the interface, but I noticed no runtime regression. Just to confirm, your suggestion is to remove the abstract interface definition, and just leave one there as a multi-in-line comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, your suggestion is to remove the abstract interface definition, and just leave one there as a multi-in-line comment?

Yea, that was my suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kay I have done that in e24cda9

# but it would have resulted in a ~10% overall tree fitting performance
# degradation caused by the overhead frequent virtual method lookups.
ctypedef fused Partitioner:
DensePartitioner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is still a fused type, I think there should not be any v-table look ups.

But to be sure, can you check the C++ code in node_split_best to see if there is a v-table look up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated C++ code shows:

static CYTHON_INLINE int __pyx_fuse_0__pyx_f_7sklearn_4tree_9_splitter_node_split_best(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_DensePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
static CYTHON_INLINE int __pyx_fuse_1__pyx_f_7sklearn_4tree_9_splitter_node_split_best(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_SparsePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
static CYTHON_INLINE int __pyx_fuse_0__pyx_f_7sklearn_4tree_9_splitter_node_split_random(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_DensePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
static CYTHON_INLINE int __pyx_fuse_1__pyx_f_7sklearn_4tree_9_splitter_node_split_random(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_SparsePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/

so there is no v-table lookup as there is a specialized function generated for each combination of node_split_best/node_split_random and Dense/Sparse Partitioner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I reverted the BasePartitioner in e24cda9 and implemented your suggestion: #29458 (comment)

adam2392 added 6 commits July 15, 2024 21:20
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 requested review from nithish08 and glemaitre and removed request for nithish08 July 16, 2024 14:14
@adam2392 adam2392 requested a review from thomasjpfan July 16, 2024 14:14
@adam2392
Copy link
Member Author

Perhaps @adrinjalali you're interested in this too?

This will help the categorical splitting for Decision Trees be a bit easier to review #29437.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @adam2392. I think the overall structure looks nice. Could you resolve the conflicts and incorporate uint8_t here as well?

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Member Author

Thanks for the PR @adam2392. I think the overall structure looks nice. Could you resolve the conflicts and incorporate uint8_t here as well?

Thanks for the review @OmarManzoor! I have addressed these issues.



# Sort n-element arrays pointed to by feature_values and samples, simultaneously,
# by the values in feature_values. Algorithm: Introsort (Musser, SP&E, 1997).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be feasible to move all these sorting functions towards the end of the file so that we have the common functions in one place and the classes are present together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this alone as this was how it was structured in _splitter.pyx. These functions end up being after their corresponding "Splitter". I don't have strong feelings though.

Perhaps @thomasjpfan has some thoughts since he's browsing the tree code often :p

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking it over, I like @OmarManzoor suggestion and moving the functions to be under SparsePartitioner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Makes sense to me. Done in e34ef57

adam2392 added 2 commits July 18, 2024 09:19
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Jul 19, 2024
adam2392 added 4 commits July 19, 2024 11:58
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@thomasjpfan thomasjpfan merged commit c3fed50 into scikit-learn:main Jul 20, 2024
30 checks passed
@adam2392 adam2392 deleted the partition branch July 20, 2024 21:19
MarcBresson pushed a commit to MarcBresson/scikit-learn that referenced this pull request Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython module:tree No Changelog Needed Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MAINT, RFC Simplify the Cython code in sklearn/tree/ by splitting the "Splitter" and "Partitioner" code
4 participants