MAINT Pull apart Splitter and Partitioner in the sklearn/tree code #29458

adam2392 · 2024-07-10T14:23:39Z

Reference Issues/PRs

Note: this is entirely just a moving of code, nothing changes functionally, except the abstract implementation (see below). This should hopefully be pretty easy to review since no unit-tests will change, or fail.

What does this implement/fix? Explain your changes.

Separates the concept of Splitter and Partitioner into two separate Cython files to make the code easier to read and maintain.
Adds an abstract base class called BasePartitioner to limit the number of repeated function definitions. I can remove this or add as an in-line comment as suggested by @thomasjpfan. I have it currently added, so lmk what you think.

These are the summary of the code moves:

sorting functions: _splitter.pyx -> _partitioner.pyx
partitioner classes: _splitter.pyx -> _partitioner.pyx
any moving of samples functions (e.g. shift_missing_values_to_left_if_required) moved to _partitioner.pyx
(Optional) Implementation of an abstract BasePartitioner class, so the partitioner definition is handled in this abstract class. I can remove this if people want. See: MAINT Pull apart Splitter and Partitioner in the sklearn/tree code #29458 (comment)

Any other comments?

This shouldn't introduce any performance regressions as the computational tricks used to make the code the same speed are still there:

DensePartitioner and SparsePartitioner are still decorated with @final, so they are not able to be subclassed and Cython will optimize the code
The fused type trick within _splitter.pyx is still used to define a join of the DensePartitioner and SparsePartitioner. I'm actually not 100% sure this is needed… But I think if we use the Partitioner class in _partiitioner.pxd, this may incur performance issues via vtable lookup?

I ran some benchmarks using asv on this PR branch and the one on main, and I actually don't see that many diffs:

asv run --bench RandomForestClassifierBenchmark.time_fit —verbose

Partition PR (fused type)
[100.00%] ··· ================ =========
              —                 n_jobs 
              ---------------- ————
               representation      1    
              ================ =========
                   dense        5.34±0s 
                   sparse       6.62±0s 
              ================ =========

Main 
[100.00%] ··· ================ ============
              --                  n_jobs   
              ---------------- ——————
               representation       1      
              ================ ============
                   dense        5.32±0.02s 
                   sparse        6.63±0s   
              ================ ============

Signed-off-by: Adam Li <adam2392@gmail.com>

github-actions · 2024-07-10T14:25:03Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: e34ef57. Link to the linter CI: here}

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2024-07-11T00:57:51Z

Lmk if this diff is too large… I couldn't see another way to do it (e.g. incrementally), since all the functions sort of rely on each other.

I guess one thing I could do is do the "sorting" functions first, and then one partitioner at a time, but that seems kind of tedious(?)

Signed-off-by: Adam Li <adam2392@gmail.com>

sklearn/tree/_splitter.pyx

Signed-off-by: Adam Li <adam2392@gmail.com>

asv_benchmarks/asv.conf.json

benchmarks/bench_tree.py

adam2392 · 2024-07-15T13:04:54Z

What do you think of this refactoring proposal (no code functionality change).

cc: @thomasjpfan, @glemaitre

thomasjpfan · 2024-07-15T21:16:15Z

sklearn/tree/_partitioner.pyx

+cdef float32_t INFINITY_32t = np.inf
+
+
+cdef class Partitioner:


According to your opening comment, is this suppose to be BasePartitioner?

Also, is defining this base class just for defining the interface? I didn't define the interface so we can be sure there is no v-table look up. I do not think Cython would generate one, but there could be a bug in Cython that will generate it in the future, and those are hard to detect.

If we want to be safe, we can comment out the interface and let it be a comment.

According to your opening comment, is this suppose to be BasePartitioner?

Yes, I should change it back.

Also, is defining this base class just for defining the interface? I didn't define the interface so we can be sure there is no v-table look up. I do not think Cython would generate one, but there could be a bug in Cython that will generate it in the future, and those are hard to detect.

Yes it's just to define the interface, but I noticed no runtime regression. Just to confirm, your suggestion is to remove the abstract interface definition, and just leave one there as a multi-in-line comment?

Just to confirm, your suggestion is to remove the abstract interface definition, and just leave one there as a multi-in-line comment?

Yea, that was my suggestion.

Kay I have done that in e24cda9

thomasjpfan · 2024-07-15T21:21:59Z

sklearn/tree/_splitter.pyx

+# but it would have resulted in a ~10% overall tree fitting performance
+# degradation caused by the overhead frequent virtual method lookups.
+ctypedef fused Partitioner:
+    DensePartitioner


Since this is still a fused type, I think there should not be any v-table look ups.

But to be sure, can you check the C++ code in node_split_best to see if there is a v-table look up?

The generated C++ code shows:

static CYTHON_INLINE int __pyx_fuse_0__pyx_f_7sklearn_4tree_9_splitter_node_split_best(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_DensePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/ static CYTHON_INLINE int __pyx_fuse_1__pyx_f_7sklearn_4tree_9_splitter_node_split_best(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_SparsePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/ static CYTHON_INLINE int __pyx_fuse_0__pyx_f_7sklearn_4tree_9_splitter_node_split_random(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_DensePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/ static CYTHON_INLINE int __pyx_fuse_1__pyx_f_7sklearn_4tree_9_splitter_node_split_random(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_SparsePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/

so there is no v-table lookup as there is a specialized function generated for each combination of node_split_best/node_split_random and Dense/Sparse Partitioner.

For now, I reverted the BasePartitioner in e24cda9 and implemented your suggestion: #29458 (comment)

Signed-off-by: Adam Li <adam2392@gmail.com>

…into partition

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2024-07-16T14:42:57Z

Perhaps @adrinjalali you're interested in this too?

This will help the categorical splitting for Decision Trees be a bit easier to review #29437.

OmarManzoor

Thanks for the PR @adam2392. I think the overall structure looks nice. Could you resolve the conflicts and incorporate uint8_t here as well?

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2024-07-18T12:18:54Z

Thanks for the PR @adam2392. I think the overall structure looks nice. Could you resolve the conflicts and incorporate uint8_t here as well?

Thanks for the review @OmarManzoor! I have addressed these issues.

sklearn/tree/_partitioner.pyx

OmarManzoor · 2024-07-18T12:34:48Z

sklearn/tree/_partitioner.pyx

+
+
+# Sort n-element arrays pointed to by feature_values and samples, simultaneously,
+# by the values in feature_values. Algorithm: Introsort (Musser, SP&E, 1997).


Would it be feasible to move all these sorting functions towards the end of the file so that we have the common functions in one place and the classes are present together.

I left this alone as this was how it was structured in _splitter.pyx. These functions end up being after their corresponding "Splitter". I don't have strong feelings though.

Perhaps @thomasjpfan has some thoughts since he's browsing the tree code often :p

Looking it over, I like @OmarManzoor suggestion and moving the functions to be under SparsePartitioner.

Cool. Makes sense to me. Done in e34ef57

Signed-off-by: Adam Li <adam2392@gmail.com>

OmarManzoor

Otherwise LGTM

sklearn/tree/_partitioner.pxd

sklearn/tree/_splitter.pxd

Signed-off-by: Adam Li <adam2392@gmail.com>

…into partition

Signed-off-by: Adam Li <adam2392@gmail.com>

…cikit-learn#29458) Signed-off-by: Adam Li <adam2392@gmail.com>

Maintainencne show

42b6427

Signed-off-by: Adam Li <adam2392@gmail.com>

github-actions bot added cython module:tree labels Jul 10, 2024

adam2392 mentioned this pull request Jul 10, 2024

MAINT, RFC Simplify the Cython code in sklearn/tree/ by splitting the "Splitter" and "Partitioner" code #29459

Closed

adam2392 added 3 commits July 10, 2024 10:56

Include in setup

d533205

Signed-off-by: Adam Li <adam2392@gmail.com>

Fix lint

454c535

Signed-off-by: Adam Li <adam2392@gmail.com>

Try some asv stuff

b2fe94e

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 marked this pull request as ready for review July 10, 2024 21:39

Update benchmarking

bee1876

Signed-off-by: Adam Li <adam2392@gmail.com>

nithish08 reviewed Jul 11, 2024

View reviewed changes

sklearn/tree/_splitter.pyx Show resolved Hide resolved

adam2392 added 3 commits July 12, 2024 07:32

Merging main

7352549

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'main' into partition

1d764ac

Add file headers

dec2017

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 commented Jul 15, 2024

View reviewed changes

asv_benchmarks/asv.conf.json Outdated Show resolved Hide resolved

adam2392 commented Jul 15, 2024

View reviewed changes

benchmarks/bench_tree.py Outdated Show resolved Hide resolved

Merge branch 'main' into partition

7aa1b1e

thomasjpfan reviewed Jul 15, 2024

View reviewed changes

adam2392 added 6 commits July 15, 2024 21:20

Specialize the interface

3c8f08b

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'partition' of https://github.com/adam2392/scikit-learn …

a7504f1

…into partition

Merge branch 'main' into partition

2ac5277

Merge branch 'main' into partition

fedc609

Reverse diff

929035e

Signed-off-by: Adam Li <adam2392@gmail.com>

Reverse diff

75158f2

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 requested review from nithish08 and glemaitre and removed request for nithish08 July 16, 2024 14:14

adam2392 requested a review from thomasjpfan July 16, 2024 14:14

adam2392 mentioned this pull request Jul 17, 2024

FEA Categorical split support for DecisionTree*, ExtraTree*, RandomForest* and `ExtraTrees* #29437

Draft

OmarManzoor reviewed Jul 18, 2024

View reviewed changes

Merge branch 'main' into partition

2cb4ed0

adam2392 added the No Changelog Needed label Jul 18, 2024

Incorporate uint8_t in Partitioner

25134cf

Signed-off-by: Adam Li <adam2392@gmail.com>

OmarManzoor reviewed Jul 18, 2024

View reviewed changes

adam2392 added 2 commits July 18, 2024 09:19

Move another extra function

517a64d

Signed-off-by: Adam Li <adam2392@gmail.com>

Clean up imports

ca6ffb9

Signed-off-by: Adam Li <adam2392@gmail.com>

OmarManzoor approved these changes Jul 18, 2024

View reviewed changes

sklearn/tree/_partitioner.pxd Outdated Show resolved Hide resolved

sklearn/tree/_splitter.pxd Outdated Show resolved Hide resolved

Address omar comments

d77d6ed

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 added Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one! labels Jul 19, 2024

adam2392 added 4 commits July 19, 2024 11:58

Merge branch 'main' into partition

65a6318

Address thomas' comments

e24cda9

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'partition' of https://github.com/adam2392/scikit-learn …

310a476

…into partition

Move functions

e34ef57

Signed-off-by: Adam Li <adam2392@gmail.com>

thomasjpfan approved these changes Jul 20, 2024

View reviewed changes

thomasjpfan merged commit c3fed50 into scikit-learn:main Jul 20, 2024
30 checks passed

adam2392 deleted the partition branch July 20, 2024 21:19

MarcBresson pushed a commit to MarcBresson/scikit-learn that referenced this pull request Sep 2, 2024

MAINT Pull apart Splitter and Partitioner in the sklearn/tree code (s…

4c3db2f

…cikit-learn#29458) Signed-off-by: Adam Li <adam2392@gmail.com>

sebp mentioned this pull request Dec 28, 2024

scikit-learn 1.6 changed behavior of growing trees #30554

Closed



		# Sort n-element arrays pointed to by feature_values and samples, simultaneously,
		# by the values in feature_values. Algorithm: Introsort (Musser, SP&E, 1997).

Uh oh!

MAINT Pull apart Splitter and Partitioner in the sklearn/tree code #29458

MAINT Pull apart Splitter and Partitioner in the sklearn/tree code #29458

Uh oh!

Conversation

adam2392 commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

adam2392 commented Jul 11, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adam2392 commented Jul 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adam2392 commented Jul 16, 2024

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

adam2392 commented Jul 18, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adam2392 commented Jul 10, 2024 •

edited

Loading

github-actions bot commented Jul 10, 2024 •

edited

Loading