-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Pull apart Splitter and Partitioner in the sklearn/tree code #29458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Lmk if this diff is too large… I couldn't see another way to do it (e.g. incrementally), since all the functions sort of rely on each other. I guess one thing I could do is do the "sorting" functions first, and then one partitioner at a time, but that seems kind of tedious(?) |
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
What do you think of this refactoring proposal (no code functionality change). cc: @thomasjpfan, @glemaitre |
sklearn/tree/_partitioner.pyx
Outdated
cdef float32_t INFINITY_32t = np.inf | ||
|
||
|
||
cdef class Partitioner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to your opening comment, is this suppose to be BasePartitioner
?
Also, is defining this base class just for defining the interface? I didn't define the interface so we can be sure there is no v-table look up. I do not think Cython would generate one, but there could be a bug in Cython that will generate it in the future, and those are hard to detect.
If we want to be safe, we can comment out the interface and let it be a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to your opening comment, is this suppose to be
BasePartitioner
?
Yes, I should change it back.
Also, is defining this base class just for defining the interface? I didn't define the interface so we can be sure there is no v-table look up. I do not think Cython would generate one, but there could be a bug in Cython that will generate it in the future, and those are hard to detect.
Yes it's just to define the interface, but I noticed no runtime regression. Just to confirm, your suggestion is to remove the abstract interface definition, and just leave one there as a multi-in-line comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, your suggestion is to remove the abstract interface definition, and just leave one there as a multi-in-line comment?
Yea, that was my suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kay I have done that in e24cda9
# but it would have resulted in a ~10% overall tree fitting performance | ||
# degradation caused by the overhead frequent virtual method lookups. | ||
ctypedef fused Partitioner: | ||
DensePartitioner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is still a fused type, I think there should not be any v-table look ups.
But to be sure, can you check the C++ code in node_split_best
to see if there is a v-table look up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generated C++ code shows:
static CYTHON_INLINE int __pyx_fuse_0__pyx_f_7sklearn_4tree_9_splitter_node_split_best(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_DensePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
static CYTHON_INLINE int __pyx_fuse_1__pyx_f_7sklearn_4tree_9_splitter_node_split_best(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_SparsePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
static CYTHON_INLINE int __pyx_fuse_0__pyx_f_7sklearn_4tree_9_splitter_node_split_random(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_DensePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
static CYTHON_INLINE int __pyx_fuse_1__pyx_f_7sklearn_4tree_9_splitter_node_split_random(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_SparsePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
so there is no v-table lookup as there is a specialized function generated for each combination of node_split_best/node_split_random
and Dense/Sparse Partitioner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I reverted the BasePartitioner in e24cda9 and implemented your suggestion: #29458 (comment)
Signed-off-by: Adam Li <adam2392@gmail.com>
…into partition
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Perhaps @adrinjalali you're interested in this too? This will help the categorical splitting for Decision Trees be a bit easier to review #29437. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @adam2392. I think the overall structure looks nice. Could you resolve the conflicts and incorporate uint8_t here as well?
Signed-off-by: Adam Li <adam2392@gmail.com>
Thanks for the review @OmarManzoor! I have addressed these issues. |
sklearn/tree/_partitioner.pyx
Outdated
|
||
|
||
# Sort n-element arrays pointed to by feature_values and samples, simultaneously, | ||
# by the values in feature_values. Algorithm: Introsort (Musser, SP&E, 1997). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be feasible to move all these sorting functions towards the end of the file so that we have the common functions in one place and the classes are present together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this alone as this was how it was structured in _splitter.pyx
. These functions end up being after their corresponding "Splitter". I don't have strong feelings though.
Perhaps @thomasjpfan has some thoughts since he's browsing the tree code often :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking it over, I like @OmarManzoor suggestion and moving the functions to be under SparsePartitioner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Makes sense to me. Done in e34ef57
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
…into partition
Signed-off-by: Adam Li <adam2392@gmail.com>
…cikit-learn#29458) Signed-off-by: Adam Li <adam2392@gmail.com>
Reference Issues/PRs
Fixes: #29459
Note: this is entirely just a moving of code, nothing changes functionally, except the abstract implementation (see below). This should hopefully be pretty easy to review since no unit-tests will change, or fail.
What does this implement/fix? Explain your changes.
BasePartitioner
to limit the number of repeated function definitions. I can remove this or add as an in-line comment as suggested by @thomasjpfan. I have it currently added, so lmk what you think.These are the summary of the code moves:
_splitter.pyx
->_partitioner.pyx
_splitter.pyx
->_partitioner.pyx
shift_missing_values_to_left_if_required
) moved to_partitioner.pyx
BasePartitioner
class, so the partitioner definition is handled in this abstract class. I can remove this if people want. See: MAINT Pull apart Splitter and Partitioner in the sklearn/tree code #29458 (comment)Any other comments?
This shouldn't introduce any performance regressions as the computational tricks used to make the code the same speed are still there:
DensePartitioner
andSparsePartitioner
are still decorated with@final
, so they are not able to be subclassed and Cython will optimize the code_splitter.pyx
is still used to define a join of theDensePartitioner
andSparsePartitioner
. I'm actually not 100% sure this is needed… But I think if we use thePartitioner
class in_partiitioner.pxd
, this may incur performance issues via vtable lookup?I ran some benchmarks using asv on this PR branch and the one on
main
, and I actually don't see that many diffs:asv run --bench RandomForestClassifierBenchmark.time_fit —verbose