-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT, RFC Simplify the Cython code in sklearn/tree/
by splitting the "Splitter" and "Partitioner" code
#29459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would ppl be open to this simplification? I'm a bit biased as even from my personal experience, it was a headache reading through the entire Assuming there is no performance degradation. |
I think that I'm fine with the split. One thing that we need to do when splitting the file is to exactly explain the job of the splitter and partitioner: looking at the name one could think that they do the same thing (they cut data). I therefore think that we need to have a kind of developer documentation on the top of the file to explain why those are split: as far I remember, we have the partitioner to avoid speed regression due to some overhead in class inheritance.
I think this something that would lead to regression and the current workaround is actually a workaround abstract class. @thomasjpfan would know the details. Having his feedback here would be wonderful because we could use it to document the files. |
Here is my current understanding of Partitioner and Splitter (I can edit this as my understanding improves): Splitter: handles the algorithm for determining a split node, either via a greedy, or random search over a Partitioner: this performs the actual partitioning of the samples index array based on split points, missing values, etc. In the future is where we would partition samples to left/right as well for categorical splits. |
Yes IIUC that was what was described here: #25306. However, I'm wondering if the vtable lookup is still incurred if we just have an abstract base class, Just to summarize the state of affairs. If the following is the
then this would get hit w/ vtable lookup(?)
However, I'm wondering if the following
Forgive my understanding of vtable lookup if this is naive/incorrect. Tho its in C++, my guess is this SO post is somewhat related in terms of hitting a vtable lookup when compiling(?): https://stackoverflow.com/questions/64808615/is-a-vtable-lookup-performed-when-indirectly-calling-a-virtual-method-in-a-non-o. |
I ran some benchmarks using asv on the PR branch and the one on
Perhaps @thomasjpfan has some insight into when the vtable lookup hit was occurring? Is it only if the abstract base class has methods actually defined rather than "pass", or something else perhaps. |
The approach in #29458 should be okay when it comes to v-table lookups. I left a review in #29458 (review) with my thoughts. Overall, I do not really want to depend on Cython to have the correct v-table behavior with a base class. The alternative is to to define the partitioner interface with a comment rather than an actual base class. |
Summary of the problem
There are quite a number of GH issues with the label
tree
(https://github.com/scikit-learn/scikit-learn/issues?page=2&q=is%3Aopen+is%3Aissue+label%3Amodule%3Atree).However, the code is a bit hard to approach as there's so many moving parts. For example, see: #18448 (comment), where users are intimidated by the Cython codebase. A large part of the complexity comes in the splitter, which contains the most algorithmic logic. I think with the work done by @thomasjpfan recently, we were able to pull apart the idea of a "partitioner" and "splitter" in the trees.
This problem also arises because https://github.com/scikit-learn/scikit-learn/pull/29437/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fc makes the code-diff very large inside
_splitter.pyx
file making it also harder to review and understand what changes affect the "splitter" and what changes affect the "partitioner".Proposed change
I further propose to split these into separate files, so it is easier to maintain, read and determine what is affecting what. In addition, we can decrease the code complexity by creating an abstract base class for the
DensePartitioner
andSparsePartitioner
.See the following PR for an example of what would change. The
_splitter.pyx
file would decrease by over 800 LOC, while it gets moved to_partitioner.pyx
.The text was updated successfully, but these errors were encountered: