-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT Refactor tree splitter to use memoryviews #23273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if start_positive < end: | ||
simultaneous_sort(&Xf[start_positive], &samples[start_positive], end - start_positive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes a bug on main where start_positive == end
, which can lead to samples[start_positive]
being out of bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to write a non-regression test to trigger this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few tests that trigger this case causing the CI to fail before. Note these test only fail when compiled with SKLEARN_ENABLE_DEBUG_CYTHON_DIRECTIVES=1
, which enables bound checking on memoryviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the bug fixes would deserve a changelog entry and ideally a non-regression test.
About the use of typed memoryviews, it looks good to me. I tried to see if the new code would not leak any memory using a for loop with psutil
prints and all things good (memory usage is stable after a few iterations, as in in main
).
I am surprised that we seem to observe a small but significant speed-up with the MSE criterion. I am not sure what is causing this. Maybe the compiler can optimize things further with the C code generated by typed memory views (e.g. contiguity explicitly declared with the [::1]
notation?).
a825b50
to
2be7035
Compare
2be7035
to
fcf5675
Compare
cdef SIZE_t n_features # X.shape[1] | ||
cdef DTYPE_t* feature_values # temp. array holding feature values | ||
cdef DTYPE_t[::1] feature_values # temp. array holding feature values | ||
|
||
cdef SIZE_t start # Start position for the current node | ||
cdef SIZE_t end # End position for the current node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to change as sample_weight
in a future PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was referring to the pointer that is 2 lines below sample_weight
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @glemaitre means:
scikit-learn/sklearn/tree/_splitter.pxd
Line 61 in d247579
cdef DOUBLE_t* sample_weight |
Yes I plan to do it in the future. sample_weight
touches multiple files, so I wanted to do it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok I see: the sample_weight attribute below is still defined as a pointer (cdef DOUBLE_t* sample_weight
) and it could also be changed to a memory view.
+1. I am fine for doing this in a later PR and merge this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small question before merging.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Fixes binary incompatability due to scikit-learn/scikit-learn#23273 ==1.1 enforces scikit-learn 1.1.0, but ~=1.1.0 allows for any version up to 1.2
This PR refactors the tree splitter to use memoryview and allows Python to manage memory.
Benchmark
Running this benchmark that compares best/random splitter and numpy/sparse input between this PR and main:
Overall, this PR has the same runtime performance as
main
for dense input. For sparse input, this PR does a little better.