-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Use simultaenous sort in tree splitter #22868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Use simultaenous sort in tree splitter #22868
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @thomasjpfan!
Better performances while removing custom implementations: everything maintainers love!
Before clicking "✅ Approve": do simultaneous_sort
, introsort
, and heap_sort
share the same stability?
Edit: I think addressing #22827 for sklearn.tree
(and also probably sklearn.ensemble
) would be preferable before subsequently refactoring this module. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan already got a maintainability award in #22365 and most of the work here was prepared by @jjerphan, so the jury needs to think what to do.
They do not share the same stableness. I ran an experiment with a small python interface: # tree sort
def py_sort(DTYPE_t[:] Xf, SIZE_t[:] samples):
cdef SIZE_t n = Xf.shape[0]
sort(&Xf[0], &samples[0], n)
def py_simultaneous_sort(floating[:] values, ITYPE_t[:] indices):
cdef ITYPE_t n = values.shape[0]
simultaneous_sort(&values[0], &indices[0], n) with: values = np.asarray([1.1, -0.5, -0.5, -0.18, -0.5], dtype=np.float32)
indicies = np.arange(5)
# tree sort result indices
array([4, 1, 2, 3, 0])
# simultaneous sort
array([2, 4, 1, 3, 0]) Both implementations are unstable. Also I observe that I updated the original message with benchmarks using other tree based models and observe similar performance improvements. Since, a model trained on the same data may be different because of this PR, I add an entry in the "Changed models" section here: e9ba146
It would be nice to have it before this PR, but I do not see it as strict requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you, @thomasjpfan
It would be nice to have it before this PR, but I do not see it as strict requirement.
So do I now, since you provided experiments regarding the two algorithms stability.
Should we wait for another maintainer to merge this PR due to models' changes?
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both the old and the new algorithms are unstable I think this is good to go.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
This reverts commit 4cf932d.
This reverts commit 4cf932d.
This PR replaces the use of
sort
in the tree splitter withsimultaneous_sort
. Running the following benchmark script:Benchmark
Decision Tree Benchmarks
I see the performance benefits for this PR compared to
main
:RandomForest Benchmarks
GradientBoosting Benchmarks
n_features=20
, and less samples since the runtime is longer overallCC @jjerphan