MNT Refactor tree splitter to use memoryviews #23273

thomasjpfan · 2022-05-04T03:13:29Z

This PR refactors the tree splitter to use memoryview and allows Python to manage memory.

Benchmark

Running this benchmark that compares best/random splitter and numpy/sparse input between this PR and main:

Overall, this PR has the same runtime performance as main for dense input. For sparse input, this PR does a little better.

sklearn/tree/_splitter.pyx

thomasjpfan · 2022-05-04T18:25:26Z

sklearn/tree/_splitter.pyx

+            if start_positive < end:
+                simultaneous_sort(&Xf[start_positive], &samples[start_positive], end - start_positive)


This fixes a bug on main where start_positive == end, which can lead to samples[start_positive] being out of bounds.

Would it be possible to write a non-regression test to trigger this case?

There are a few tests that trigger this case causing the CI to fail before. Note these test only fail when compiled with SKLEARN_ENABLE_DEBUG_CYTHON_DIRECTIVES=1, which enables bound checking on memoryviews.

ogrisel

I think the bug fixes would deserve a changelog entry and ideally a non-regression test.

About the use of typed memoryviews, it looks good to me. I tried to see if the new code would not leak any memory using a for loop with psutil prints and all things good (memory usage is stable after a few iterations, as in in main).

I am surprised that we seem to observe a small but significant speed-up with the MSE criterion. I am not sure what is causing this. Maybe the compiler can optimize things further with the C code generated by typed memory views (e.g. contiguity explicitly declared with the [::1] notation?).

ogrisel · 2022-05-09T16:39:16Z

I re-ran the benchmark on my own laptop with the latest version of this PR and I the following similar:

I don't think there is any significant difference between this branch and main.

glemaitre · 2022-05-13T12:20:42Z

sklearn/tree/_splitter.pxd

    cdef SIZE_t n_features               # X.shape[1]
-    cdef DTYPE_t* feature_values         # temp. array holding feature values
+    cdef DTYPE_t[::1] feature_values     # temp. array holding feature values

    cdef SIZE_t start                    # Start position for the current node
    cdef SIZE_t end                      # End position for the current node


Do you plan to change as sample_weight in a future PR?

What do you mean?

Sorry I was referring to the pointer that is 2 lines below sample_weight.

I think @glemaitre means:

scikit-learn/sklearn/tree/_splitter.pxd

Line 61 in d247579

cdef DOUBLE_t* sample_weight

Yes I plan to do it in the future. sample_weight touches multiple files, so I wanted to do it in another PR.

Ah ok I see: the sample_weight attribute below is still defined as a pointer (cdef DOUBLE_t* sample_weight) and it could also be changed to a memory view.

+1. I am fine for doing this in a later PR and merge this one.

Works with me.

glemaitre

Just a small question before merging.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Fixes binary incompatability due to scikit-learn/scikit-learn#23273 ==1.1 enforces scikit-learn 1.1.0, but ~=1.1.0 allows for any version up to 1.2

MNT Refactor splitter to use memoryviews

a8a182c

github-actions bot added cython module:tree labels May 4, 2022

thomasjpfan marked this pull request as draft May 4, 2022 12:48

FIX Fixes edge case

9080ccc

thomasjpfan marked this pull request as ready for review May 4, 2022 12:54

thomasjpfan marked this pull request as draft May 4, 2022 13:01

FIX Fixes another edge case

be576c3

thomasjpfan marked this pull request as ready for review May 4, 2022 15:28

thomasjpfan marked this pull request as draft May 4, 2022 17:33

thomasjpfan marked this pull request as ready for review May 4, 2022 17:57

thomasjpfan marked this pull request as draft May 4, 2022 18:08

thomasjpfan marked this pull request as ready for review May 4, 2022 18:22

thomasjpfan commented May 4, 2022

View reviewed changes

ogrisel reviewed May 9, 2022

View reviewed changes

thomasjpfan force-pushed the mv_splitter_v2 branch from a825b50 to 2be7035 Compare May 9, 2022 15:31

ENH Do not define zero_pos at all

fcf5675

thomasjpfan force-pushed the mv_splitter_v2 branch from 2be7035 to fcf5675 Compare May 9, 2022 15:46

Merge remote-tracking branch 'upstream/main' into mv_splitter_v2

b5b66f5

ogrisel approved these changes May 9, 2022

View reviewed changes

thomasjpfan and others added 2 commits May 9, 2022 13:40

DOC Adds whats new

fd4322a

Merge branch 'main' into mv_splitter_v2

fe0d68d

glemaitre self-requested a review May 13, 2022 12:15

glemaitre reviewed May 13, 2022

View reviewed changes

glemaitre approved these changes May 13, 2022

View reviewed changes

glemaitre merged commit 962c9b4 into scikit-learn:main May 13, 2022

RMeli added a commit to RMeli/scikit-learn that referenced this pull request May 14, 2022

fix merge conflicts with PR scikit-learn#23273

39563fd

RMeli mentioned this pull request May 14, 2022

MNT Use cimport numpy as cnp for sklearn/tree #23315

Merged

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022

MNT Refactor tree splitter to use memoryviews (scikit-learn#23273)

f044ff1

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre added a commit that referenced this pull request Aug 5, 2022

MNT Refactor tree splitter to use memoryviews (#23273)

ad5936f

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 17, 2022

Require scikit-learn 1.1.0 compatible release

33b6d6a

Fixes binary incompatability due to scikit-learn/scikit-learn#23273 ==1.1 enforces scikit-learn 1.1.0, but ~=1.1.0 allows for any version up to 1.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT Refactor tree splitter to use memoryviews #23273

MNT Refactor tree splitter to use memoryviews #23273

thomasjpfan commented May 4, 2022 •

edited

Loading

thomasjpfan May 4, 2022

ogrisel May 9, 2022

thomasjpfan May 9, 2022 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel commented May 9, 2022 •

edited

Loading

glemaitre May 13, 2022

ogrisel May 13, 2022

glemaitre May 13, 2022

thomasjpfan May 13, 2022

ogrisel May 13, 2022

glemaitre May 13, 2022

glemaitre left a comment

		if start_positive < end:
		simultaneous_sort(&Xf[start_positive], &samples[start_positive], end - start_positive)

MNT Refactor tree splitter to use memoryviews #23273

MNT Refactor tree splitter to use memoryviews #23273

Conversation

thomasjpfan commented May 4, 2022 • edited Loading

Benchmark

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan May 9, 2022 • edited Loading

Choose a reason for hiding this comment

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

ogrisel commented May 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

thomasjpfan commented May 4, 2022 •

edited

Loading

thomasjpfan May 9, 2022 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel commented May 9, 2022 •

edited

Loading