MNT Refactor tree splitter to use memoryviews #23273

thomasjpfan · 2022-05-04T03:13:29Z

This PR refactors the tree splitter to use memoryview and allows Python to manage memory.

Benchmark

Running this benchmark that compares best/random splitter and numpy/sparse input between this PR and main:

Overall, this PR has the same runtime performance as main for dense input. For sparse input, this PR does a little better.

sklearn/tree/_splitter.pyx

thomasjpfan · 2022-05-04T18:25:26Z

sklearn/tree/_splitter.pyx

+            if start_positive < end:
+                simultaneous_sort(&Xf[start_positive], &samples[start_positive], end - start_positive)


This fixes a bug on main where start_positive == end, which can lead to samples[start_positive] being out of bounds.

Would it be possible to write a non-regression test to trigger this case?

There are a few tests that trigger this case causing the CI to fail before. Note these test only fail when compiled with SKLEARN_ENABLE_DEBUG_CYTHON_DIRECTIVES=1, which enables bound checking on memoryviews.

ogrisel

I think the bug fixes would deserve a changelog entry and ideally a non-regression test.

About the use of typed memoryviews, it looks good to me. I tried to see if the new code would not leak any memory using a for loop with psutil prints and all things good (memory usage is stable after a few iterations, as in in main).

I am surprised that we seem to observe a small but significant speed-up with the MSE criterion. I am not sure what is causing this. Maybe the compiler can optimize things further with the C code generated by typed memory views (e.g. contiguity explicitly declared with the [::1] notation?).

ogrisel · 2022-05-09T16:39:16Z

I re-ran the benchmark on my own laptop with the latest version of this PR and I the following similar:

I don't think there is any significant difference between this branch and main.

glemaitre · 2022-05-13T12:20:42Z

sklearn/tree/_splitter.pxd

    cdef SIZE_t n_features               # X.shape[1]
-    cdef DTYPE_t* feature_values         # temp. array holding feature values
+    cdef DTYPE_t[::1] feature_values     # temp. array holding feature values

    cdef SIZE_t start                    # Start position for the current node
    cdef SIZE_t end                      # End position for the current node


Do you plan to change as sample_weight in a future PR?

What do you mean?

Sorry I was referring to the pointer that is 2 lines below sample_weight.

I think @glemaitre means:

scikit-learn/sklearn/tree/_splitter.pxd

Line 61 in d247579

cdef DOUBLE_t* sample_weight

Yes I plan to do it in the future. sample_weight touches multiple files, so I wanted to do it in another PR.

Ah ok I see: the sample_weight attribute below is still defined as a pointer (cdef DOUBLE_t* sample_weight) and it could also be changed to a memory view.

+1. I am fine for doing this in a later PR and merge this one.

Works with me.

glemaitre

Just a small question before merging.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Fixes binary incompatability due to scikit-learn/scikit-learn#23273 ==1.1 enforces scikit-learn 1.1.0, but ~=1.1.0 allows for any version up to 1.2

MNT Refactor splitter to use memoryviews

a8a182c

github-actions bot added cython module:tree labels May 4, 2022

thomasjpfan marked this pull request as draft May 4, 2022 12:48

FIX Fixes edge case

9080ccc

thomasjpfan marked this pull request as ready for review May 4, 2022 12:54

thomasjpfan marked this pull request as draft May 4, 2022 13:01

FIX Fixes another edge case

be576c3

thomasjpfan marked this pull request as ready for review May 4, 2022 15:28

thomasjpfan marked this pull request as draft May 4, 2022 17:33

thomasjpfan marked this pull request as ready for review May 4, 2022 17:57

thomasjpfan marked this pull request as draft May 4, 2022 18:08

thomasjpfan marked this pull request as ready for review May 4, 2022 18:22

thomasjpfan commented May 4, 2022

View reviewed changes

ogrisel reviewed May 9, 2022

View reviewed changes

thomasjpfan force-pushed the mv_splitter_v2 branch from a825b50 to 2be7035 Compare May 9, 2022 15:31

ENH Do not define zero_pos at all

fcf5675

thomasjpfan force-pushed the mv_splitter_v2 branch from 2be7035 to fcf5675 Compare May 9, 2022 15:46

Merge remote-tracking branch 'upstream/main' into mv_splitter_v2

b5b66f5

ogrisel approved these changes May 9, 2022

View reviewed changes

thomasjpfan and others added 2 commits May 9, 2022 13:40

DOC Adds whats new

fd4322a

Merge branch 'main' into mv_splitter_v2

fe0d68d

glemaitre self-requested a review May 13, 2022 12:15

glemaitre reviewed May 13, 2022

View reviewed changes

glemaitre approved these changes May 13, 2022

View reviewed changes

glemaitre merged commit 962c9b4 into scikit-learn:main May 13, 2022

RMeli added a commit to RMeli/scikit-learn that referenced this pull request May 14, 2022

fix merge conflicts with PR scikit-learn#23273

39563fd

RMeli mentioned this pull request May 14, 2022

MNT Use cimport numpy as cnp for sklearn/tree #23315

Merged

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022

MNT Refactor tree splitter to use memoryviews (scikit-learn#23273)

f044ff1

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre added a commit that referenced this pull request Aug 5, 2022

MNT Refactor tree splitter to use memoryviews (#23273)

ad5936f

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

sebp added a commit to sebp/scikit-survival that referenced this pull request Oct 17, 2022

Require scikit-learn 1.1.0 compatible release

33b6d6a

Fixes binary incompatability due to scikit-learn/scikit-learn#23273 ==1.1 enforces scikit-learn 1.1.0, but ~=1.1.0 allows for any version up to 1.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MNT Refactor tree splitter to use memoryviews #23273

MNT Refactor tree splitter to use memoryviews #23273

Uh oh!

thomasjpfan commented May 4, 2022 •

edited

Loading

Uh oh!

Uh oh!

thomasjpfan May 4, 2022

Uh oh!

ogrisel May 9, 2022

Uh oh!

thomasjpfan May 9, 2022 •

edited

Loading

Uh oh!

ogrisel left a comment •

edited

Loading

Uh oh!

ogrisel commented May 9, 2022 •

edited

Loading

Uh oh!

glemaitre May 13, 2022

Uh oh!

ogrisel May 13, 2022

Uh oh!

glemaitre May 13, 2022

Uh oh!

thomasjpfan May 13, 2022

Uh oh!

ogrisel May 13, 2022

Uh oh!

glemaitre May 13, 2022

Uh oh!

glemaitre left a comment

Uh oh!

Uh oh!

		if start_positive < end:
		simultaneous_sort(&Xf[start_positive], &samples[start_positive], end - start_positive)

Uh oh!

MNT Refactor tree splitter to use memoryviews #23273

MNT Refactor tree splitter to use memoryviews #23273

Uh oh!

Conversation

thomasjpfan commented May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan May 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented May 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan commented May 4, 2022 •

edited

Loading

thomasjpfan May 9, 2022 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel commented May 9, 2022 •

edited

Loading