FIX out of bound error in split_indices #21130

lorentzenchr · 2021-09-23T15:10:29Z

Reference Issues/PRs

This fixes an out of bound error in HGBT estimators that might lead to segfault.

How to produce the error?

In the current main branch, set boundscheck=True in splitting.pyx and run a tests like

pytest sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py

This gives:

E   IndexError: Out of bounds on buffer access (axis 0)

lorentzenchr · 2021-09-23T15:13:21Z

@ogrisel @thomasjpfan @NicolasHug @glemaitre @adrinjalali ping
I set the 1.0.1 milestone. I still hope that my analysis is flawed... 👀

adrinjalali · 2021-09-23T15:17:47Z

I agree this probably is a bug, but I'm not sure if we need to panic about it. We do have a large set of bugs already 😁

NicolasHug · 2021-09-23T15:24:16Z

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx

+                    # if n_threads >= 2 one has right_counts[-1] = 0 and
+                    # right_offset[-1] = len(sample_indices) leading to


I'm likely missing something but I don't understand these 2 lines

in the run-down example above, we have n_threads >= 2 and yet right_counts[-1] != 0 and right_offset[-1] != len(sample_indices)

Also it's not clear to me why this issue only ever happens for the right node?

Didn't you write that code? 😏

The "last" thread is responsible for the most "right" samples. The "right" part of the "last" thread sorts the last indices. In can happen that right_counts[-1]=0. Then the memcpy is a no-op. But it also leads to potentially evaluating the first argument of memcpy, which is &sample_indices[right_offset[thread_idx]] = &sample_indices[n_samples] which accesses one element after the last one, i.e. out of bound.

Again, this is a no-op, but it should not be executed for potentially undefined behavior.

I updated the comment in the hope to make it clearer.

NicolasHug

Thanks for clarifying the comment.

I think this is a legit bug, good find! IIUC, it should be fairly rare, as it can only ever happen if we're splitting the rightmost node in the tree. For that node, we would be accessing partition[n_samples] which is OOB. But for any other nodes, since sample_indices is a view on this bigger partition array, any access to sample_indices[n_samples_at_node] is indeed out of the node's boudary, but it's still within partition, so there wouldn't be any segfault.

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx

lorentzenchr · 2021-09-24T11:51:22Z

A follow-up thought, I have after discovering this bug: Do we want to setup a CICD pipeline that compiles with boundscheck=True and runs the test suite?

thomasjpfan

Small comment about comment, otherwise LGTM

thomasjpfan · 2021-09-27T20:07:21Z

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx

+                    # right_offset[-1] = len(sample_indices) leading to evaluating
+                    # &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
+                    # = &partition[n_samples_in_tree] which produces
+                    # IndexError: Out of bounds on buffer access (with boundscheck=True)


Do you think this is clearer (without going through partition)?

Suggested change

# right_offset[-1] = len(sample_indices) leading to evaluating

# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]

# = &partition[n_samples_in_tree] which produces

# IndexError: Out of bounds on buffer access (with boundscheck=True)

# right_offset[-1] = len(sample_indices) leading to evaluating

# &sample_indices[len(sample_indices)] which is out of bounds.

In general, sample_indices is a view so sample_indices[len(sample_indices)] can exist, but in the scope of this method, one should not access sample_indices beyond it's bounds?

I don't know if your suggestion is clearer. For the current comment, see #21130 (comment).
@NicolasHug might have an opinion?

And YES, even though sample_indices is a view on partition and the operation might be allowed with high probability, we should never ever access x[len(x)] - nowhere ever.

accessing sample_indices[len(sample_indices)] is out of bound w.r.t. the logic, but not out of the allowed memory space. This is the latter that causes a segfault, not the former. So I'd still prefer to be specific here.

we should never ever access x[len(x)] - nowhere ever

That's fair @lorentzenchr, but by that logic, we should also be protecting the left child calls.
There's never a segfault on the left child calls for the same reason there's never a segfault if the parent isn't the right-most node. Yet these OOB accesses can happen on any node and actually on any thread. There's just one single case where this is problematic.

I'm fine with either as long as we're consistent :)

If I understand the implementation correctly (I am new to this one), I guess right_counts[thread_idx] or left_counts[thread_idx] being null do not cause any problem in this case even if we touch an OOB element as nothing will be copied by memset.

Knowing this edge-case is useful (I think we can keep the comment), but do we want to add this conditional statement if it does not prevent any error but solely when boundscheck=True?

Should it clarifies that this also prevents segmentation faults?

Suggested change

# right_offset[-1] = len(sample_indices) leading to evaluating

# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]

# = &partition[n_samples_in_tree] which produces

# IndexError: Out of bounds on buffer access (with boundscheck=True)

# right_offset[-1] = len(sample_indices) leading to evaluating

# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]

# = &partition[n_samples_in_tree] which is out of bounds and can cause a

# segmentation fault.

#

# When boundscheck=True, this produces:

# IndexError: Out of bounds on buffer access

EDIT: see the new suggestion bellow.

I agree with the intent of the suggestion but the indentation of the suggested change in the comment makes it really hard to read.

Clarifying the comment while taking care of the readers' eyes. 👀

Suggested change

# right_offset[-1] = len(sample_indices) leading to evaluating

# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]

# = &partition[n_samples_in_tree] which produces

# IndexError: Out of bounds on buffer access (with boundscheck=True)

# right_offset[-1] = len(sample_indices)

#

# leading to evaluating

#

# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]

# = &partition[n_samples_in_tree]

#

# which is an out-of-bounds read access that can cause a segmentation fault.

# When boundscheck=True, removing this check produces this exception:

#

# IndexError: Out of bounds on buffer access

#

Does the text above better? I think we should move forward with this to focus on #21020.

I updated the PR with the @jjerphan suggestion.

jjerphan

LGTM.

I don't know how easy a CI with bounds checks enabled (as proposed in #21130 (comment)) is to set and maintain overtime. Shouldn't only the final packaged implementations be tested?

If we decide to set one, we can also enable other checks such as the one for memory-views initialization which are settable via initializedcheck.

jjerphan · 2021-09-29T11:52:47Z

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx

+                    # right_offset[-1] = len(sample_indices) leading to evaluating
+                    # &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
+                    # = &partition[n_samples_in_tree] which produces
+                    # IndexError: Out of bounds on buffer access (with boundscheck=True)


Should it clarifies that this also prevents segmentation faults?

Suggested change

# right_offset[-1] = len(sample_indices) leading to evaluating

# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]

# = &partition[n_samples_in_tree] which produces

# IndexError: Out of bounds on buffer access (with boundscheck=True)

# right_offset[-1] = len(sample_indices) leading to evaluating

# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]

# = &partition[n_samples_in_tree] which is out of bounds and can cause a

# segmentation fault.

#

# When boundscheck=True, this produces:

# IndexError: Out of bounds on buffer access

EDIT: see the new suggestion bellow.

lorentzenchr · 2021-09-29T17:37:03Z

FYI, I won't be able to continue for the next month. In case some reviewer want's to take over, feel free.

ogrisel · 2021-10-05T15:08:19Z

If we decide to set one, we can also enable other checks such as the one for memory-views initialization which are settable via initializedcheck.

Interesting idea but probably for another PR as we might go down a deep rabbit hole following this suggestion.

ogrisel

LGTM as well. Let's wait for the CI to be green after the last merge commit and let's merge.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

lorentzenchr · 2021-10-22T15:10:12Z

Thanks for finishing this one!

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

FIX out of bound error in split_indices

b73e260

lorentzenchr added Bug module:ensemble labels Sep 23, 2021

lorentzenchr added this to the 1.0.1 milestone Sep 23, 2021

github-actions bot added the cython label Sep 23, 2021

NicolasHug reviewed Sep 23, 2021

View reviewed changes

better comment

d9d1d7c

NicolasHug approved these changes Sep 24, 2021

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx Outdated Show resolved Hide resolved

improve comment according to review

a5e5d0f

lorentzenchr added 2 commits September 26, 2021 13:11

Merge branch 'main' into fix_out_of_bound_error

5e9289d

DOC add whatsnew

2ebef8e

lorentzenchr mentioned this pull request Sep 27, 2021

ENH FEA add interaction constraints to HGBT #21020

Merged

7 tasks

thomasjpfan approved these changes Sep 27, 2021

View reviewed changes

jjerphan approved these changes Sep 29, 2021

View reviewed changes

CLN Use suggestion with better spacing

f34cb04

jjerphan requested a review from ogrisel October 5, 2021 09:17

Merge branch 'main' into fix_out_of_bound_error

9eaf266

ogrisel approved these changes Oct 5, 2021

View reviewed changes

jjerphan requested a review from ogrisel October 6, 2021 09:08

ogrisel merged commit b6f1155 into scikit-learn:main Oct 6, 2021

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Oct 8, 2021

FIX out of bound error in split_indices (scikit-learn#21130)

cb8223b

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

NicolasHug mentioned this pull request Oct 8, 2021

Segfault in HistGradientBoostingClassifier #21283

Closed

lorentzenchr deleted the fix_out_of_bound_error branch October 22, 2021 15:18

glemaitre mentioned this pull request Oct 23, 2021

Release 1.0.1 #21404

Merged

10 tasks

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021

FIX out of bound error in split_indices (scikit-learn#21130)

009579e

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre pushed a commit that referenced this pull request Oct 25, 2021

FIX out of bound error in split_indices (#21130)

a34afbd

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

thomasjpfan mentioned this pull request Oct 31, 2021

Revisit gitattributes merge=union #21516

Closed

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

FIX out of bound error in split_indices (scikit-learn#21130)

c5e00f8

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021

FIX out of bound error in split_indices (scikit-learn#21130)

c654115

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre pushed a commit that referenced this pull request Dec 25, 2021

FIX out of bound error in split_indices (#21130)

1b97932

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

		# if n_threads >= 2 one has right_counts[-1] = 0 and
		# right_offset[-1] = len(sample_indices) leading to

Uh oh!

FIX out of bound error in split_indices #21130

FIX out of bound error in split_indices #21130

Uh oh!

Conversation

lorentzenchr commented Sep 23, 2021 • edited by adrinjalali Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

How to produce the error?

Uh oh!

lorentzenchr commented Sep 23, 2021

Uh oh!

adrinjalali commented Sep 23, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lorentzenchr commented Sep 24, 2021

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug Sep 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

jjerphan Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented Sep 29, 2021

Uh oh!

ogrisel commented Oct 5, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented Oct 22, 2021

Uh oh!

Uh oh!

lorentzenchr commented Sep 23, 2021 •

edited by adrinjalali

Loading

NicolasHug left a comment •

edited

Loading

NicolasHug Sep 28, 2021 •

edited

Loading

jjerphan Sep 29, 2021 •

edited

Loading

jjerphan Sep 29, 2021 •

edited

Loading

jjerphan Sep 29, 2021 •

edited

Loading