Skip to content

FIX out of bound error in split_indices #21130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 6, 2021

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Sep 23, 2021

Reference Issues/PRs

This fixes an out of bound error in HGBT estimators that might lead to segfault.

Discovered in #21020 (comment).

How to produce the error?

In the current main branch, set boundscheck=True in splitting.pyx and run a tests like

pytest sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py

This gives:

E   IndexError: Out of bounds on buffer access (axis 0)

@lorentzenchr
Copy link
Member Author

@ogrisel @thomasjpfan @NicolasHug @glemaitre @adrinjalali ping
I set the 1.0.1 milestone. I still hope that my analysis is flawed... 👀

@adrinjalali
Copy link
Member

I agree this probably is a bug, but I'm not sure if we need to panic about it. We do have a large set of bugs already 😁

Comment on lines 392 to 393
# if n_threads >= 2 one has right_counts[-1] = 0 and
# right_offset[-1] = len(sample_indices) leading to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm likely missing something but I don't understand these 2 lines

in the run-down example above, we have n_threads >= 2 and yet right_counts[-1] != 0 and right_offset[-1] != len(sample_indices)

Also it's not clear to me why this issue only ever happens for the right node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you write that code? 😏

The "last" thread is responsible for the most "right" samples. The "right" part of the "last" thread sorts the last indices. In can happen that right_counts[-1]=0. Then the memcpy is a no-op. But it also leads to potentially evaluating the first argument of memcpy, which is &sample_indices[right_offset[thread_idx]] = &sample_indices[n_samples] which accesses one element after the last one, i.e. out of bound.

Again, this is a no-op, but it should not be executed for potentially undefined behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment in the hope to make it clearer.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying the comment.

I think this is a legit bug, good find! IIUC, it should be fairly rare, as it can only ever happen if we're splitting the rightmost node in the tree. For that node, we would be accessing partition[n_samples] which is OOB. But for any other nodes, since sample_indices is a view on this bigger partition array, any access to sample_indices[n_samples_at_node] is indeed out of the node's boudary, but it's still within partition, so there wouldn't be any segfault.

@lorentzenchr
Copy link
Member Author

A follow-up thought, I have after discovering this bug: Do we want to setup a CICD pipeline that compiles with boundscheck=True and runs the test suite?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment about comment, otherwise LGTM

Comment on lines 395 to 398
# right_offset[-1] = len(sample_indices) leading to evaluating
# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
# = &partition[n_samples_in_tree] which produces
# IndexError: Out of bounds on buffer access (with boundscheck=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is clearer (without going through partition)?

Suggested change
# right_offset[-1] = len(sample_indices) leading to evaluating
# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
# = &partition[n_samples_in_tree] which produces
# IndexError: Out of bounds on buffer access (with boundscheck=True)
# right_offset[-1] = len(sample_indices) leading to evaluating
# &sample_indices[len(sample_indices)] which is out of bounds.

In general, sample_indices is a view so sample_indices[len(sample_indices)] can exist, but in the scope of this method, one should not access sample_indices beyond it's bounds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if your suggestion is clearer. For the current comment, see #21130 (comment).
@NicolasHug might have an opinion?

And YES, even though sample_indices is a view on partition and the operation might be allowed with high probability, we should never ever access x[len(x)] - nowhere ever.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessing sample_indices[len(sample_indices)] is out of bound w.r.t. the logic, but not out of the allowed memory space. This is the latter that causes a segfault, not the former. So I'd still prefer to be specific here.

Copy link
Member

@NicolasHug NicolasHug Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should never ever access x[len(x)] - nowhere ever

That's fair @lorentzenchr, but by that logic, we should also be protecting the left child calls.
There's never a segfault on the left child calls for the same reason there's never a segfault if the parent isn't the right-most node. Yet these OOB accesses can happen on any node and actually on any thread. There's just one single case where this is problematic.

I'm fine with either as long as we're consistent :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the implementation correctly (I am new to this one), I guess right_counts[thread_idx] or left_counts[thread_idx] being null do not cause any problem in this case even if we touch an OOB element as nothing will be copied by memset.

Knowing this edge-case is useful (I think we can keep the comment), but do we want to add this conditional statement if it does not prevent any error but solely when boundscheck=True?

Copy link
Member

@jjerphan jjerphan Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it clarifies that this also prevents segmentation faults?

Suggested change
# right_offset[-1] = len(sample_indices) leading to evaluating
# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
# = &partition[n_samples_in_tree] which produces
# IndexError: Out of bounds on buffer access (with boundscheck=True)
# right_offset[-1] = len(sample_indices) leading to evaluating
# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
# = &partition[n_samples_in_tree] which is out of bounds and can cause a
# segmentation fault.
#
# When boundscheck=True, this produces:
# IndexError: Out of bounds on buffer access

EDIT: see the new suggestion bellow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the intent of the suggestion but the indentation of the suggested change in the comment makes it really hard to read.

Copy link
Member

@jjerphan jjerphan Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying the comment while taking care of the readers' eyes. 👀

Suggested change
# right_offset[-1] = len(sample_indices) leading to evaluating
# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
# = &partition[n_samples_in_tree] which produces
# IndexError: Out of bounds on buffer access (with boundscheck=True)
# right_offset[-1] = len(sample_indices)
#
# leading to evaluating
#
# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
# = &partition[n_samples_in_tree]
#
# which is an out-of-bounds read access that can cause a segmentation fault.
# When boundscheck=True, removing this check produces this exception:
#
# IndexError: Out of bounds on buffer access
#

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the text above better? I think we should move forward with this to focus on #21020.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR with the @jjerphan suggestion.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I don't know how easy a CI with bounds checks enabled (as proposed in #21130 (comment)) is to set and maintain overtime. Shouldn't only the final packaged implementations be tested?

If we decide to set one, we can also enable other checks such as the one for memory-views initialization which are settable via initializedcheck.

Comment on lines 395 to 398
# right_offset[-1] = len(sample_indices) leading to evaluating
# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
# = &partition[n_samples_in_tree] which produces
# IndexError: Out of bounds on buffer access (with boundscheck=True)
Copy link
Member

@jjerphan jjerphan Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it clarifies that this also prevents segmentation faults?

Suggested change
# right_offset[-1] = len(sample_indices) leading to evaluating
# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
# = &partition[n_samples_in_tree] which produces
# IndexError: Out of bounds on buffer access (with boundscheck=True)
# right_offset[-1] = len(sample_indices) leading to evaluating
# &sample_indices[right_offset[-1]] = &samples_indices[n_samples_at_node]
# = &partition[n_samples_in_tree] which is out of bounds and can cause a
# segmentation fault.
#
# When boundscheck=True, this produces:
# IndexError: Out of bounds on buffer access

EDIT: see the new suggestion bellow.

@lorentzenchr
Copy link
Member Author

FYI, I won't be able to continue for the next month. In case some reviewer want's to take over, feel free.

@jjerphan jjerphan requested a review from ogrisel October 5, 2021 09:17
@ogrisel
Copy link
Member

ogrisel commented Oct 5, 2021

If we decide to set one, we can also enable other checks such as the one for memory-views initialization which are settable via initializedcheck.

Interesting idea but probably for another PR as we might go down a deep rabbit hole following this suggestion.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. Let's wait for the CI to be green after the last merge commit and let's merge.

@jjerphan jjerphan requested a review from ogrisel October 6, 2021 09:08
@ogrisel ogrisel merged commit b6f1155 into scikit-learn:main Oct 6, 2021
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Oct 8, 2021

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@lorentzenchr
Copy link
Member Author

Thanks for finishing this one!

@lorentzenchr lorentzenchr deleted the fix_out_of_bound_error branch October 22, 2021 15:18
@glemaitre glemaitre mentioned this pull request Oct 23, 2021
10 tasks
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
glemaitre pushed a commit that referenced this pull request Oct 25, 2021

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
glemaitre pushed a commit that referenced this pull request Dec 25, 2021

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants