Skip to content

FIX out of bounds error on X_indptr in kmeans #21662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Nov 14, 2021

Reference Issues/PRs

When compiled with boundscheck=True, Kmeans tests like test_kmeans_results throws

sklearn/cluster/tests/test_k_means.py:73: AssertionError
------------------------------------------------------- Captured stderr call -------------------------------------------------------
IndexError: Out of bounds on buffer access (axis 0)
IndexError: Out of bounds on buffer access (axis 0)

What does this implement/fix? Explain your changes.

A CSR matrix X with n rows, has X.indptr with n+1 elements as beautifully explained in https://stackoverflow.com/a/52299730/16761084.

The n+1th element might never be accessed such that this is not a user visible bug.

Any other comments?

Found by working on #21654.

@adrinjalali
Copy link
Member

I don't suppose we can actually test this @lorentzenchr 😁

@jeremiedbb
Copy link
Member

I think it makes sense but I don't understand how it never triggered a segfault

@lorentzenchr
Copy link
Member Author

I think it makes sense but I don't understand how it never triggered a segfault

IIUC, it does not really read/access the last element. Segfaults are just hard to pin down anyway.

@jeremiedbb
Copy link
Member

So I tested this snippet

import numpy as np

a = np.array([1., 2, 3, 4, 5])
cdef double[::1] a_view = a
cdef double[::1] b_view = a[:-1]  # slice of a_view of all but the last element
print(b_view[4])

with boundscheck=True, cython throws an error because we try to access out of bound
with boundscheck=False, it prints 5, i.e. the last element of a. I guess it's because with boundscheck=False it just directly deref the pointer of the base array which is properly allocated and not out of bound.

In summary, the current code works but in a non-intuitive way

@adrinjalali adrinjalali merged commit f7f0a2f into scikit-learn:main Nov 16, 2021
@lorentzenchr lorentzenchr deleted the fix_out_of_bounds_sparse_kmeans branch November 16, 2021 20:49
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 22, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants