-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] FIX index overflow error in sparse matrix polynomial expansion (bis) #19676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the overarching issue with _csr_polynomial_expansion
is that the output type of indices can still be encoded as int32
if the expansion is not too big. If the expansion is too big, even a int32
input would need int64
indices to encode the expansion.
The solution in this PR, is to covert everything into int64
, which would would lead to more memory usage for int32 indices.
I think a solution could be:
_csr_polynomial_expansion
can accept 32bit and 64bit indices.- try to build a 32bit expansion
- If the 32bit expansion does not have enough capacity, build a 64bit expansion.
Something like this:
try:
Xp_next = _csr_polynomial_expansion(..., output_format="int32")
except ValueError:
Xp_next = _csr_polynomial_expansion(..., output_format="int64")
(The API is up for discussion). We can also wrap the above logic in _csr_polynomial_expansion
and try to reuse the (incomplete) result of the int32
expansion and use that to inform the int64
expansion.
sklearn/preprocessing/_data.py
Outdated
# use np.int64 for index datatype to prevent overflow | ||
# in case X has a large dimension | ||
Xp_next = _csr_polynomial_expansion(X.data, | ||
X.indices.astype(np.int64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think casting here would lead to to more memory usage for spares matrices encoded with int32 indices.
@thomasjpfan Thanks for your help! The problem I have is that in this example, no error is thrown by the cython code, but wrong negative indices are created silently instead (a
print(d)
expanded_dimensionality = int((d**2 + d) / 2 - interaction_only*d)
print(expanded_dimensionality)
|
It might not be a good idea to use a |
We would need to update the Cython to properly error when it notices that
I agree. If there is a fast way to compute the dimension, we can quickly decide the output format of the indices from there. At a glance, I think it can be done by using |
…onality and in total_nnz
I agree, I'll try to compute the largest possible index as you suggested @thomasjpfan :
(A simpler idea (and which wouldn't need to go through all the elements in (A more efficient solution instead of going through all indices to find the max, would be to solve on After that, I think we also need to check whether the But first, I already pushed some casting to |
I also see that all current tests pass but the example from @jianlingzhong is not passing, so I'll add it as a test |
I'm going for the simplest solution described above (("A simpler idea (and which wouldn't need to go through all the elements in indices), would be to just look at the expanded_dimensionality and see if it's larger than 2147483647"), which I guess is enough to perform well in most cases ? Regarding the possibility for multiple values that are pointed to the same place in the sparse matrix, there is no such pb because Also, I'm wondering if there can be problems with the data inside the array (for instance if it's I'm also thinking there could be a silent problem if the input array indices are Also, to test because of some bug in the I didn't have time to advance yet but I'll get back to it soon |
fyi, #19734 was just merged so you should be good to add that test. |
I'm still in progress, but I'm having trouble to explicitly cast a variable with fused type in say an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some tips :)
# expanded_indptr to int64 too. | ||
# TODO: check that the casting is indeed done | ||
if expanded_dimensionality > MAX_INT32: | ||
indices = np.array(indices, dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If indices
is of dtype np.int32
, it would be strictly typed to indices
at this line. This means that we can not have it point to a int64 array.
I think we need to factor out the code that does the expansion into it's own fused function. The original _csr_polynomial_expansion
will decide if it should cast indices
and friends.
if total_nnz > MAX_INT32: | ||
expanded_indptr = np.array(expanded_indptr, dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could lead to a csr with indicies and indptr not having the same dtype. I do not know if scipy supports that.
# in the expanded space, so we cast to int64 | ||
# before the expansion computation to | ||
# avoid overflow: | ||
nnz = np.int64(indptr[row_i + 1] - indptr[row_i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nnz would need to be typed int64
sklearn/preprocessing/_polynomial.py
Outdated
Xp_next = _csr_polynomial_expansion(X.data, X.indices, | ||
X.indptr, X.shape[1], | ||
X.indptr, | ||
np.int64(X.shape[1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If X.indptr.dtype is int32, this would be downcasted anyways. We can change the signature of _csr_polynomial_expansion
to always do the upcasting.
def _csr_polynomial_expansion(...
np.int64_t d, INDEX_T interaction_only,
INDEX_T degree):
@thomasjpfan thanks a lot for the help ! It's much clearer now how casting works in cython, |
in 90e12b8, similarly to @thomasjpfan 's suggestion I separated:
(@thomasjpfan in #19676 (comment) you suggested to do the expansion outside of I went for a simple rule for casting: if the expanded dimensionality needs to be an Let me know what you think ! Tests are green now so it's encouraging, I'll just need to add a what's new entry, and maybe I was thinking add a bit more cases to the test to be sure the code works in every case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to know that we would need to actually compute and sum how many nonzeros elements there are in each expanded row, so we might find halfway that we should have started with int64 from the beginning, which is not ideal...
My initial thought was to start the computation and when it discovers that it should be int64
to begin with, create a int64
container, copy over the results, and continue from there.
Give that, I see that the heuristic we have here is simple. My concern is that there would be a memory regression for input that did not actual need the int64
conversion.
X.indices = X.indices.astype(np.int64) | ||
X.indptr = X.indptr.astype(np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two copies being created here, one from X.copy
and here when changing the dtype of the indices. I do not see a good alternative. I would expect the following to work:
X = csr_matrix(
(X.data, X.indices.astype(np.int64), X.indptr.astype(np.int64)),
shape=X.shape, copy=True,
)
but internally csr_matrix
checks the indices and will cast back down to np.int32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, your solution sounds good
I'm not completely sure to understand the problem with it: is it because when creating a csr_matrix
with indices that could hold in int32
, scipy automatically choses int32
indices ?
In this case maybe I could avoid copying X
, but deal with every part of it (data, indices, inptr) separately, like:
- remove the
X = X.copy()
line - do
X_indices = X.indices.astype(np.int64, copy=False)
(to do a copy only if the type is changed) - similarly
X_indptr =X_indptr.astype(np.int64, copy=False)
And then returnX.data
, andX_indices
andX_indptr
, which I would use later on in_csr_polynomial_expansion
What do you think ?
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Thanks a lot for the review @thomasjpfan ! And sorry for the delay in my answer Regarding the strategy to start the computation, I agree it would be best, but I couldn't find any way to do that efficiently: the problem I have is that I don't know how it could find on the fly that we should have used So I guess the only way to know if say But I agree that it could lead to memory regressions for input that did not actual need the Maybe we could do the following, by mixing the two strategies, what do you think ? :
Let me know how that sounds, if it's a good idea I'll start implementing it |
(Btw, I have limited time these days, so if you or anyone feels like taking over the PR, go ahead, it could go faster, otherwise I'll try to continue from time to time when I have some time, sorry for the delay) |
@wdevazelhes I have labeled as 'help wanted' if this is ok with you. If you find some time to continue the PR you can always... help yourself... :) |
@cmarmo thanks a lot, that works :) |
Hello, scikit-learn community. I have some time and the great interest to make contribution to the project. I investigated history of this issue and read current PR and now I have a working solution for this problem. Could I take this issue to work on it and to create a new PR? I slightly rewrote code of _csr_polynomial_expansion.pyx to move logic for array allocation outside of the c-code. |
Closing as superseded by #23731. Thanks @wdevazelhes for your work! |
This PR is an attempt to take over #16831, I'll start by a simple merge with master and then investigate the issue @jianlingzhong described here #16831 (comment)