Skip to content

[WIP] FIX index overflow error in sparse matrix polynomial expansion (bis) #19676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

wdevazelhes
Copy link
Contributor

This PR is an attempt to take over #16831, I'll start by a simple merge with master and then investigate the issue @jianlingzhong described here #16831 (comment)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overarching issue with _csr_polynomial_expansion is that the output type of indices can still be encoded as int32 if the expansion is not too big. If the expansion is too big, even a int32 input would need int64 indices to encode the expansion.

The solution in this PR, is to covert everything into int64, which would would lead to more memory usage for int32 indices.

I think a solution could be:

  1. _csr_polynomial_expansion can accept 32bit and 64bit indices.
  2. try to build a 32bit expansion
  3. If the 32bit expansion does not have enough capacity, build a 64bit expansion.

Something like this:

try:
	Xp_next = _csr_polynomial_expansion(..., output_format="int32")
except ValueError:
	Xp_next = _csr_polynomial_expansion(..., output_format="int64")

(The API is up for discussion). We can also wrap the above logic in _csr_polynomial_expansion and try to reuse the (incomplete) result of the int32 expansion and use that to inform the int64 expansion.

# use np.int64 for index datatype to prevent overflow
# in case X has a large dimension
Xp_next = _csr_polynomial_expansion(X.data,
X.indices.astype(np.int64),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think casting here would lead to to more memory usage for spares matrices encoded with int32 indices.

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Mar 15, 2021

@thomasjpfan Thanks for your help!
Indeed, I tried to remove the casting of inputs in int64 from this PR and run @jianlingzhong 's example (here: #16803 (comment)), and the problem happens when forming the extension because that one has very high dimensionality (so needs int64)

The problem I have is that in this example, no error is thrown by the cython code, but wrong negative indices are created silently instead (a ValueError: column index exceeds matrix dimensions error is raised after the call of _csr_polynomial_expansion , at concatenation)


Also, there is something I didn't get when tracing what happens in the cython code: I added two print lines here:

print(d)    
expanded_dimensionality = int((d**2 + d) / 2 - interaction_only*d)
print(expanded_dimensionality)

d is 120006, however expanded_dimensionality is 758209071, when it should be 7200660015 ... And I thought it was due to the int32 overflow so I tried to cast 7200660015 into int32 and it returns -2147483648 not 758209071 so I have trouble understanding what's going on here...
=> In fact what happens is that 758209071 = int((np.int32(120006**2) + np.int32(120006))/2 - np.int32(120006)), with an overflow coming from the square power.

@jianlingzhong
Copy link

It might not be a good idea to use a try ... except... approach as the expansion would take a long time when the number of dimensions is high. It may be a good idea to calculate the resultant dimension number and check if int32 is sufficient.

@thomasjpfan
Copy link
Member

The problem I have is that in this example, no error is thrown by the cython code, but wrong negative indices are created silently instead (a ValueError: column index exceeds matrix dimensions error is raised after the call of _csr_polynomial_expansion , at concatenation)

We would need to update the Cython to properly error when it notices that int64 is not good enough.

It might not be a good idea to use a try ... except... approach as the expansion would take a long time when the number of dimensions is high. It may be a good idea to calculate the resultant dimension number and check if int32 is sufficient.

I agree. If there is a fast way to compute the dimension, we can quickly decide the output format of the indices from there. At a glance, I think it can be done by using _deg2_column or _deg3_column to compute the largest possible indices. (We would need to explicitly use the int64 versions of these functions)

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Mar 17, 2021

It might not be a good idea to use a try ... except... approach as the expansion would take a long time when the number of dimensions is high. It may be a good idea to calculate the resultant dimension number and check if int32 is sufficient.

I agree. If there is a fast way to compute the dimension, we can quickly decide the output format of the indices from there. At a glance, I think it can be done by using _deg2_column or _deg3_column to compute the largest possible indices. (We would need to explicitly use the int64 versions of these functions)

I agree, I'll try to compute the largest possible index as you suggested @thomasjpfan :

  • For the "polynomial expansion" case, I think we can just find the maximum column index col_max in the indices, and indeed just compute _deg2_column(col_max, col_max) or _deg3_column(col_max, col_max, col_max), with the int64 version of those functions: if it's bigger than 2147483647 (np.iinfo(np.int32)), then we need to convert indices to int64 before doing the expansion
  • For the "interaction expansion" case, I think we would need to find the maximum column index col_max as above, then among all rows that contain a column col_max, find the max column number that they contain except col_max, col_max_2 (and then similarly find the subsequent col_max_3), and then decide based on the int64 computation of _deg2_column(col_max, col_max_2) (or `_deg3_column(col_max, col_max_2, col_max_3)``. (I need to double check this)

(A simpler idea (and which wouldn't need to go through all the elements in indices), would be to just look at the expanded_dimensionality and see if it's larger than 2147483647. But it may sometimes be worse than computing the largest possible index as described above, in some weird cases for instance if many last columns of the expanded matrix are empty... I don't know if those cases would arise often however, and if it's worth going through all elements of indices, just to be optimal in those rare cases))

(A more efficient solution instead of going through all indices to find the max, would be to solve on i the equation _deg2_column(i, i)=2147483647 and see if indices contains any index bigger than this, But I don't know if it's worth too)

After that, I think we also need to check whether the indptr should be casted to int64 too: I think in theory the biggest number in indptr is its last element and should be equal to total_nnz (the the number of nonzero elements of the expansion), but I need to double check in case I'm missing some edge cases (for example I'm not sure yet how the code deals with sparse matrix that have duplicates)

But first, I already pushed some casting to int64 needed to avoid overflow in the computation of the expanded dimensionality and the number of nonzero elements (I'm not 100% sure of the syntax for casting in cython, this is how I saw it online and it compiled, but I may be mistaken)

@wdevazelhes
Copy link
Contributor Author

I also see that all current tests pass but the example from @jianlingzhong is not passing, so I'll add it as a test

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Mar 22, 2021

I'm going for the simplest solution described above (("A simpler idea (and which wouldn't need to go through all the elements in indices), would be to just look at the expanded_dimensionality and see if it's larger than 2147483647"), which I guess is enough to perform well in most cases ?

Regarding the possibility for multiple values that are pointed to the same place in the sparse matrix, there is no such pb because _validate_data takes care of that (if there are such multiple values (duplicates) it sums them, as is standard for sparse matrices)

Also, I'm wondering if there can be problems with the data inside the array (for instance if it's float32 but the product of two float32 is too big to be a float32). Since it seems that overflow fails silently in cython code this could be bad for ML algorithms it would overflow without telling the user... I'll look at this and raise an issue if needed.

I'm also thinking there could be a silent problem if the input array indices are int64, but the expanded dimensionality is even bigger than the max int64.

Also, to test because of some bug in the fit function of PolynomialFeatures, it will be very long to fit_transform in @jianlingzhong's example, but this PR should fix it so I'll add the test after it's merged #19734

I didn't have time to advance yet but I'll get back to it soon

@frrad
Copy link
Contributor

frrad commented Apr 2, 2021

fyi, #19734 was just merged so you should be good to add that test.

@wdevazelhes
Copy link
Contributor Author

fyi, #19734 was just merged so you should be good to add that test.

Thanks for the heads up @frrad ! I'll get started on adding the test

@wdevazelhes
Copy link
Contributor Author

I'm still in progress, but I'm having trouble to explicitly cast a variable with fused type in say an int64 in cython: it seems the syntax var = <np.int64_t>var doesn't work, I'll try to find how to do it, but it would help in case anyone knows

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some tips :)

# expanded_indptr to int64 too.
# TODO: check that the casting is indeed done
if expanded_dimensionality > MAX_INT32:
indices = np.array(indices, dtype=np.int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If indices is of dtype np.int32, it would be strictly typed to indices at this line. This means that we can not have it point to a int64 array.

I think we need to factor out the code that does the expansion into it's own fused function. The original _csr_polynomial_expansion will decide if it should cast indices and friends.

Comment on lines 138 to 139
if total_nnz > MAX_INT32:
expanded_indptr = np.array(expanded_indptr, dtype=np.int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could lead to a csr with indicies and indptr not having the same dtype. I do not know if scipy supports that.

# in the expanded space, so we cast to int64
# before the expansion computation to
# avoid overflow:
nnz = np.int64(indptr[row_i + 1] - indptr[row_i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nnz would need to be typed int64

Xp_next = _csr_polynomial_expansion(X.data, X.indices,
X.indptr, X.shape[1],
X.indptr,
np.int64(X.shape[1]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If X.indptr.dtype is int32, this would be downcasted anyways. We can change the signature of _csr_polynomial_expansion to always do the upcasting.

def _csr_polynomial_expansion(...
                              np.int64_t d, INDEX_T interaction_only,
                              INDEX_T degree):

@wdevazelhes
Copy link
Contributor Author

@thomasjpfan thanks a lot for the help ! It's much clearer now how casting works in cython,
Indeed, factoring out the code into its own fused function sounds good, I'll try to do that this week

@wdevazelhes
Copy link
Contributor Author

in 90e12b8, similarly to @thomasjpfan 's suggestion I separated:

  • the part which decides whether we need casting or not on one hand, in the python code (in _cast_to_int64_if_needed)
  • the part that does the actual expansion in the fused _csr_polynomial_expansion function on the other hand, , in the cython code.

(@thomasjpfan in #19676 (comment) you suggested to do the expansion outside of _csr_polynomial_expansion but I started extracting the casting check outside of it and then it just felt natural to keep _csr_polynomial_expansion as it is, let me know what you think)

I went for a simple rule for casting: if the expanded dimensionality needs to be an int64, then I convert all indices and inptr to int64. This may not be optimal since maybe the actual expanded matrix is so sparse that the expanded indptr (related to the number of nonzeros elements in each row), could be an int32 array, however there are actually no way to know this in advance: to know that we would need to actually compute and sum how many nonzeros elements there are in each expanded row, so we might find halfway that we should have started with int64 from the beginning, which is not ideal... I think this method is a good heuristic that prevents overflow errors while still allowing people to use int32 in most cases where they can.

Let me know what you think !

Tests are green now so it's encouraging, I'll just need to add a what's new entry, and maybe I was thinking add a bit more cases to the test to be sure the code works in every case

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to know that we would need to actually compute and sum how many nonzeros elements there are in each expanded row, so we might find halfway that we should have started with int64 from the beginning, which is not ideal...

My initial thought was to start the computation and when it discovers that it should be int64 to begin with, create a int64 container, copy over the results, and continue from there.

Give that, I see that the heuristic we have here is simple. My concern is that there would be a memory regression for input that did not actual need the int64 conversion.

Comment on lines +44 to +45
X.indices = X.indices.astype(np.int64)
X.indptr = X.indptr.astype(np.int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two copies being created here, one from X.copy and here when changing the dtype of the indices. I do not see a good alternative. I would expect the following to work:

        X = csr_matrix(
            (X.data, X.indices.astype(np.int64), X.indptr.astype(np.int64)),
            shape=X.shape, copy=True,
        )

but internally csr_matrix checks the indices and will cast back down to np.int32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, your solution sounds good
I'm not completely sure to understand the problem with it: is it because when creating a csr_matrix with indices that could hold in int32, scipy automatically choses int32 indices ?

In this case maybe I could avoid copying X, but deal with every part of it (data, indices, inptr) separately, like:

  • remove the X = X.copy() line
  • do X_indices = X.indices.astype(np.int64, copy=False) (to do a copy only if the type is changed)
  • similarly X_indptr =X_indptr.astype(np.int64, copy=False)
    And then return X.data, and X_indices and X_indptr, which I would use later on in _csr_polynomial_expansion
    What do you think ?

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@wdevazelhes
Copy link
Contributor Author

My initial thought was to start the computation and when it discovers that it should be int64 to begin with, create a int64 container, copy over the results, and continue from there.

Give that, I see that the heuristic we have here is simple. My concern is that there would be a memory regression for input that did not actual need the int64 conversion.

Thanks a lot for the review @thomasjpfan ! And sorry for the delay in my answer

Regarding the strategy to start the computation, I agree it would be best, but I couldn't find any way to do that efficiently: the problem I have is that I don't know how it could find on the fly that we should have used int64 to begin with: indeed it seems cython code, when doing operations that will overflow, doesn't raise any warning/error (see this notebook for an example https://colab.research.google.com/drive/18KW2Zk1eq5d-WkTBf_l23UF2zNs0USpk?usp=sharing), contrary to numpy for instance, so we can't catch such error to know we should have done an int64 conversion.

So I guess the only way to know if say c=a+b will overflow (with a and b both np.int32), is to convert them to np.int64, and compare c with the max possible value for int32 no ? So I guess this would slow down the computations in some cases, whereas with the heuristic above, in many cases the computations can be done without any conversions ?

But I agree that it could lead to memory regressions for input that did not actual need the int64 conversion.

Maybe we could do the following, by mixing the two strategies, what do you think ? :

  • if the heuristic above finds that we can keep the data as int32, then we do so, this way the computation is fast (no conversions and comparisons with the max np.int32 at each operation), and keeps a low memory
  • if the heuristic above finds that the expanded dimensionality is too big, then we do checks on the fly as you said i.e. for each operation we do it with int64 converted indices, check whether the result can hold on int32, if yes we return the int32, if not we copy the previous results in int64 and keep going with int64 computations for the next results as you said ?

Let me know how that sounds, if it's a good idea I'll start implementing it

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented May 11, 2021

(Btw, I have limited time these days, so if you or anyone feels like taking over the PR, go ahead, it could go faster, otherwise I'll try to continue from time to time when I have some time, sorry for the delay)

@cmarmo
Copy link
Contributor

cmarmo commented May 12, 2021

@wdevazelhes I have labeled as 'help wanted' if this is ok with you. If you find some time to continue the PR you can always... help yourself... :)

@wdevazelhes
Copy link
Contributor Author

@wdevazelhes I have labeled as 'help wanted' if this is ok with you. If you find some time to continue the PR you can always... help yourself... :)

@cmarmo thanks a lot, that works :)

@niuk-a
Copy link
Contributor

niuk-a commented Jul 8, 2021

Hello, scikit-learn community. I have some time and the great interest to make contribution to the project. I investigated history of this issue and read current PR and now I have a working solution for this problem. Could I take this issue to work on it and to create a new PR?

I slightly rewrote code of _csr_polynomial_expansion.pyx to move logic for array allocation outside of the c-code.

@cmarmo
Copy link
Contributor

cmarmo commented Jul 9, 2022

Closing as superseded by #23731. Thanks @wdevazelhes for your work!

@cmarmo cmarmo closed this Jul 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython module:preprocessing Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants