[WIP] FIX index overflow error in sparse matrix polynomial expansion (bis) #19676

wdevazelhes · 2021-03-14T13:28:34Z

This PR is an attempt to take over #16831, I'll start by a simple merge with master and then investigate the issue @jianlingzhong described here #16831 (comment)

thomasjpfan

I think the overarching issue with _csr_polynomial_expansion is that the output type of indices can still be encoded as int32 if the expansion is not too big. If the expansion is too big, even a int32 input would need int64 indices to encode the expansion.

The solution in this PR, is to covert everything into int64, which would would lead to more memory usage for int32 indices.

I think a solution could be:

_csr_polynomial_expansion can accept 32bit and 64bit indices.
try to build a 32bit expansion
If the 32bit expansion does not have enough capacity, build a 64bit expansion.

Something like this:

try:
	Xp_next = _csr_polynomial_expansion(..., output_format="int32")
except ValueError:
	Xp_next = _csr_polynomial_expansion(..., output_format="int64")

(The API is up for discussion). We can also wrap the above logic in _csr_polynomial_expansion and try to reuse the (incomplete) result of the int32 expansion and use that to inform the int64 expansion.

thomasjpfan · 2021-03-15T16:03:55Z

sklearn/preprocessing/_data.py

+                # use np.int64 for index datatype to prevent overflow
+                # in case X has a large dimension
+                Xp_next = _csr_polynomial_expansion(X.data,
+                                                    X.indices.astype(np.int64),


I think casting here would lead to to more memory usage for spares matrices encoded with int32 indices.

wdevazelhes · 2021-03-15T19:19:18Z

@thomasjpfan Thanks for your help!
Indeed, I tried to remove the casting of inputs in int64 from this PR and run @jianlingzhong 's example (here: #16803 (comment)), and the problem happens when forming the extension because that one has very high dimensionality (so needs int64)

The problem I have is that in this example, no error is thrown by the cython code, but wrong negative indices are created silently instead (a ValueError: column index exceeds matrix dimensions error is raised after the call of _csr_polynomial_expansion , at concatenation)

~~Also, there is something I didn't get when tracing what happens in the cython code: I added two print lines here:~~

~~print(d) expanded_dimensionality = int((d**2 + d) / 2 - interaction_only*d) print(expanded_dimensionality)~~

d is 120006, however expanded_dimensionality is 758209071, when it should be 7200660015 ... And I thought it was due to the int32 overflow so I tried to cast 7200660015 into int32 and it returns -2147483648 not 758209071 so I have trouble understanding what's going on here...
=> In fact what happens is that 758209071 = int((np.int32(120006**2) + np.int32(120006))/2 - np.int32(120006)), with an overflow coming from the square power.

jianlingzhong · 2021-03-15T20:07:44Z

It might not be a good idea to use a try ... except... approach as the expansion would take a long time when the number of dimensions is high. It may be a good idea to calculate the resultant dimension number and check if int32 is sufficient.

thomasjpfan · 2021-03-15T21:03:32Z

The problem I have is that in this example, no error is thrown by the cython code, but wrong negative indices are created silently instead (a ValueError: column index exceeds matrix dimensions error is raised after the call of _csr_polynomial_expansion , at concatenation)

We would need to update the Cython to properly error when it notices that int64 is not good enough.

It might not be a good idea to use a try ... except... approach as the expansion would take a long time when the number of dimensions is high. It may be a good idea to calculate the resultant dimension number and check if int32 is sufficient.

I agree. If there is a fast way to compute the dimension, we can quickly decide the output format of the indices from there. At a glance, I think it can be done by using _deg2_column or _deg3_column to compute the largest possible indices. (We would need to explicitly use the int64 versions of these functions)

…onality and in total_nnz

wdevazelhes · 2021-03-17T00:36:36Z

It might not be a good idea to use a try ... except... approach as the expansion would take a long time when the number of dimensions is high. It may be a good idea to calculate the resultant dimension number and check if int32 is sufficient.

I agree. If there is a fast way to compute the dimension, we can quickly decide the output format of the indices from there. At a glance, I think it can be done by using _deg2_column or _deg3_column to compute the largest possible indices. (We would need to explicitly use the int64 versions of these functions)

I agree, I'll try to compute the largest possible index as you suggested @thomasjpfan :

For the "polynomial expansion" case, I think we can just find the maximum column index col_max in the indices, and indeed just compute _deg2_column(col_max, col_max) or _deg3_column(col_max, col_max, col_max), with the int64 version of those functions: if it's bigger than 2147483647 (np.iinfo(np.int32)), then we need to convert indices to int64 before doing the expansion
For the "interaction expansion" case, I think we would need to find the maximum column index col_max as above, then among all rows that contain a column col_max, find the max column number that they contain except col_max, col_max_2 (and then similarly find the subsequent col_max_3), and then decide based on the int64 computation of _deg2_column(col_max, col_max_2) (or `_deg3_column(col_max, col_max_2, col_max_3)``. (I need to double check this)

(A simpler idea (and which wouldn't need to go through all the elements in indices), would be to just look at the expanded_dimensionality and see if it's larger than 2147483647. But it may sometimes be worse than computing the largest possible index as described above, in some weird cases for instance if many last columns of the expanded matrix are empty... I don't know if those cases would arise often however, and if it's worth going through all elements of indices, just to be optimal in those rare cases))

(A more efficient solution instead of going through all indices to find the max, would be to solve on i the equation _deg2_column(i, i)=2147483647 and see if indices contains any index bigger than this, But I don't know if it's worth too)

After that, I think we also need to check whether the indptr should be casted to int64 too: I think in theory the biggest number in indptr is its last element and should be equal to total_nnz (the the number of nonzero elements of the expansion), but I need to double check in case I'm missing some edge cases (for example I'm not sure yet how the code deals with sparse matrix that have duplicates)

But first, I already pushed some casting to int64 needed to avoid overflow in the computation of the expanded dimensionality and the number of nonzero elements (I'm not 100% sure of the syntax for casting in cython, this is how I saw it online and it compiled, but I may be mistaken)

wdevazelhes · 2021-03-17T08:10:41Z

I also see that all current tests pass but the example from @jianlingzhong is not passing, so I'll add it as a test

wdevazelhes · 2021-03-22T15:50:01Z

I'm going for the simplest solution described above (("A simpler idea (and which wouldn't need to go through all the elements in indices), would be to just look at the expanded_dimensionality and see if it's larger than 2147483647"), which I guess is enough to perform well in most cases ?

Regarding the possibility for multiple values that are pointed to the same place in the sparse matrix, there is no such pb because _validate_data takes care of that (if there are such multiple values (duplicates) it sums them, as is standard for sparse matrices)

Also, I'm wondering if there can be problems with the data inside the array (for instance if it's float32 but the product of two float32 is too big to be a float32). Since it seems that overflow fails silently in cython code this could be bad for ML algorithms it would overflow without telling the user... I'll look at this and raise an issue if needed.

I'm also thinking there could be a silent problem if the input array indices are int64, but the expanded dimensionality is even bigger than the max int64.

Also, to test because of some bug in the fit function of PolynomialFeatures, it will be very long to fit_transform in @jianlingzhong's example, but this PR should fix it so I'll add the test after it's merged #19734

I didn't have time to advance yet but I'll get back to it soon

frrad · 2021-04-02T15:45:45Z

fyi, #19734 was just merged so you should be good to add that test.

wdevazelhes · 2021-04-07T06:01:01Z

fyi, #19734 was just merged so you should be good to add that test.

Thanks for the heads up @frrad ! I'll get started on adding the test

…work inside cython)

wdevazelhes · 2021-04-07T10:07:02Z

I'm still in progress, but I'm having trouble to explicitly cast a variable with fused type in say an int64 in cython: it seems the syntax var = <np.int64_t>var doesn't work, I'll try to find how to do it, but it would help in case anyone knows

thomasjpfan

Left some tips :)

thomasjpfan · 2021-04-07T14:02:42Z

sklearn/preprocessing/_csr_polynomial_expansion.pyx

+    # expanded_indptr to int64 too.
+    # TODO: check that the casting is indeed done
+    if expanded_dimensionality > MAX_INT32:
+        indices = np.array(indices, dtype=np.int64)


If indices is of dtype np.int32, it would be strictly typed to indices at this line. This means that we can not have it point to a int64 array.

I think we need to factor out the code that does the expansion into it's own fused function. The original _csr_polynomial_expansion will decide if it should cast indices and friends.

thomasjpfan · 2021-04-07T14:03:32Z

sklearn/preprocessing/_csr_polynomial_expansion.pyx

+    if total_nnz > MAX_INT32:
+        expanded_indptr = np.array(expanded_indptr, dtype=np.int64)


This could lead to a csr with indicies and indptr not having the same dtype. I do not know if scipy supports that.

thomasjpfan · 2021-04-07T14:04:58Z

sklearn/preprocessing/_csr_polynomial_expansion.pyx

+        # in the expanded space, so we cast to int64
+        # before the expansion computation to
+        # avoid overflow:
+        nnz = np.int64(indptr[row_i + 1] - indptr[row_i])


nnz would need to be typed int64

thomasjpfan · 2021-04-07T14:08:17Z

sklearn/preprocessing/_polynomial.py

                Xp_next = _csr_polynomial_expansion(X.data, X.indices,
-                                                    X.indptr, X.shape[1],
+                                                    X.indptr,
+                                                    np.int64(X.shape[1]),


If X.indptr.dtype is int32, this would be downcasted anyways. We can change the signature of _csr_polynomial_expansion to always do the upcasting.

def _csr_polynomial_expansion(... np.int64_t d, INDEX_T interaction_only, INDEX_T degree):

wdevazelhes · 2021-04-12T09:17:25Z

@thomasjpfan thanks a lot for the help ! It's much clearer now how casting works in cython,
Indeed, factoring out the code into its own fused function sounds good, I'll try to do that this week

wdevazelhes · 2021-04-19T10:42:42Z

in 90e12b8, similarly to @thomasjpfan 's suggestion I separated:

the part which decides whether we need casting or not on one hand, in the python code (in _cast_to_int64_if_needed)
the part that does the actual expansion in the fused _csr_polynomial_expansion function on the other hand, , in the cython code.

(@thomasjpfan in #19676 (comment) you suggested to do the expansion outside of _csr_polynomial_expansion but I started extracting the casting check outside of it and then it just felt natural to keep _csr_polynomial_expansion as it is, let me know what you think)

I went for a simple rule for casting: if the expanded dimensionality needs to be an int64, then I convert all indices and inptr to int64. This may not be optimal since maybe the actual expanded matrix is so sparse that the expanded indptr (related to the number of nonzeros elements in each row), could be an int32 array, however there are actually no way to know this in advance: to know that we would need to actually compute and sum how many nonzeros elements there are in each expanded row, so we might find halfway that we should have started with int64 from the beginning, which is not ideal... I think this method is a good heuristic that prevents overflow errors while still allowing people to use int32 in most cases where they can.

Let me know what you think !

Tests are green now so it's encouraging, I'll just need to add a what's new entry, and maybe I was thinking add a bit more cases to the test to be sure the code works in every case

thomasjpfan

to know that we would need to actually compute and sum how many nonzeros elements there are in each expanded row, so we might find halfway that we should have started with int64 from the beginning, which is not ideal...

My initial thought was to start the computation and when it discovers that it should be int64 to begin with, create a int64 container, copy over the results, and continue from there.

Give that, I see that the heuristic we have here is simple. My concern is that there would be a memory regression for input that did not actual need the int64 conversion.

sklearn/preprocessing/_polynomial.py

thomasjpfan · 2021-04-26T19:20:10Z

sklearn/preprocessing/_polynomial.py

+        X.indices = X.indices.astype(np.int64)
+        X.indptr = X.indptr.astype(np.int64)


There are two copies being created here, one from X.copy and here when changing the dtype of the indices. I do not see a good alternative. I would expect the following to work:

X = csr_matrix( (X.data, X.indices.astype(np.int64), X.indptr.astype(np.int64)), shape=X.shape, copy=True, )

but internally csr_matrix checks the indices and will cast back down to np.int32.

That's right, your solution sounds good
I'm not completely sure to understand the problem with it: is it because when creating a csr_matrix with indices that could hold in int32, scipy automatically choses int32 indices ?

In this case maybe I could avoid copying X, but deal with every part of it (data, indices, inptr) separately, like:

remove the X = X.copy() line

do X_indices = X.indices.astype(np.int64, copy=False) (to do a copy only if the type is changed)

similarly X_indptr =X_indptr.astype(np.int64, copy=False)
And then return X.data, and X_indices and X_indptr, which I would use later on in _csr_polynomial_expansion
What do you think ?

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

wdevazelhes · 2021-05-11T17:39:16Z

My initial thought was to start the computation and when it discovers that it should be int64 to begin with, create a int64 container, copy over the results, and continue from there.

Give that, I see that the heuristic we have here is simple. My concern is that there would be a memory regression for input that did not actual need the int64 conversion.

Thanks a lot for the review @thomasjpfan ! And sorry for the delay in my answer

Regarding the strategy to start the computation, I agree it would be best, but I couldn't find any way to do that efficiently: the problem I have is that I don't know how it could find on the fly that we should have used int64 to begin with: indeed it seems cython code, when doing operations that will overflow, doesn't raise any warning/error (see this notebook for an example https://colab.research.google.com/drive/18KW2Zk1eq5d-WkTBf_l23UF2zNs0USpk?usp=sharing), contrary to numpy for instance, so we can't catch such error to know we should have done an int64 conversion.

So I guess the only way to know if say c=a+b will overflow (with a and b both np.int32), is to convert them to np.int64, and compare c with the max possible value for int32 no ? So I guess this would slow down the computations in some cases, whereas with the heuristic above, in many cases the computations can be done without any conversions ?

But I agree that it could lead to memory regressions for input that did not actual need the int64 conversion.

Maybe we could do the following, by mixing the two strategies, what do you think ? :

if the heuristic above finds that we can keep the data as int32, then we do so, this way the computation is fast (no conversions and comparisons with the max np.int32 at each operation), and keeps a low memory
if the heuristic above finds that the expanded dimensionality is too big, then we do checks on the fly as you said i.e. for each operation we do it with int64 converted indices, check whether the result can hold on int32, if yes we return the int32, if not we copy the previous results in int64 and keep going with int64 computations for the next results as you said ?

Let me know how that sounds, if it's a good idea I'll start implementing it

wdevazelhes · 2021-05-11T17:41:43Z

(Btw, I have limited time these days, so if you or anyone feels like taking over the PR, go ahead, it could go faster, otherwise I'll try to continue from time to time when I have some time, sorry for the delay)

cmarmo · 2021-05-12T11:45:32Z

@wdevazelhes I have labeled as 'help wanted' if this is ok with you. If you find some time to continue the PR you can always... help yourself... :)

wdevazelhes · 2021-05-12T18:38:41Z

@wdevazelhes I have labeled as 'help wanted' if this is ok with you. If you find some time to continue the PR you can always... help yourself... :)

@cmarmo thanks a lot, that works :)

niuk-a · 2021-07-08T18:47:46Z

Hello, scikit-learn community. I have some time and the great interest to make contribution to the project. I investigated history of this issue and read current PR and now I have a working solution for this problem. Could I take this issue to work on it and to create a new PR?

I slightly rewrote code of _csr_polynomial_expansion.pyx to move logic for array allocation outside of the c-code.

…(bis) Fixes scikit-learn#19676

cmarmo · 2022-07-09T03:23:36Z

Closing as superseded by #23731. Thanks @wdevazelhes for your work!

jianlingzhong and others added 2 commits April 2, 2020 16:35

[MRG] FIX index overflow error in sparse matrix polynomial expansion

a92b8a7

Take over scikit-learn#16831

c60a3e9

github-actions bot added the module:preprocessing label Mar 14, 2021

thomasjpfan reviewed Mar 15, 2021

View reviewed changes

Do some first castings to int64 to avoid overflow in expanded_dimensi…

b78f40d

…onality and in total_nnz

wdevazelhes added 4 commits April 7, 2021 08:10

merge with master

ed6a5b5

Add test

3f3db6a

cast the input dim to int64 outside cython (didn't manage to make it …

d5364ce

…work inside cython)

cast also expanded indices and indptr

d90233b

cast to int64 using numpy

fe39a7b

thomasjpfan reviewed Apr 7, 2021

View reviewed changes

thomasjpfan mentioned this pull request Apr 8, 2021

[MRG] FIX index overflow error in sparse matrix polynomial expansion #16831

Closed

thomasjpfan added the cython label Apr 13, 2021

refactor code and factors out the int64 casting

90e12b8

remove unnecessary changes

e5daef8

thomasjpfan reviewed Apr 26, 2021

View reviewed changes

Update sklearn/preprocessing/_polynomial.py

7138cf4

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

cmarmo added the help wanted label May 12, 2021

niuk-a added a commit to niuk-a/scikit-learn that referenced this pull request Jul 13, 2021

[WIP] FIX index overflow error in sparse matrix polynomial expansion …

7eef7ad

…(bis) Fixes scikit-learn#19676

niuk-a mentioned this pull request Jul 13, 2021

[WIP] FIX index overflow error in sparse matrix polynomial expansion … #20524

Closed

cmarmo added Superseded PR has been replace by a newer PR and removed help wanted labels Jan 24, 2022

ogrisel mentioned this pull request Jun 16, 2022

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653

Open

Micky774 mentioned this pull request Jun 22, 2022

ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures for sparse matrices #23731

Merged

cmarmo closed this Jul 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] FIX index overflow error in sparse matrix polynomial expansion (bis) #19676

[WIP] FIX index overflow error in sparse matrix polynomial expansion (bis) #19676

wdevazelhes commented Mar 14, 2021

thomasjpfan left a comment •

edited

Loading

thomasjpfan Mar 15, 2021

wdevazelhes commented Mar 15, 2021 •

edited

Loading

jianlingzhong commented Mar 15, 2021

thomasjpfan commented Mar 15, 2021

wdevazelhes commented Mar 17, 2021 •

edited

Loading

wdevazelhes commented Mar 17, 2021

wdevazelhes commented Mar 22, 2021 •

edited

Loading

frrad commented Apr 2, 2021

wdevazelhes commented Apr 7, 2021

wdevazelhes commented Apr 7, 2021

thomasjpfan left a comment

thomasjpfan Apr 7, 2021

thomasjpfan Apr 7, 2021

thomasjpfan Apr 7, 2021

thomasjpfan Apr 7, 2021

wdevazelhes commented Apr 12, 2021

wdevazelhes commented Apr 19, 2021

thomasjpfan left a comment

thomasjpfan Apr 26, 2021

wdevazelhes May 11, 2021

wdevazelhes commented May 11, 2021

wdevazelhes commented May 11, 2021 •

edited

Loading

cmarmo commented May 12, 2021

wdevazelhes commented May 12, 2021

niuk-a commented Jul 8, 2021

cmarmo commented Jul 9, 2022

		if total_nnz > MAX_INT32:
		expanded_indptr = np.array(expanded_indptr, dtype=np.int64)

		X.indices = X.indices.astype(np.int64)
		X.indptr = X.indptr.astype(np.int64)

[WIP] FIX index overflow error in sparse matrix polynomial expansion (bis) #19676

[WIP] FIX index overflow error in sparse matrix polynomial expansion (bis) #19676

Conversation

wdevazelhes commented Mar 14, 2021

thomasjpfan left a comment • edited Loading

Choose a reason for hiding this comment

thomasjpfan Mar 15, 2021

Choose a reason for hiding this comment

wdevazelhes commented Mar 15, 2021 • edited Loading

jianlingzhong commented Mar 15, 2021

thomasjpfan commented Mar 15, 2021

wdevazelhes commented Mar 17, 2021 • edited Loading

wdevazelhes commented Mar 17, 2021

wdevazelhes commented Mar 22, 2021 • edited Loading

frrad commented Apr 2, 2021

wdevazelhes commented Apr 7, 2021

wdevazelhes commented Apr 7, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Apr 7, 2021

Choose a reason for hiding this comment

thomasjpfan Apr 7, 2021

Choose a reason for hiding this comment

thomasjpfan Apr 7, 2021

Choose a reason for hiding this comment

thomasjpfan Apr 7, 2021

Choose a reason for hiding this comment

wdevazelhes commented Apr 12, 2021

wdevazelhes commented Apr 19, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Apr 26, 2021

Choose a reason for hiding this comment

wdevazelhes May 11, 2021

Choose a reason for hiding this comment

wdevazelhes commented May 11, 2021

wdevazelhes commented May 11, 2021 • edited Loading

cmarmo commented May 12, 2021

wdevazelhes commented May 12, 2021

niuk-a commented Jul 8, 2021

cmarmo commented Jul 9, 2022

thomasjpfan left a comment •

edited

Loading

wdevazelhes commented Mar 15, 2021 •

edited

Loading

wdevazelhes commented Mar 17, 2021 •

edited

Loading

wdevazelhes commented Mar 22, 2021 •

edited

Loading

wdevazelhes commented May 11, 2021 •

edited

Loading