-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Describe the workflow you want to enable
The code in preprocessing/_csr_polynomial_expansion.pyx produces the components of a CSR matrix with column values represented as int32s. For inputs with a sufficiently large number of columns, this can cause overflow, which leads to negative indices in the resulting CSR matrix.
Here's an instance of a user running into this problem:
https://stackoverflow.com/questions/60920877/scipy-sparse-matrix-negative-column-index-found
The dimensionality of a polynomial expansion grows quickly with respect to the input dimensionality. Using int32s to represent columns, a second degree expansion without bias is overwhelmed with an input dimensionality of 65,535.
Describe your proposed solution
Use int64s as the index type in the cython file. The type is currently specified as an int32 in a typedef, so this should be a simple fix.
Describe alternatives you've considered, if relevant
The type necessary to store the output dimensionality could easily be determined. INDEX_T could be made a fused type. The call to _csr_polynomial_expansion could support two specializations, one for int32 and another for int64. A wrapper around them could decide which to call based on the input dimensionality.
This approach seems like overkill.