Skip to content

Cython code for PolynomialFeatures should use int64s for indices. #17554

@AWNystrom

Description

@AWNystrom

Describe the workflow you want to enable

The code in preprocessing/_csr_polynomial_expansion.pyx produces the components of a CSR matrix with column values represented as int32s. For inputs with a sufficiently large number of columns, this can cause overflow, which leads to negative indices in the resulting CSR matrix.

Here's an instance of a user running into this problem:
https://stackoverflow.com/questions/60920877/scipy-sparse-matrix-negative-column-index-found

The dimensionality of a polynomial expansion grows quickly with respect to the input dimensionality. Using int32s to represent columns, a second degree expansion without bias is overwhelmed with an input dimensionality of 65,535.

Describe your proposed solution

Use int64s as the index type in the cython file. The type is currently specified as an int32 in a typedef, so this should be a simple fix.

Describe alternatives you've considered, if relevant

The type necessary to store the output dimensionality could easily be determined. INDEX_T could be made a fused type. The call to _csr_polynomial_expansion could support two specializations, one for int32 and another for int64. A wrapper around them could decide which to call based on the input dimensionality.

This approach seems like overkill.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions