Update sparse spmatrix syntax to support both sparray and spmatrix #30858

dschult · 2025-02-19T01:51:27Z

This PR updates scipy.sparse syntax within scikit-learn so that both sparse array (sparray) and sparse matrix (spmatrix) are fully supported.

It completes "pass 1" of the 2-pass process described in the SciPy migration from spmatrix to sparray document. This entails:

changing a couple of matrix multiply operations still using * as the operator.
updating function/method names for those replaced for sparray (like getnnz, getformat, getcol, etc).
other changes that you have already implemented so nothing was found in this process (like changing from isspmatrix).
It looks like you have already done most of this work -- so what is here is just the stragglers that weren't caught by previous updates.

This not only ensures that sparray inputs will work with scikit-learn, but it prepares for "pass 2" of the migration process. Pass 2 involves shifting internal usage of spmatrix to sparray while maintaining support for spmatrix. The hardest part of that are functions that return sparse. Should they return sparray or spmatrix? Often you can tell from the type of an input which type should be expected. Otherwise you can create two versions of the function (like SciPy did with eye_array) or you can use a kwarg (like SciPy did with loadmat(..., spmatrix=True). Those decisions are for the next PR.

This PR should be a set of small changes that don't affect the output, but update to the spmatrix syntax that works with sparray. Let me know how I can help with the review of this PR.

github-actions · 2025-02-19T01:52:38Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: f4b4d15. Link to the linter CI: here}

dschult · 2025-02-20T04:44:39Z

To convert the one remaining A.getnnz I had to adjust the sparsefuncs.py function count_nonzero to support CSC format. I adjusted the test for that as well.

lesteve · 2025-02-20T10:37:03Z

Thanks for the PR 🙏!

Looks like we were still using sparse matrix specific constructs like * instead of @ and that no tests were complaining about it. It kind of makes me think that we have a lack of coverage about testing algorithms with both sparse matrices and sparse arrays, i.e. what #27090 was trying to achieve.

I guess tackling this lack of coverage in this PR is a bit too much to ask, not 100% sure 🤔.

dschult · 2025-02-20T17:48:36Z

I think #27090 did a very good job of:

making sure that code worked for inputs that are either sparse matrix or sparse array
checking that tests which used sparse constructors checked that either constructor would work

The approach used here to identify needed changes is to (locally on my installation) monkey_patch the scipy.sparse matrix classes so that calling __mul__ with a non-scalar other argument raises an error. And calling getformat raises, etc. This does not check that sparse array inputs work. That's already done. This flags any internal computations involving sparse matrices that cannot work with sparse arrays due to old-style-syntax. I believe these are a few places where, internally, sparse matrices are being created, manipulated and results are returned without the sparse matrix being exposed publicly.

So the goal here is somewhat different from that of the previous issue. The library is already tested to work for sparse matrix and sparse array. This effort is preparing the library to work with all internal sparse creation using sparse arrays. Pass 1 makes sure that internally used syntax will support either. Pass 2 shifts internal creation to sparse arrays. Ideally, the library should work with either type of input throughout this process. Decisions during pass 2 will need to be made as part of that process. The very long term goal is that after pass 2, and after almost all users have shifted to sparse arrays, we could allow users to install a version of SciPy without support for sparse matrices.

I will look into adding tests that flag these cases when run with sparse arrays. But these leftovers are the places where creation of the sparse entity is hidden behind multiple layers of object creation and transforming. For example, the first line of the diff fixes handling a sparse matrix that arises from 1) creating a ColumnTransformer, 2) calling that transformer's .fit_transform method on a numpy array with Python Object dtype that holds strings, integers and True/False. It's not intuitive why that should return a sparse matrix. But the result is a sparse matrix -- and the rabbit hole of creation parameters and conditions is deep(!) involving many private helper functions! So I don't promise anything, but I will look into it.

I also discovered that I did not run the doctests with my monkey-patched tool. So I will add any additionally discovered fixes (and tests of them where appropriate) too.

dschult · 2025-02-22T20:33:02Z

Further tests revealed two more matmul operations that needed to be switched from * to @.

All the fixes in the PR are orthogonal to the goals from #27090 in that these new fixes are not dependent on sparse inputs. This PR changes the handling of sparse matrices created internally to sklearn. And the changes here should not change the results for any sklearn code -- only the syntax used to get those results.

lesteve

Thanks for the PR and explanation about the context. The strategy pass 1 (make our internal code compatible with sparse arrays) + pass 2 (replace internal use of sparse matrices by sparse arrays) looks good to me!

A few comments below.

sklearn/utils/tests/test_sparsefuncs.py

sklearn/utils/sparsefuncs.py

dschult · 2025-02-28T12:34:58Z

I've got the count_nonzero function working with CSC format and the added sample_weight argument. The tricky part was making CSC work with row-based sample_weights. But with it now working for CSR and CSC, it would be fairly straightforwad to allow sample weights across columns (probably not a normal usecase though). It was a good dive into that feature and could help scipy add a sample_weight feature -- though no promises there... :)

Looks like this version of mypy doesn't allow match/case syntax. I guess that requires mypy under Python 3.10.
I rewrote it with if/else.

sklearn/impute/_base.py

dschult · 2025-03-03T18:12:53Z

OK.. I think this change does what we want. But let's make sure:

Original code: n_missing = imputer_mask.getnnz(axis=0). Need to remove getnnz method.
Option 1: Use sum of mask instead: n_missing = imputer_mask.sum(axis=0)
Option 2: Use index arrays to find nnz by row: code here depends on format, but avoids sum

if imputer_mask.format == "csr":
    n_missing = np.bincount(imputer_mask.indices, minlength=imputer_mask.shape[1])
elif imputer_mask.format == "csc":
    n_missing = np.diff(imputer_mask.indptr)

I've got Option 2 in the PR now. One commit prior has Option 1. They both are passing tests.

thomasjpfan · 2025-03-03T23:05:38Z

If both options have the similar performance, I would go with imputer_mask.sum(axis=0) because it is simpler.

dschult · 2025-03-05T15:21:59Z

The times are pretty similar -- sum is slower for small arrays, but faster for large arrays. So sum is simpler and faster for the big problems. I've put Option 1 into the code for this PR.

lorentzenchr · 2025-03-15T17:10:11Z

I think this PR deserves a changelog entry.

dschult · 2025-03-16T03:16:57Z

I added an entry for the changelog -- but someone should check for style and detail level. e.g. I did not state that this should not affect any results produced by the library. Indeed users should not notice anything.

dschult · 2025-03-16T15:42:09Z

I put the changelog type as "other" type (couldn't find "maint") and in the many_modules part of upcoming changes. But these can be changed...

lesteve · 2025-03-17T07:49:59Z

@lorentzenchr can you explain more why you think a changelog is needed?

My understanding it that this PR is making sure that our internal code is compatible with both sparse matrices and sparse arrays. This should not affect users.

lorentzenchr · 2025-03-17T16:11:13Z

It is a visible improvement preparing us for the removal of sparse matrices. I just think it deserves visibility.

lesteve · 2025-03-18T13:30:04Z

I have tweaked the changelog to follow our guidelines and enabled auto-merge, thanks @dschult!

dschult force-pushed the sparse_spmatrix_migration_pass1 branch 2 times, most recently from ab07ed2 to 5351d43 Compare February 20, 2025 02:43

dschult force-pushed the sparse_spmatrix_migration_pass1 branch from 23bd790 to 601fe56 Compare February 22, 2025 21:20

lesteve added the No Changelog Needed label Feb 24, 2025

lesteve reviewed Feb 24, 2025

View reviewed changes

sklearn/utils/tests/test_sparsefuncs.py Outdated Show resolved Hide resolved

sklearn/utils/sparsefuncs.py Outdated Show resolved Hide resolved

dschult force-pushed the sparse_spmatrix_migration_pass1 branch from 601fe56 to c230cd2 Compare February 28, 2025 11:45

thomasjpfan reviewed Mar 2, 2025

View reviewed changes

sklearn/impute/_base.py Outdated Show resolved Hide resolved

dschult force-pushed the sparse_spmatrix_migration_pass1 branch 2 times, most recently from 208ccde to c1f3528 Compare March 3, 2025 17:41

dschult force-pushed the sparse_spmatrix_migration_pass1 branch 2 times, most recently from 23e1525 to cba6d4c Compare March 8, 2025 22:03

thomasjpfan approved these changes Mar 9, 2025

View reviewed changes

thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Mar 9, 2025

lorentzenchr removed the No Changelog Needed label Mar 15, 2025

dschult added 6 commits March 15, 2025 23:14

Update spmatrix syntax to support sparray and spmatrix

1c5a696

use sparsefuncs count_nonzero over same-named method

f43e207

adjust sparsefuncs count_nonzero and test to use CSC

adbd821

more tests show two other matmuls to fix

3ec1239

count_nonzero support for CSC with sample_weight

0ef9773

replace match logic with ifs

1d0038c

dschult added 4 commits March 15, 2025 23:14

revert count_nonzero expansion

9f7f5cc

use direct numpy manipulation of indexes to count nnz

84c1e8c

revert to imputer_mask.sum(axis=0)

bb110ab

Add changelog entry for PR 30858

ff9eb82

dschult force-pushed the sparse_spmatrix_migration_pass1 branch from 46dac0f to ff9eb82 Compare March 16, 2025 03:14

name file with valid type "other"

ad34249

dschult mentioned this pull request Mar 16, 2025

RFC Supporting scipy.sparse.sparray #26418

Open

Tweak changelog

7025abb

lesteve enabled auto-merge (squash) March 18, 2025 13:28

Merge branch 'main' into sparse_spmatrix_migration_pass1

f4b4d15

lesteve merged commit 239112a into scikit-learn:main Mar 18, 2025
31 checks passed

Uh oh!

Update sparse spmatrix syntax to support both sparray and spmatrix #30858

Update sparse spmatrix syntax to support both sparray and spmatrix #30858

Uh oh!

Conversation

dschult commented Feb 19, 2025

Uh oh!

github-actions bot commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

dschult commented Feb 20, 2025

Uh oh!

lesteve commented Feb 20, 2025

Uh oh!

dschult commented Feb 20, 2025

Uh oh!

dschult commented Feb 22, 2025

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dschult commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dschult commented Mar 3, 2025

Uh oh!

thomasjpfan commented Mar 3, 2025

Uh oh!

dschult commented Mar 5, 2025

Uh oh!

lorentzenchr commented Mar 15, 2025

Uh oh!

dschult commented Mar 16, 2025

Uh oh!

dschult commented Mar 16, 2025

Uh oh!

lesteve commented Mar 17, 2025

Uh oh!

lorentzenchr commented Mar 17, 2025

Uh oh!

lesteve commented Mar 18, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 19, 2025 •

edited

Loading

dschult commented Feb 28, 2025 •

edited

Loading