Skip to content

Update sparse spmatrix syntax to support both sparray and spmatrix #30858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 18, 2025

Conversation

dschult
Copy link
Contributor

@dschult dschult commented Feb 19, 2025

This PR updates scipy.sparse syntax within scikit-learn so that both sparse array (sparray) and sparse matrix (spmatrix) are fully supported.

It completes "pass 1" of the 2-pass process described in the SciPy migration from spmatrix to sparray document. This entails:

  • changing a couple of matrix multiply operations still using * as the operator.
  • updating function/method names for those replaced for sparray (like getnnz, getformat, getcol, etc).
  • other changes that you have already implemented so nothing was found in this process (like changing from isspmatrix).
    It looks like you have already done most of this work -- so what is here is just the stragglers that weren't caught by previous updates.

This not only ensures that sparray inputs will work with scikit-learn, but it prepares for "pass 2" of the migration process. Pass 2 involves shifting internal usage of spmatrix to sparray while maintaining support for spmatrix. The hardest part of that are functions that return sparse. Should they return sparray or spmatrix? Often you can tell from the type of an input which type should be expected. Otherwise you can create two versions of the function (like SciPy did with eye_array) or you can use a kwarg (like SciPy did with loadmat(..., spmatrix=True). Those decisions are for the next PR.

This PR should be a set of small changes that don't affect the output, but update to the spmatrix syntax that works with sparray. Let me know how I can help with the review of this PR.

Copy link

github-actions bot commented Feb 19, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: f4b4d15. Link to the linter CI: here

@dschult dschult force-pushed the sparse_spmatrix_migration_pass1 branch 2 times, most recently from ab07ed2 to 5351d43 Compare February 20, 2025 02:43
@dschult
Copy link
Contributor Author

dschult commented Feb 20, 2025

To convert the one remaining A.getnnz I had to adjust the sparsefuncs.py function count_nonzero to support CSC format. I adjusted the test for that as well.

@lesteve
Copy link
Member

lesteve commented Feb 20, 2025

Thanks for the PR 🙏!

Looks like we were still using sparse matrix specific constructs like * instead of @ and that no tests were complaining about it. It kind of makes me think that we have a lack of coverage about testing algorithms with both sparse matrices and sparse arrays, i.e. what #27090 was trying to achieve.

I guess tackling this lack of coverage in this PR is a bit too much to ask, not 100% sure 🤔.

@dschult
Copy link
Contributor Author

dschult commented Feb 20, 2025

I think #27090 did a very good job of:

  • making sure that code worked for inputs that are either sparse matrix or sparse array
  • checking that tests which used sparse constructors checked that either constructor would work

The approach used here to identify needed changes is to (locally on my installation) monkey_patch the scipy.sparse matrix classes so that calling __mul__ with a non-scalar other argument raises an error. And calling getformat raises, etc. This does not check that sparse array inputs work. That's already done. This flags any internal computations involving sparse matrices that cannot work with sparse arrays due to old-style-syntax. I believe these are a few places where, internally, sparse matrices are being created, manipulated and results are returned without the sparse matrix being exposed publicly.

So the goal here is somewhat different from that of the previous issue. The library is already tested to work for sparse matrix and sparse array. This effort is preparing the library to work with all internal sparse creation using sparse arrays. Pass 1 makes sure that internally used syntax will support either. Pass 2 shifts internal creation to sparse arrays. Ideally, the library should work with either type of input throughout this process. Decisions during pass 2 will need to be made as part of that process. The very long term goal is that after pass 2, and after almost all users have shifted to sparse arrays, we could allow users to install a version of SciPy without support for sparse matrices.

I will look into adding tests that flag these cases when run with sparse arrays. But these leftovers are the places where creation of the sparse entity is hidden behind multiple layers of object creation and transforming. For example, the first line of the diff fixes handling a sparse matrix that arises from 1) creating a ColumnTransformer, 2) calling that transformer's .fit_transform method on a numpy array with Python Object dtype that holds strings, integers and True/False. It's not intuitive why that should return a sparse matrix. But the result is a sparse matrix -- and the rabbit hole of creation parameters and conditions is deep(!) involving many private helper functions! So I don't promise anything, but I will look into it.

I also discovered that I did not run the doctests with my monkey-patched tool. So I will add any additionally discovered fixes (and tests of them where appropriate) too.

@dschult
Copy link
Contributor Author

dschult commented Feb 22, 2025

Further tests revealed two more matmul operations that needed to be switched from * to @.

All the fixes in the PR are orthogonal to the goals from #27090 in that these new fixes are not dependent on sparse inputs. This PR changes the handling of sparse matrices created internally to sklearn. And the changes here should not change the results for any sklearn code -- only the syntax used to get those results.

@dschult dschult force-pushed the sparse_spmatrix_migration_pass1 branch from 23bd790 to 601fe56 Compare February 22, 2025 21:20
Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and explanation about the context. The strategy pass 1 (make our internal code compatible with sparse arrays) + pass 2 (replace internal use of sparse matrices by sparse arrays) looks good to me!

A few comments below.

@dschult dschult force-pushed the sparse_spmatrix_migration_pass1 branch from 601fe56 to c230cd2 Compare February 28, 2025 11:45
@dschult
Copy link
Contributor Author

dschult commented Feb 28, 2025

I've got the count_nonzero function working with CSC format and the added sample_weight argument. The tricky part was making CSC work with row-based sample_weights. But with it now working for CSR and CSC, it would be fairly straightforwad to allow sample weights across columns (probably not a normal usecase though). It was a good dive into that feature and could help scipy add a sample_weight feature -- though no promises there... :)

Looks like this version of mypy doesn't allow match/case syntax. I guess that requires mypy under Python 3.10.
I rewrote it with if/else.

@dschult dschult force-pushed the sparse_spmatrix_migration_pass1 branch 2 times, most recently from 208ccde to c1f3528 Compare March 3, 2025 17:41
@dschult
Copy link
Contributor Author

dschult commented Mar 3, 2025

OK.. I think this change does what we want. But let's make sure:

Original code: n_missing = imputer_mask.getnnz(axis=0). Need to remove getnnz method.
Option 1: Use sum of mask instead: n_missing = imputer_mask.sum(axis=0)
Option 2: Use index arrays to find nnz by row: code here depends on format, but avoids sum

if imputer_mask.format == "csr":
    n_missing = np.bincount(imputer_mask.indices, minlength=imputer_mask.shape[1])
elif imputer_mask.format == "csc":
    n_missing = np.diff(imputer_mask.indptr)

I've got Option 2 in the PR now. One commit prior has Option 1. They both are passing tests.

@thomasjpfan
Copy link
Member

If both options have the similar performance, I would go with imputer_mask.sum(axis=0) because it is simpler.

@dschult
Copy link
Contributor Author

dschult commented Mar 5, 2025

The times are pretty similar -- sum is slower for small arrays, but faster for large arrays. So sum is simpler and faster for the big problems. I've put Option 1 into the code for this PR.

@dschult dschult force-pushed the sparse_spmatrix_migration_pass1 branch 2 times, most recently from 23e1525 to cba6d4c Compare March 8, 2025 22:03
@thomasjpfan thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Mar 9, 2025
@lorentzenchr
Copy link
Member

I think this PR deserves a changelog entry.

@dschult dschult force-pushed the sparse_spmatrix_migration_pass1 branch from 46dac0f to ff9eb82 Compare March 16, 2025 03:14
@dschult
Copy link
Contributor Author

dschult commented Mar 16, 2025

I added an entry for the changelog -- but someone should check for style and detail level. e.g. I did not state that this should not affect any results produced by the library. Indeed users should not notice anything.

@dschult
Copy link
Contributor Author

dschult commented Mar 16, 2025

I put the changelog type as "other" type (couldn't find "maint") and in the many_modules part of upcoming changes. But these can be changed...

@lesteve
Copy link
Member

lesteve commented Mar 17, 2025

@lorentzenchr can you explain more why you think a changelog is needed?

My understanding it that this PR is making sure that our internal code is compatible with both sparse matrices and sparse arrays. This should not affect users.

@lorentzenchr
Copy link
Member

It is a visible improvement preparing us for the removal of sparse matrices. I just think it deserves visibility.

@lesteve lesteve enabled auto-merge (squash) March 18, 2025 13:28
@lesteve
Copy link
Member

lesteve commented Mar 18, 2025

I have tweaked the changelog to follow our guidelines and enabled auto-merge, thanks @dschult!

@lesteve lesteve merged commit 239112a into scikit-learn:main Mar 18, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants