-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Update sparse spmatrix syntax to support both sparray and spmatrix #30858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update sparse spmatrix syntax to support both sparray and spmatrix #30858
Conversation
ab07ed2
to
5351d43
Compare
To convert the one remaining |
Thanks for the PR 🙏! Looks like we were still using sparse matrix specific constructs like I guess tackling this lack of coverage in this PR is a bit too much to ask, not 100% sure 🤔. |
I think #27090 did a very good job of:
The approach used here to identify needed changes is to (locally on my installation) monkey_patch the scipy.sparse matrix classes so that calling So the goal here is somewhat different from that of the previous issue. The library is already tested to work for sparse matrix and sparse array. This effort is preparing the library to work with all internal sparse creation using sparse arrays. Pass 1 makes sure that internally used syntax will support either. Pass 2 shifts internal creation to sparse arrays. Ideally, the library should work with either type of input throughout this process. Decisions during pass 2 will need to be made as part of that process. The very long term goal is that after pass 2, and after almost all users have shifted to sparse arrays, we could allow users to install a version of SciPy without support for sparse matrices. I will look into adding tests that flag these cases when run with sparse arrays. But these leftovers are the places where creation of the sparse entity is hidden behind multiple layers of object creation and transforming. For example, the first line of the diff fixes handling a sparse matrix that arises from 1) creating a ColumnTransformer, 2) calling that transformer's I also discovered that I did not run the doctests with my monkey-patched tool. So I will add any additionally discovered fixes (and tests of them where appropriate) too. |
Further tests revealed two more matmul operations that needed to be switched from All the fixes in the PR are orthogonal to the goals from #27090 in that these new fixes are not dependent on sparse inputs. This PR changes the handling of sparse matrices created internally to sklearn. And the changes here should not change the results for any sklearn code -- only the syntax used to get those results. |
23bd790
to
601fe56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and explanation about the context. The strategy pass 1 (make our internal code compatible with sparse arrays) + pass 2 (replace internal use of sparse matrices by sparse arrays) looks good to me!
A few comments below.
601fe56
to
c230cd2
Compare
I've got the Looks like this version of mypy doesn't allow |
208ccde
to
c1f3528
Compare
OK.. I think this change does what we want. But let's make sure: Original code: if imputer_mask.format == "csr":
n_missing = np.bincount(imputer_mask.indices, minlength=imputer_mask.shape[1])
elif imputer_mask.format == "csc":
n_missing = np.diff(imputer_mask.indptr) I've got Option 2 in the PR now. One commit prior has Option 1. They both are passing tests. |
If both options have the similar performance, I would go with |
The times are pretty similar -- |
23e1525
to
cba6d4c
Compare
I think this PR deserves a changelog entry. |
46dac0f
to
ff9eb82
Compare
I added an entry for the changelog -- but someone should check for style and detail level. e.g. I did not state that this should not affect any results produced by the library. Indeed users should not notice anything. |
I put the changelog type as "other" type (couldn't find "maint") and in the |
@lorentzenchr can you explain more why you think a changelog is needed? My understanding it that this PR is making sure that our internal code is compatible with both sparse matrices and sparse arrays. This should not affect users. |
It is a visible improvement preparing us for the removal of sparse matrices. I just think it deserves visibility. |
I have tweaked the changelog to follow our guidelines and enabled auto-merge, thanks @dschult! |
This PR updates
scipy.sparse
syntax within scikit-learn so that both sparse array (sparray
) and sparse matrix (spmatrix
) are fully supported.It completes "pass 1" of the 2-pass process described in the SciPy migration from spmatrix to sparray document. This entails:
*
as the operator.isspmatrix
).It looks like you have already done most of this work -- so what is here is just the stragglers that weren't caught by previous updates.
This not only ensures that
sparray
inputs will work with scikit-learn, but it prepares for "pass 2" of the migration process. Pass 2 involves shifting internal usage ofspmatrix
tosparray
while maintaining support forspmatrix
. The hardest part of that are functions that return sparse. Should they return sparray or spmatrix? Often you can tell from the type of an input which type should be expected. Otherwise you can create two versions of the function (like SciPy did witheye_array
) or you can use a kwarg (like SciPy did withloadmat(..., spmatrix=True)
. Those decisions are for the next PR.This PR should be a set of small changes that don't affect the output, but update to the spmatrix syntax that works with sparray. Let me know how I can help with the review of this PR.