-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFC Supporting scipy.sparse.sparray
#26418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In general, I'm +1 with supporting |
scipy.sparse.sparray
transitionscipy.sparse.sparray
transition
#26420 has been opened as a small preliminary. |
scipy.sparse.sparray
transitionscipy.sparse.sparray
Also +1 for supporting
I‘m leaning towards 2 or 1. |
SciPy is going to deprecate matrices and will ultimately remove them. |
#27090 has been opened to update tests accordingly. |
What should we do about things that return sparse classes? After brief discussion with @jjerphan and @ogrisel it sounds like the desired behavior is to pass through the array classes as much as possible. E.g. we should try and have |
I also think we should preserve the type of inputs. I think that functions returning sparse matrices which do not have sparse matrices or array are arguments can for now continue to return sparse matrices until the version of SciPy who will officially deprecate sparse matrices is installed (in which case, sparse arrays would be returned instead). |
I would be in favor of trying to do this as much as possible, but there might be some cases where the two have different behaviors and an explicit testing for the type might be necessary. I wonder what would be a concise way to do this to keep backward compat with scipy 1.8 though... Maybe something like: import scipy.sparse as sp
if sp.issparse(X):
if hasattr(sp, "isspmatrix") and not sp.isspmatrix(X):
# special handling for `scipy.sparse.sparray`
...
else:
# backward compat for `scipy.sparse.spmatrix`
... |
Once scipy starts issuing deprecation warnings, we might need a way to allow the users to silence them though. Maybe an estimator with Note that for |
There has been discussion of removing some of the
👍 this is the approach I've been taking with a number of deprecations. It's a little annoying due to the number of warnings and downstream authors have to change code twice, but I haven't been able to come up with something as safe with less action required. One concern here would be if you'd like to return other kinds of sparse array, like sparse |
Ideally, if those implement the Array API (which is not the case at the moment), this might be possible for estimators that accept Array API inputs. For estimators that work with Cython code and directly access the underlying CSR/CSC memory buffers via pointers then there is no solution unless we provide an arguably less efficient, pure Array API code path for such inputs.
Let's do this in the first PR where we actually need it. |
I found this issue and want to either inject some new questions into these comments, or I will open a new issue if people prefer that. I will follow this big-arc strategy post with a specific post asking about return values for features that return sparse. SciPy current state and plans for sparse Conversion steps: (tracker)
|
This comment is about benchmarks which use sparse. When we switch the benchmarks from using spmatrix to sparray, the timing difference should take that into consideration when viewing the benchmark results. I suggest we do it all at once. But you may have other ideas. We can do this now -- independently of the conversion of internal use to sparray. We currently test for both sparray and spmatrix. Presumably sparse input of either type will work for any benchmark. Is there any reason to wait before switching the benchmarks to sparray? |
Similarly, docs and examples that use spmatrix should all "just work" with sparray inputs. So we could change the docs and examples. Is there any reason to wait for switching the docs and examples to sparray? |
Questions about how to handle migration from spmatrix to sparray when returning sparse:
Questions about specific functions:
I've looked through the other functions and think we can make the others work either with existing keyword args, or they just don't return sparse. So there might be ~30 functions affected between the above comments. I propose that:
Then I'll need feedback on the specific functions with Questions above. Thanks! |
@dschult Thanks for your effort. My personal take at the moment is roughly:
My guess is that for most cases where scikit-learn create sparse data, it doesn’t matter to users which one (array or matrix). |
Users will need to adapt their code for Given the sparse surface area is quite big, I'm considering having a global: sklearn.set_config(sparse_format="spmatrix")
sklearn.set_config(sparse_format="sparray") where Implementation question: Is the conversion between a |
Yes, the conversion between spmatrix and sparray is zero copy (unless you use the kwarg So, I think you are suggesting that we could use sparray or spmatrix internally and convert to That could mean if they input spmatrix while config is set to sparray, they might input spmatrix and get back sparray. (which seems fine so long as the config setting is widely known and understood). Do I understand correctly? In terms of implementation, we could make a PR that implements the config option and an idiom for performing the conversion just before storing/returning any sparse result. Should an idiom like this get put into a helper function, e.g. I would then want to make a PR to convert the internal creation of sparse to use the sparray pattern. This would set up sklearn to require minimal changes as the rest of the migration to sparray occurs. hopefully only changes to the idiom, and in the very distant future, removal of the call to the helper. Here's a proposed idiom, just to make sure I am understanding how the config idea would work: if sparse.issparse(X):
sparse_format = get_config()["sparse_format"]
if not isinstance(X, sparse_format):
klass = X.format + '_' + sparse_format[2:]
X = sparse.getattr(klass)(X) |
Yes, that is correct. We have the same semantics for
I'll say a new As for implementation, I'm thinking of having a single function that changes the format based on the global config: def _as_sparse(X_sparse):
# check that X_sparse is sparse (if not raise an error)
...
# assume there are only two sparse formats
global_is_sparray = get_config()["sparse_format"] == "sparray"
X_is_sparray = _is_sparray(X_sparse)
if global_is_sparray:
if X_is_sparray:
return X_sparse
return _convert_from_spmatrix_to_sparray(X_sparse)
else: # global is spmatrix
if not X_is_sparray:
return X_sparse
return _convert_from_sparray_to_spmatrix(X_sparse) And then call it everywhere we expose a sparse object in the public API. |
@thomasjpfan I would very much like to avoid to introduce a global config option for sparse array/matrix. I would rather consider moving parallel to deprecation in scipy and make the behaviour dependent on the scipy version, e.g. output sparse array as soon as scipy deprecates it. Could we list entries in the public API that are affected? |
I would also like to avoid introducing a global config for that. Can't we just add an entry in the changelog of some future release that all sparse outputs will be sparse arrays 2 releases later (or longer if we think it's needed) ? |
Let's say we deprecate sparse matrix now and in 3 versions we switch to sparse arrays everywhere. What can a user do now to prepare for the transition while scikit-learn is still returning sparse matrices? I would want a way to configure scikit-learn to output sparse arrays and make sure my code works. |
Fair point. My concern is that
can represent a ton of places to adapt and test. Maybe this is a motivation for scikit-learn 2 ? |
I don't think our behavior should be dependent on user's installed scipy version. It would be really hard to pinpoint the problem if user has not changed scikit-learn version but found that the behavior changed. I also think that global config is not the most desirable option here. For me the most intuitive way is that However there are indeed functions/methods that do not take sparse input but produces sparse output. If there are not too many of them, probably its feasible to add a temporary per-function/method parameter that controls the behavior and follow the normal deprecation cycle. Otherwise probably we would have to have a global config (but IMO it should not affect functions/methods that does take sparse input). |
There are 2 different use cases to consider:
|
It's probably good to keep in mind that users who avoid I'm seeing two approaches proposed in these comments and they both have concrete example implementations (for only part of the library) in existing PRs. Take a look if you think a concrete example is helpful for your decision making.
The places in the code that this affects are laid out in an above comment. To summarize:
I hope this helps focus the discussion on whether to use a library wide parameter or a function-by-function approach. Thanks! |
In the dev meeting 28 April 2025, we decided to go with a global config. Which options it has and further details must be worked out. |
Context
SciPy is now favoring sparse arrays (i.e.$n$ -dimensional data-structures since SciPy 1.8 (see scipy/scipy#14822).
scipy.sparse.sparray
and its subclasses) over sparse matrices (i.e.scipy.sparse.spmatrix
and its subclasses) to enlarge the scope of matrices toSparse matrices now subclass their associated sparse arrays (see scipy/scipy#18440).
scikit-learn has been supporting sparse matrices but now also needs to support SciPy sparse arrays.
Proposed solutions
Ordered by preference:
scipy.sparse.issparse
and theformat
attribute everywhere and not useisinstance
at allisinstance
on private compatibility class defined to bescipy.sparse.spmatrix
orscipy.sparse.sparray
conditionally on SciPy's version (i.e. usescipy.sparse.sparray
if available)cc @ivirshup
The text was updated successfully, but these errors were encountered: