RFC Supporting `scipy.sparse.sparray` #26418

jjerphan · 2023-05-22T23:06:08Z

Context

SciPy is now favoring sparse arrays (i.e. scipy.sparse.sparray and its subclasses) over sparse matrices (i.e. scipy.sparse.spmatrix and its subclasses) to enlarge the scope of matrices to $n$-dimensional data-structures since SciPy 1.8 (see scipy/scipy#14822).

Sparse matrices now subclass their associated sparse arrays (see scipy/scipy#18440).

scikit-learn has been supporting sparse matrices but now also needs to support SciPy sparse arrays.

Proposed solutions

Ordered by preference:

Use scipy.sparse.issparse and the format attribute everywhere and not use isinstance at all
Use isinstance on private compatibility class defined to be scipy.sparse.spmatrix or scipy.sparse.sparray conditionally on SciPy's version (i.e. use scipy.sparse.sparray if available)
Rely on duck-typing or the class name to check for sparse arrays

cc @ivirshup

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2023-05-23T00:25:37Z

In general, I'm +1 with supporting scipy.sparse.sparray. Although, we likely need to support both spmatrix and sparray for a while. To support both, I suspect we need to be careful about some of the behavior differences between spmatrix and sparray.

jjerphan · 2023-05-23T23:43:57Z

#26420 has been opened as a small preliminary.

lorentzenchr · 2023-07-08T17:36:32Z

Also +1 for supporting scipy.sparse.sparray.
The real question for me is what to do with the old sparse matrices:

Follow scipy. (What‘s their plan? Deprecation, removal, continued support?)
Deprecate, maybe longer depr cycle than usual.
Continued support

I‘m leaning towards 2 or 1.

jjerphan · 2023-07-08T20:26:05Z

SciPy is going to deprecate matrices and will ultimately remove them.

jjerphan · 2023-08-18T10:16:47Z

#27090 has been opened to update tests accordingly.

ivirshup · 2023-08-18T12:27:30Z

What should we do about things that return sparse classes?

After brief discussion with @jjerphan and @ogrisel it sounds like the desired behavior is to pass through the array classes as much as possible. E.g. we should try and have f(X: *_array, ...) -> *_array and f(X: *_matrix, ...) -> *_matrix where applicable.

jjerphan · 2023-08-18T12:33:01Z

I also think we should preserve the type of inputs.

I think that functions returning sparse matrices which do not have sparse matrices or array are arguments can for now continue to return sparse matrices until the version of SciPy who will officially deprecate sparse matrices is installed (in which case, sparse arrays would be returned instead).

ogrisel · 2023-08-22T08:23:13Z

Use scipy.sparse.issparse and the format attribute everywhere and not use isinstance at all

I would be in favor of trying to do this as much as possible, but there might be some cases where the two have different behaviors and an explicit testing for the type might be necessary. I wonder what would be a concise way to do this to keep backward compat with scipy 1.8 though...

Maybe something like:

import scipy.sparse as sp


if sp.issparse(X):
    if hasattr(sp, "isspmatrix") and not sp.isspmatrix(X):
        # special handling for `scipy.sparse.sparray`
        ...
    else:
        # backward compat for `scipy.sparse.spmatrix`
        ...

ogrisel · 2023-08-22T08:42:34Z

I think that functions returning sparse matrices which do not have sparse matrices or array are arguments can for now continue to return sparse matrices until the version of SciPy who will officially deprecate sparse matrices is installed (in which case, sparse arrays would be returned instead).

Once scipy starts issuing deprecation warnings, we might need a way to allow the users to silence them though.

Maybe an estimator with sparse_output=True could start to issue a FutureWarning to state that in two versions, scikit-learn will return scipy.sparse.sparray instances instead of scipy.sparse.spmatrix instances. To silence the warning we could introduce a new temporary constructor parameter sparse_output_container="sparray" so has to make it possible to handle the transition.

Note that for ColumnTransformer, there is not sparse_output boolean parameter but instead it uses sparse_threshold=0.3. Introducing a new parameter sparse_output_container="sparray" is probably the only to respect our rolling deprecation strategy for backward compat.

ivirshup · 2023-08-22T09:39:35Z

Maybe something like:

There has been discussion of removing some of the isspmatrix_*/ isspmatrix functions and replacing them with instance checks. E.g.: isinstance(x, sp.spmatrix). However, there will be some releases of scipy where issubclass(sp.csr_array, spmatrix) is True... I think this could be worth a helper function which may need to take into account the version of scipy.

To silence the warning we could introduce a new temporary constructor parameter sparse_output_container="sparray" so has to make it possible to handle the transition.

👍 this is the approach I've been taking with a number of deprecations. It's a little annoying due to the number of warnings and downstream authors have to change code twice, but I haven't been able to come up with something as safe with less action required.

One concern here would be if you'd like to return other kinds of sparse array, like sparse pytorch.Tensor, jax.experimental.sparse, or cupyx.scipy.sparse matrices. Do you think this is something scikit-learn may want to support, and is the string enum argument enough to support this?

ogrisel · 2023-08-24T08:13:12Z

One concern here would be if you'd like to return other kinds of sparse array, like sparse pytorch.Tensor, jax.experimental.sparse, or cupyx.scipy.sparse matrices. Do you think this is something scikit-learn may want to support, and is the string enum argument enough to support this?

Ideally, if those implement the Array API (which is not the case at the moment), this might be possible for estimators that accept Array API inputs. For estimators that work with Cython code and directly access the underlying CSR/CSC memory buffers via pointers then there is no solution unless we provide an arguably less efficient, pure Array API code path for such inputs.

I think this could be worth a helper function which may need to take into account the version of scipy.

Let's do this in the first PR where we actually need it.

dschult · 2025-03-16T19:45:35Z

I found this issue and want to either inject some new questions into these comments, or I will open a new issue if people prefer that. I will follow this big-arc strategy post with a specific post asking about return values for features that return sparse.

SciPy current state and plans for sparse
SciPy now has a document SciPy migration from spmatrix to sparray and I am working to help implement that process for scikit-learn. The conversion to sparray within SciPy has been completed. The documentation has been warning new code creators to use sparray for a few releases now. We are now in a phase of waiting a release or two before officially deprecating spmatrix (by that I mean making deprecation warnings appear). The final timing has not been universally agreed on and part of that depends on adoption rates among heavy users like sklearn. I am dedicating time toward this goal. Hopefully it is helpful for sklearn too. As of v1.15 SciPy sparse has very rudimentary nD sparray support, and I expect it to grow quickly. Broadcasting is another feature that sparray will be developing soon. These are good features for which conversion to sparray will be necessary.

Conversion steps: (tracker)

(PR TST Extend tests for scipy.sparse.*array #27090) test both sparray and spmatrix. This requires accepting sparray inputs and spmatrix inputs equally
(PR Update sparse spmatrix syntax to support both sparray and spmatrix #30858) convert all internal code so it works with either sparray or spmatrix This differs from 1. in that spmatrices created inside sklearn do not get tested in 1. This changes the syntax of manipulations of internally created spmatrix to work for sparray or spmatrix. It does not change outputs.
(PR coming) determine whether to return spmatrix to sparray We all agree that if there is a sparse input the sparse output should use the same type as the sparse input. But functions that don't have sparse input yet return sparse output will need to have a way to choose the type of the output. (discussed somewhat above).
- Some libraries (like pyamg) are making a hard cut at a specific version, (advantage: one change for users). Users only need to wrap the output e.g. sp.csr_matrix(...) if they want spmatrix.
- Some libraries are adding kwargs to indicate the return type (two user changes -- first when kwarg is introduced or default value is changed -- second when kwarg is deprecated eventually).
- Some libraries create two function names: one for sparrays and one for spmatrices (one change if the sparray name is the long term better name, two changes if the sparray func name will be changed to a canonical name later).
(PR coming) ensure index arrays are 32bit for codepaths that use 32bit libraries sparrays set index dtype by type of input instead of values of input (spmatrix use values). This is a tricky part of the conversion so I thought it might help to separate it from larger more straight forward changes.
(PR coming) change internal usage from spmatrix to sparray (this should be a long straightfoward PR with only the name change and no other logic/code.
Feedback on this process is welcome. If this is better in a new Issue please let me know.

dschult · 2025-03-16T19:52:28Z

This comment is about benchmarks which use sparse. When we switch the benchmarks from using spmatrix to sparray, the timing difference should take that into consideration when viewing the benchmark results. I suggest we do it all at once. But you may have other ideas.

We can do this now -- independently of the conversion of internal use to sparray. We currently test for both sparray and spmatrix. Presumably sparse input of either type will work for any benchmark.

Is there any reason to wait before switching the benchmarks to sparray?

dschult · 2025-03-16T19:57:15Z

Similarly, docs and examples that use spmatrix should all "just work" with sparray inputs. So we could change the docs and examples.

Is there any reason to wait for switching the docs and examples to sparray?

dschult · 2025-03-17T00:57:23Z

Questions about how to handle migration from spmatrix to sparray when returning sparse:

Many "feature" modules have classes that can hold either sparse matrix or sparse array. Often in transform or fit a sparse is created to be held or returned. If there is a sparse input for X and maybe even y we can match that sparse type. But what should we do when a dense is converted to sparse as part of fit() or transform()? I can get examples, or you can look in text.py _hash.py, image.py, etc. in the feature_extraction subpackage.
About 12 functions/classes return sparse but have kwarg values like sparse, sparse_input sparse_output. Can we use those to determine what form of output? They are now bool. How to indicate spmatrix vs sparray?
linear_model.coordinate_descent (and others) uses sparse_coef property or attribute. How to decide spmatrix/sparray? Make two properties?

Questions about specific functions:

datasets.load_svmlight_files -> output is sparse. Does it matter which type?
dataset.fetch_rcv1 converts to csr_array for lexicographic order. Would a user need to be spmatrix?
barycenter_kneighbors_graph -> returns newly created sparse object. how to handle?
classification._check_targets -> converts ndarray to sparse. Existing comment lines question why we convert an ndarray to sparse just before returning. I agree. It is already using the memory. Should we switch to ndarray? How to handle spmatrix/sparray in the meantime?

I've looked through the other functions and think we can make the others work either with existing keyword args, or they just don't return sparse. So there might be ~30 functions affected between the above comments.

I propose that:

for fit() we should allow ourselves to keep whichever sparse format we like. User shouldn't care.
for transform we look at type of X. If X is dense and converted to sparse, can we just shift to return sparray? Or is a kwarg needed?
Any functions with existing bool inputs regarding "sparse" should use that kwarg to indicate sparray or spmatrix too. Maybe bool or string? Ideas?
sparse_coef property can either always now be sparray. Or we make two properties like sparse_array_coef?

Then I'll need feedback on the specific functions with Questions above.
If you'd rather see this in PR form before committing one way or the other let me know and I'll put that together.

Thanks!

lorentzenchr · 2025-04-04T10:58:01Z

@dschult Thanks for your effort.
@scikit-learn/core-devs This needs awareness, opinions and decisions.

My personal take at the moment is roughly:

Decide on a hard cut as of which version we only ever create sparse arrays.
Keep support of sparse matrix input as long as the oldest supported scipy still has it.

My guess is that for most cases where scikit-learn create sparse data, it doesn’t matter to users which one (array or matrix).

thomasjpfan · 2025-04-05T01:14:35Z

it doesn’t matter to users which one (array or matrix).

Users will need to adapt their code for sparray output because of the different semantics of the * operator. (sparray.__mul__ is element-wise multiplication while spmatrix.__mul__ is matrix multiplication)

Given the sparse surface area is quite big, I'm considering having a global:

sklearn.set_config(sparse_format="spmatrix")

sklearn.set_config(sparse_format="sparray")

where sparray becomes the default when our oldest supported SciPy drops spmatrix. When sparse_format="sparray" everything that scikit-learn produces that is sparse will be a sparray. i.e function returns, estimator attributes, estimator properties, etc.

Implementation question: Is the conversion between a spmatrix and 2D sparray zero copy?

dschult · 2025-04-05T11:51:44Z

Yes, the conversion between spmatrix and sparray is zero copy (unless you use the kwarg copy=True).

So, I think you are suggesting that we could use sparray or spmatrix internally and convert to get_config()["sparse_format"] at the end of any function that returns sparse. And similarly, estimator attributes that are sparse can be constructed internally using either sparry or spmatrix and converted when assigned to the estimator.

That could mean if they input spmatrix while config is set to sparray, they might input spmatrix and get back sparray. (which seems fine so long as the config setting is widely known and understood). Do I understand correctly?

In terms of implementation, we could make a PR that implements the config option and an idiom for performing the conversion just before storing/returning any sparse result. Should an idiom like this get put into a helper function, e.g. X = convert_sparse_to_sparse_format(X) with a better name)? Where would that helper function live?

I would then want to make a PR to convert the internal creation of sparse to use the sparray pattern. This would set up sklearn to require minimal changes as the rest of the migration to sparray occurs. hopefully only changes to the idiom, and in the very distant future, removal of the call to the helper.

Here's a proposed idiom, just to make sure I am understanding how the config idea would work:

    if sparse.issparse(X):
        sparse_format = get_config()["sparse_format"]
        if not isinstance(X, sparse_format):
            klass = X.format + '_' + sparse_format[2:]
            X = sparse.getattr(klass)(X)

thomasjpfan · 2025-04-05T13:23:01Z

That could mean if they input spmatrix while config is set to sparray, they might input spmatrix and get back sparray. (which seems fine so long as the config setting is widely known and understood). Do I understand correctly?

Yes, that is correct. We have the same semantics for set_config(transform_output="pandas"), where the output type is independent of the input type.

Where would that helper function live?

I'll say a new sklearn.utils._sparse module. We may need to make it a public module for third party estimator developers to have the same semantics as scikit-learn.

As for implementation, I'm thinking of having a single function that changes the format based on the global config:

def _as_sparse(X_sparse):
    # check that X_sparse is sparse (if not raise an error)
    ...

    # assume there are only two sparse formats
    global_is_sparray =  get_config()["sparse_format"] == "sparray"
    X_is_sparray = _is_sparray(X_sparse)

    if global_is_sparray:
        if X_is_sparray:
            return X_sparse
        return _convert_from_spmatrix_to_sparray(X_sparse)
    else:  # global is spmatrix
        if not X_is_sparray:
            return X_sparse
        return _convert_from_sparray_to_spmatrix(X_sparse)

And then call it everywhere we expose a sparse object in the public API.

lorentzenchr · 2025-04-11T13:24:57Z

@thomasjpfan I would very much like to avoid to introduce a global config option for sparse array/matrix. I would rather consider moving parallel to deprecation in scipy and make the behaviour dependent on the scipy version, e.g. output sparse array as soon as scipy deprecates it.

Could we list entries in the public API that are affected?

jeremiedbb · 2025-04-11T15:48:17Z

I would also like to avoid introducing a global config for that. Can't we just add an entry in the changelog of some future release that all sparse outputs will be sparse arrays 2 releases later (or longer if we think it's needed) ?
I think that it can even be done in 1.7 since our scipy min version supports sparse arrays.

thomasjpfan · 2025-04-11T16:53:28Z

Let's say we deprecate sparse matrix now and in 3 versions we switch to sparse arrays everywhere. What can a user do now to prepare for the transition while scikit-learn is still returning sparse matrices? I would want a way to configure scikit-learn to output sparse arrays and make sure my code works.

jeremiedbb · 2025-04-11T17:33:30Z

Fair point. My concern is that

When sparse_format="sparray" everything that scikit-learn produces that is sparse will be a sparray. i.e function returns, estimator attributes, estimator properties, etc.

can represent a ton of places to adapt and test.

Maybe this is a motivation for scikit-learn 2 ?

Charlie-XIAO · 2025-04-11T17:34:21Z

I don't think our behavior should be dependent on user's installed scipy version. It would be really hard to pinpoint the problem if user has not changed scikit-learn version but found that the behavior changed.

I also think that global config is not the most desirable option here. For me the most intuitive way is that f(X: *_array, ...) -> *_array and f(X: *_matrix, ...) -> *_matrix as mentioned in #26418 (comment) so that scikit-learn automatically reacts to user's migration at fine granularity. Suppose that there is a global configuration, it would be inconvenient if user only wants to migrate part of the code to use sparray (unless with some trick), and again it seems a bit counter-intuitive to me if I input spmatrix but get back sparray or vice versa, even if I'm aware of that global config.

However there are indeed functions/methods that do not take sparse input but produces sparse output. If there are not too many of them, probably its feasible to add a temporary per-function/method parameter that controls the behavior and follow the normal deprecation cycle. Otherwise probably we would have to have a global config (but IMO it should not affect functions/methods that does take sparse input).

lorentzenchr · 2025-04-11T22:21:28Z

There are 2 different use cases to consider:

transform(X) and predict(X) return a sparse object, sometimes triggered by the estimator parameter sparse_output.
Attributes of estimators like lasso_model.sparse_coef_.

dschult · 2025-04-14T04:50:54Z

It's probably good to keep in mind that users who avoid * and instead use @ and .multiply() will likely not care which sparse interface they get.

I'm seeing two approaches proposed in these comments and they both have concrete example implementations (for only part of the library) in existing PRs. Take a look if you think a concrete example is helpful for your decision making.

a global config sparse_interface parameter to use library wide:
Enable config setting sparse_interface to control sparray and spmatrix creation #31177
a per-function-call approach using an input to indicate return type:
First steps toward sparray migration pass 2 #31072

The places in the code that this affects are laid out in an above comment. To summarize:

there are common places like .fit() and .transform() methods
some functions can return sparse based on a bool keyword that could be made bool or string.
there is the sparse_coef property
there are four other "misfit functions", listed in that linked comment above, with no existing way to indicate the sparse interface.

I hope this helps focus the discussion on whether to use a library wide parameter or a function-by-function approach. Thanks!

lorentzenchr · 2025-04-28T14:59:17Z

In the dev meeting 28 April 2025, we decided to go with a global config. Which options it has and further details must be worked out.

github-actions bot added the Needs Triage Issue requires triage label May 22, 2023

thomasjpfan added New Feature RFC Needs Decision - API and removed Needs Triage Issue requires triage labels May 23, 2023

jjerphan changed the title ~~MAINT Supporting scipy.sparse.sparray transition~~ RFC Supporting scipy.sparse.sparray transition May 23, 2023

jjerphan mentioned this issue May 23, 2023

MAINT Use scipy.sparse.isspmatrix_* #26420

Merged

jjerphan changed the title ~~RFC Supporting scipy.sparse.sparray transition~~ RFC Supporting scipy.sparse.sparray May 27, 2023

jjerphan mentioned this issue Jun 23, 2023

⚠️ CI failed on Linux_Nightly.pylatest_pip_scipy_dev ⚠️ #26154

Closed

jjerphan mentioned this issue Jul 2, 2023

MAINT Remove soon-to-be deprecated scipy.sparse functions #26751

Merged

jjerphan mentioned this issue Aug 18, 2023

TST Extend tests for scipy.sparse.*array #27090

Closed

This was referenced Aug 28, 2023

TST Extend tests for scipy.sparse.*array in sklearn/cluster/tests/test_spectral.py #27161

Merged

ENH Adds polars output support to ColumnTransformer #26683

Merged

lesteve mentioned this issue Nov 10, 2023

FIX improve warning message in _ensure_sparse_format #27757

Merged

lorentzenchr added API Needs Decision Requires decision and removed Needs Decision - API labels Mar 14, 2024

dschult mentioned this issue Mar 26, 2025

First steps toward sparray migration pass 2 #31072

Open

dschult mentioned this issue Apr 11, 2025

Enable config setting sparse_interface to control sparray and spmatrix creation #31177

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC Supporting `scipy.sparse.sparray` #26418

RFC Supporting `scipy.sparse.sparray` #26418

jjerphan commented May 22, 2023 •

edited

Loading

thomasjpfan commented May 23, 2023

jjerphan commented May 23, 2023

lorentzenchr commented Jul 8, 2023

jjerphan commented Jul 8, 2023

jjerphan commented Aug 18, 2023

ivirshup commented Aug 18, 2023

jjerphan commented Aug 18, 2023

ogrisel commented Aug 22, 2023 •

edited

Loading

ogrisel commented Aug 22, 2023

ivirshup commented Aug 22, 2023

ogrisel commented Aug 24, 2023 •

edited

Loading

dschult commented Mar 16, 2025 •

edited

Loading

dschult commented Mar 16, 2025

dschult commented Mar 16, 2025

dschult commented Mar 17, 2025 •

edited

Loading

lorentzenchr commented Apr 4, 2025

thomasjpfan commented Apr 5, 2025

dschult commented Apr 5, 2025

thomasjpfan commented Apr 5, 2025

lorentzenchr commented Apr 11, 2025

jeremiedbb commented Apr 11, 2025

thomasjpfan commented Apr 11, 2025 •

edited

Loading

jeremiedbb commented Apr 11, 2025

Charlie-XIAO commented Apr 11, 2025

lorentzenchr commented Apr 11, 2025 •

edited

Loading

dschult commented Apr 14, 2025

lorentzenchr commented Apr 28, 2025

RFC Supporting scipy.sparse.sparray #26418

RFC Supporting scipy.sparse.sparray #26418

Comments

jjerphan commented May 22, 2023 • edited Loading

Context

Proposed solutions

thomasjpfan commented May 23, 2023

jjerphan commented May 23, 2023

lorentzenchr commented Jul 8, 2023

jjerphan commented Jul 8, 2023

jjerphan commented Aug 18, 2023

ivirshup commented Aug 18, 2023

jjerphan commented Aug 18, 2023

ogrisel commented Aug 22, 2023 • edited Loading

ogrisel commented Aug 22, 2023

ivirshup commented Aug 22, 2023

ogrisel commented Aug 24, 2023 • edited Loading

dschult commented Mar 16, 2025 • edited Loading

dschult commented Mar 16, 2025

dschult commented Mar 16, 2025

dschult commented Mar 17, 2025 • edited Loading

lorentzenchr commented Apr 4, 2025

thomasjpfan commented Apr 5, 2025

dschult commented Apr 5, 2025

thomasjpfan commented Apr 5, 2025

lorentzenchr commented Apr 11, 2025

jeremiedbb commented Apr 11, 2025

thomasjpfan commented Apr 11, 2025 • edited Loading

jeremiedbb commented Apr 11, 2025

Charlie-XIAO commented Apr 11, 2025

lorentzenchr commented Apr 11, 2025 • edited Loading

dschult commented Apr 14, 2025

lorentzenchr commented Apr 28, 2025

RFC Supporting `scipy.sparse.sparray` #26418

RFC Supporting `scipy.sparse.sparray` #26418

jjerphan commented May 22, 2023 •

edited

Loading

ogrisel commented Aug 22, 2023 •

edited

Loading

ogrisel commented Aug 24, 2023 •

edited

Loading

dschult commented Mar 16, 2025 •

edited

Loading

dschult commented Mar 17, 2025 •

edited

Loading

thomasjpfan commented Apr 11, 2025 •

edited

Loading

lorentzenchr commented Apr 11, 2025 •

edited

Loading