Skip to content

RFC Supporting scipy.sparse.sparray #26418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jjerphan opened this issue May 22, 2023 · 27 comments
Open

RFC Supporting scipy.sparse.sparray #26418

jjerphan opened this issue May 22, 2023 · 27 comments

Comments

@jjerphan
Copy link
Member

jjerphan commented May 22, 2023

Context

SciPy is now favoring sparse arrays (i.e. scipy.sparse.sparray and its subclasses) over sparse matrices (i.e. scipy.sparse.spmatrix and its subclasses) to enlarge the scope of matrices to $n$-dimensional data-structures since SciPy 1.8 (see scipy/scipy#14822).

Sparse matrices now subclass their associated sparse arrays (see scipy/scipy#18440).

scikit-learn has been supporting sparse matrices but now also needs to support SciPy sparse arrays.

Proposed solutions

Ordered by preference:

  • Use scipy.sparse.issparse and the format attribute everywhere and not use isinstance at all
  • Use isinstance on private compatibility class defined to be scipy.sparse.spmatrix or scipy.sparse.sparray conditionally on SciPy's version (i.e. use scipy.sparse.sparray if available)
  • Rely on duck-typing or the class name to check for sparse arrays

cc @ivirshup

@github-actions github-actions bot added the Needs Triage Issue requires triage label May 22, 2023
@thomasjpfan
Copy link
Member

In general, I'm +1 with supporting scipy.sparse.sparray. Although, we likely need to support both spmatrix and sparray for a while. To support both, I suspect we need to be careful about some of the behavior differences between spmatrix and sparray.

@jjerphan jjerphan changed the title MAINT Supporting scipy.sparse.sparray transition RFC Supporting scipy.sparse.sparray transition May 23, 2023
@jjerphan
Copy link
Member Author

#26420 has been opened as a small preliminary.

@jjerphan jjerphan changed the title RFC Supporting scipy.sparse.sparray transition RFC Supporting scipy.sparse.sparray May 27, 2023
@lorentzenchr
Copy link
Member

Also +1 for supporting scipy.sparse.sparray.
The real question for me is what to do with the old sparse matrices:

  1. Follow scipy. (What‘s their plan? Deprecation, removal, continued support?)
  2. Deprecate, maybe longer depr cycle than usual.
  3. Continued support

I‘m leaning towards 2 or 1.

@jjerphan
Copy link
Member Author

jjerphan commented Jul 8, 2023

SciPy is going to deprecate matrices and will ultimately remove them.

@jjerphan
Copy link
Member Author

#27090 has been opened to update tests accordingly.

@ivirshup
Copy link
Contributor

What should we do about things that return sparse classes?

After brief discussion with @jjerphan and @ogrisel it sounds like the desired behavior is to pass through the array classes as much as possible. E.g. we should try and have f(X: *_array, ...) -> *_array and f(X: *_matrix, ...) -> *_matrix where applicable.

@jjerphan
Copy link
Member Author

I also think we should preserve the type of inputs.

I think that functions returning sparse matrices which do not have sparse matrices or array are arguments can for now continue to return sparse matrices until the version of SciPy who will officially deprecate sparse matrices is installed (in which case, sparse arrays would be returned instead).

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2023

Use scipy.sparse.issparse and the format attribute everywhere and not use isinstance at all

I would be in favor of trying to do this as much as possible, but there might be some cases where the two have different behaviors and an explicit testing for the type might be necessary. I wonder what would be a concise way to do this to keep backward compat with scipy 1.8 though...

Maybe something like:

import scipy.sparse as sp


if sp.issparse(X):
    if hasattr(sp, "isspmatrix") and not sp.isspmatrix(X):
        # special handling for `scipy.sparse.sparray`
        ...
    else:
        # backward compat for `scipy.sparse.spmatrix`
        ...

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2023

I think that functions returning sparse matrices which do not have sparse matrices or array are arguments can for now continue to return sparse matrices until the version of SciPy who will officially deprecate sparse matrices is installed (in which case, sparse arrays would be returned instead).

Once scipy starts issuing deprecation warnings, we might need a way to allow the users to silence them though.

Maybe an estimator with sparse_output=True could start to issue a FutureWarning to state that in two versions, scikit-learn will return scipy.sparse.sparray instances instead of scipy.sparse.spmatrix instances. To silence the warning we could introduce a new temporary constructor parameter sparse_output_container="sparray" so has to make it possible to handle the transition.

Note that for ColumnTransformer, there is not sparse_output boolean parameter but instead it uses sparse_threshold=0.3. Introducing a new parameter sparse_output_container="sparray" is probably the only to respect our rolling deprecation strategy for backward compat.

@ivirshup
Copy link
Contributor

Maybe something like:

There has been discussion of removing some of the isspmatrix_*/ isspmatrix functions and replacing them with instance checks. E.g.: isinstance(x, sp.spmatrix). However, there will be some releases of scipy where issubclass(sp.csr_array, spmatrix) is True... I think this could be worth a helper function which may need to take into account the version of scipy.

To silence the warning we could introduce a new temporary constructor parameter sparse_output_container="sparray" so has to make it possible to handle the transition.

👍 this is the approach I've been taking with a number of deprecations. It's a little annoying due to the number of warnings and downstream authors have to change code twice, but I haven't been able to come up with something as safe with less action required.

One concern here would be if you'd like to return other kinds of sparse array, like sparse pytorch.Tensor, jax.experimental.sparse, or cupyx.scipy.sparse matrices. Do you think this is something scikit-learn may want to support, and is the string enum argument enough to support this?

@ogrisel
Copy link
Member

ogrisel commented Aug 24, 2023

One concern here would be if you'd like to return other kinds of sparse array, like sparse pytorch.Tensor, jax.experimental.sparse, or cupyx.scipy.sparse matrices. Do you think this is something scikit-learn may want to support, and is the string enum argument enough to support this?

Ideally, if those implement the Array API (which is not the case at the moment), this might be possible for estimators that accept Array API inputs. For estimators that work with Cython code and directly access the underlying CSR/CSC memory buffers via pointers then there is no solution unless we provide an arguably less efficient, pure Array API code path for such inputs.

I think this could be worth a helper function which may need to take into account the version of scipy.

Let's do this in the first PR where we actually need it.

@dschult
Copy link
Contributor

dschult commented Mar 16, 2025

I found this issue and want to either inject some new questions into these comments, or I will open a new issue if people prefer that. I will follow this big-arc strategy post with a specific post asking about return values for features that return sparse.

SciPy current state and plans for sparse
SciPy now has a document SciPy migration from spmatrix to sparray and I am working to help implement that process for scikit-learn. The conversion to sparray within SciPy has been completed. The documentation has been warning new code creators to use sparray for a few releases now. We are now in a phase of waiting a release or two before officially deprecating spmatrix (by that I mean making deprecation warnings appear). The final timing has not been universally agreed on and part of that depends on adoption rates among heavy users like sklearn. I am dedicating time toward this goal. Hopefully it is helpful for sklearn too. As of v1.15 SciPy sparse has very rudimentary nD sparray support, and I expect it to grow quickly. Broadcasting is another feature that sparray will be developing soon. These are good features for which conversion to sparray will be necessary.

Conversion steps: (tracker)

  1. (PR TST Extend tests for scipy.sparse.*array #27090) test both sparray and spmatrix. This requires accepting sparray inputs and spmatrix inputs equally
  2. (PR Update sparse spmatrix syntax to support both sparray and spmatrix #30858) convert all internal code so it works with either sparray or spmatrix This differs from 1. in that spmatrices created inside sklearn do not get tested in 1. This changes the syntax of manipulations of internally created spmatrix to work for sparray or spmatrix. It does not change outputs.
  3. (PR coming) determine whether to return spmatrix to sparray We all agree that if there is a sparse input the sparse output should use the same type as the sparse input. But functions that don't have sparse input yet return sparse output will need to have a way to choose the type of the output. (discussed somewhat above).
    - Some libraries (like pyamg) are making a hard cut at a specific version, (advantage: one change for users). Users only need to wrap the output e.g. sp.csr_matrix(...) if they want spmatrix.
    - Some libraries are adding kwargs to indicate the return type (two user changes -- first when kwarg is introduced or default value is changed -- second when kwarg is deprecated eventually).
    - Some libraries create two function names: one for sparrays and one for spmatrices (one change if the sparray name is the long term better name, two changes if the sparray func name will be changed to a canonical name later).
  4. (PR coming) ensure index arrays are 32bit for codepaths that use 32bit libraries sparrays set index dtype by type of input instead of values of input (spmatrix use values). This is a tricky part of the conversion so I thought it might help to separate it from larger more straight forward changes.
  5. (PR coming) change internal usage from spmatrix to sparray (this should be a long straightfoward PR with only the name change and no other logic/code.
    Feedback on this process is welcome. If this is better in a new Issue please let me know.

@dschult
Copy link
Contributor

dschult commented Mar 16, 2025

This comment is about benchmarks which use sparse. When we switch the benchmarks from using spmatrix to sparray, the timing difference should take that into consideration when viewing the benchmark results. I suggest we do it all at once. But you may have other ideas.

We can do this now -- independently of the conversion of internal use to sparray. We currently test for both sparray and spmatrix. Presumably sparse input of either type will work for any benchmark.

Is there any reason to wait before switching the benchmarks to sparray?

@dschult
Copy link
Contributor

dschult commented Mar 16, 2025

Similarly, docs and examples that use spmatrix should all "just work" with sparray inputs. So we could change the docs and examples.

Is there any reason to wait for switching the docs and examples to sparray?

@dschult
Copy link
Contributor

dschult commented Mar 17, 2025

Questions about how to handle migration from spmatrix to sparray when returning sparse:

  • Many "feature" modules have classes that can hold either sparse matrix or sparse array. Often in transform or fit a sparse is created to be held or returned. If there is a sparse input for X and maybe even y we can match that sparse type. But what should we do when a dense is converted to sparse as part of fit() or transform()? I can get examples, or you can look in text.py _hash.py, image.py, etc. in the feature_extraction subpackage.
  • About 12 functions/classes return sparse but have kwarg values like sparse, sparse_input sparse_output. Can we use those to determine what form of output? They are now bool. How to indicate spmatrix vs sparray?
  • linear_model.coordinate_descent (and others) uses sparse_coef property or attribute. How to decide spmatrix/sparray? Make two properties?

Questions about specific functions:

  • datasets.load_svmlight_files -> output is sparse. Does it matter which type?
  • dataset.fetch_rcv1 converts to csr_array for lexicographic order. Would a user need to be spmatrix?
  • barycenter_kneighbors_graph -> returns newly created sparse object. how to handle?
  • classification._check_targets -> converts ndarray to sparse. Existing comment lines question why we convert an ndarray to sparse just before returning. I agree. It is already using the memory. Should we switch to ndarray? How to handle spmatrix/sparray in the meantime?

I've looked through the other functions and think we can make the others work either with existing keyword args, or they just don't return sparse. So there might be ~30 functions affected between the above comments.

I propose that:

  • for fit() we should allow ourselves to keep whichever sparse format we like. User shouldn't care.
  • for transform we look at type of X. If X is dense and converted to sparse, can we just shift to return sparray? Or is a kwarg needed?
  • Any functions with existing bool inputs regarding "sparse" should use that kwarg to indicate sparray or spmatrix too. Maybe bool or string? Ideas?
  • sparse_coef property can either always now be sparray. Or we make two properties like sparse_array_coef?

Then I'll need feedback on the specific functions with Questions above.
If you'd rather see this in PR form before committing one way or the other let me know and I'll put that together.

Thanks!

@lorentzenchr
Copy link
Member

@dschult Thanks for your effort.
@scikit-learn/core-devs This needs awareness, opinions and decisions.

My personal take at the moment is roughly:

  • Decide on a hard cut as of which version we only ever create sparse arrays.
  • Keep support of sparse matrix input as long as the oldest supported scipy still has it.

My guess is that for most cases where scikit-learn create sparse data, it doesn’t matter to users which one (array or matrix).

@thomasjpfan
Copy link
Member

it doesn’t matter to users which one (array or matrix).

Users will need to adapt their code for sparray output because of the different semantics of the * operator. (sparray.__mul__ is element-wise multiplication while spmatrix.__mul__ is matrix multiplication)


Given the sparse surface area is quite big, I'm considering having a global:

sklearn.set_config(sparse_format="spmatrix")

sklearn.set_config(sparse_format="sparray")

where sparray becomes the default when our oldest supported SciPy drops spmatrix. When sparse_format="sparray" everything that scikit-learn produces that is sparse will be a sparray. i.e function returns, estimator attributes, estimator properties, etc.


Implementation question: Is the conversion between a spmatrix and 2D sparray zero copy?

@dschult
Copy link
Contributor

dschult commented Apr 5, 2025

Yes, the conversion between spmatrix and sparray is zero copy (unless you use the kwarg copy=True).

So, I think you are suggesting that we could use sparray or spmatrix internally and convert to get_config()["sparse_format"] at the end of any function that returns sparse. And similarly, estimator attributes that are sparse can be constructed internally using either sparry or spmatrix and converted when assigned to the estimator.

That could mean if they input spmatrix while config is set to sparray, they might input spmatrix and get back sparray. (which seems fine so long as the config setting is widely known and understood). Do I understand correctly?

In terms of implementation, we could make a PR that implements the config option and an idiom for performing the conversion just before storing/returning any sparse result. Should an idiom like this get put into a helper function, e.g. X = convert_sparse_to_sparse_format(X) with a better name)? Where would that helper function live?

I would then want to make a PR to convert the internal creation of sparse to use the sparray pattern. This would set up sklearn to require minimal changes as the rest of the migration to sparray occurs. hopefully only changes to the idiom, and in the very distant future, removal of the call to the helper.

Here's a proposed idiom, just to make sure I am understanding how the config idea would work:

    if sparse.issparse(X):
        sparse_format = get_config()["sparse_format"]
        if not isinstance(X, sparse_format):
            klass = X.format + '_' + sparse_format[2:]
            X = sparse.getattr(klass)(X)

@thomasjpfan
Copy link
Member

That could mean if they input spmatrix while config is set to sparray, they might input spmatrix and get back sparray. (which seems fine so long as the config setting is widely known and understood). Do I understand correctly?

Yes, that is correct. We have the same semantics for set_config(transform_output="pandas"), where the output type is independent of the input type.

Where would that helper function live?

I'll say a new sklearn.utils._sparse module. We may need to make it a public module for third party estimator developers to have the same semantics as scikit-learn.

As for implementation, I'm thinking of having a single function that changes the format based on the global config:

def _as_sparse(X_sparse):
    # check that X_sparse is sparse (if not raise an error)
    ...

    # assume there are only two sparse formats
    global_is_sparray =  get_config()["sparse_format"] == "sparray"
    X_is_sparray = _is_sparray(X_sparse)

    if global_is_sparray:
        if X_is_sparray:
            return X_sparse
        return _convert_from_spmatrix_to_sparray(X_sparse)
    else:  # global is spmatrix
        if not X_is_sparray:
            return X_sparse
        return _convert_from_sparray_to_spmatrix(X_sparse)

And then call it everywhere we expose a sparse object in the public API.

@lorentzenchr
Copy link
Member

@thomasjpfan I would very much like to avoid to introduce a global config option for sparse array/matrix. I would rather consider moving parallel to deprecation in scipy and make the behaviour dependent on the scipy version, e.g. output sparse array as soon as scipy deprecates it.

Could we list entries in the public API that are affected?

@jeremiedbb
Copy link
Member

I would also like to avoid introducing a global config for that. Can't we just add an entry in the changelog of some future release that all sparse outputs will be sparse arrays 2 releases later (or longer if we think it's needed) ?
I think that it can even be done in 1.7 since our scipy min version supports sparse arrays.

@thomasjpfan
Copy link
Member

thomasjpfan commented Apr 11, 2025

Let's say we deprecate sparse matrix now and in 3 versions we switch to sparse arrays everywhere. What can a user do now to prepare for the transition while scikit-learn is still returning sparse matrices? I would want a way to configure scikit-learn to output sparse arrays and make sure my code works.

@jeremiedbb
Copy link
Member

Fair point. My concern is that

When sparse_format="sparray" everything that scikit-learn produces that is sparse will be a sparray. i.e function returns, estimator attributes, estimator properties, etc.

can represent a ton of places to adapt and test.

Maybe this is a motivation for scikit-learn 2 ?

@Charlie-XIAO
Copy link
Contributor

I don't think our behavior should be dependent on user's installed scipy version. It would be really hard to pinpoint the problem if user has not changed scikit-learn version but found that the behavior changed.

I also think that global config is not the most desirable option here. For me the most intuitive way is that f(X: *_array, ...) -> *_array and f(X: *_matrix, ...) -> *_matrix as mentioned in #26418 (comment) so that scikit-learn automatically reacts to user's migration at fine granularity. Suppose that there is a global configuration, it would be inconvenient if user only wants to migrate part of the code to use sparray (unless with some trick), and again it seems a bit counter-intuitive to me if I input spmatrix but get back sparray or vice versa, even if I'm aware of that global config.

However there are indeed functions/methods that do not take sparse input but produces sparse output. If there are not too many of them, probably its feasible to add a temporary per-function/method parameter that controls the behavior and follow the normal deprecation cycle. Otherwise probably we would have to have a global config (but IMO it should not affect functions/methods that does take sparse input).

@lorentzenchr
Copy link
Member

lorentzenchr commented Apr 11, 2025

There are 2 different use cases to consider:

  1. transform(X) and predict(X) return a sparse object, sometimes triggered by the estimator parameter sparse_output.
  2. Attributes of estimators like lasso_model.sparse_coef_.

@dschult
Copy link
Contributor

dschult commented Apr 14, 2025

It's probably good to keep in mind that users who avoid * and instead use @ and .multiply() will likely not care which sparse interface they get.

I'm seeing two approaches proposed in these comments and they both have concrete example implementations (for only part of the library) in existing PRs. Take a look if you think a concrete example is helpful for your decision making.

The places in the code that this affects are laid out in an above comment. To summarize:

  • there are common places like .fit() and .transform() methods
  • some functions can return sparse based on a bool keyword that could be made bool or string.
  • there is the sparse_coef property
  • there are four other "misfit functions", listed in that linked comment above, with no existing way to indicate the sparse interface.

I hope this helps focus the discussion on whether to use a library wide parameter or a function-by-function approach. Thanks!

@lorentzenchr
Copy link
Member

In the dev meeting 28 April 2025, we decided to go with a global config. Which options it has and further details must be worked out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants