PyData/Sparse support #17364

hameerabbasi · 2020-05-27T10:05:54Z

Describe the workflow you want to enable

Replacing scipy.sparse arrays with PyData/Sparse in scikit-learn.

Describe your proposed solution

Hello. I'm the maintainer of PyData/Sparse. So far, I feel the conversation between us and the scikit-learn developers has been mostly second-hand.

I'm opening this issue in hopes of opening up a channel, and creating a meta-issue over in the PyData/Sparse repo for any requirements that scikit-learn may have to bring this to fruition. Over time, my hope is we can resolve issues on both sides to enable this.

The text was updated successfully, but these errors were encountered:

amueller · 2020-05-28T02:35:06Z

Hey! Thanks for reaching out! Sorry I didn't answer to your issue, I got a bit side-tracked.
There's basically two different paths. One is that we want to use it as a wrapper format but internally use scipy sparse. That's because we (or some of us) like feature names and we can get them via xarray. Benchmark and discussion here:
#16772 (comment)

What would be better of course would be to totally replace scipy sparse matrices.
For that we would need pretty substantial BLAS support to start with, and then we would need to think about the Cython code.
Is there a C / Cython API?

For full adoption we would also need to think about whether we want to support both formats or transition entirely to PyData/Sparse. Either way, basically we need a Cython API and BLAS.

We could try to make a full list at some point but we definitely need eig, eigh, svd, qr, lsqr and pinv (in addition to probably all the matrix matrix and matrix vector multiplications).

rgommers · 2020-05-28T08:40:44Z

@amueller curious about why a Cython API? scipy.sparse doesn't have one, so I assume you're going through Python right now when using sparse matrices in Cython code.

rth · 2020-05-28T09:20:52Z

I assume you're going through Python right now when using sparse matrices in Cython code.

Yes, that's what we are doing and those Cython routines mostly take CSR or CSC as input.

Thanks for opening this discussion. It would be interesting to patch functions in sklearn.datasets to return pydata/sparse for sparse cases and run the test suite to see what would be necessary to make it work. I suspect, we would need,

add support to utils.check_array
all cases where we currently have a separate handling of sparse/non sparse might need some work (e.g. here). A number of these are simply workarounds for the fact that scipy.sparse is not an ndarray, and so hopefully pydata/sparse would just work with the code in the dense branch.
When we use our Cython sparse utils (see e.g. utils/sparsefuncs.py) convert to a format that's expected. Ideally some of those could be upstreamed.

Realistically I don't see us transitioning to pydata/sparse entirely without very extensive testing, so that would likely mean that we would need to support both during some time which is additional maintenance effort.

Adding numba as a required dependency is also not a light decision. In particular that would mean that we would no longer be able to use scikit-learn with PyPy if I understand correctly?

hameerabbasi · 2020-05-28T11:56:36Z

Realistically I don't see us transitioning to pydata/sparse entirely without very extensive testing, so that would likely mean that we would need to support both during some time which is additional maintenance effort.

Agreed that it'll be a big effort to switch, and stability is something that's currently a long-term goal, although we try and stick with the ndarray interface as closely as possible.

A number of these are simply workarounds for the fact that scipy.sparse is not an ndarray, and so hopefully pydata/sparse would just work with the code in the dense branch.

That was one of our goals. 😄

When we use our Cython sparse utils (see e.g. utils/sparsefuncs.py) convert to a format that's expected. Ideally some of those could be upstreamed.

I had a brief glance at the function names in that file, it all seems to be code of the kind we'd want to upstream.

Adding numba as a required dependency is also not a light decision. In particular that would mean that we would no longer be able to use scikit-learn with PyPy if I understand correctly?

PyData/Sparse could be a soft, rather than a hard, dependency until Numba gains PyPy support. I've opened a discussion over at numba/numba#5766 to that end.

rgommers · 2020-05-30T14:36:15Z

Realistically I don't see us transitioning to pydata/sparse entirely without very extensive testing,

I agree. There's also performance to worry about - pydata/sparse is not yet on par with scipy.sparse. And the sparse linalg support is still going through scipy.sparse.linalg anyway.

PyData/Sparse could be a soft, rather than a hard, dependency until Numba gains PyPy support.

I think there's much more to that than PyPy support. Numba isn't mature/stable enough to be a hard dependency yet. I don't think much has fundamentally changed there since the discussion on making Numba a dependency of SciPy in 2018 and for Pandas in 2019. What Pandas is doing right now - accepting user-defined kernels written with Numba as input - seems fine. Numba is great for writing individual functions that are fast easily. But for a hard dependency, you'd expect things like non-super-scary-exceptions, portability, stability, etc.

amueller · 2020-06-02T22:37:29Z

Sorry my statement about cython was misleading. We are extracting the numpy arrays for indptr etc and are operating on those:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs_fast.pyx

It would be great to upstream as much as possible of this, but I assume we have some code using sparse data that shouldn't be upstreamed, in k-means and SGDClassifier and things like that.

hameerabbasi · 2020-06-03T08:19:54Z

Assuming pydata/sparse#326 is completed, all of those will be trivial. In any case, for COO, we do have variances, means, and many of the things in that file.

amueller · 2020-06-03T15:10:19Z

@hameerabbasi can you explain how this relates to pydata/sparse#326, I don't think I follow.

hameerabbasi · 2020-06-03T15:30:35Z

Sure, pydata/sparse#326 refers to a set of research papers that define a method to generate efficient kernels for a broad range of storage formats. They can do things composed of element-wise operations (with broadcasting) and reductions, but they can't do things like (for example) eigendecompositions (which we intend to do with SciPy wrappers for LAPACK, et. al.).

Since variances/means are all composed of such operations, those can be easily computed.

amueller · 2020-06-03T15:33:59Z

That's fair but I think we required actual indexing etc in sklearn on cython level. Right now that's by using the buffer interface of the underlying arrays for CSR and CSC. I'm quite certain we need something comparable.

hameerabbasi · 2020-06-03T15:36:48Z

That won't be a problem either, as the underlying structures will all be NumPy arrays, all corresponding to what one would expect for CSR/CSC.

amueller · 2020-06-03T15:45:49Z

Ah ok, that will probably work then, though I assume we'll have to write some conversion tools.

hameerabbasi added the New Feature label May 27, 2020

hameerabbasi mentioned this issue May 28, 2020

PyPy support numba/numba#5766

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyData/Sparse support #17364

PyData/Sparse support #17364

hameerabbasi commented May 27, 2020

amueller commented May 28, 2020 •

edited

Loading

rgommers commented May 28, 2020

rth commented May 28, 2020

hameerabbasi commented May 28, 2020

rgommers commented May 30, 2020

amueller commented Jun 2, 2020

hameerabbasi commented Jun 3, 2020

amueller commented Jun 3, 2020

hameerabbasi commented Jun 3, 2020

amueller commented Jun 3, 2020

hameerabbasi commented Jun 3, 2020

amueller commented Jun 3, 2020

PyData/Sparse support #17364

PyData/Sparse support #17364

Comments

hameerabbasi commented May 27, 2020

Describe the workflow you want to enable

Describe your proposed solution

amueller commented May 28, 2020 • edited Loading

rgommers commented May 28, 2020

rth commented May 28, 2020

hameerabbasi commented May 28, 2020

rgommers commented May 30, 2020

amueller commented Jun 2, 2020

hameerabbasi commented Jun 3, 2020

amueller commented Jun 3, 2020

hameerabbasi commented Jun 3, 2020

amueller commented Jun 3, 2020

hameerabbasi commented Jun 3, 2020

amueller commented Jun 3, 2020

amueller commented May 28, 2020 •

edited

Loading