Skip to content

PyData/Sparse support #17364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hameerabbasi opened this issue May 27, 2020 · 12 comments
Open

PyData/Sparse support #17364

hameerabbasi opened this issue May 27, 2020 · 12 comments

Comments

@hameerabbasi
Copy link

Describe the workflow you want to enable

Replacing scipy.sparse arrays with PyData/Sparse in scikit-learn.

Describe your proposed solution

Hello. I'm the maintainer of PyData/Sparse. So far, I feel the conversation between us and the scikit-learn developers has been mostly second-hand.

I'm opening this issue in hopes of opening up a channel, and creating a meta-issue over in the PyData/Sparse repo for any requirements that scikit-learn may have to bring this to fruition. Over time, my hope is we can resolve issues on both sides to enable this.

@amueller
Copy link
Member

amueller commented May 28, 2020

Hey! Thanks for reaching out! Sorry I didn't answer to your issue, I got a bit side-tracked.
There's basically two different paths. One is that we want to use it as a wrapper format but internally use scipy sparse. That's because we (or some of us) like feature names and we can get them via xarray. Benchmark and discussion here:
#16772 (comment)

What would be better of course would be to totally replace scipy sparse matrices.
For that we would need pretty substantial BLAS support to start with, and then we would need to think about the Cython code.
Is there a C / Cython API?

For full adoption we would also need to think about whether we want to support both formats or transition entirely to PyData/Sparse. Either way, basically we need a Cython API and BLAS.

We could try to make a full list at some point but we definitely need eig, eigh, svd, qr, lsqr and pinv (in addition to probably all the matrix matrix and matrix vector multiplications).

@rgommers
Copy link
Contributor

@amueller curious about why a Cython API? scipy.sparse doesn't have one, so I assume you're going through Python right now when using sparse matrices in Cython code.

@rth
Copy link
Member

rth commented May 28, 2020

I assume you're going through Python right now when using sparse matrices in Cython code.

Yes, that's what we are doing and those Cython routines mostly take CSR or CSC as input.

Thanks for opening this discussion. It would be interesting to patch functions in sklearn.datasets to return pydata/sparse for sparse cases and run the test suite to see what would be necessary to make it work. I suspect, we would need,

  • add support to utils.check_array
  • all cases where we currently have a separate handling of sparse/non sparse might need some work (e.g. here). A number of these are simply workarounds for the fact that scipy.sparse is not an ndarray, and so hopefully pydata/sparse would just work with the code in the dense branch.
  • When we use our Cython sparse utils (see e.g. utils/sparsefuncs.py) convert to a format that's expected. Ideally some of those could be upstreamed.

Realistically I don't see us transitioning to pydata/sparse entirely without very extensive testing, so that would likely mean that we would need to support both during some time which is additional maintenance effort.

Adding numba as a required dependency is also not a light decision. In particular that would mean that we would no longer be able to use scikit-learn with PyPy if I understand correctly?

@hameerabbasi
Copy link
Author

Realistically I don't see us transitioning to pydata/sparse entirely without very extensive testing, so that would likely mean that we would need to support both during some time which is additional maintenance effort.

Agreed that it'll be a big effort to switch, and stability is something that's currently a long-term goal, although we try and stick with the ndarray interface as closely as possible.

A number of these are simply workarounds for the fact that scipy.sparse is not an ndarray, and so hopefully pydata/sparse would just work with the code in the dense branch.

That was one of our goals. 😄

When we use our Cython sparse utils (see e.g. utils/sparsefuncs.py) convert to a format that's expected. Ideally some of those could be upstreamed.

I had a brief glance at the function names in that file, it all seems to be code of the kind we'd want to upstream.

Adding numba as a required dependency is also not a light decision. In particular that would mean that we would no longer be able to use scikit-learn with PyPy if I understand correctly?

PyData/Sparse could be a soft, rather than a hard, dependency until Numba gains PyPy support. I've opened a discussion over at numba/numba#5766 to that end.

@rgommers
Copy link
Contributor

Realistically I don't see us transitioning to pydata/sparse entirely without very extensive testing,

I agree. There's also performance to worry about - pydata/sparse is not yet on par with scipy.sparse. And the sparse linalg support is still going through scipy.sparse.linalg anyway.

PyData/Sparse could be a soft, rather than a hard, dependency until Numba gains PyPy support.

I think there's much more to that than PyPy support. Numba isn't mature/stable enough to be a hard dependency yet. I don't think much has fundamentally changed there since the discussion on making Numba a dependency of SciPy in 2018 and for Pandas in 2019. What Pandas is doing right now - accepting user-defined kernels written with Numba as input - seems fine. Numba is great for writing individual functions that are fast easily. But for a hard dependency, you'd expect things like non-super-scary-exceptions, portability, stability, etc.

@amueller
Copy link
Member

amueller commented Jun 2, 2020

Sorry my statement about cython was misleading. We are extracting the numpy arrays for indptr etc and are operating on those:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs_fast.pyx

It would be great to upstream as much as possible of this, but I assume we have some code using sparse data that shouldn't be upstreamed, in k-means and SGDClassifier and things like that.

@hameerabbasi
Copy link
Author

Assuming pydata/sparse#326 is completed, all of those will be trivial. In any case, for COO, we do have variances, means, and many of the things in that file.

@amueller
Copy link
Member

amueller commented Jun 3, 2020

@hameerabbasi can you explain how this relates to pydata/sparse#326, I don't think I follow.

@hameerabbasi
Copy link
Author

Sure, pydata/sparse#326 refers to a set of research papers that define a method to generate efficient kernels for a broad range of storage formats. They can do things composed of element-wise operations (with broadcasting) and reductions, but they can't do things like (for example) eigendecompositions (which we intend to do with SciPy wrappers for LAPACK, et. al.).

Since variances/means are all composed of such operations, those can be easily computed.

@amueller
Copy link
Member

amueller commented Jun 3, 2020

That's fair but I think we required actual indexing etc in sklearn on cython level. Right now that's by using the buffer interface of the underlying arrays for CSR and CSC. I'm quite certain we need something comparable.

@hameerabbasi
Copy link
Author

That won't be a problem either, as the underlying structures will all be NumPy arrays, all corresponding to what one would expect for CSR/CSC.

@amueller
Copy link
Member

amueller commented Jun 3, 2020

Ah ok, that will probably work then, though I assume we'll have to write some conversion tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants