-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
PyData/Sparse support #17364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey! Thanks for reaching out! Sorry I didn't answer to your issue, I got a bit side-tracked. What would be better of course would be to totally replace scipy sparse matrices. For full adoption we would also need to think about whether we want to support both formats or transition entirely to PyData/Sparse. Either way, basically we need a Cython API and BLAS. We could try to make a full list at some point but we definitely need eig, eigh, svd, qr, lsqr and pinv (in addition to probably all the matrix matrix and matrix vector multiplications). |
@amueller curious about why a Cython API? |
Yes, that's what we are doing and those Cython routines mostly take CSR or CSC as input. Thanks for opening this discussion. It would be interesting to patch functions in
Realistically I don't see us transitioning to pydata/sparse entirely without very extensive testing, so that would likely mean that we would need to support both during some time which is additional maintenance effort. Adding numba as a required dependency is also not a light decision. In particular that would mean that we would no longer be able to use scikit-learn with PyPy if I understand correctly? |
Agreed that it'll be a big effort to switch, and stability is something that's currently a long-term goal, although we try and stick with the
That was one of our goals. 😄
I had a brief glance at the function names in that file, it all seems to be code of the kind we'd want to upstream.
PyData/Sparse could be a soft, rather than a hard, dependency until Numba gains PyPy support. I've opened a discussion over at numba/numba#5766 to that end. |
I agree. There's also performance to worry about - pydata/sparse is not yet on par with
I think there's much more to that than PyPy support. Numba isn't mature/stable enough to be a hard dependency yet. I don't think much has fundamentally changed there since the discussion on making Numba a dependency of SciPy in 2018 and for Pandas in 2019. What Pandas is doing right now - accepting user-defined kernels written with Numba as input - seems fine. Numba is great for writing individual functions that are fast easily. But for a hard dependency, you'd expect things like non-super-scary-exceptions, portability, stability, etc. |
Sorry my statement about cython was misleading. We are extracting the numpy arrays for indptr etc and are operating on those: It would be great to upstream as much as possible of this, but I assume we have some code using sparse data that shouldn't be upstreamed, in k-means and SGDClassifier and things like that. |
Assuming pydata/sparse#326 is completed, all of those will be trivial. In any case, for |
@hameerabbasi can you explain how this relates to pydata/sparse#326, I don't think I follow. |
Sure, pydata/sparse#326 refers to a set of research papers that define a method to generate efficient kernels for a broad range of storage formats. They can do things composed of element-wise operations (with broadcasting) and reductions, but they can't do things like (for example) eigendecompositions (which we intend to do with SciPy wrappers for LAPACK, et. al.). Since variances/means are all composed of such operations, those can be easily computed. |
That's fair but I think we required actual indexing etc in sklearn on cython level. Right now that's by using the buffer interface of the underlying arrays for CSR and CSC. I'm quite certain we need something comparable. |
That won't be a problem either, as the underlying structures will all be NumPy arrays, all corresponding to what one would expect for CSR/CSC. |
Ah ok, that will probably work then, though I assume we'll have to write some conversion tools. |
Describe the workflow you want to enable
Replacing
scipy.sparse
arrays with PyData/Sparse in scikit-learn.Describe your proposed solution
Hello. I'm the maintainer of PyData/Sparse. So far, I feel the conversation between us and the scikit-learn developers has been mostly second-hand.
I'm opening this issue in hopes of opening up a channel, and creating a meta-issue over in the PyData/Sparse repo for any requirements that scikit-learn may have to bring this to fruition. Over time, my hope is we can resolve issues on both sides to enable this.
The text was updated successfully, but these errors were encountered: