-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Enable config setting sparse_interface
to control sparray and spmatrix creation
#31177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Another note: We can work around it for now with e.g. The recent features for sparse by version are:
This info might help us decide when to support which versions. I think the construction functions are all that is currently needed. If/when we start using indexing code for both sparse and dense, we will likely want 1.15. If/when we want nD sparse we will need v1.16, and broadcasting binary operations in v1.17. But for now, 1.8 leaves out only construction functions from current code. |
e2b7d8b
to
251e2e2
Compare
It sounds like the community has decided to convert to SciPy sparray using the config parameter. This PR is a start toward that. I've removed the "Draft status" from the PR. I am ready to implement this approach in other parts of the library. I have two questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this is basically what I had in mind.
if _as_sparse(X_csr) is X_csr: | ||
assert X_transform is X_csr | ||
else: | ||
assert X_transform is not X_csr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the object identity is not the same, is the underlying data/indices/indexes pointing to the same data? If so, we can check the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- the underlying data is the same. Good idea. I've added a check for indptr
.
sklearn/utils/_sparse.py
Outdated
from .._config import get_config | ||
|
||
|
||
def _as_sparse(X_sparse): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it'll be simpler to have _as_sparse
be _select_interface_if_sparse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the difference between functions is whether we know already that the input is sparse. We usually know whether it is sparse. But there are some cases where the input could be dense or sparse.
Looking at this again now, the cost of a simpler one-function approach is small. We always check issparse
so we could forego the exception and just pass-through anything that is not sparse.
Do you have a suggestion for the name of the single function? _as_sparse
suggests it converts everything to sparse. But _select_interface_if_sparse
is a long name. :) Maybe _align_api_if_sparse
? Or _align_sparse_api
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with _align_api_if_sparse
.
5e3d3cd
to
511de76
Compare
I've updated the name and example code -- and I added a commit that implements (with a These changes are largely orthogonal to the return sparse interface issue we've been focusing on here. But it is a needed step that would be easier to get feedback on here before this gets any bigger. If you'd prefer this in a different PR let me know. And if it'd be good to put further changes in a separate PR let me know. The further changes are mostly switching Do my choices for function names that bridge the old versions of SciPy look ok? |
8b58511
to
a802f12
Compare
05da23c
to
6fbaae6
Compare
normalizer = np.array(normalizer) # convert np.matrix to np.array | ||
if normalizer.ndim == 2: | ||
# old spmatrix treatment. RHS is a scalar (b/c normalizer is 2D row) | ||
affinity_matrix.data /= np.diag(normalizer) | ||
else: | ||
# We could use the (questionable) spmatrix treatment using: | ||
# affinity_matrix.data /= np.diag(np.array(normalizer[np.newaxis, :])) | ||
# Instead: use numpy treatment dividing each row by its normalizer. | ||
affinity_matrix.data /= normalizer[affinity_matrix.indices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change deserves a comment -- and maybe even a separate PR.
normalizer
is the axis=0
sum of affinity_matrix
. When affinity_matrix is numpy or sparray this is a 1D array.
But when affinity_matrix is spmatrix, it is 2D. So sparse matrices currently (on main) get special treatment:
affinity_matrix.data /= np.diag(np.array(normalizer))
normalizer is an np.matrix
(2D with 1 row). np.array(normalizer)
is an array (still 2D with 1 row). So np.diag of that is just the first entry in the first row (a scalar).
But that divides the entire affinity_matrix by the sum of the first row! So, instead of normalizing each row by its sum, we get the whole matrix normalized by the sum of the first row. I think this is an error.
When affinity_matrix
is a numpy array, we do divide each row by its sum. So, currently sparse and dense compute different normalizations here.
I have added sparse array handling here to divide each row by its sum. But I have left the spare matrix code there in case backward compatibility is an issue.
Should I open a separate PR for changing the spmatrix behavior so it doesn't divide by a scalar? Or just fix it in this PR? Or am I fixing it incorrectly in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably easier to review as a part of a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #31924
I reverted the changes here as they are covered in that PR.
8994094
to
220e7f1
Compare
5d190f2
to
2d71b2c
Compare
Okay, I think this is finally ready for review. I've implemented a config to determine the type of sparse matrix to return (sparray or spmatrix). Both kinds can be used for sparse inputs. New helper functions in
New Scipy Version bool flags to ease indexing code that depends on SciPy version (mostly index that make 1D arrays):
All internal constructors are switched to If you want me to separate this into smaller PRs, let me know. CC: @thomasjpfan , @lorentzenchr, @jjerphan, #26418 |
@scikit-learn/core-devs ping for visibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty straightforward to me.
- `"sparray"`: Return sparse as SciPy sparse array | ||
- `"spmatrix"`: Return sparse as SciPy sparse matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we changing this default at some point? Should we introducing a deprecation cycle at the same time?
normalizer = np.array(normalizer) # convert np.matrix to np.array | ||
if normalizer.ndim == 2: | ||
# old spmatrix treatment. RHS is a scalar (b/c normalizer is 2D row) | ||
affinity_matrix.data /= np.diag(normalizer) | ||
else: | ||
# We could use the (questionable) spmatrix treatment using: | ||
# affinity_matrix.data /= np.diag(np.array(normalizer[np.newaxis, :])) | ||
# Instead: use numpy treatment dividing each row by its normalizer. | ||
affinity_matrix.data /= normalizer[affinity_matrix.indices] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably easier to review as a part of a separate PR.
set up function _align_api_if_sparse() with tests Also functions _ensure_sparse_index_int32() and safely_cast_index_arrays() and _sparse_eye, _sparse_diags, _sparse_random to span Scipy <1.12 changes Introduce SCIPY_VERSION_BELOW_1_12 and SCIPY_VERSION_BELOW_1_15 boolean flags fix rng keyword arg for old SciPy versions ensure 2d sparse convert benchmarks doc modules convert csr_matrix to csr_array and CSR, COO, DIA, etc. make doctests pass pass tests on older scipy versions. i.e. ensure int32 indices where needed. make it work for SciPy 1.8 improve test coverage
2d71b2c
to
1d724ec
Compare
This PR sets up a config parameter
sparse_interface
to indicate "sparray" or "spmatrix" outputs, as suggested in #26418.The first commit sets everything up and implements the system for a few modules. Please take a look and provide feedback for whether this is the way to proceed. The next commit(s) will implement these same style of changes throughout the library. If you would prefer they be in separate PRs let me know. (I'll keep Draft status until there is feedback and the full library is convered.)
More specifically, this PR does the following:
sparse_interface
to the config parameters. (I think this name is better than sparse_format because "format" means csr/coo/lil, etc in the sparse world.) The values it can hold are "sparray" or "spmatrix". Update config tests accordingly.utils._sparse.py
with (private) helper functions. The difference is how much checking is done. Tests added too._as_sparse(x_sparse)
, raises unless sparse input. converts to interface chosen by config._select_interface_if_sparse(x)
, allows dense input with no action, sparse input uses_as_sparse
._convert_from_spmatrix_to_sparray(x)
and_convert_from_sparray_to_spmatrix(x)
sklearn/feature_selection/text.py
and adapts tests.sklearn/linear_model/_coordinate_descent.py
no tests change needed.sklearn/manifold/_locally_linear.py
no tests change needed.@thomasjpfan can you see if this does what you had in mind? I tried to pick modules that cover returning sparse, setting estimators to hold sparse, and transforming to sparse, so you can see how this would work.
Let me know if you think
_as_sparse
should be a public function, and if my approach aligns with how you want it. The next steps for this PR are to repeat these type of changes throughout the library.