-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
TST separate checks for sparse array and sparse matrix input in estimator_checks
#27576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST separate checks for sparse array and sparse matrix input in estimator_checks
#27576
Conversation
estimator_checks
|
I need to have a closer look to those ones but I would not be surprised that we have something similar to what we try to solve in: #27372, meaning that the cython code expect 32-bits and the COO array will not try to downcast to 32-bits and keep the indices in 64-bits and then it will fail. But I have to check. |
sklearn/utils/estimator_checks.py
Outdated
# TODO re-insert "dia" when PR #27372 is merged | ||
for sparse_format in ["dok", "lil", "bsr", "csc", "coo"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would wait for #27372 to be merged before merging this PR. Just adding a message to not forget about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. 👍 I'm not sure where to put such a message so it doesn't get burried under newer messages after a while. If you're okay with this, I would change the comment in the code, this is best visible, I believe.
sklearn/utils/estimator_checks.py
Outdated
In format['dok', 'lil', 'dia', 'bsr', 'csr', 'csc', 'coo', | ||
'coo_64', 'csc_64', 'csr_64'] | ||
""" | ||
|
||
assert X_csr.format == "csr" | ||
yield "csr", X_csr.copy() | ||
for sparse_format in ["dok", "lil", "dia", "bsr", "csc", "coo"]: | ||
# TODO re-insert "dia" when PR #27372 is merged | ||
for sparse_format in ["dok", "lil", "bsr", "csc", "coo"]: | ||
yield sparse_format, X_csr.asformat(sparse_format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, for the "coo" case, we need to make sure that the indices are always 32-bits. Matrices will automatically downcase but this is not the case of array. This is probably the failure that we see in the Lasso
and ElasticNet
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, if I understand correctly, then large sparse matrices would only downcast automatically, if the data has little enough non-zero values so that their indices fit into the 32 bits, otherwise it would use 64 bits?
The estimators in scikit-learn, that currently accept_large_sparse=True
won't fail if we keep it as is here.
Except for Lasso and ElasticNet, that in fact don't support large data, not even for sparse matrices. I would change that in check_array
of the estimators, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure why we would cast here @glemaitre . If anything, I feel like the fact that .asformat
downcasts automatically for spmatrix, is making us miss the cases where estimators don't support large sparse, but we think we're testing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I things that there 2 things here:
- in the first loop
for sparse_format in [...]
, we intend to test the 32-bits behaviour and we need to cast the indices because MAINT downcast indices dtype when converting sparse arrays #27372 intend to impose the same behaviour between matrix and array (and this is already the case in scipy-dev since we reported the regression). - the case under the
for
loop is especially crafted for 64-bits indices. This case is the one that is testing if we support the large indices.
On the top of that I think that we need to expand this 64-bits to CSR/CSC format in addition to COO because it will be more common that a COO matrix will be converted to CSR or CSC. Thus, we can these different format to see what happen when the matrix is not going to be converted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SO with #27372 merged in main
, we should be able to go forward. I think that you can change this part of the code and remove the temporary comment. Then, we can do another round of review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @glemaitre.
I have re-inserted the "dia" sparse data format into the data generator. Am I understanding correctly that with the merge of #27372 nothings else needs to be changed here?
There are still two issues from my side:
The CI had failed because of a DeprecationWarning from SciPy, which I have handled with ignore_warnings
(though not sure when to use warnings.filterwarnings
instead).
Also, regarding the code coverage CI failure, I am not sure why it is complaining, because it seems to me that this PR doesn't cause fewer lines to be tested. Should I do something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make a review and check those.
@glemaitre Yes, this is the case, as we've found out yesterday with @adrinjalali. In scipy PR #18509 made 64-bit indices for the row and column arrays the standard, so that they aren't downcasted automatically depending on their actual sizes (as was before the case for coo matrices). But no matter if sparse arrays or matrices, it seems that I believe the solution for this PR is to set |
Indeed because the Cython code is only written for 32-bits indices. |
The CI failure Now I see, why you recommended me to write So, I will re-write this, but still I'm not sure how I could check for Edit: Solved that together with Adrin by adding something to check for in |
sklearn/utils/estimator_checks.py
Outdated
In format['dok', 'lil', 'dia', 'bsr', 'csr', 'csc', 'coo', | ||
'coo_64', 'csc_64', 'csr_64'] | ||
""" | ||
|
||
assert X_csr.format == "csr" | ||
yield "csr", X_csr.copy() | ||
for sparse_format in ["dok", "lil", "dia", "bsr", "csc", "coo"]: | ||
# TODO re-insert "dia" when PR #27372 is merged | ||
for sparse_format in ["dok", "lil", "bsr", "csc", "coo"]: | ||
yield sparse_format, X_csr.asformat(sparse_format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure why we would cast here @glemaitre . If anything, I feel like the fact that .asformat
downcasts automatically for spmatrix, is making us miss the cases where estimators don't support large sparse, but we think we're testing them.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
…cikit-learn into sparse_estimator_checks
CI fails. |
Ok, so the CI failure likely stems from scipy 1.12. Our test on RegressorChain crashes, when we run As a minimal reproducible, this does not work and I could not find a way to fix that: import scipy.sparse as sp
import numpy as np
X_dense = np.array([[1, 0, 0, 2], [0, 4, 1, 0], [0, 0, 5, 0]])
X_sparse = sp.dok_array(X_dense)
Y_sparse = sp.lil_array((3, 1))
sp.hstack((X_sparse, Y_sparse)) There were some other fixes lately related to this one (#28047, #28185, maybe, a bit older: scipy#19220), but maybe here there is nothing we can do except for changing types or an raising error for the users? |
This is not an advice how to make CI happy but this seems like a scipy 1.12 bug that you can not stack two import scipy.sparse as sp
import numpy as np
X_dense = np.array([[1, 0, 0, 2], [0, 4, 1, 0], [0, 0, 5, 0]])
X_sparse = sp.dok_array(X_dense)
sp.hstack((X_sparse, X_sparse)) I am going to open an issue about this and they may have some recommendation for a work-around. Edit: opened scipy/scipy#20060 |
From the Scipy tracker, using Should we do that? |
Maybe one possibility for this PR is to skip the test for I am going to guess that this kind of edge case (using Another PR could add a |
I like that suggestion, @lesteve. With this bug not popping up on the issues (neither on scipy as far as I could search), nor in any other test, I think it's reasonable to skip that test here and add a temporary fix in utils. If another maintainer agrees, I would like to go ahead with this. |
Since scipy/scipy#20060 (comment) argues that having the input converted to |
@adrinjalali elif sp.issparse(X):
# to prevent scipy.sparse.hstack from breaking, we convert the sparse
# dok_array to a coo_array format, it's also faster; see scipy issue
# https://github.com/scipy/scipy/issues/20060#issuecomment-1937007039
if isinstance(X, sp._dok.dok_array):
X = sp.coo_array(X)
Y_pred_chain = sp.lil_matrix((X.shape[0], Y.shape[1]))
X_aug = sp.hstack((X, Y_pred_chain), format="lil") And it would stay like that, no matter if scipy fixes their bug? A permanent fix for hstacking sparse array of dok format is not needed? In case somebody plans to use it for another test before the next scipy version is released? What do you think, @lesteve ? Edit: CodeCov is only referring to code that I didn't touch. I think this is then a false positive. |
That sounds okay, although I'm not sure really why we're using Also, this |
I've changed it to a public path.
I couldn't find any explanation for that (looked into the PR that introduced it). For this PR, I'd be happy if we could merge it soon, if there is nothing else to take care of, @adrinjalali |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to find the right solution before merging. Rather merge the right things than merging fast.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@adrinjalali |
Codecov is still mourning, but it has nothing to do with the new code I wrote. |
Finally 🥳 , thanks a lot for your help here, @adrinjalali! |
Reference Issues/PRs
Towards #27522
What does this implement/fix? Explain your changes.
This PR aims to separate the check for sparse X input into two different checks, differentiating scipy sparse array and scipy sparse matrix.
LargeSparseNotSupportedClassifier
has been modified to make it possible, to fail depending on input type.I haven't changed any tags, because currently, all scikit-learn estimators do accept dense, as well as sparse input of any type.
The thing from
_testing
is something I've stumbled over, I hope it's okay.Many thanks, @adrinjalali, for your help!