Skip to content

TST separate checks for sparse array and sparse matrix input in estimator_checks #27576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Feb 23, 2024

Conversation

StefanieSenger
Copy link
Contributor

@StefanieSenger StefanieSenger commented Oct 12, 2023

Reference Issues/PRs

Towards #27522

What does this implement/fix? Explain your changes.

This PR aims to separate the check for sparse X input into two different checks, differentiating scipy sparse array and scipy sparse matrix.

LargeSparseNotSupportedClassifier has been modified to make it possible, to fail depending on input type.

I haven't changed any tags, because currently, all scikit-learn estimators do accept dense, as well as sparse input of any type.

The thing from _testing is something I've stumbled over, I hope it's okay.

Many thanks, @adrinjalali, for your help!

@github-actions
Copy link

github-actions bot commented Oct 12, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 7b0df06. Link to the linter CI: here

@StefanieSenger StefanieSenger changed the title TST separate checks for sparse array and sparse matrix input TST separate checks for sparse array and sparse matrix input in estimator_checks Oct 12, 2023
@StefanieSenger StefanieSenger marked this pull request as draft October 12, 2023 20:53
@StefanieSenger StefanieSenger marked this pull request as ready for review October 12, 2023 23:21
@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Oct 12, 2023

  1. The previous CI failures were related to PR MAINT downcast indices dtype when converting sparse arrays #27372, where check_array wouldn't be able to correctly convert from "dia" sparse array format into another sparse array format. I have excluded "dia" from the data generation for the two checks that use it now, to make the corresonding tests pass.

    Side note: It's a bit odd how test_check_estimator would only expect this one error message

    msg = (
            "Estimator LargeSparseNotSupportedClassifier doesn't seem to "
            r"support \S{3}_64 matrix, and is not failing gracefully.*"
        )
    

    that would only be raised if the check with all the generated types would run until the end. If it breaks before (as with 'dia'), the test will also only compare end results and fail for an unrelated reason. I believe this could be more intuitive.

  2. There is another pytest failure and I don't know what could be the issue around ElasticNet and Lasso. The error that turns up say they wouldn't be able to handle large sparse arrays (but they do handle large sparse matrices).

@glemaitre
Copy link
Member

There is another pytest failure and I don't know what could be the issue around ElasticNet and Lasso.

I need to have a closer look to those ones but I would not be surprised that we have something similar to what we try to solve in: #27372, meaning that the cython code expect 32-bits and the COO array will not try to downcast to 32-bits and keep the indices in 64-bits and then it will fail.

But I have to check.

Comment on lines 842 to 843
# TODO re-insert "dia" when PR #27372 is merged
for sparse_format in ["dok", "lil", "bsr", "csc", "coo"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would wait for #27372 to be merged before merging this PR. Just adding a message to not forget about it.

Copy link
Contributor Author

@StefanieSenger StefanieSenger Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. 👍 I'm not sure where to put such a message so it doesn't get burried under newer messages after a while. If you're okay with this, I would change the comment in the code, this is best visible, I believe.

In format['dok', 'lil', 'dia', 'bsr', 'csr', 'csc', 'coo',
'coo_64', 'csc_64', 'csr_64']
"""

assert X_csr.format == "csr"
yield "csr", X_csr.copy()
for sparse_format in ["dok", "lil", "dia", "bsr", "csc", "coo"]:
# TODO re-insert "dia" when PR #27372 is merged
for sparse_format in ["dok", "lil", "bsr", "csc", "coo"]:
yield sparse_format, X_csr.asformat(sparse_format)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, for the "coo" case, we need to make sure that the indices are always 32-bits. Matrices will automatically downcase but this is not the case of array. This is probably the failure that we see in the Lasso and ElasticNet:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, if I understand correctly, then large sparse matrices would only downcast automatically, if the data has little enough non-zero values so that their indices fit into the 32 bits, otherwise it would use 64 bits?

The estimators in scikit-learn, that currently accept_large_sparse=True won't fail if we keep it as is here.

Except for Lasso and ElasticNet, that in fact don't support large data, not even for sparse matrices. I would change that in check_array of the estimators, not here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure why we would cast here @glemaitre . If anything, I feel like the fact that .asformat downcasts automatically for spmatrix, is making us miss the cases where estimators don't support large sparse, but we think we're testing them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I things that there 2 things here:

  • in the first loop for sparse_format in [...], we intend to test the 32-bits behaviour and we need to cast the indices because MAINT downcast indices dtype when converting sparse arrays #27372 intend to impose the same behaviour between matrix and array (and this is already the case in scipy-dev since we reported the regression).
  • the case under the for loop is especially crafted for 64-bits indices. This case is the one that is testing if we support the large indices.

On the top of that I think that we need to expand this 64-bits to CSR/CSC format in addition to COO because it will be more common that a COO matrix will be converted to CSR or CSC. Thus, we can these different format to see what happen when the matrix is not going to be converted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO with #27372 merged in main, we should be able to go forward. I think that you can change this part of the code and remove the temporary comment. Then, we can do another round of review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @glemaitre.

I have re-inserted the "dia" sparse data format into the data generator. Am I understanding correctly that with the merge of #27372 nothings else needs to be changed here?

There are still two issues from my side:

The CI had failed because of a DeprecationWarning from SciPy, which I have handled with ignore_warnings (though not sure when to use warnings.filterwarnings instead).

Also, regarding the code coverage CI failure, I am not sure why it is complaining, because it seems to me that this PR doesn't cause fewer lines to be tested. Should I do something here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a review and check those.

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Oct 18, 2023

I need to have a closer look to those ones but I would not be surprised that we have something similar to what we try to solve in: #27372, meaning that the cython code expect 32-bits and the COO array will not try to downcast to 32-bits and keep the indices in 64-bits and then it will fail.

@glemaitre Yes, this is the case, as we've found out yesterday with @adrinjalali. In scipy PR #18509 made 64-bit indices for the row and column arrays the standard, so that they aren't downcasted automatically depending on their actual sizes (as was before the case for coo matrices). sklearn/linear_model/_cd_fast.pyx expects const int[::1] X_indices, and then the code breaks.

But no matter if sparse arrays or matrices, it seems that ElasticNet and Lasso never were able to support large sparse data (and it was not tested if they could).

I believe the solution for this PR is to set accept_large_sparse=False in check_array.

@glemaitre
Copy link
Member

I believe the solution for this PR is to set accept_large_sparse=False in check_array.

Indeed because the Cython code is only written for 32-bits indices.
Your solution is the right one and we need to have a new entry in the changelog to acknowledge this bug.

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Oct 18, 2023

The CI failure AttributeError: module 'scipy.sparse' has no attribute 'sparray' reminds us that we cannot import sparse arrays and everything related from older scipy versions.

Now I see, why you recommended me to write _check_estimator_sparse_container using containers instead of variables, @glemaitre. The CI test would fail on this function as well, if it would just get so far, wouldn't it?

So, I will re-write this, but still I'm not sure how I could check for isinstance(X, sp.sparray) and isinstance(X, sp.spmatrix) otherwise?

Edit: Solved that together with Adrin by adding something to check for in utils.fixes.

In format['dok', 'lil', 'dia', 'bsr', 'csr', 'csc', 'coo',
'coo_64', 'csc_64', 'csr_64']
"""

assert X_csr.format == "csr"
yield "csr", X_csr.copy()
for sparse_format in ["dok", "lil", "dia", "bsr", "csc", "coo"]:
# TODO re-insert "dia" when PR #27372 is merged
for sparse_format in ["dok", "lil", "bsr", "csc", "coo"]:
yield sparse_format, X_csr.asformat(sparse_format)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure why we would cast here @glemaitre . If anything, I feel like the fact that .asformat downcasts automatically for spmatrix, is making us miss the cases where estimators don't support large sparse, but we think we're testing them.

@adrinjalali
Copy link
Member

CI fails.

@StefanieSenger
Copy link
Contributor Author

Ok, so the CI failure likely stems from scipy 1.12. Our test on RegressorChain crashes, when we run check_estimator_sparse_array with input type scipy.sparse.dok_array (all the other sparse array type pass):

As a minimal reproducible, this does not work and I could not find a way to fix that:

import scipy.sparse as sp
import numpy as np

X_dense = np.array([[1, 0, 0, 2], [0, 4, 1, 0], [0, 0, 5, 0]])
X_sparse = sp.dok_array(X_dense)
Y_sparse = sp.lil_array((3, 1))

sp.hstack((X_sparse, Y_sparse))

There were some other fixes lately related to this one (#28047, #28185, maybe, a bit older: scipy#19220), but maybe here there is nothing we can do except for changing types or an raising error for the users?

@lesteve
Copy link
Member

lesteve commented Feb 9, 2024

This is not an advice how to make CI happy but this seems like a scipy 1.12 bug that you can not stack two dok_array (works fine in scipy 1.11):

import scipy.sparse as sp
import numpy as np

X_dense = np.array([[1, 0, 0, 2], [0, 4, 1, 0], [0, 0, 5, 0]])
X_sparse = sp.dok_array(X_dense)

sp.hstack((X_sparse, X_sparse))

I am going to open an issue about this and they may have some recommendation for a work-around.

Edit: opened scipy/scipy#20060

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Feb 12, 2024

From the Scipy tracker, using sparse.coo_array is recommended as a workaround (see scipy/scipy#20060 (comment)).

Should we do that?
If so, I think it should be a separate PR that's focusing on fixing RegressorChain and we should also state that sparse.dok_array is not supported as an input type for X in the docstrings in RegressorChain. (Maybe it's also not supported as a type for Y, and I would check that as well.) Since this would be only temporarily, I'd also put comments remind us to remove that constraint once scipy has fixed this problem.

@lesteve
Copy link
Member

lesteve commented Feb 13, 2024

Maybe one possibility for this PR is to skip the test for RegressorChain and dok_array?

I am going to guess that this kind of edge case (using RegressorChain with dok_array) will not happen too much in practice but maybe I am wrong ...

Another PR could add a utils.fixes for scipy.sparse.hstack to be robust to the scipy 1.12 bug? I am wondering if this is worth it though. The fact that nobody reported the bug before us kind of leads me to believe that we are one of the main scipy sparse array users through our tests.

@StefanieSenger
Copy link
Contributor Author

I like that suggestion, @lesteve. With this bug not popping up on the issues (neither on scipy as far as I could search), nor in any other test, I think it's reasonable to skip that test here and add a temporary fix in utils.

If another maintainer agrees, I would like to go ahead with this.

@adrinjalali
Copy link
Member

Since scipy/scipy#20060 (comment) argues that having the input converted to coo in case of dok and lil is faster anyway, I think we should do that in the estimator, which also avoids hitting the bug we reported.

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Feb 16, 2024

@adrinjalali
You mean like that?

        elif sp.issparse(X):
            # to prevent scipy.sparse.hstack from breaking, we convert the sparse
            # dok_array to a coo_array format, it's also faster; see scipy issue
            # https://github.com/scipy/scipy/issues/20060#issuecomment-1937007039
            if isinstance(X, sp._dok.dok_array):
                X = sp.coo_array(X)
            Y_pred_chain = sp.lil_matrix((X.shape[0], Y.shape[1]))
            X_aug = sp.hstack((X, Y_pred_chain), format="lil")

And it would stay like that, no matter if scipy fixes their bug?
When I now understand correctly, we don't even need to change the documentation, since the sp.hstack had converted it's result to a coo format anyways.

A permanent fix for hstacking sparse array of dok format is not needed? In case somebody plans to use it for another test before the next scipy version is released? What do you think, @lesteve ?

Edit: CodeCov is only referring to code that I didn't touch. I think this is then a false positive.

@adrinjalali
Copy link
Member

That sounds okay, although I'm not sure really why we're using lil here for y_pred in the first place instead of coo. We could simply do coo and at the end do the lil conversion, which might be faster.

Also, this if isinstance(X, sp._dok.dok_array): uses private path of the array class, we should always use the public documented way of checking if it's a doc_array, which I think is X.format

@StefanieSenger
Copy link
Contributor Author

Also, this if isinstance(X, sp._dok.dok_array): uses private path of the array class, we should always use the public documented way of checking if it's a doc_array, which I think is X.format

I've changed it to a public path.

That sounds okay, although I'm not sure really why we're using lil here for y_pred in the first place instead of coo. We could simply do coo and at the end do the lil conversion, which might be faster.

I couldn't find any explanation for that (looked into the PR that introduced it).
Am I right assuming that performance is the only aspect to consider when deciding which sparse format would be used? If so, I would like to try running a benchmark test and deal with it in a separate PR if converting to coo is any faster.

For this PR, I'd be happy if we could merge it soon, if there is nothing else to take care of, @adrinjalali

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find the right solution before merging. Rather merge the right things than merging fast.

StefanieSenger and others added 2 commits February 20, 2024 16:53
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@StefanieSenger
Copy link
Contributor Author

@adrinjalali
I have now made a suggestion of how to create Y_pred_chain based on what type X is. Please have a look.

@StefanieSenger
Copy link
Contributor Author

Codecov is still mourning, but it has nothing to do with the new code I wrote.

@adrinjalali adrinjalali merged commit 9814884 into scikit-learn:main Feb 23, 2024
@StefanieSenger StefanieSenger deleted the sparse_estimator_checks branch February 23, 2024 13:57
@StefanieSenger
Copy link
Contributor Author

Finally 🥳 , thanks a lot for your help here, @adrinjalali!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants