TST separate checks for sparse array and sparse matrix input in `estimator_checks` #27576

StefanieSenger · 2023-10-12T09:30:24Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR aims to separate the check for sparse X input into two different checks, differentiating scipy sparse array and scipy sparse matrix.

LargeSparseNotSupportedClassifier has been modified to make it possible, to fail depending on input type.

I haven't changed any tags, because currently, all scikit-learn estimators do accept dense, as well as sparse input of any type.

The thing from _testing is something I've stumbled over, I hope it's okay.

Many thanks, @adrinjalali, for your help!

github-actions · 2023-10-12T09:32:34Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 7b0df06. Link to the linter CI: here}

StefanieSenger · 2023-10-12T23:44:38Z

The previous CI failures were related to PR MAINT downcast indices dtype when converting sparse arrays #27372, where check_array wouldn't be able to correctly convert from "dia" sparse array format into another sparse array format. I have excluded "dia" from the data generation for the two checks that use it now, to make the corresonding tests pass.

Side note: It's a bit odd how test_check_estimator would only expect this one error message
```
msg = (
        "Estimator LargeSparseNotSupportedClassifier doesn't seem to "
        r"support \S{3}_64 matrix, and is not failing gracefully.*"
    )
```
that would only be raised if the check with all the generated types would run until the end. If it breaks before (as with 'dia'), the test will also only compare end results and fail for an unrelated reason. I believe this could be more intuitive.
There is another pytest failure and I don't know what could be the issue around ElasticNet and Lasso. The error that turns up say they wouldn't be able to handle large sparse arrays (but they do handle large sparse matrices).

glemaitre · 2023-10-17T15:41:15Z

There is another pytest failure and I don't know what could be the issue around ElasticNet and Lasso.

I need to have a closer look to those ones but I would not be surprised that we have something similar to what we try to solve in: #27372, meaning that the cython code expect 32-bits and the COO array will not try to downcast to 32-bits and keep the indices in 64-bits and then it will fail.

But I have to check.

glemaitre · 2023-10-17T15:18:41Z

sklearn/utils/estimator_checks.py

+    # TODO re-insert "dia" when PR #27372 is merged
+    for sparse_format in ["dok", "lil", "bsr", "csc", "coo"]:


I would wait for #27372 to be merged before merging this PR. Just adding a message to not forget about it.

Yes. 👍 I'm not sure where to put such a message so it doesn't get burried under newer messages after a while. If you're okay with this, I would change the comment in the code, this is best visible, I believe.

sklearn/utils/estimator_checks.py

sklearn/utils/tests/test_estimator_checks.py

glemaitre · 2023-10-17T15:44:41Z

sklearn/utils/estimator_checks.py

        In format['dok', 'lil', 'dia', 'bsr', 'csr', 'csc', 'coo',
        'coo_64', 'csc_64', 'csr_64']
    """

    assert X_csr.format == "csr"
    yield "csr", X_csr.copy()
-    for sparse_format in ["dok", "lil", "dia", "bsr", "csc", "coo"]:
+    # TODO re-insert "dia" when PR #27372 is merged
+    for sparse_format in ["dok", "lil", "bsr", "csc", "coo"]:
        yield sparse_format, X_csr.asformat(sparse_format)


Here, for the "coo" case, we need to make sure that the indices are always 32-bits. Matrices will automatically downcase but this is not the case of array. This is probably the failure that we see in the Lasso and ElasticNet:

Hm, if I understand correctly, then large sparse matrices would only downcast automatically, if the data has little enough non-zero values so that their indices fit into the 32 bits, otherwise it would use 64 bits?

The estimators in scikit-learn, that currently accept_large_sparse=True won't fail if we keep it as is here.

Except for Lasso and ElasticNet, that in fact don't support large data, not even for sparse matrices. I would change that in check_array of the estimators, not here.

I'm also not sure why we would cast here @glemaitre . If anything, I feel like the fact that .asformat downcasts automatically for spmatrix, is making us miss the cases where estimators don't support large sparse, but we think we're testing them.

I things that there 2 things here:

in the first loop for sparse_format in [...], we intend to test the 32-bits behaviour and we need to cast the indices because MAINT downcast indices dtype when converting sparse arrays #27372 intend to impose the same behaviour between matrix and array (and this is already the case in scipy-dev since we reported the regression).

the case under the for loop is especially crafted for 64-bits indices. This case is the one that is testing if we support the large indices.

On the top of that I think that we need to expand this 64-bits to CSR/CSC format in addition to COO because it will be more common that a COO matrix will be converted to CSR or CSC. Thus, we can these different format to see what happen when the matrix is not going to be converted.

SO with #27372 merged in main, we should be able to go forward. I think that you can change this part of the code and remove the temporary comment. Then, we can do another round of review.

Thank you, @glemaitre.

I have re-inserted the "dia" sparse data format into the data generator. Am I understanding correctly that with the merge of #27372 nothings else needs to be changed here?

There are still two issues from my side:

The CI had failed because of a DeprecationWarning from SciPy, which I have handled with ignore_warnings (though not sure when to use warnings.filterwarnings instead).

Also, regarding the code coverage CI failure, I am not sure why it is complaining, because it seems to me that this PR doesn't cause fewer lines to be tested. Should I do something here?

I will make a review and check those.

StefanieSenger · 2023-10-18T10:32:27Z

I need to have a closer look to those ones but I would not be surprised that we have something similar to what we try to solve in: #27372, meaning that the cython code expect 32-bits and the COO array will not try to downcast to 32-bits and keep the indices in 64-bits and then it will fail.

@glemaitre Yes, this is the case, as we've found out yesterday with @adrinjalali. In scipy PR #18509 made 64-bit indices for the row and column arrays the standard, so that they aren't downcasted automatically depending on their actual sizes (as was before the case for coo matrices). sklearn/linear_model/_cd_fast.pyx expects const int[::1] X_indices, and then the code breaks.

But no matter if sparse arrays or matrices, it seems that ElasticNet and Lasso never were able to support large sparse data (and it was not tested if they could).

I believe the solution for this PR is to set accept_large_sparse=False in check_array.

glemaitre · 2023-10-18T15:15:49Z

I believe the solution for this PR is to set accept_large_sparse=False in check_array.

Indeed because the Cython code is only written for 32-bits indices.
Your solution is the right one and we need to have a new entry in the changelog to acknowledge this bug.

StefanieSenger · 2023-10-18T19:21:29Z

The CI failure AttributeError: module 'scipy.sparse' has no attribute 'sparray' reminds us that we cannot import sparse arrays and everything related from older scipy versions.

Now I see, why you recommended me to write _check_estimator_sparse_container using containers instead of variables, @glemaitre. The CI test would fail on this function as well, if it would just get so far, wouldn't it?

So, I will re-write this, but still I'm not sure how I could check for isinstance(X, sp.sparray) and isinstance(X, sp.spmatrix) otherwise?

Edit: Solved that together with Adrin by adding something to check for in utils.fixes.

sklearn/linear_model/_coordinate_descent.py

adrinjalali · 2023-10-19T10:04:46Z

sklearn/utils/estimator_checks.py

        In format['dok', 'lil', 'dia', 'bsr', 'csr', 'csc', 'coo',
        'coo_64', 'csc_64', 'csr_64']
    """

    assert X_csr.format == "csr"
    yield "csr", X_csr.copy()
-    for sparse_format in ["dok", "lil", "dia", "bsr", "csc", "coo"]:
+    # TODO re-insert "dia" when PR #27372 is merged
+    for sparse_format in ["dok", "lil", "bsr", "csc", "coo"]:
        yield sparse_format, X_csr.asformat(sparse_format)


I'm also not sure why we would cast here @glemaitre . If anything, I feel like the fact that .asformat downcasts automatically for spmatrix, is making us miss the cases where estimators don't support large sparse, but we think we're testing them.

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

…cikit-learn into sparse_estimator_checks

adrinjalali · 2024-02-06T17:48:49Z

CI fails.

StefanieSenger · 2024-02-09T13:38:07Z

Ok, so the CI failure likely stems from scipy 1.12. Our test on RegressorChain crashes, when we run check_estimator_sparse_array with input type scipy.sparse.dok_array (all the other sparse array type pass):

As a minimal reproducible, this does not work and I could not find a way to fix that:

import scipy.sparse as sp
import numpy as np

X_dense = np.array([[1, 0, 0, 2], [0, 4, 1, 0], [0, 0, 5, 0]])
X_sparse = sp.dok_array(X_dense)
Y_sparse = sp.lil_array((3, 1))

sp.hstack((X_sparse, Y_sparse))

There were some other fixes lately related to this one (#28047, #28185, maybe, a bit older: scipy#19220), but maybe here there is nothing we can do except for changing types or an raising error for the users?

lesteve · 2024-02-09T14:13:42Z

This is not an advice how to make CI happy but this seems like a scipy 1.12 bug that you can not stack two dok_array (works fine in scipy 1.11):

import scipy.sparse as sp
import numpy as np

X_dense = np.array([[1, 0, 0, 2], [0, 4, 1, 0], [0, 0, 5, 0]])
X_sparse = sp.dok_array(X_dense)

sp.hstack((X_sparse, X_sparse))

I am going to open an issue about this and they may have some recommendation for a work-around.

Edit: opened scipy/scipy#20060

StefanieSenger · 2024-02-12T15:25:04Z

From the Scipy tracker, using sparse.coo_array is recommended as a workaround (see scipy/scipy#20060 (comment)).

Should we do that?
If so, I think it should be a separate PR that's focusing on fixing RegressorChain and we should also state that sparse.dok_array is not supported as an input type for X in the docstrings in RegressorChain. (Maybe it's also not supported as a type for Y, and I would check that as well.) Since this would be only temporarily, I'd also put comments remind us to remove that constraint once scipy has fixed this problem.

lesteve · 2024-02-13T08:03:15Z

Maybe one possibility for this PR is to skip the test for RegressorChain and dok_array?

I am going to guess that this kind of edge case (using RegressorChain with dok_array) will not happen too much in practice but maybe I am wrong ...

Another PR could add a utils.fixes for scipy.sparse.hstack to be robust to the scipy 1.12 bug? I am wondering if this is worth it though. The fact that nobody reported the bug before us kind of leads me to believe that we are one of the main scipy sparse array users through our tests.

StefanieSenger · 2024-02-13T09:41:43Z

I like that suggestion, @lesteve. With this bug not popping up on the issues (neither on scipy as far as I could search), nor in any other test, I think it's reasonable to skip that test here and add a temporary fix in utils.

If another maintainer agrees, I would like to go ahead with this.

adrinjalali · 2024-02-13T12:36:15Z

Since scipy/scipy#20060 (comment) argues that having the input converted to coo in case of dok and lil is faster anyway, I think we should do that in the estimator, which also avoids hitting the bug we reported.

StefanieSenger · 2024-02-16T12:15:20Z

@adrinjalali
You mean like that?

        elif sp.issparse(X):
            # to prevent scipy.sparse.hstack from breaking, we convert the sparse
            # dok_array to a coo_array format, it's also faster; see scipy issue
            # https://github.com/scipy/scipy/issues/20060#issuecomment-1937007039
            if isinstance(X, sp._dok.dok_array):
                X = sp.coo_array(X)
            Y_pred_chain = sp.lil_matrix((X.shape[0], Y.shape[1]))
            X_aug = sp.hstack((X, Y_pred_chain), format="lil")

And it would stay like that, no matter if scipy fixes their bug?
When I now understand correctly, we don't even need to change the documentation, since the sp.hstack had converted it's result to a coo format anyways.

A permanent fix for hstacking sparse array of dok format is not needed? In case somebody plans to use it for another test before the next scipy version is released? What do you think, @lesteve ?

Edit: CodeCov is only referring to code that I didn't touch. I think this is then a false positive.

adrinjalali · 2024-02-19T08:23:23Z

That sounds okay, although I'm not sure really why we're using lil here for y_pred in the first place instead of coo. We could simply do coo and at the end do the lil conversion, which might be faster.

Also, this if isinstance(X, sp._dok.dok_array): uses private path of the array class, we should always use the public documented way of checking if it's a doc_array, which I think is X.format

StefanieSenger · 2024-02-20T08:42:25Z

Also, this if isinstance(X, sp._dok.dok_array): uses private path of the array class, we should always use the public documented way of checking if it's a doc_array, which I think is X.format

I've changed it to a public path.

That sounds okay, although I'm not sure really why we're using lil here for y_pred in the first place instead of coo. We could simply do coo and at the end do the lil conversion, which might be faster.

I couldn't find any explanation for that (looked into the PR that introduced it).
Am I right assuming that performance is the only aspect to consider when deciding which sparse format would be used? If so, I would like to try running a benchmark test and deal with it in a separate PR if converting to coo is any faster.

For this PR, I'd be happy if we could merge it soon, if there is nothing else to take care of, @adrinjalali

adrinjalali

We need to find the right solution before merging. Rather merge the right things than merging fast.

sklearn/multioutput.py

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

StefanieSenger · 2024-02-22T10:27:09Z

@adrinjalali
I have now made a suggestion of how to create Y_pred_chain based on what type X is. Please have a look.

StefanieSenger · 2024-02-23T13:28:03Z

Codecov is still mourning, but it has nothing to do with the new code I wrote.

StefanieSenger · 2024-02-23T13:57:57Z

Finally 🥳 , thanks a lot for your help here, @adrinjalali!

separate checks for sparse array and sparse matrix input

a6c488d

github-actions bot added module:cluster module:utils labels Oct 12, 2023

StefanieSenger changed the title ~~TST separate checks for sparse array and sparse matrix input~~ TST separate checks for sparse array and sparse matrix input in estimator_checks Oct 12, 2023

StefanieSenger added 2 commits October 12, 2023 13:52

fix input for raise_for_type

c60e232

correcter error message

0071c10

StefanieSenger marked this pull request as draft October 12, 2023 20:53

repaired datatype selection and CI failures

8b1eb77

StefanieSenger marked this pull request as ready for review October 12, 2023 23:21

adrinjalali requested a review from glemaitre October 17, 2023 11:42

glemaitre added the No Changelog Needed label Oct 17, 2023

glemaitre reviewed Oct 17, 2023

View reviewed changes

StefanieSenger and others added 2 commits October 18, 2023 13:08

Lasso and ElasticNet don't support large sparse data

028d580

Merge branch 'main' into sparse_estimator_checks

5acf09a

StefanieSenger added 2 commits October 18, 2023 19:48

refactored code of almost identical checks

8e9feaa

changelog

22b3998

adrinjalali reviewed Oct 19, 2023

View reviewed changes

StefanieSenger and others added 6 commits October 19, 2023 12:49

test should now run with lower scipy version

b156ef6

fix

bba2616

Update sklearn/linear_model/_coordinate_descent.py

2c7b59a

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

doc for LinearModelCV

3314d79

Merge branch 'main' into sparse_estimator_checks

cb0f629

'dia' part of data generation and handled DeprecationWarning

22a0f1d

Merge branch 'sparse_estimator_checks' of github.com:StefanieSenger/s…

a87292e

…cikit-learn into sparse_estimator_checks

lesteve mentioned this pull request Feb 9, 2024

BUG: stacking two dok_array produces a NotImplementedError about 1D sparse slices scipy/scipy#20060

Closed

Merge branch 'main' into sparse_estimator_checks

1f2527c

StefanieSenger and others added 3 commits February 16, 2024 13:16

convert sparse dok_array into sparse coo_array before hstack

7a2fc50

compatibility with scipy versions < 1.11

a0bd979

Merge branch 'main' into sparse_estimator_checks

43b482d

public path for isinstance check

83439c3

adrinjalali reviewed Feb 20, 2024

View reviewed changes

sklearn/multioutput.py Outdated Show resolved Hide resolved

sklearn/multioutput.py Outdated Show resolved Hide resolved

StefanieSenger and others added 2 commits February 20, 2024 16:53

Apply suggestions from code review

3cfc057

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

added conversion into sparse array

2bf114a

Merge branch 'main' into sparse_estimator_checks

7b0df06

adrinjalali approved these changes Feb 23, 2024

View reviewed changes

adrinjalali merged commit 9814884 into scikit-learn:main Feb 23, 2024

StefanieSenger deleted the sparse_estimator_checks branch February 23, 2024 13:57

lesteve mentioned this pull request Feb 24, 2024

ENH Add chain_method to ClassifierChain #27700

Merged

StefanieSenger mentioned this pull request Feb 24, 2024

FIX fix scipy bug with sp.hstack in ClassifierChain and RegressorChain #28524

Merged

jeremiedbb mentioned this pull request May 7, 2024

Is it intention to drop check_estimator_sparse_data in 1.5? #28966

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST separate checks for sparse array and sparse matrix input in `estimator_checks` #27576

TST separate checks for sparse array and sparse matrix input in `estimator_checks` #27576

StefanieSenger commented Oct 12, 2023 •

edited

Loading

github-actions bot commented Oct 12, 2023 •

edited

Loading

StefanieSenger commented Oct 12, 2023 •

edited

Loading

glemaitre commented Oct 17, 2023

glemaitre Oct 17, 2023

StefanieSenger Oct 18, 2023 •

edited

Loading

glemaitre Oct 17, 2023

StefanieSenger Oct 18, 2023

adrinjalali Oct 19, 2023

glemaitre Oct 23, 2023

glemaitre Oct 30, 2023

StefanieSenger Nov 11, 2023

glemaitre Nov 14, 2023

StefanieSenger commented Oct 18, 2023 •

edited

Loading

glemaitre commented Oct 18, 2023

StefanieSenger commented Oct 18, 2023 •

edited

Loading

adrinjalali Oct 19, 2023

adrinjalali commented Feb 6, 2024

StefanieSenger commented Feb 9, 2024

lesteve commented Feb 9, 2024 •

edited

Loading

StefanieSenger commented Feb 12, 2024 •

edited by lesteve

Loading

lesteve commented Feb 13, 2024 •

edited

Loading

StefanieSenger commented Feb 13, 2024

adrinjalali commented Feb 13, 2024

StefanieSenger commented Feb 16, 2024 •

edited

Loading

adrinjalali commented Feb 19, 2024

StefanieSenger commented Feb 20, 2024

adrinjalali left a comment

StefanieSenger commented Feb 22, 2024

StefanieSenger commented Feb 23, 2024

StefanieSenger commented Feb 23, 2024

		# TODO re-insert "dia" when PR #27372 is merged
		for sparse_format in ["dok", "lil", "bsr", "csc", "coo"]:

TST separate checks for sparse array and sparse matrix input in estimator_checks #27576

TST separate checks for sparse array and sparse matrix input in estimator_checks #27576

Conversation

StefanieSenger commented Oct 12, 2023 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

github-actions bot commented Oct 12, 2023 • edited Loading

✔️ Linting Passed

StefanieSenger commented Oct 12, 2023 • edited Loading

glemaitre commented Oct 17, 2023

Choose a reason for hiding this comment

StefanieSenger Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger commented Oct 18, 2023 • edited Loading

glemaitre commented Oct 18, 2023

StefanieSenger commented Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

adrinjalali commented Feb 6, 2024

StefanieSenger commented Feb 9, 2024

lesteve commented Feb 9, 2024 • edited Loading

StefanieSenger commented Feb 12, 2024 • edited by lesteve Loading

lesteve commented Feb 13, 2024 • edited Loading

StefanieSenger commented Feb 13, 2024

adrinjalali commented Feb 13, 2024

StefanieSenger commented Feb 16, 2024 • edited Loading

adrinjalali commented Feb 19, 2024

StefanieSenger commented Feb 20, 2024

adrinjalali left a comment

Choose a reason for hiding this comment

StefanieSenger commented Feb 22, 2024

StefanieSenger commented Feb 23, 2024

StefanieSenger commented Feb 23, 2024

TST separate checks for sparse array and sparse matrix input in `estimator_checks` #27576

TST separate checks for sparse array and sparse matrix input in `estimator_checks` #27576

StefanieSenger commented Oct 12, 2023 •

edited

Loading

github-actions bot commented Oct 12, 2023 •

edited

Loading

StefanieSenger commented Oct 12, 2023 •

edited

Loading

StefanieSenger Oct 18, 2023 •

edited

Loading

StefanieSenger commented Oct 18, 2023 •

edited

Loading

StefanieSenger commented Oct 18, 2023 •

edited

Loading

lesteve commented Feb 9, 2024 •

edited

Loading

StefanieSenger commented Feb 12, 2024 •

edited by lesteve

Loading

lesteve commented Feb 13, 2024 •

edited

Loading

StefanieSenger commented Feb 16, 2024 •

edited

Loading