-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
TST Extend tests for scipy.sparse.*array
in sklearn/cluster/tests/test_optics.py
#27104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
scipy.sparse.*array
in sklearn/cluster/tests/test_optics.py
@marenwestermann do you need help with this? Are the above information enough for you to investigate? Otherwise I can try to dig further to give more specific guidance. |
One thing that I could see is that our |
Thanks @ogrisel for asking. I actually do need help. I applied the suggested change, i.e.
and this resulted in the following error:
I then tried the following:
and equally
and this fixed the index errors but resulted in the following error:
I noticed that the shape of So this is where I currently don't know how to proceed. I had a look at @glemaitre 's last comment but from my understanding that bug doesn't cause the problem described here. I pulled the recent changes from upstream main but the problem described here persists. |
PS: I finally have time again to be active on scikit-learn and I'm excited to be back. 🎉 |
Looking closer to the code, we need to change the following: dists = X[point_index, unproc]
if issparse(dists):
dists.sort_indices()
dists = dists.data to dists = X[[point_index], unproc]
if isinstance(dists, np.matrix):
dists = dists.A1 So:
I tried locally and the tests are passing. |
I see, that makes sense. Thanks a lot for the detailed explanation @glemaitre. I applied the suggested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments to optimize the number of tests to the minimum.
Good catch, I wouldn't have seen this myself. I had a careful look through the changes and tested them and everything looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marenwestermann LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR fixes the code of the scikit-learn library (not just the test suite) and therefore needs a dedicated entry in sklearn/doc/whats_new/v1.4.rst
to document the fact that sklearn.cluster.OPTICS
now accepts scipy sparse arrays as inputs.
Take example on the log entry for NMF
as a reference to write this entry for the changelog.
Other than that here is a suggestion for further consistency in the code (even if it does not change the behavior seen by the tests):
I don't think the CI failures are due to my changes (unless it's a problem that this PR which is mentioned in the changelog isn't merged yet). |
Uhm. Could you solve the conflict of the changelog. |
Ah, I didn't see that it was a merge conflict, I fixed it. Please note that somewhere in the restructuring process of the changelog the changelog entry for #26602 seems to have gotten lost. |
Looks like the test |
sklearn/cluster/_optics.py
Outdated
dists = np.asarray(dists) | ||
elif hasattr(dists, "toarray"): | ||
# X is a scipy sparse array. | ||
dists = dists.toarray() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marenwestermann do you understand why this branch of the condition is not already covered by test_precomputed_dists
?
I am a bit confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel The issue is that previously dists
was a sparse matrix for the test case "is_sparse"=True
. Now, when we want to test for the case csr_matrix
, dists
becomes a numpy matrix (which is currently covered in the code) and in the case where we want to test for csr_array
, dists
becomes a numpy array. So we never test for the case sparse array.
Now the question is should dists
be sparse when X is sparse (as it was preciously the case) or is the current implementation ok? I don't have a deep enough understanding of the OPTICS estimator to make a judgement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minimal example to understand what happens here:
In [5]: from scipy import sparse
In [6]: X = sparse.random(10, 10, format="csr")
In [7]: X[0, :]
Out[7]:
<1x10 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
In [8]: X[0, [1, 2]]
Out[8]:
<1x2 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>
In [9]: X[[0], [1, 2]]
Out[9]: matrix([[0., 0.]])
Since we index both dimension, we get this np.matrix
that is the difference with the previous code.
For me, we don't need to cover anymore the case where we would eventually have a sparse matrix because we will get already a np.matrix
or a np.array
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marenwestermann do you think that you can reduce to only having np.matrix
or np.array
(not anymore sparse matrices) for the condition. Once done, I think that we good for merging.
This one was waiting for merge. Thanks @marenwestermann |
Reference Issues/PRs
#27090
What does this implement/fix? Explain your changes.
Makes use of the CSR_CONTAINERS fix to test sparse arrays.
Any other comments?
I included a
NotImplementedError
exception because I got the following test failure:Not sure if this is the best solution.