Skip to content

First steps toward sparray migration pass 2 #31072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dschult
Copy link
Contributor

@dschult dschult commented Mar 26, 2025

This PR converts sklearn/metrics/cluster/_supervised.py [Edit: and sklearn/manifold/_locally_linear.py] to use sparray. The reason for doing a small one-file two-file PR here is to get feedback on kwargs that direct spmatrix vs sparray. [Edit: the second module includes a function that has no current sparse keyword -- so maybe needs a new one. see the next comment in this PR.]

  1. For the function contingency_matrix this PR extends the sparse keyword to allow "boolean or str" input with valid string inputs being "sparray" or "spmatrix". The value False continues to signify an ndarray return value while True signifies spmatrix for now with a note that this will change behavior to sparray alongside the deprecation of spmatrix. Note that in this module all calls to contingency_matrix lead to non-sparse output so have been switched from sparse=True to sparse="sparray" in alignment with Pass 2 of migration to sparray making all internal (non-user facing) sparse usage be sparray.

    • All calls to contingency_matrix repo-wide that use sparse have been updated (all are in this module).
    • Tests are updated to check the new keyword args.
    • Is this the right approach? the right wording? Thanks!
  2. There is a doc reference to sklearn.sparse.csr_matrix which I assume never existed. So I reworded that doc.

  3. This file has a function fowles_mallow_score which has a sparse keyword but it is never used. There is a PR FIX ‘sparse’ kwarg was not used by fowlkes_mallows_score #28981 which points out this flaw and attempts to make the keyword work, but the function's code/algorithm uses sparse features of the contingency_matrix -- so the code would not work with dense arrays -- the keyword cannot be meaningful here without new and less efficient code. I believe that the keyword should simply be removed. It is broken and shouldn't be enabled.

    • So, this PR removes the sparse keyword from fowlkes_mallow_score. Let me know if this should be separated from the rest of this PR.

This PR relates to #26418 (RFC supporting scipy.sparse.sparray)

Copy link

github-actions bot commented Mar 26, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e88b716. Link to the linter CI: here

@dschult dschult force-pushed the sparray_migration_pass2_trialrun branch from d35c29c to 438b5fe Compare March 26, 2025 00:51
@dschult
Copy link
Contributor Author

dschult commented Mar 27, 2025

I have committed a second module migration to this PR: sklearn/manifold/_locally_linear.py.
This one shows one possible way to handle a function (barycenter_kneighbors_graph from ) that would need a new kwarg to indicate spmatrix or sparray. But I now see this might be a not-so-great example because it is a utility function that maybe could just switch to sparray as an implementation detail. If so, just say the word and I'll implement that instead. The tests of barycenter... get adjusted to test both sparray and spmatrix output types. And the doc_string explains what is changing.

This two-change approach for the users sets the default spmatrix=True as the default, so they don't have to change anything right away, but should change to spmatrix=False when their code works for sparray. Then we later change the default to spmatrix=False, allowing them to remove the keyword (two-change). And finally we remove the kwarg.

Alternatives include making the initial default be spmatrix=False. Users who don't care will never have to change anything. Users who depend on the result being spmatrix (should not be many -- they just use it to input to null_space) will need to change their code to use-or-not-use the keyword arg depending on version of sklearn. Note: even with the two-change approach some libraries may have to make version dependent code if the time between adding the kwarg and changing its default is not super long.

The other change in this module (sklearn/manifold/_locally_linear.py) is a typical Pass 2 change (from the migration to sparray document) of the constructor lil_matrix to lil_array, and the sparse construction function eye to eye_array.

This should still be short enough to ease review of the strategies shown here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant