Enable config setting `sparse_interface` to control sparray and spmatrix creation #31177

dschult · 2025-04-11T01:57:30Z

This PR sets up a config parameter sparse_interface to indicate "sparray" or "spmatrix" outputs, as suggested in #26418.

The first commit sets everything up and implements the system for a few modules. Please take a look and provide feedback for whether this is the way to proceed. The next commit(s) will implement these same style of changes throughout the library. If you would prefer they be in separate PRs let me know. (I'll keep Draft status until there is feedback and the full library is convered.)

More specifically, this PR does the following:

adds sparse_interface to the config parameters. (I think this name is better than sparse_format because "format" means csr/coo/lil, etc in the sparse world.) The values it can hold are "sparray" or "spmatrix". Update config tests accordingly.
adds utils._sparse.py with (private) helper functions. The difference is how much checking is done. Tests added too.
- _as_sparse(x_sparse), raises unless sparse input. converts to interface chosen by config.
- _select_interface_if_sparse(x), allows dense input with no action, sparse input uses _as_sparse.
- one-line convenience functions: _convert_from_spmatrix_to_sparray(x) and _convert_from_sparray_to_spmatrix(x)
updates the following modules to use this helper utility to return or store newly created sparse objects.
- sklearn/feature_selection/text.py and adapts tests.
- sklearn/linear_model/_coordinate_descent.py no tests change needed.
- sklearn/manifold/_locally_linear.py no tests change needed.

@thomasjpfan can you see if this does what you had in mind? I tried to pick modules that cover returning sparse, setting estimators to hold sparse, and transforming to sparse, so you can see how this would work.

Let me know if you think _as_sparse should be a public function, and if my approach aligns with how you want it. The next steps for this PR are to repeat these type of changes throughout the library.

github-actions · 2025-04-11T01:58:47Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 412fb75. Link to the linter CI: here}

dschult · 2025-04-11T02:03:36Z

Another note:
The new sparse construction function eye_array(n) which is the sparray version of eye(n) was released in SciPy v1.12 along with other construction functions like diags_array. So they will not work with the oldest supported version being v1.8.

We can work around it for now with e.g. _as_sparse(eye(n)), but it will need to be updated later (before spmatrix is removed).

The recent features for sparse by version are:

v1.12 added construction function e.g. eye_array, diags_array, etc
v1.14 added 1D sparray support
v1.15 added indexing for sparray which returns 1D objects (like numpy.array does), e.g. A[3,:] -> 1D array
goals: v1.16 nD support, v1.17 broadcasting of binary operations.

This info might help us decide when to support which versions. I think the construction functions are all that is currently needed. If/when we start using indexing code for both sparse and dense, we will likely want 1.15. If/when we want nD sparse we will need v1.16, and broadcasting binary operations in v1.17. But for now, 1.8 leaves out only construction functions from current code.

dschult · 2025-05-04T03:34:52Z

It sounds like the community has decided to convert to SciPy sparray using the config parameter. This PR is a start toward that. I've removed the "Draft status" from the PR. I am ready to implement this approach in other parts of the library.

I have two questions:

should I put the changes for other parts of the library in a different PR?
Is there a timeline for bumping up the minimum version dependence on scipy? I will likely want to add a few functions to fixes.py and knowing their expected lifetime would give context to that effort. If it isn't known, that's a fine answer too.

thomasjpfan

Yup, this is basically what I had in mind.

thomasjpfan · 2025-05-06T15:28:29Z

sklearn/feature_extraction/tests/test_text.py

+    if _as_sparse(X_csr) is X_csr:
+        assert X_transform is X_csr
+    else:
+        assert X_transform is not X_csr


Although the object identity is not the same, is the underlying data/indices/indexes pointing to the same data? If so, we can check the data?

Yes -- the underlying data is the same. Good idea. I've added a check for indptr.

thomasjpfan · 2025-05-06T15:32:25Z

sklearn/utils/_sparse.py

+from .._config import get_config
+
+
+def _as_sparse(X_sparse):


Do you think it'll be simpler to have _as_sparse be _select_interface_if_sparse?

I guess the difference between functions is whether we know already that the input is sparse. We usually know whether it is sparse. But there are some cases where the input could be dense or sparse.

Looking at this again now, the cost of a simpler one-function approach is small. We always check issparse so we could forego the exception and just pass-through anything that is not sparse.

Do you have a suggestion for the name of the single function? _as_sparse suggests it converts everything to sparse. But _select_interface_if_sparse is a long name. :) Maybe _align_api_if_sparse? Or _align_sparse_api?

I'm okay with _align_api_if_sparse.

dschult · 2025-05-08T04:36:06Z

I've updated the name and example code -- and I added a commit that implements (with a utils shim function for SciPy versions older than 1.12) the 4 sparray construction functions. I've named them _sparse_eye, _sparse_diags, _sparse_random and _sparse_block. They call eye_array, diags_array, random_array and block_array on recent versions of SciPy.

These changes are largely orthogonal to the return sparse interface issue we've been focusing on here. But it is a needed step that would be easier to get feedback on here before this gets any bigger. If you'd prefer this in a different PR let me know. And if it'd be good to put further changes in a separate PR let me know. The further changes are mostly switching csr_matrix calls to csr_array with _align_api_if_sparse at the end of the function if it will be returned or stored somewhere.

Do my choices for function names that bridge the old versions of SciPy look ok?
Thanks!

dschult · 2025-07-25T18:04:03Z

sklearn/semi_supervised/_label_propagation.py

+            normalizer = np.array(normalizer)  # convert np.matrix to np.array
+            if normalizer.ndim == 2:
+                # old spmatrix treatment. RHS is a scalar (b/c normalizer is 2D row)
+                affinity_matrix.data /= np.diag(normalizer)
+            else:
+                # We could use the (questionable) spmatrix treatment using:
+                # affinity_matrix.data /= np.diag(np.array(normalizer[np.newaxis, :]))
+                # Instead: use  numpy treatment dividing each row by its normalizer.
+                affinity_matrix.data /= normalizer[affinity_matrix.indices]


This change deserves a comment -- and maybe even a separate PR.
normalizer is the axis=0 sum of affinity_matrix. When affinity_matrix is numpy or sparray this is a 1D array.
But when affinity_matrix is spmatrix, it is 2D. So sparse matrices currently (on main) get special treatment:
affinity_matrix.data /= np.diag(np.array(normalizer))
normalizer is an np.matrix (2D with 1 row). np.array(normalizer) is an array (still 2D with 1 row). So np.diag of that is just the first entry in the first row (a scalar).

But that divides the entire affinity_matrix by the sum of the first row! So, instead of normalizing each row by its sum, we get the whole matrix normalized by the sum of the first row. I think this is an error.

When affinity_matrix is a numpy array, we do divide each row by its sum. So, currently sparse and dense compute different normalizations here.

I have added sparse array handling here to divide each row by its sum. But I have left the spare matrix code there in case backward compatibility is an issue.

Should I open a separate PR for changing the spmatrix behavior so it doesn't divide by a scalar? Or just fix it in this PR? Or am I fixing it incorrectly in some way?

probably easier to review as a part of a separate PR.

Done in #31924
I reverted the changes here as they are covered in that PR.

dschult · 2025-08-04T02:37:44Z

Okay, I think this is finally ready for review. I've implemented a config to determine the type of sparse matrix to return (sparray or spmatrix). Both kinds can be used for sparse inputs.

New helper functions in utils._sparse.py:

_align_api_if_sparse(X) changes the sparse type if needed
_sparse_random, _sparse_eye_array, _sparse_diags (sparse construction functions across versions 1.8->1.12-> 1.15+)
_ensure_sparse_index_int32(X) sets index dtype for sparse X to be int32 if safe to do so. Raise if not. Useful when using libraries that only support int32 index arrays.
safely_cast_index_arrays' local version of scipy.sparse._sputils.safely_cast_index_arrays` that works across SciPy versions.

New Scipy Version bool flags to ease indexing code that depends on SciPy version (mostly index that make 1D arrays):

SCIPY_VERSION_BELOW_1_12
SCIPY_VERSION_BELOW_1_15

All internal constructors are switched to sparray, e.g. csr_array, csc_array, etc.
Return values are converted to the interface specified in config 'sparse_interface'.
code, docs, rst files, doc_strings, benchmarks are all updated.

If you want me to separate this into smaller PRs, let me know.

CC: @thomasjpfan , @lorentzenchr, @jjerphan, #26418

lorentzenchr · 2025-08-06T19:09:47Z

@scikit-learn/core-devs ping for visibility

adrinjalali

Seems pretty straightforward to me.

adrinjalali · 2025-08-07T08:38:00Z

sklearn/_config.py

+        - `"sparray"`: Return sparse as SciPy sparse array
+        - `"spmatrix"`: Return sparse as SciPy sparse matrix


are we changing this default at some point? Should we introducing a deprecation cycle at the same time?

sklearn/_config.py

adrinjalali · 2025-08-07T08:38:54Z

sklearn/semi_supervised/_label_propagation.py

+            normalizer = np.array(normalizer)  # convert np.matrix to np.array
+            if normalizer.ndim == 2:
+                # old spmatrix treatment. RHS is a scalar (b/c normalizer is 2D row)
+                affinity_matrix.data /= np.diag(normalizer)
+            else:
+                # We could use the (questionable) spmatrix treatment using:
+                # affinity_matrix.data /= np.diag(np.array(normalizer[np.newaxis, :]))
+                # Instead: use  numpy treatment dividing each row by its normalizer.
+                affinity_matrix.data /= normalizer[affinity_matrix.indices]


probably easier to review as a part of a separate PR.

sklearn/manifold/_isomap.py

set up function _align_api_if_sparse() with tests Also functions _ensure_sparse_index_int32() and safely_cast_index_arrays() and _sparse_eye, _sparse_diags, _sparse_random to span Scipy <1.12 changes Introduce SCIPY_VERSION_BELOW_1_12 and SCIPY_VERSION_BELOW_1_15 boolean flags fix rng keyword arg for old SciPy versions ensure 2d sparse convert benchmarks doc modules convert csr_matrix to csr_array and CSR, COO, DIA, etc. make doctests pass pass tests on older scipy versions. i.e. ensure int32 indices where needed. make it work for SciPy 1.8 improve test coverage

…1924

dschult mentioned this pull request Apr 14, 2025

RFC Supporting scipy.sparse.sparray #26418

Open

dschult force-pushed the impl_as_sparse_function branch from e2b7d8b to 251e2e2 Compare May 3, 2025 19:19

dschult marked this pull request as ready for review May 4, 2025 03:35

thomasjpfan reviewed May 6, 2025

View reviewed changes

dschult force-pushed the impl_as_sparse_function branch 3 times, most recently from 5e3d3cd to 511de76 Compare May 8, 2025 03:29

dschult force-pushed the impl_as_sparse_function branch 3 times, most recently from 8b58511 to a802f12 Compare July 19, 2025 17:04

dschult force-pushed the impl_as_sparse_function branch 3 times, most recently from 05da23c to 6fbaae6 Compare July 25, 2025 17:50

dschult commented Jul 25, 2025

View reviewed changes

dschult force-pushed the impl_as_sparse_function branch 2 times, most recently from 8994094 to 220e7f1 Compare July 26, 2025 17:32

dschult force-pushed the impl_as_sparse_function branch 2 times, most recently from 5d190f2 to 2d71b2c Compare August 3, 2025 16:37

lorentzenchr added this to the 1.8 milestone Aug 6, 2025

adrinjalali reviewed Aug 7, 2025

View reviewed changes

thomasjpfan reviewed Aug 7, 2025

View reviewed changes

sklearn/manifold/_isomap.py Show resolved Hide resolved

dschult mentioned this pull request Aug 11, 2025

FIX normalization in semi_supervised label_propagation #31924

Open

dschult added 2 commits August 11, 2025 09:18

revert changes to label_propagation _build_graph which are now in gh3…

1d724ec

…1924

dschult force-pushed the impl_as_sparse_function branch from 2d71b2c to 1d724ec Compare August 11, 2025 13:18

update versionadded for sparse_interface config

fbe7df7

dschult mentioned this pull request Aug 12, 2025

First steps toward sparray migration pass 2 #31072

Closed

tweak label_propagation to give same answer for spmatrix and sparray

412fb75

This was referenced Aug 14, 2025

Change sparse matrix to array zbw/stwfsapy#96

Draft

Migrate from spmatrix to sparray zbw/qualle#46

Open

		- `"sparray"`: Return sparse as SciPy sparse array
		- `"spmatrix"`: Return sparse as SciPy sparse matrix

Uh oh!

Enable config setting sparse_interface to control sparray and spmatrix creation #31177

Are you sure you want to change the base?

Enable config setting sparse_interface to control sparray and spmatrix creation #31177

Conversation

dschult commented Apr 11, 2025

Uh oh!

github-actions bot commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

dschult commented Apr 11, 2025

Uh oh!

dschult commented May 4, 2025

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dschult May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dschult commented May 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dschult commented Aug 4, 2025

Uh oh!

lorentzenchr commented Aug 6, 2025

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Enable config setting `sparse_interface` to control sparray and spmatrix creation #31177

Enable config setting `sparse_interface` to control sparray and spmatrix creation #31177

github-actions bot commented Apr 11, 2025 •

edited

Loading

dschult May 6, 2025 •

edited

Loading