Fix mixed dense/sparse array API namespace inspection #29466

ogrisel · 2024-07-11T10:07:55Z

I think this is the most natural way to handle mixed dense and sparse array inputs: rely on the caller code to handle sparse data in specific code branch and let get_namespace ignore those, unless the input data is all-sparse, in which case we return the numpy namespace to avoid errors.

The alternative would be to:

accept all-sparse and mixed scipy sparse / numpy inputs;
reject mixed scipy sparse / non-numpy inputs.

However the latter introduces some asymetry and I am not sure it's justified and the code would be more complex. So unless we have a good reason to chose the second option, I find the option implemented in this PR leaner.

github-actions · 2024-07-11T10:09:14Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: ca8f257. Link to the linter CI: here}

ogrisel · 2024-07-11T10:18:01Z

@betatim @OmarManzoor @adrinjalali please let me know what you think about this proposal.

ogrisel · 2024-07-11T11:58:02Z

BTW, I checked that the original reproducer of #29452 pass with this PR.

adrinjalali

I'm not very comfortable with this, for these concerns:

as a developer, if I call get_namespace and returns something, I would assume I can just use that to apply operations on my data, but here if they're all sparse, then I end up doing normal numpy stuff and it'd fail with many operations.
If for some reason the data is a mix of jax arrays and numpy sparse arrays, we return a very odd namespace (jax) (sorry if my vocabulary is not accurate here)
I would expect the namespace to fail / return None if they're not all on the same space I think.

WDYT?

betatim · 2024-07-11T12:24:34Z

After reading the issue I was thinking that calling xp, array_api_compliant = get_namespace((something_sparse, numpy_array)) should lead to array_api_compliant == False so that we basically keep doing what we were doing before adding get_namespace. Same if all the input arrays are sparse.

Something I don't know: how interchangeable is a Numpy array and a sparse array? For code that we have today, do we have to guard all the code that handles (potentially) sparse arrays with if sp.issparse()? If yes, then maybe get_namespace should just ignore sparse arrays when it determines the namespace and only consider dense arrays?

adrinjalali · 2024-07-11T12:34:41Z

After reading the issue I was thinking that calling xp, array_api_compliant = get_namespace((something_sparse, numpy_array)) should lead to array_api_compliant == False so that we basically keep doing what we were doing before adding get_namespace. Same if all the input arrays are sparse.

100%. You explained better than me my feeling here.

Something I don't know: how interchangeable is a Numpy array and a sparse array?

I think of them as not at all, since you can't do a sparse_array + 1 since it makes it dense.

betatim · 2024-07-11T12:39:30Z

(showing off my lack of knowledge here)

If in a normal Numpy/pre-array API scenario dense and sparse arrays can not directly interact, and instead the author of the code has to take care of using a special namespace/module to act on a sparse array or when combining a sparse and dense array, then I think it is also fair to assume/expect/require that this continues. This would suggest that get_namespace ignores sparse arrays (like it ignores None and strings) when determining the namespace.

A question is: what happens if you pass a scipy sparse array and a jax/torch/cupy array to a function that today accepts scipy sparse and dense Numpy arrays? I don't know if what ever code that exists in such a function that handles interactions between the two will also work when the dense array isn't a Numpy array.

ogrisel · 2024-07-11T12:40:34Z

If for some reason the data is a mix of jax arrays and numpy sparse arrays, we return a very odd namespace (jax) (sorry if my vocabulary is not accurate here)

numpy sparse arrays do not exist. Scipy sparse arrays do and are based on numpy. But our code always has specific branches to handle scipy.

I would expect the namespace to fail / return None if they're not all on the same space I think.

If you call get_namespace(X_numpy, Y_torch, Z_jax) then you get an exception.

There is no array API compatible namespace for scipy sparse arrays. But in practice, code that manipulates sparse matrices often requires access to some symbols from the numpy namespace in practice.

I think it's reasonable to have get_namespace accepts scipy sparse inputs and have specific handling for scipy sparse inputs because we often have code patterns like:

def _some_private_function_that_accepts_dense_and_sparse(
    A_always_dense,
    B_sparse_or_dense,
    other_params,
    xp=None
):
    # A_always_dense and B_sparse_or_dense are validated by the public caller.
    # xp has sometimes already been inspected, other times not, because this
    # function can be called both by public callers that have been updated to support
    # array API but and by other public callers that haven't been updated yet.
    xp, is_array_api_compliant = get_namespace(A_always_dense, B_sparse_or_dense, xp=xp)

    if sp.issparse(B_sparse_or_dense):
        # Code that is scipy sparse specific goes here. We might typically need
        # the `np` module in that branch.
        ...
    else:
        # Do numpy or array API stuff here using xp.
        ...

and it's cleaner / simpler to have a single call to get_namespace at the beginning of such functions.

However, I agree that while this code should work when A is a numpy array and B is a scipy sparse array, there, it is debatable about what to do in the A is a torch tensor and B is a scipy sparse array case.

I will open a new sister PR that implements the alternative I proposed in the description.

ogrisel · 2024-07-11T12:42:44Z

If yes, then maybe get_namespace should just ignore sparse arrays when it determines the namespace and only consider dense arrays?

If all inputs are sparse arrays, we should still output xp is np, no? Or shall we let the user caller use xp in that case and assume that the caller will only use np in branches of code that interact with scipy sparse arrays?

betatim · 2024-07-11T12:55:39Z

Building on your example code with a few edits that I think were typos(?) and making it have different arguments.

def _some_private_function_that_accepts_dense_and_sparse(
    A_sparse_or_dense,
    B_sparse_or_dense,
    other_params
):
    xp, is_array_api_compliant = get_namespace(A_sparse_or_dense, B_sparse_or_dense)

    if sp.issparse(A_sparse_or_dense) or sp.issparse(B_sparse_or_dense):
        # Code that is scipy sparse specific goes here. We might typically need
        # the `np` module in that branch and use it explicitly. Not via `xp` but
        # by referencing `np`.
        # We assume that if one of A or B is dense then it is a Numpy array
        # because it doesn't make sense to pass a torch array and a scipy sparse array(?)
        # but maybe we need a check for this?
        ...
    else:
        # Do numpy or array API stuff here using xp.
        ...

In the case of one or more sparse inputs we completely ignore xp. So I am not sure what it should be. Maybe np?

ogrisel · 2024-07-11T13:05:56Z

Passing xp to get_namespace was intentional in my original snippet (as explained in the inline comment).

I fixed the typo.

betatim · 2024-07-11T13:08:10Z

Nods. Sorry, I removed xp= because for my point it was not needed. Not because I thought it was a typo.

ogrisel · 2024-07-11T13:18:40Z

Nods. Sorry, I removed xp= because for my point it was not needed. Not because I thought it was a typo.

It's a bit needed because it helps explain why we prefer to do a single call to get_namspace at the beginning of such private functions rather than in nested code paths below if not sp.sparse(data) guards.

betatim · 2024-07-11T13:45:56Z

But should we use xp inside the sparse branch or use numpy explicitly? I'm thinking that we could/should use numpy explicitly and maybe raise an error if someone tries to pass a dense jax array and a sparse scipy array.

ogrisel · 2024-07-11T13:55:05Z

But should we use xp inside the sparse branch or use numpy explicitly? I'm thinking that we could/should use numpy explicitly and maybe raise an error if someone tries to pass a dense jax array and a sparse scipy array.

I can try to do that but I am afraid this might be quite disruptive. We still need to solve the problem reported in #29452 for functions that have not yet been (entirely) updated to support array API inputs but receive numpy and scipy arrays under an config context where array API dispatched is enabled. In the case of #29452, it calls something like check_array(sparse_data, accept_sparse=True) which in turns calls get_namespace(sparse_data). In this particular case we can probably just avoid calling get_namespace(sparse_data) when sp.issparse(sparse_data). That will make check_array even more complex though: it really requires some refactoring.

adrinjalali · 2024-07-11T14:12:34Z

Looking at the codes above, I think this to me makes sense:

def _some_private_function_that_accepts_dense_and_sparse(
    A_always_dense,
    B_sparse_or_dense,
    other_params,
    xp=None
):
    if sp.issparse(B_sparse_or_dense):
        _private_operation_sparse(...)
    else:
        _private_operation_dense(...)


def _private_operation_sparse(...):
    ...

def _private_operation_dense(...):
    xp, is_array_api_compliant = get_namespace(A_always_dense, B_sparse_or_dense, xp=xp)
    ...

And I'd say get_namespace either raises on sparse or it returns None, and check_array deals with it?

ogrisel · 2024-07-11T14:38:38Z

And I'd say get_namespace either raises on sparse or it returns None, and check_array deals with it?

I made a minimal change in check_array in #29469.

I will also try to see what happens if I make get_namespace(sparse_data) raise even when array API dispatch it not enabled to have an idea of how invasive would be this kind of change if it had to be replicated throughout our code base.

ogrisel · 2024-07-11T15:04:35Z

I opened the draft PR #29470 to identify all occurrences of this problem.

I already found type_of_target that will be much more challenging to update.

While #29466 (comment) might be the cleanest approach, I am afraid implementing this approach will be very invasive and will require a full refactor of a significant part of our code base to be able to add array API support consistently while not breaking scipy sparse support.

OmarManzoor · 2024-07-11T15:06:12Z

I think the code that @adrinjalali and @ogrisel shared for the case where we have only one array which can be sparse , i.e. B, looks nice, but in the branch that uses the sparse functionality if A is an array api array would that require converting that to numpy and then to a sparse array considering there is functionality in which A and B need to interact. Or we should raise an error?

And similarly if both can be sparse or dense, should we accept either both sparse or dense and raise an error otherwise, or should we do conversions to handle the functionality?

ogrisel · 2024-07-11T15:25:25Z

but in the branch that uses the sparse functionality if A is an array api array would that require converting that to numpy and then to a sparse array considering there is functionality in which A and B need to interact. Or we should raise an error?

Not sure in general. I think we should make such decisions on a case by case basis. I don't think what you describe is a very common pattern in practice.

OmarManzoor · 2024-07-11T15:29:07Z

I don't think what you describe is a very common pattern in practice.

Then when we are considering a mixture of dense and sparse arrays, generally the sparse arrays would be used in their own dedicated computation?

ogrisel · 2024-07-11T15:34:20Z

Then when we are considering a mixture of dense and sparse arrays, generally the sparse arrays would be used in their own dedicated computation?

That could help make the code more maintainable but this is definitely not consistently the case in our code-base. I think we have many functions that have a large common chunk of code that is sparsity agnostic and just a few locally branching code paths that depend on sp.issparse guards. Refactoring such functions might be a lot of work and might also introduce a lot of redundant code.

ogrisel · 2024-07-11T15:46:09Z

The investigative CI run triggered by #29470 has found 1733 test failing because of this problem:

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=68639&view=logs&j=dde5042c-7464-5d47-9507-31bdd2ee0a3a&t=4bd2dad8-62b3-5bf9-08a5-a9880c530c94

I suspect that many of those are redundant however, it reveals that the approach propose by @adrinjalali would be quite tedious to implement and would introduce a lot of redundant code:

All in all, I think I am in favor of sticking with the solution proposed in this PR (in #29466) and have an explicit contributor guideline to always use np instead of xp under sp.isparse branches.

adrinjalali · 2024-07-11T15:55:21Z

I agree #29466 (comment) might make things unnecessarily tedious. So happy to figure out a way that doesn't need get_namespace to raise I guess ever?

But still not happy with the semantics of the output we get in this case. I like @betatim 's intuition when it comes with calling the function.

ogrisel · 2024-07-11T15:57:13Z

But still not happy with the semantics of the output we get in this case. I like @betatim 's intuition when it comes with calling the function.

What do you suggest instead? I do not see any alternative.

ogrisel · 2024-07-11T15:57:39Z

I could improve the docstring and the array API documentation page though.

adrinjalali · 2024-07-11T17:21:08Z

I'll have a more detailed look tomorrow and see if I can come up with a better proposal

ogrisel · 2024-08-08T09:40:22Z

Let's consolidate the finalization of this fix in #29476 which I believe is closer to reaching a consensus in its review.

Fix mixed dense/sparse array API namespace inspection

c4a1cbc

ogrisel added the Array API label Jul 11, 2024

github-actions bot added the module:utils label Jul 11, 2024

DOC document the fix in the changelog

ca8f257

adrinjalali reviewed Jul 11, 2024

View reviewed changes

ogrisel mentioned this pull request Jul 11, 2024

FIX make check_array accept sparse inputs when array api dispatch is enabled #29469

Closed

ogrisel mentioned this pull request Jul 11, 2024

Investigate impact of rejecting sparse inputs in get_namespace #29470

Closed

lesteve added CUDA CI and removed CUDA CI labels Jul 12, 2024

lesteve mentioned this pull request Jul 12, 2024

CI Move CUDA CI to pull_request trigger #29376

Merged

5 tasks

adrinjalali mentioned this pull request Jul 12, 2024

FIX sparse arrays not included in array API dispatch #29476

Merged

ogrisel closed this Aug 8, 2024

Uh oh!

Fix mixed dense/sparse array API namespace inspection #29466

Fix mixed dense/sparse array API namespace inspection #29466

Uh oh!

Conversation

ogrisel commented Jul 11, 2024

Uh oh!

github-actions bot commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel commented Jul 11, 2024

Uh oh!

ogrisel commented Jul 11, 2024

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

betatim commented Jul 11, 2024

Uh oh!

adrinjalali commented Jul 11, 2024

Uh oh!

betatim commented Jul 11, 2024

Uh oh!

ogrisel commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betatim commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betatim commented Jul 11, 2024

Uh oh!

ogrisel commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betatim commented Jul 11, 2024

Uh oh!

ogrisel commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Jul 11, 2024

Uh oh!

ogrisel commented Jul 11, 2024

Uh oh!

ogrisel commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 11, 2024

Uh oh!

OmarManzoor commented Jul 11, 2024

Uh oh!

ogrisel commented Jul 11, 2024

Uh oh!

ogrisel commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Jul 11, 2024

Uh oh!

ogrisel commented Jul 11, 2024

Uh oh!

ogrisel commented Jul 11, 2024

Uh oh!

adrinjalali commented Jul 11, 2024

Uh oh!

ogrisel commented Aug 8, 2024

Uh oh!

Uh oh!

github-actions bot commented Jul 11, 2024 •

edited

Loading

ogrisel commented Jul 11, 2024 •

edited

Loading

ogrisel commented Jul 11, 2024 •

edited

Loading

betatim commented Jul 11, 2024 •

edited

Loading

ogrisel commented Jul 11, 2024 •

edited

Loading

ogrisel commented Jul 11, 2024 •

edited

Loading

ogrisel commented Jul 11, 2024 •

edited

Loading

ogrisel commented Jul 11, 2024 •

edited

Loading

OmarManzoor commented Jul 11, 2024 •

edited

Loading

ogrisel commented Jul 11, 2024 •

edited

Loading