Skip to content

Fix mixed dense/sparse array API namespace inspection #29466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Jul 11, 2024

Fixes #29452.

I think this is the most natural way to handle mixed dense and sparse array inputs: rely on the caller code to handle sparse data in specific code branch and let get_namespace ignore those, unless the input data is all-sparse, in which case we return the numpy namespace to avoid errors.

The alternative would be to:

  • accept all-sparse and mixed scipy sparse / numpy inputs;
  • reject mixed scipy sparse / non-numpy inputs.

However the latter introduces some asymetry and I am not sure it's justified and the code would be more complex. So unless we have a good reason to chose the second option, I find the option implemented in this PR leaner.

Copy link

github-actions bot commented Jul 11, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ca8f257. Link to the linter CI: here

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

@betatim @OmarManzoor @adrinjalali please let me know what you think about this proposal.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

BTW, I checked that the original reproducer of #29452 pass with this PR.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very comfortable with this, for these concerns:

  • as a developer, if I call get_namespace and returns something, I would assume I can just use that to apply operations on my data, but here if they're all sparse, then I end up doing normal numpy stuff and it'd fail with many operations.
  • If for some reason the data is a mix of jax arrays and numpy sparse arrays, we return a very odd namespace (jax) (sorry if my vocabulary is not accurate here)
  • I would expect the namespace to fail / return None if they're not all on the same space I think.

WDYT?

@betatim
Copy link
Member

betatim commented Jul 11, 2024

After reading the issue I was thinking that calling xp, array_api_compliant = get_namespace((something_sparse, numpy_array)) should lead to array_api_compliant == False so that we basically keep doing what we were doing before adding get_namespace. Same if all the input arrays are sparse.

Something I don't know: how interchangeable is a Numpy array and a sparse array? For code that we have today, do we have to guard all the code that handles (potentially) sparse arrays with if sp.issparse()? If yes, then maybe get_namespace should just ignore sparse arrays when it determines the namespace and only consider dense arrays?

@adrinjalali
Copy link
Member

After reading the issue I was thinking that calling xp, array_api_compliant = get_namespace((something_sparse, numpy_array)) should lead to array_api_compliant == False so that we basically keep doing what we were doing before adding get_namespace. Same if all the input arrays are sparse.

100%. You explained better than me my feeling here.

Something I don't know: how interchangeable is a Numpy array and a sparse array?

I think of them as not at all, since you can't do a sparse_array + 1 since it makes it dense.

@betatim
Copy link
Member

betatim commented Jul 11, 2024

(showing off my lack of knowledge here)

If in a normal Numpy/pre-array API scenario dense and sparse arrays can not directly interact, and instead the author of the code has to take care of using a special namespace/module to act on a sparse array or when combining a sparse and dense array, then I think it is also fair to assume/expect/require that this continues. This would suggest that get_namespace ignores sparse arrays (like it ignores None and strings) when determining the namespace.

A question is: what happens if you pass a scipy sparse array and a jax/torch/cupy array to a function that today accepts scipy sparse and dense Numpy arrays? I don't know if what ever code that exists in such a function that handles interactions between the two will also work when the dense array isn't a Numpy array.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

If for some reason the data is a mix of jax arrays and numpy sparse arrays, we return a very odd namespace (jax) (sorry if my vocabulary is not accurate here)

numpy sparse arrays do not exist. Scipy sparse arrays do and are based on numpy. But our code always has specific branches to handle scipy.

I would expect the namespace to fail / return None if they're not all on the same space I think.

If you call get_namespace(X_numpy, Y_torch, Z_jax) then you get an exception.

There is no array API compatible namespace for scipy sparse arrays. But in practice, code that manipulates sparse matrices often requires access to some symbols from the numpy namespace in practice.

I think it's reasonable to have get_namespace accepts scipy sparse inputs and have specific handling for scipy sparse inputs because we often have code patterns like:

def _some_private_function_that_accepts_dense_and_sparse(
    A_always_dense,
    B_sparse_or_dense,
    other_params,
    xp=None
):
    # A_always_dense and B_sparse_or_dense are validated by the public caller.
    # xp has sometimes already been inspected, other times not, because this
    # function can be called both by public callers that have been updated to support
    # array API but and by other public callers that haven't been updated yet.
    xp, is_array_api_compliant = get_namespace(A_always_dense, B_sparse_or_dense, xp=xp)

    if sp.issparse(B_sparse_or_dense):
        # Code that is scipy sparse specific goes here. We might typically need
        # the `np` module in that branch.
        ...
    else:
        # Do numpy or array API stuff here using xp.
        ...

and it's cleaner / simpler to have a single call to get_namespace at the beginning of such functions.

However, I agree that while this code should work when A is a numpy array and B is a scipy sparse array, there, it is debatable about what to do in the A is a torch tensor and B is a scipy sparse array case.

I will open a new sister PR that implements the alternative I proposed in the description.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

If yes, then maybe get_namespace should just ignore sparse arrays when it determines the namespace and only consider dense arrays?

If all inputs are sparse arrays, we should still output xp is np, no? Or shall we let the user caller use xp in that case and assume that the caller will only use np in branches of code that interact with scipy sparse arrays?

@betatim
Copy link
Member

betatim commented Jul 11, 2024

Building on your example code with a few edits that I think were typos(?) and making it have different arguments.

def _some_private_function_that_accepts_dense_and_sparse(
    A_sparse_or_dense,
    B_sparse_or_dense,
    other_params
):
    xp, is_array_api_compliant = get_namespace(A_sparse_or_dense, B_sparse_or_dense)

    if sp.issparse(A_sparse_or_dense) or sp.issparse(B_sparse_or_dense):
        # Code that is scipy sparse specific goes here. We might typically need
        # the `np` module in that branch and use it explicitly. Not via `xp` but
        # by referencing `np`.
        # We assume that if one of A or B is dense then it is a Numpy array
        # because it doesn't make sense to pass a torch array and a scipy sparse array(?)
        # but maybe we need a check for this?
        ...
    else:
        # Do numpy or array API stuff here using xp.
        ...

In the case of one or more sparse inputs we completely ignore xp. So I am not sure what it should be. Maybe np?

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

Passing xp to get_namespace was intentional in my original snippet (as explained in the inline comment).

I fixed the typo.

@betatim
Copy link
Member

betatim commented Jul 11, 2024

Nods. Sorry, I removed xp= because for my point it was not needed. Not because I thought it was a typo.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

Nods. Sorry, I removed xp= because for my point it was not needed. Not because I thought it was a typo.

It's a bit needed because it helps explain why we prefer to do a single call to get_namspace at the beginning of such private functions rather than in nested code paths below if not sp.sparse(data) guards.

@betatim
Copy link
Member

betatim commented Jul 11, 2024

But should we use xp inside the sparse branch or use numpy explicitly? I'm thinking that we could/should use numpy explicitly and maybe raise an error if someone tries to pass a dense jax array and a sparse scipy array.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

But should we use xp inside the sparse branch or use numpy explicitly? I'm thinking that we could/should use numpy explicitly and maybe raise an error if someone tries to pass a dense jax array and a sparse scipy array.

I can try to do that but I am afraid this might be quite disruptive. We still need to solve the problem reported in #29452 for functions that have not yet been (entirely) updated to support array API inputs but receive numpy and scipy arrays under an config context where array API dispatched is enabled. In the case of #29452, it calls something like check_array(sparse_data, accept_sparse=True) which in turns calls get_namespace(sparse_data). In this particular case we can probably just avoid calling get_namespace(sparse_data) when sp.issparse(sparse_data). That will make check_array even more complex though: it really requires some refactoring.

@adrinjalali
Copy link
Member

Looking at the codes above, I think this to me makes sense:

def _some_private_function_that_accepts_dense_and_sparse(
    A_always_dense,
    B_sparse_or_dense,
    other_params,
    xp=None
):
    if sp.issparse(B_sparse_or_dense):
        _private_operation_sparse(...)
    else:
        _private_operation_dense(...)


def _private_operation_sparse(...):
    ...

def _private_operation_dense(...):
    xp, is_array_api_compliant = get_namespace(A_always_dense, B_sparse_or_dense, xp=xp)
    ...

And I'd say get_namespace either raises on sparse or it returns None, and check_array deals with it?

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

And I'd say get_namespace either raises on sparse or it returns None, and check_array deals with it?

I made a minimal change in check_array in #29469.

I will also try to see what happens if I make get_namespace(sparse_data) raise even when array API dispatch it not enabled to have an idea of how invasive would be this kind of change if it had to be replicated throughout our code base.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

I opened the draft PR #29470 to identify all occurrences of this problem.

I already found type_of_target that will be much more challenging to update.

While #29466 (comment) might be the cleanest approach, I am afraid implementing this approach will be very invasive and will require a full refactor of a significant part of our code base to be able to add array API support consistently while not breaking scipy sparse support.

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Jul 11, 2024

I think the code that @adrinjalali and @ogrisel shared for the case where we have only one array which can be sparse , i.e. B, looks nice, but in the branch that uses the sparse functionality if A is an array api array would that require converting that to numpy and then to a sparse array considering there is functionality in which A and B need to interact. Or we should raise an error?

And similarly if both can be sparse or dense, should we accept either both sparse or dense and raise an error otherwise, or should we do conversions to handle the functionality?

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

but in the branch that uses the sparse functionality if A is an array api array would that require converting that to numpy and then to a sparse array considering there is functionality in which A and B need to interact. Or we should raise an error?

Not sure in general. I think we should make such decisions on a case by case basis. I don't think what you describe is a very common pattern in practice.

@OmarManzoor
Copy link
Contributor

I don't think what you describe is a very common pattern in practice.

Then when we are considering a mixture of dense and sparse arrays, generally the sparse arrays would be used in their own dedicated computation?

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

Then when we are considering a mixture of dense and sparse arrays, generally the sparse arrays would be used in their own dedicated computation?

That could help make the code more maintainable but this is definitely not consistently the case in our code-base. I think we have many functions that have a large common chunk of code that is sparsity agnostic and just a few locally branching code paths that depend on sp.issparse guards. Refactoring such functions might be a lot of work and might also introduce a lot of redundant code.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

The investigative CI run triggered by #29470 has found 1733 test failing because of this problem:

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=68639&view=logs&j=dde5042c-7464-5d47-9507-31bdd2ee0a3a&t=4bd2dad8-62b3-5bf9-08a5-a9880c530c94

I suspect that many of those are redundant however, it reveals that the approach propose by @adrinjalali would be quite tedious to implement and would introduce a lot of redundant code:

All in all, I think I am in favor of sticking with the solution proposed in this PR (in #29466) and have an explicit contributor guideline to always use np instead of xp under sp.isparse branches.

@adrinjalali
Copy link
Member

I agree #29466 (comment) might make things unnecessarily tedious. So happy to figure out a way that doesn't need get_namespace to raise I guess ever?

But still not happy with the semantics of the output we get in this case. I like @betatim 's intuition when it comes with calling the function.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

But still not happy with the semantics of the output we get in this case. I like @betatim 's intuition when it comes with calling the function.

What do you suggest instead? I do not see any alternative.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 11, 2024

I could improve the docstring and the array API documentation page though.

@adrinjalali
Copy link
Member

I'll have a more detailed look tomorrow and see if I can come up with a better proposal

@ogrisel
Copy link
Member Author

ogrisel commented Aug 8, 2024

Let's consolidate the finalization of this fix in #29476 which I believe is closer to reaching a consensus in its review.

@ogrisel ogrisel closed this Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Array API regression in homogeneity_completeness_v_measure?
5 participants