Skip to content

BUG: Fix NEP50 weak-type scalar logic for binary operations on masked arrays #28567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rhugonnet
Copy link

@rhugonnet rhugonnet commented Mar 21, 2025

This PR adds NEP50 logic (https://numpy.org/neps/nep-0050-scalar-promotion.html) to masked arrays, which was overlooked in all binary operations.

For in-place operations specifically, this resulted in errors, see #27679. But in general, wrong dtypes were returned for all binary operations.

Tagging @mattip @seberg as related to the above issue and NEP50 that you are familiar with.

Resolves #27679

Short example of the bug

For an ndarray, this passes:

import numpy as np
res = np.array([1], dtype=np.uint8) + 2
assert res.dtype == np.uint8

The masked array equivalent fails:

res = np.ma.masked_array([1], dtype=np.uint8) + 2
assert res.dtype == np.uint8

Logic of the fix

Binary ops not in-place

We modify getdata() with a return_pyscalar argument to optionally return the scalar itself (instead of always converting to an array), and update the getdata() calls in _MaskedBinaryOperation and _DomainedBinaryOperation to use the scalar directly in their call to the corresponding np.umath function.

This fixes the issue for all not in-place binary ops.

Binary ops in-place

The problem:
For in-place operations, the problem is more delicate. As masked arrays vow to preserve the .data under the mask during an operation (see first lines here: https://numpy.org/doc/stable/reference/maskedarray.generic.html#operations-on-masked-arrays), they need a specific in-place logic.
In case of binary ops, the resulting .data is preserved under the final mask with respect to the left term, so self.data for in-place. To perform this without adding more memory usage, the existing logic in the ma replaces data under the future mask in other_data with the invariant value for the operation (0 for add/subtract, 1 for power/division), before passing it to the corresponding np.umath.
Therefore, we cannot rely on the np.umath function to take care of the NEP50 scalar logic, as the scalar other_data has to be converted to an array to substitute some cells to these invariant values before calling the np.umath.

Proposed fix:
We still use the return_pyscalar argument of getdata(), as for not in-place above, to get other_data as a scalar. Then, in the case where it is a scalar, we run an equivalent of result_dtype() on the scalar+array for this operation. However, result_dtype itself is not sufficient here. We need a function that can take as input (scalar, array_type, operator) and can both return the proper result_dtype for NEP50, and flag if we need to raise an OverflowError.

Does such a function exist internally somewhere in NumPy? (I couldn't find any)

For now, as a placeholder operation simulating the same behaviour, I've used a dubious call to a single value numpy array + the scalar for this operation.
For example for __iadd__:

if isinstance(other_data, (bool, int, float)):
    placehold_arr = np.array([0], dtype=self._data.dtype)  # Placeholder data
    placehold_arr.__iadd__(other_data)   # Raises appropriate errors
    other_dtype = placehold_arr.dtype   # Returns appropriate dtype to create a other_data as array

This fixes the issue for all in-place binary ops.

New tests

First, I've mirrored all the relevant tests in https://github.com/numpy/numpy/blob/main/numpy/_core/tests/test_nep50_promotions.py for masked arrays.

However, as those don't cover in-place operations, I've also added a test_nep50_inplace_ma that checks the behaviour for all combinations of in-place ops, scalars and dtypes.

@tylerjereddy tylerjereddy added the component: numpy.ma masked arrays label Mar 21, 2025
@ngoldbaum ngoldbaum added the triage review Issue/PR to be discussed at the next triage meeting label Mar 24, 2025
@rhugonnet
Copy link
Author

@mattip @seberg @tylerjereddy @ngoldbaum Any feedback on this?

Just in case you didn't dive in: It's relatively minor changes (almost all line changes are additional tests).

If we can easily define a function _result_dtype_raise_overflow(scalar, array_dtype, operator), it'll just be about calling it within all in-place binary ops, instead of the current placeholder method that replicates that behaviour.
And for not in-place ops, it's just those couple lines passing the scalar to the op directly.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it, sorry masked arrays tend to not get much attention. I won't be able to look at this for a while, but the rough approach seems good, that "hack" looks a bit awkward, though.

numpy/ma/core.py Outdated
other_data = getdata(other, return_scalar=True)
if isinstance(other_data, (bool, int, float)):
# Trick to raise appropriate error and get output dtype
placehold_arr = np.array([0], dtype=self._data.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trick seems a bit strange to me. Where does it actually fail if you don't do this here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is where we'd need a _result_dtype_raise_overflow(scalar, array_dtype, operator) function to give us the dtype or raise the OverflowError, and cleanly replace this hack.

Without this bit, the code would fail here (even with other_data now being returned as a scalar):

 other_data = np.where(self._mask, other_data.type(0), other_data)
 self._data.__isub__(other_data)

It fails because the python scalar has to be converted into an array before the actual operation.
This conversion is (unfortunately) required in order replace masked values by the neutral value for the op (0 for add/sub, 1 for pow/div), as masked array vow not to touch masked values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, you only need the promotion result? The other error will happen naturally when the mask is created?
This is an possibility, I think:

dtype = np.true_divide.resolve_dtypes((np.dtype(np.int8), int, np.dtype(np.int8)), casting="unsafe")[1]

but I would prefer to avoid that if we can think of a way. We could use where= on the ufunc call (as part of the __itruediv__), but I suppose that will make things slower.

Copy link
Author

@rhugonnet rhugonnet Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that'd be ideal, but I couldn't think of any option. I was afraid that using where for a in-place operation would unnecessarily increase memory usage (I think we'd need to conserve self._data, and also have the result of op(self._data, other_data) in order to run where?)
Depending on how __iadd__ for ndarray works in the background, maybe there's a way, but I wouldn't know.

Following your advice, I tried the option of changing the hack everywhere into this:

if isinstance(type(other_data), (int, float, complex)):
    other_dtype = np.ufunc.resolve_dtypes(
        (self._data.dtype, type(other_data), None), casting="unsafe")[1]

with ufunc the corresponding operation to the __ioperation__.

It doesn't seem to give exactly the same other_dtype output as the previous hack that worked for all.
Now, combinations of np.uint8 and Python int fail for iadd, isub, imul, ifloordiv and ipow (so everything except truediv and div) during the __ioperation__:

E       numpy._core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'floor_divide' output from dtype('int64') to dtype('uint32') with casting rule 'same_kind'

Any ideas?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And unfortunately, the OverflowError doesn't happen naturally with the above option.
For instance, this doesn't raise one while it does with np.array():

np.ma.masked_array([1], dtype=np.uint8) + 300

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I forgot the part about using in instead of isinstance!

All passing with the following instead of the hack (including OverflowError!) 😁 :

if type(other_data) in (int, float, complex):
  other_dtype = np.ufunc.resolve_dtypes(
      (self._data.dtype, type(other_data), None), casting="unsafe")[1]

numpy/ma/core.py Outdated
other_data = getdata(other)
other_data = np.where(self._mask, other_data.dtype.type(1), other_data)
other_data = getdata(other, return_scalar=True)
if isinstance(other_data, (bool, int, float)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bool, int, float) is wrong, it has to be (int, float, complex). Also make sure to use type(other_data) in ... for an exact type check.
(Yes, I realize that there is some discussion about whether it should be an exact check, but it is what we use currently, and float64 is a float subclass so isinstance may at times be confusing.)

Clearly this applies to all occurances. Also rename return_scalar to return_pyscalar or return_pyliterals even to make it sure that this is not about preserving scalars in general.

@rhugonnet
Copy link
Author

Pushed the changes, all works nicely with np.ufunc.resolve_dtypes for in-place! The last test function I added checks this quite thoroughly.

If there is no where solution that comes to mind without computing/memory trouble, I'm happy with this! 🙂

@ngoldbaum
Copy link
Member

Ping @seberg - mind giving this another once-over?

@rhugonnet it looks like there's a merge conflict - mind clearing that?

Sorry this got dropped. We just looked at this at a triage meeting where Sebastian wasn't there, so I hoped that pinging Sebastian is sufficient to say this has been triaged.

@ngoldbaum ngoldbaum added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Aug 20, 2025
@@ -710,7 +710,7 @@ def get_masked_subclass(*arrays):
return rcls


def getdata(a, subok=True):
def getdata(a, subok=True, return_pyscalar=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as it's being used as keyword-only, so maybe we should just enforce it?

Suggested change
def getdata(a, subok=True, return_pyscalar=False):
def getdata(a, subok=True, *, return_pyscalar=False):

@rhugonnet
Copy link
Author

Thanks for picking this up again @ngoldbaum!

Fixed the two merge conflicts: I kept the idiv removal of #28927, and the linting change of #28755.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug component: numpy.ma masked arrays triaged Issue/PR that was discussed in a triage meeting
Projects
Status: Awaiting a code review
Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent casting error for masked-arrays in-place operations
5 participants