-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
BUG: Fix NEP50 weak-type scalar logic for binary operations on masked arrays #28567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@mattip @seberg @tylerjereddy @ngoldbaum Any feedback on this? Just in case you didn't dive in: It's relatively minor changes (almost all line changes are additional tests). If we can easily define a function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it, sorry masked arrays tend to not get much attention. I won't be able to look at this for a while, but the rough approach seems good, that "hack" looks a bit awkward, though.
numpy/ma/core.py
Outdated
other_data = getdata(other, return_scalar=True) | ||
if isinstance(other_data, (bool, int, float)): | ||
# Trick to raise appropriate error and get output dtype | ||
placehold_arr = np.array([0], dtype=self._data.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trick seems a bit strange to me. Where does it actually fail if you don't do this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is where we'd need a _result_dtype_raise_overflow(scalar, array_dtype, operator)
function to give us the dtype
or raise the OverflowError
, and cleanly replace this hack.
Without this bit, the code would fail here (even with other_data
now being returned as a scalar):
other_data = np.where(self._mask, other_data.type(0), other_data)
self._data.__isub__(other_data)
It fails because the python scalar has to be converted into an array before the actual operation.
This conversion is (unfortunately) required in order replace masked values by the neutral value for the op (0 for add/sub, 1 for pow/div), as masked array vow not to touch masked values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, you only need the promotion result? The other error will happen naturally when the mask is created?
This is an possibility, I think:
dtype = np.true_divide.resolve_dtypes((np.dtype(np.int8), int, np.dtype(np.int8)), casting="unsafe")[1]
but I would prefer to avoid that if we can think of a way. We could use where=
on the ufunc call (as part of the __itruediv__
), but I suppose that will make things slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that'd be ideal, but I couldn't think of any option. I was afraid that using where
for a in-place operation would unnecessarily increase memory usage (I think we'd need to conserve self._data
, and also have the result of op(self._data, other_data)
in order to run where
?)
Depending on how __iadd__
for ndarray
works in the background, maybe there's a way, but I wouldn't know.
Following your advice, I tried the option of changing the hack everywhere into this:
if isinstance(type(other_data), (int, float, complex)):
other_dtype = np.ufunc.resolve_dtypes(
(self._data.dtype, type(other_data), None), casting="unsafe")[1]
with ufunc
the corresponding operation to the __ioperation__
.
It doesn't seem to give exactly the same other_dtype
output as the previous hack that worked for all.
Now, combinations of np.uint8
and Python int
fail for iadd
, isub
, imul
, ifloordiv
and ipow
(so everything except truediv
and div
) during the __ioperation__
:
E numpy._core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'floor_divide' output from dtype('int64') to dtype('uint32') with casting rule 'same_kind'
Any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And unfortunately, the OverflowError
doesn't happen naturally with the above option.
For instance, this doesn't raise one while it does with np.array()
:
np.ma.masked_array([1], dtype=np.uint8) + 300
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I forgot the part about using in
instead of isinstance
!
All passing with the following instead of the hack (including OverflowError
!) 😁 :
if type(other_data) in (int, float, complex):
other_dtype = np.ufunc.resolve_dtypes(
(self._data.dtype, type(other_data), None), casting="unsafe")[1]
numpy/ma/core.py
Outdated
other_data = getdata(other) | ||
other_data = np.where(self._mask, other_data.dtype.type(1), other_data) | ||
other_data = getdata(other, return_scalar=True) | ||
if isinstance(other_data, (bool, int, float)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(bool, int, float)
is wrong, it has to be (int, float, complex)
. Also make sure to use type(other_data) in ...
for an exact type check.
(Yes, I realize that there is some discussion about whether it should be an exact check, but it is what we use currently, and float64
is a float
subclass so isinstance
may at times be confusing.)
Clearly this applies to all occurances. Also rename return_scalar
to return_pyscalar
or return_pyliterals
even to make it sure that this is not about preserving scalars in general.
Pushed the changes, all works nicely with If there is no |
Ping @seberg - mind giving this another once-over? @rhugonnet it looks like there's a merge conflict - mind clearing that? Sorry this got dropped. We just looked at this at a triage meeting where Sebastian wasn't there, so I hoped that pinging Sebastian is sufficient to say this has been triaged. |
@@ -710,7 +710,7 @@ def get_masked_subclass(*arrays): | |||
return rcls | |||
|
|||
|
|||
def getdata(a, subok=True): | |||
def getdata(a, subok=True, return_pyscalar=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing as it's being used as keyword-only, so maybe we should just enforce it?
def getdata(a, subok=True, return_pyscalar=False): | |
def getdata(a, subok=True, *, return_pyscalar=False): |
Thanks for picking this up again @ngoldbaum! Fixed the two merge conflicts: I kept the |
This PR adds NEP50 logic (https://numpy.org/neps/nep-0050-scalar-promotion.html) to masked arrays, which was overlooked in all binary operations.
For in-place operations specifically, this resulted in errors, see #27679. But in general, wrong dtypes were returned for all binary operations.
Tagging @mattip @seberg as related to the above issue and NEP50 that you are familiar with.
Resolves #27679
Short example of the bug
For an ndarray, this passes:
The masked array equivalent fails:
Logic of the fix
Binary ops not in-place
We modify
getdata()
with areturn_pyscalar
argument to optionally return the scalar itself (instead of always converting to an array), and update thegetdata()
calls in_MaskedBinaryOperation
and_DomainedBinaryOperation
to use the scalar directly in their call to the correspondingnp.umath
function.This fixes the issue for all not in-place binary ops.
Binary ops in-place
The problem:
For in-place operations, the problem is more delicate. As masked arrays vow to preserve the
.data
under themask
during an operation (see first lines here: https://numpy.org/doc/stable/reference/maskedarray.generic.html#operations-on-masked-arrays), they need a specific in-place logic.In case of binary ops, the resulting
.data
is preserved under the final mask with respect to the left term, soself.data
for in-place. To perform this without adding more memory usage, the existing logic in thema
replaces data under the future mask inother_data
with the invariant value for the operation (0 for add/subtract, 1 for power/division), before passing it to the correspondingnp.umath
.Therefore, we cannot rely on the
np.umath
function to take care of the NEP50 scalar logic, as the scalarother_data
has to be converted to an array to substitute some cells to these invariant values before calling thenp.umath
.Proposed fix:
We still use the
return_pyscalar
argument ofgetdata()
, as for not in-place above, to getother_data
as a scalar. Then, in the case where it is a scalar, we run an equivalent ofresult_dtype()
on the scalar+array for this operation. However,result_dtype
itself is not sufficient here. We need a function that can take as input(scalar, array_type, operator)
and can both return the properresult_dtype
for NEP50, and flag if we need to raise anOverflowError
.Does such a function exist internally somewhere in NumPy? (I couldn't find any)
For now, as a placeholder operation simulating the same behaviour, I've used a dubious call to a single value numpy array + the scalar for this operation.
For example for
__iadd__
:This fixes the issue for all in-place binary ops.
New tests
First, I've mirrored all the relevant tests in https://github.com/numpy/numpy/blob/main/numpy/_core/tests/test_nep50_promotions.py for masked arrays.
However, as those don't cover in-place operations, I've also added a
test_nep50_inplace_ma
that checks the behaviour for all combinations of in-place ops, scalars and dtypes.