MAINT: fix unique for masked arrays #18968

ImenRajhi · 2021-05-09T17:50:32Z

Co-authored-by: alinanesen alina.nesen@gmail.com / GitHub username: @alinanesen

Modified np.ma.unique to call the np.unique with the array without the masked values.

Co-authored-by: alinanesen <alina.nesen@gmail.com>

rgommers · 2021-05-09T20:40:54Z

The tests are failing because the input to np.ma.unique can be either a regular numpy array or a masked array:

>       output = np.unique(ar1.data[~ar1.mask],
                           return_index=return_index,
                           return_inverse=return_inverse)
E       AttributeError: 'numpy.ndarray' object has no attribute 'mask'

So you'll want to only use the mask if it's actually a masked array. You can special-case this with a if isinstance(ar1, MaskedArray) for example.

rgommers · 2021-05-09T20:42:45Z

Try to run the tests locally to see this failure, then it's much easier to iterate than if you have to wait on CI. For example:

python runtests.py -s ma

to run all the numpy.ma tests.

ImenRajhi · 2021-05-11T13:40:10Z

Thanks for pointing that out @rgommers.

After a second look I realized we have misdiagnosed the problem. Here are my findings:

The problem behind this behavior: the ordering of dtype extremities and the masked values are not defined while sorting.
How exactly is that causing this behavior: Unique calls for sort and sort does not know how to order masked and extremes and therefore e.g. of np.uint8: sort([255, --, 255, --]) -> outputs [255, --, 255, --]. Afterwards unique will compare pairwise and will keep everything thus the weird behavior (outputs [255, --, 255, --]). The same example but with the input rearranged [255, 255, --, --] will output the correct results of [255, --].
The solution: define the ordering of dtype extremities and the masked values.

ImenRajhi · 2021-05-11T22:25:47Z

More details:
The np.ma.argsort fills in the masked array with the fill_value. The fill value always ends up being the dtype maximum as fill_value is passed null and endwith is default True. Therefore input [255, --, 255, --] becomes a filled array: [255, 255, 255, 255]. Then filled array is sorted and right there the sorting strange behavior happens.

rgommers · 2021-05-12T11:40:10Z

Nice analysis @ImenRajhi. Indeed, sort is a little odd:

>>> x = np.ma.array([4,3,2,1], mask=[True, False, True, False])
>>> np.ma.sort(x)
masked_array(data=[1, 3, --, --],
             mask=[False, False,  True,  True],
       fill_value=999999)
>>> y = np.ma.array([4,255,2,1], dtype=np.uint8, mask=[True, False, True, False])
>>> np.ma.sort(y)
masked_array(data=[1, --, 255, --],
             mask=[False,  True, False,  True],
       fill_value=999999,
            dtype=uint8)

The current behavior was thought about and document as being undefined in gh-8678. That said, I don't see a real discussion about it on that PR, and I also don't see why it would not be possible to sort masked values to the end. Correctness is more important than performance here. argsort and sort behavior should match, and argsort does the actual work - so MaskedArray.argsort is the method to fix.

rgommers · 2021-05-12T11:41:23Z

The current behavior was thought about and document as being undefined in gh-8678

@eric-wieser it's 4 years ago, but do you recollect why you decided to document this as undefined behavior?

eric-wieser · 2021-05-12T21:58:59Z

I think I decided that it was a hard problem to fix without adding substantial runtime overhead, and didn't have any evidence to suggest anyone really cared anyway; so I figured I'd just document what was already happening.

eric-wieser · 2021-05-12T22:00:36Z

numpy/ma/extras.py

@@ -1075,7 +1075,7 @@ def unique(ar1, return_index=False, return_inverse=False):
    numpy.unique : Equivalent function for ndarrays.

    """
-    output = np.unique(ar1,
+    output = np.unique(ar1.data[~ar1.mask],
                       return_index=return_index,
                       return_inverse=return_inverse)


I don't think this will do the right thing when asked to return indices

Thanks for the review @eric-wieser. Indeed, the problem turned out to be way deeper.

eric-wieser · 2021-05-12T22:02:35Z

That's not to say I'm opposed to it being fixed; I just didn't have the motivation to do it myself, and left that docstring to prevent people being surprised.

rgommers · 2021-05-14T20:19:48Z

Hi @ImenRajhi, did you close this on purpose? We can continue to add to this PR, I think the fix is going in the right direction.

ImenRajhi · 2021-05-15T11:11:35Z

Sorry @rgommers, small misunderstanding on my part. I reopened it.

charris · 2023-02-19T20:23:49Z

I'm going to close this as "Good Idea", but inactive. @ImenRajhi Feel free to continue this work in a new PR.

fix unique for masked arrays

28d7f32

Co-authored-by: alinanesen <alina.nesen@gmail.com>

rgommers added 03 - Maintenance component: numpy.ma masked arrays labels May 9, 2021

rgommers changed the title ~~fix unique for masked arrays~~ MAINT: fix unique for masked arrays May 9, 2021

eric-wieser reviewed May 12, 2021

View reviewed changes

ImenRajhi closed this May 14, 2021

ImenRajhi reopened this May 15, 2021

charris added the 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. label Feb 19, 2023

charris closed this Feb 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MAINT: fix unique for masked arrays #18968

MAINT: fix unique for masked arrays #18968

Uh oh!

ImenRajhi commented May 9, 2021 •

edited by rgommers

Loading

Uh oh!

rgommers commented May 9, 2021

Uh oh!

rgommers commented May 9, 2021

Uh oh!

ImenRajhi commented May 11, 2021

Uh oh!

ImenRajhi commented May 11, 2021

Uh oh!

rgommers commented May 12, 2021

Uh oh!

rgommers commented May 12, 2021

Uh oh!

eric-wieser commented May 12, 2021

Uh oh!

eric-wieser May 12, 2021

Uh oh!

ImenRajhi May 14, 2021

Uh oh!

eric-wieser commented May 12, 2021

Uh oh!

rgommers commented May 14, 2021

Uh oh!

ImenRajhi commented May 15, 2021 •

edited

Loading

Uh oh!

charris commented Feb 19, 2023

Uh oh!

Uh oh!

Uh oh!

MAINT: fix unique for masked arrays #18968

MAINT: fix unique for masked arrays #18968

Uh oh!

Conversation

ImenRajhi commented May 9, 2021 • edited by rgommers Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgommers commented May 9, 2021

Uh oh!

rgommers commented May 9, 2021

Uh oh!

ImenRajhi commented May 11, 2021

Uh oh!

ImenRajhi commented May 11, 2021

Uh oh!

rgommers commented May 12, 2021

Uh oh!

rgommers commented May 12, 2021

Uh oh!

eric-wieser commented May 12, 2021

Uh oh!

eric-wieser May 12, 2021

Choose a reason for hiding this comment

Uh oh!

ImenRajhi May 14, 2021

Choose a reason for hiding this comment

Uh oh!

eric-wieser commented May 12, 2021

Uh oh!

rgommers commented May 14, 2021

Uh oh!

ImenRajhi commented May 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Feb 19, 2023

Uh oh!

Uh oh!

ImenRajhi commented May 9, 2021 •

edited by rgommers

Loading

ImenRajhi commented May 15, 2021 •

edited

Loading