Skip to content

MAINT: fix unique for masked arrays #18968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ImenRajhi
Copy link
Contributor

@ImenRajhi ImenRajhi commented May 9, 2021

Co-authored-by: alinanesen alina.nesen@gmail.com / GitHub username: @alinanesen

Modified np.ma.unique to call the np.unique with the array without the masked values.

Closes gh-14804

Co-authored-by: alinanesen <alina.nesen@gmail.com>
@rgommers rgommers changed the title fix unique for masked arrays MAINT: fix unique for masked arrays May 9, 2021
@rgommers
Copy link
Member

rgommers commented May 9, 2021

Thansk @ImenRajhi and @alinanesen!

The tests are failing because the input to np.ma.unique can be either a regular numpy array or a masked array:

>       output = np.unique(ar1.data[~ar1.mask],
                           return_index=return_index,
                           return_inverse=return_inverse)
E       AttributeError: 'numpy.ndarray' object has no attribute 'mask'

So you'll want to only use the mask if it's actually a masked array. You can special-case this with a if isinstance(ar1, MaskedArray) for example.

@rgommers
Copy link
Member

rgommers commented May 9, 2021

Try to run the tests locally to see this failure, then it's much easier to iterate than if you have to wait on CI. For example:

python runtests.py -s ma

to run all the numpy.ma tests.

@ImenRajhi
Copy link
Contributor Author

Thanks for pointing that out @rgommers.

After a second look I realized we have misdiagnosed the problem. Here are my findings:

  1. The problem behind this behavior: the ordering of dtype extremities and the masked values are not defined while sorting.
  2. How exactly is that causing this behavior: Unique calls for sort and sort does not know how to order masked and extremes and therefore e.g. of np.uint8: sort([255, --, 255, --]) -> outputs [255, --, 255, --]. Afterwards unique will compare pairwise and will keep everything thus the weird behavior (outputs [255, --, 255, --]). The same example but with the input rearranged [255, 255, --, --] will output the correct results of [255, --].
  3. The solution: define the ordering of dtype extremities and the masked values.

@ImenRajhi
Copy link
Contributor Author

More details:
The np.ma.argsort fills in the masked array with the fill_value. The fill value always ends up being the dtype maximum as fill_value is passed null and endwith is default True. Therefore input [255, --, 255, --] becomes a filled array: [255, 255, 255, 255]. Then filled array is sorted and right there the sorting strange behavior happens.

@rgommers
Copy link
Member

Nice analysis @ImenRajhi. Indeed, sort is a little odd:

>>> x = np.ma.array([4,3,2,1], mask=[True, False, True, False])
>>> np.ma.sort(x)
masked_array(data=[1, 3, --, --],
             mask=[False, False,  True,  True],
       fill_value=999999)
>>> y = np.ma.array([4,255,2,1], dtype=np.uint8, mask=[True, False, True, False])
>>> np.ma.sort(y)
masked_array(data=[1, --, 255, --],
             mask=[False,  True, False,  True],
       fill_value=999999,
            dtype=uint8)

The current behavior was thought about and document as being undefined in gh-8678. That said, I don't see a real discussion about it on that PR, and I also don't see why it would not be possible to sort masked values to the end. Correctness is more important than performance here. argsort and sort behavior should match, and argsort does the actual work - so MaskedArray.argsort is the method to fix.

@rgommers
Copy link
Member

The current behavior was thought about and document as being undefined in gh-8678

@eric-wieser it's 4 years ago, but do you recollect why you decided to document this as undefined behavior?

@eric-wieser
Copy link
Member

I think I decided that it was a hard problem to fix without adding substantial runtime overhead, and didn't have any evidence to suggest anyone really cared anyway; so I figured I'd just document what was already happening.

@@ -1075,7 +1075,7 @@ def unique(ar1, return_index=False, return_inverse=False):
numpy.unique : Equivalent function for ndarrays.

"""
output = np.unique(ar1,
output = np.unique(ar1.data[~ar1.mask],
return_index=return_index,
return_inverse=return_inverse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will do the right thing when asked to return indices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @eric-wieser. Indeed, the problem turned out to be way deeper.

@eric-wieser
Copy link
Member

That's not to say I'm opposed to it being fixed; I just didn't have the motivation to do it myself, and left that docstring to prevent people being surprised.

@ImenRajhi ImenRajhi closed this May 14, 2021
@rgommers
Copy link
Member

Hi @ImenRajhi, did you close this on purpose? We can continue to add to this PR, I think the fix is going in the right direction.

@ImenRajhi
Copy link
Contributor Author

ImenRajhi commented May 15, 2021

Sorry @rgommers, small misunderstanding on my part. I reopened it.

@ImenRajhi ImenRajhi reopened this May 15, 2021
@charris charris added the 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. label Feb 19, 2023
@charris
Copy link
Member

charris commented Feb 19, 2023

I'm going to close this as "Good Idea", but inactive. @ImenRajhi Feel free to continue this work in a new PR.

@charris charris closed this Feb 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 - Maintenance 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. component: numpy.ma masked arrays
Projects
Development

Successfully merging this pull request may close these issues.

Using numpy.unique with a masked array get some strange issue
4 participants