-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
MAINT: fix unique for masked arrays #18968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: alinanesen <alina.nesen@gmail.com>
Thansk @ImenRajhi and @alinanesen! The tests are failing because the input to
So you'll want to only use the mask if it's actually a masked array. You can special-case this with a |
Try to run the tests locally to see this failure, then it's much easier to iterate than if you have to wait on CI. For example:
to run all the |
Thanks for pointing that out @rgommers. After a second look I realized we have misdiagnosed the problem. Here are my findings:
|
More details: |
Nice analysis @ImenRajhi. Indeed, >>> x = np.ma.array([4,3,2,1], mask=[True, False, True, False])
>>> np.ma.sort(x)
masked_array(data=[1, 3, --, --],
mask=[False, False, True, True],
fill_value=999999)
>>> y = np.ma.array([4,255,2,1], dtype=np.uint8, mask=[True, False, True, False])
>>> np.ma.sort(y)
masked_array(data=[1, --, 255, --],
mask=[False, True, False, True],
fill_value=999999,
dtype=uint8) The current behavior was thought about and document as being undefined in gh-8678. That said, I don't see a real discussion about it on that PR, and I also don't see why it would not be possible to sort masked values to the end. Correctness is more important than performance here. |
@eric-wieser it's 4 years ago, but do you recollect why you decided to document this as undefined behavior? |
I think I decided that it was a hard problem to fix without adding substantial runtime overhead, and didn't have any evidence to suggest anyone really cared anyway; so I figured I'd just document what was already happening. |
@@ -1075,7 +1075,7 @@ def unique(ar1, return_index=False, return_inverse=False): | |||
numpy.unique : Equivalent function for ndarrays. | |||
|
|||
""" | |||
output = np.unique(ar1, | |||
output = np.unique(ar1.data[~ar1.mask], | |||
return_index=return_index, | |||
return_inverse=return_inverse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will do the right thing when asked to return indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @eric-wieser. Indeed, the problem turned out to be way deeper.
That's not to say I'm opposed to it being fixed; I just didn't have the motivation to do it myself, and left that docstring to prevent people being surprised. |
Hi @ImenRajhi, did you close this on purpose? We can continue to add to this PR, I think the fix is going in the right direction. |
Sorry @rgommers, small misunderstanding on my part. I reopened it. |
I'm going to close this as "Good Idea", but inactive. @ImenRajhi Feel free to continue this work in a new PR. |
Co-authored-by: alinanesen alina.nesen@gmail.com / GitHub username: @alinanesen
Modified np.ma.unique to call the np.unique with the array without the masked values.
Closes gh-14804