-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: Better handling of nan values in np.unique for multidimensional arrays #27345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH: Better handling of nan values in np.unique for multidimensional arrays #27345
Conversation
I see https://github.com/numpy/numpy/pull/23288/files is a prior attempt at a fix. Can you explain a little how this approach differs from that and why this avoids the issues @seberg saw in that PR? |
Hello. Thank you for bringing the PR #23288 to the discussion. I hadn't noticed it before. I couldn't find any specific complaint about that PR other than the code being too ad-hoc. In particular that code has a specific separate behavior when the array has 2 dimensions. This is not the case with the code I am providing. Also, checking the diff, my PR changes just 38 lines compared to 181 lines from that one. However, in any case, I think that this is the right moment to add support for masked arrays as well. So I'm planning to change this PR today to support both equal_nan and masked arrays using np.lexsort as suggested by @MatteoRaso in the discussion of that PR. Do you think it's better if I make the new version in a new PR or directly on top this one? |
Hello, as promised in the previous comment, I implemented support for masked arrays. This new code also fixes issue gh-23281, which is already closed but for which people would appreciate an actual solution. This PR brings several changes. I hope that doesn't make it difficult to approve it. For instance:
If this PR is accepted, several parts of the documentation about unsupported features would need to change, and a proper explanation of how the algorithm works for new supported data would be needed.
Should I proceed to change the docs? I would like some feedback or guidance before to find the correct balance between being easy to understand and being very precise. |
Hello, I created more test cases locally and my solution is not passing them. |
@caph1993 you can convert this issue to be in "draft" mode while you are working on it. I'll do it for you right now, but you can find that option in the github web interface as well. Once you are done, feel free to mark it as ready so the maintainers review it again. Cheers! |
23eee27
to
abfd4c9
Compare
Dear community, now I'm very confident of this implementation. All numpy tests passed, and 50k messy locally generated tests passed as well. The changes introduced by this PR to the function
The implementation changed significantly with respect to numpy's current implementation. The general ideas of moving the axis to the first position, then reshaping as 2D array (a table), then sorting lexicographically accross the first axis and finally comparing consecutive rows are still the same, of course. But the exact method for sorting and comparing changed, and I devoted the rest of this comment to explain why. The previous implementation was collapsing the 2D array (table) of rows into a single 1D array with a special "multi-columnar dtype", then using numpy's But even if we apply the patch, there is a fundamental problem with this design: sorting has a specified way of handling nan that can not be controlled with the parameter So, part of the solution is to use For these reason I opted for reimplementing The logic for handling nan in The implementation for |
hi there, nice work. I'd love to see this merged. I can think of two minor issues, but not sure how we can get this on the roadmap for merging afterwards:
|
I added the triage review label because it looks like this PR was overlooked. Thanks for the ping here. |
…ith_equal_nan_for_multidimensional_arrays
I'm unable to fix the build process. I'm getting the error reported in issue #27516, shown next.
|
…ith_equal_nan_for_multidimensional_arrays
Just a guess, but have you updated your git submodules in a while? Those macros are included in pythoncapi-compat, which we vendor as a git submodule. |
You shouldn’t need to update the submodule in this PR. You probably need to merge with |
…imensional_arrays' into unique_with_equal_nan_for_multidimensional_arrays
It looks like the submodules are still messed up. This PR shouldn’t change the state of them. |
Yes, indeed. I'm trying to undo those changes while being able to test locally... I'm new to submodules, multiple branches, rebasing, etc. |
Untested, but I think you'd do:
And then commit the changes to the submodules:
|
Thanks for the help @ngoldbaum. Only one test is failing now. Can you guide me on how can I debug that? |
I can take a look at that on my Windows box sometime in the next few days |
Sorry for the trouble, I gave you incorrect instructions the other day, the vendored meson is out of date in this PR. I just pushed a commit that should fix this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All issues were resolved
Enhancement of
equal_nan
option innp.unique
for multi-dimensional datafixes gh-23286.
fixes gh-20873.
This PR affects the
np.unique
function whenequal_nan
is set to true and when the input is either a multi-dimensional array with given axis (not None) or a structured 1-dimensional array.Before this PR, the results of the following two expressions are the same:
Their result is
[[0, np.nan], [0, np.nan], [np.nan, 1]]
, which is the result one would expect only whenequal_nan
isFalse
. In other words,equal_nan
is basically ignored and assumed asFalse
(but not explicitly in the code) for multi-dimensional arrays and 1-dimensional structured arrays. The same behavior is observed if the arrays are 1-dimensional with structured dtype instead of 2-dimensional.After this PR, the first expression returns
[[0, np.nan], [np.nan, 1]]
while the second returns the old result, which is more meaningful and resolves some exisiting github issues. This behavior matches the semantics of "nans being equal", in the sense that the function is essentially treating the comparison(1, np.nan) == (1, np.nan)
as if it was true. The same good behavior is observed after this PR if the arrays are 1-dimensional with structured dtype instead of 2-dimensional.