Skip to content

ENH: Better handling of nan values in np.unique for multidimensional arrays #27345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

caph1993
Copy link

@caph1993 caph1993 commented Sep 4, 2024

Enhancement of equal_nan option in np.unique for multi-dimensional data

fixes gh-23286.
fixes gh-20873.

This PR affects the np.unique function when equal_nan is set to true and when the input is either a multi-dimensional array with given axis (not None) or a structured 1-dimensional array.

Before this PR, the results of the following two expressions are the same:

np.unique([[0, np.nan], [0, np.nan], [np.nan, 1]], equal_nan=True, axis=0) # Expression 1
np.unique([[0, np.nan], [0, np.nan], [np.nan, 1]], equal_nan=False, axis=0) # Expression 2

Their result is [[0, np.nan], [0, np.nan], [np.nan, 1]], which is the result one would expect only when equal_nan is False. In other words, equal_nan is basically ignored and assumed as False (but not explicitly in the code) for multi-dimensional arrays and 1-dimensional structured arrays. The same behavior is observed if the arrays are 1-dimensional with structured dtype instead of 2-dimensional.

After this PR, the first expression returns [[0, np.nan], [np.nan, 1]] while the second returns the old result, which is more meaningful and resolves some exisiting github issues. This behavior matches the semantics of "nans being equal", in the sense that the function is essentially treating the comparison (1, np.nan) == (1, np.nan) as if it was true. The same good behavior is observed after this PR if the arrays are 1-dimensional with structured dtype instead of 2-dimensional.

@ngoldbaum
Copy link
Member

I see https://github.com/numpy/numpy/pull/23288/files is a prior attempt at a fix. Can you explain a little how this approach differs from that and why this avoids the issues @seberg saw in that PR?

@caph1993
Copy link
Author

caph1993 commented Sep 9, 2024

Hello. Thank you for bringing the PR #23288 to the discussion. I hadn't noticed it before.

I couldn't find any specific complaint about that PR other than the code being too ad-hoc. In particular that code has a specific separate behavior when the array has 2 dimensions. This is not the case with the code I am providing. Also, checking the diff, my PR changes just 38 lines compared to 181 lines from that one.

However, in any case, I think that this is the right moment to add support for masked arrays as well. So I'm planning to change this PR today to support both equal_nan and masked arrays using np.lexsort as suggested by @MatteoRaso in the discussion of that PR.

Do you think it's better if I make the new version in a new PR or directly on top this one?

@caph1993
Copy link
Author

caph1993 commented Sep 11, 2024

Hello, as promised in the previous comment, I implemented support for masked arrays. This new code also fixes issue gh-23281, which is already closed but for which people would appreciate an actual solution.

This PR brings several changes. I hope that doesn't make it difficult to approve it. For instance:

  1. The documentation for np.unique says that dtype=object is unsupported. Now it is supported.
  2. Same doc says that structured arrays with some fields having dtype=object are unsupported. Now they are supported
  3. The documentation of np.ma.unique does not have axis, return_counts, axis, and equal_nan. Now they are supported.

If this PR is accepted, several parts of the documentation about unsupported features would need to change, and a proper explanation of how the algorithm works for new supported data would be needed.

  1. For masked arrays with NaN values, it's worth mentioning that the order of precedence is: small values first, then large values, infinity, nan and finally invalid (mask=True)
  2. Perhaps some examples should be added for axis!=None, two structured-tuples containing nans are considered different if nan_equals is False, otherwise, they are equal if all their non-nan coordinates coincide and all nans are in the same positions. This is perhaps not the easiest way of explaining it. Maybe we could simplify it to saying that nan_equals is like making (3,nan,4.5)==(3,nan,4.5) true as well as (3, Nan+1j) ==(3, nan*1j) for complex nan.
  3. The same kind of explanation is also needed with sub-arrays (homogeneous or structured), and with the additional comment that if a mask exists, then nan_equals means, for example that [[3,nan],[--,nan]]==[[3,nan],[--,nan]] and [3,nan,--]!=[3,--,nan] are true. And if nan_equals is false, then both are different. Also, regardless of nan_equals, the comparison [3,--,--]==[3,--,--] would be true.

Should I proceed to change the docs? I would like some feedback or guidance before to find the correct balance between being easy to understand and being very precise.

@caph1993
Copy link
Author

Hello, I created more test cases locally and my solution is not passing them.
Please don't accept this PR until I add all the tests and make sure they pass.

@melissawm
Copy link
Member

@caph1993 you can convert this issue to be in "draft" mode while you are working on it. I'll do it for you right now, but you can find that option in the github web interface as well. Once you are done, feel free to mark it as ready so the maintainers review it again. Cheers!

@melissawm melissawm marked this pull request as draft September 13, 2024 17:28
@caph1993 caph1993 marked this pull request as ready for review September 17, 2024 15:06
@caph1993 caph1993 force-pushed the unique_with_equal_nan_for_multidimensional_arrays branch from 23eee27 to abfd4c9 Compare September 20, 2024 13:48
@caph1993 caph1993 marked this pull request as draft September 21, 2024 17:18
@caph1993 caph1993 marked this pull request as ready for review September 22, 2024 11:09
@caph1993
Copy link
Author

caph1993 commented Sep 22, 2024

Dear community, now I'm very confident of this implementation. All numpy tests passed, and 50k messy locally generated tests passed as well.

The changes introduced by this PR to the function np.unique can be summarized as these two:

  1. Added support for structured arrays (e.g. arrays with fields), masked arrays and arrays with dtype=object with the parameter objects to treat all objects as equal, as different, or to compare them with < and == (__lt__ and __eq__).
  2. Clearly defined behavior for equal_nan=True and equal_nan=False (explained below).

The implementation changed significantly with respect to numpy's current implementation. The general ideas of moving the axis to the first position, then reshaping as 2D array (a table), then sorting lexicographically accross the first axis and finally comparing consecutive rows are still the same, of course. But the exact method for sorting and comparing changed, and I devoted the rest of this comment to explain why.

The previous implementation was collapsing the 2D array (table) of rows into a single 1D array with a special "multi-columnar dtype", then using numpy's np.argsort and finally testing for equality of consecutive elements with ar[:-1]==ar[1:] and some ad-hoc treatments for nan's depending on equal_nan. The current implementation of the equality check that uses equal_nan is wrong, for instance for ex1 = [[np.nan, 42], [np.nan, -1], [np.nan, 42]], calling np.unique(ex1, axis=0, equal_nan=True) yields array([[nan, -1.], [nan, 42.], [nan, 42.]]) instead of array([[nan, -1.], [nan, 42.]]). A fix for this would be to undo the view so that the 1D array becomes 2D and then using np.all((ar2D[1:]==ar2D[:-1])|(np.isnan(ar2D[1:]) & np.isnan(ar2D[:-1])), axis=1) for consecutive equality.

But even if we apply the patch, there is a fundamental problem with this design: sorting has a specified way of handling nan that can not be controlled with the parameter equal_nan. For example, for ex2=[[complex(3, np.nan), 42], [complex(1, np.nan), 42], [complex(2, np.nan), -1] ], calling np.unique(ex2, axis=0, equal_nan=True) yields array([[ 1.+nanj, 42. +0.j], [ 2.+nanj, -1. +0.j], [ 3.+nanj, 42. +0.j]]). In this case, no matter what patch we use for the comparison of consecutive elements, won't be able to detect that [ 1.+nanj, 42. +0.j] is the same as [ 3.+nanj, 42. +0.j] (given equal_nan=True) because they are not consecutive. Again, we could add a second patch that replaces all types of complex nans with just nan when equal_nan=True before sorting, but this implies creating a copy of the array and post-processing the result to report the original nan value. That would work, but this succession of patches feels like a house of cards, and it doesn't seem simple to add support non-homogenous arrays in which both calling isnan and collapsing to 1D are problematic.

So, part of the solution is to use lexsort instead of argsort , as suggested by ... The reason is that lexsort is defined for both homogeneous and non-homogeneous cases, which simplifies the code and makes everything more general. However, lexsort suffers the same problem as argsort: it handles nan on its own way. Plus, lexsort sorts all the columns in its implementation, and this may be inconvenient with dtype=object if the user is expecting the algorithm to compare only when necessary, or if the object's comparison algorithm is expensive, e.g. for [(2, obj1), (2, obj2) (3, obj3)], clearly the first two objects (obj1 and obj2) need to be compared, but not the third (obj3) because 3 > 2, so it comes last.

For these reason I opted for reimplementing lexsort but adding support for equal_nan parameter, and I took the opportunity to add support for masked arrays, structured arrays (possibly nested, like dtype=((int,float), (float, int))), multi-dimensional arrays, arrays with dtype=object, or combinations of them.

The logic for handling nan in _lexargsort and therefore also in np.unique is the following.
When equal_nan = False, the order of nan elements follows the documentation of np.sort, e.g. 100+1j < 1+nanj < 3-nanj < nan-3j < nan+nanj. In case of ties, the first occurrence is the leader. Since all nans are treated as different elements, they are all marked as unique.
When equal_nan = True, all nan values are considered equal, e.g. 1+nanj = nan-3j, and their order is given by the order in which they appear in the input. Since all nan values are equal, only the first occurrence is marked as unique.

The implementation for _lexargsort simply uses argsort column by column from left to right in blocks of rows that are equal so far. It is very concise for the job it does and also easy to read in my opinion, but it has a python loop for the blocks, so it might be slower than the existing implementation which uses wrappers around C or C++.

@Thijss
Copy link

Thijss commented Jan 9, 2025

hi there, nice work. I'd love to see this merged.
What would be the next steps?

I can think of two minor issues, but not sure how we can get this on the roadmap for merging afterwards:

@ngoldbaum ngoldbaum added the triage review Issue/PR to be discussed at the next triage meeting label Jan 9, 2025
@ngoldbaum
Copy link
Member

I added the triage review label because it looks like this PR was overlooked. Thanks for the ping here.

@caph1993
Copy link
Author

I'm unable to fix the build process. I'm getting the error reported in issue #27516, shown next.
Does someone know how to fix it?

(python311-env) ➜  numpy git:(unique_with_equal_nan_for_multidimensional_arrays) spin test -v -t numpy/lib/tests/test_arraysetops.py

Invoking `build` prior to running tests:
$ /home/carlos/applications/python311-env/bin/python3 vendored-meson/meson/meson.py compile -C build
INFO: autodetecting backend as ninja
INFO: calculating backend command to run: /home/carlos/applications/python311-env/bin/ninja -C /home/carlos/numpy/build
ninja: Entering directory `/home/carlos/numpy/build'
[1/98] Generating numpy/generate-version with a custom command
Saving version to numpy/version.py
[3/98] Compiling C object numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o
FAILED: numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o 
cc -Inumpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p -Inumpy/_core -I../numpy/_core -Inumpy/_core/include -I../numpy/_core/include -I../numpy/_core/src/common -I../numpy/_core/src/multiarray -I../numpy/_core/src/npymath -I../numpy/_core/src/umath -I../numpy/_core/src/highway -I/usr/include/openblas -I/usr/include/python3.11 -I/home/carlos/numpy/build/meson_cpu -fvisibility=hidden -fdiagnostics-color=always -Wall -Winvalid-pch -std=c11 -O2 -g -fno-strict-aliasing -msse -msse2 -msse3 -DNPY_HAVE_SSE2 -DNPY_HAVE_SSE -DNPY_HAVE_SSE3 -fPIC -DHAVE_CBLAS -DNPY_INTERNAL_BUILD -DHAVE_NPY_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -MD -MQ numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o -MF numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o.d -o numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o -c ../numpy/_core/src/multiarray/textreading/rows.c
../numpy/_core/src/multiarray/textreading/rows.c: In function ‘create_conv_funcs’:
../numpy/_core/src/multiarray/textreading/rows.c:63:5: error: implicit declaration of function ‘Py_BEGIN_CRITICAL_SECTION’ [-Wimplicit-function-declaration]
   63 |     Py_BEGIN_CRITICAL_SECTION(converters);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~
../numpy/_core/src/multiarray/textreading/rows.c:116:5: error: implicit declaration of function ‘Py_END_CRITICAL_SECTION’ [-Wimplicit-function-declaration]
  116 |     Py_END_CRITICAL_SECTION();
      |     ^~~~~~~~~~~~~~~~~~~~~~~
[12/98] Compiling C++ object numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_clip.cpp.o
ninja: build stopped: subcommand failed.

@ngoldbaum
Copy link
Member

Just a guess, but have you updated your git submodules in a while? Those macros are included in pythoncapi-compat, which we vendor as a git submodule.

@ngoldbaum
Copy link
Member

ngoldbaum commented Jan 21, 2025

You shouldn’t need to update the submodule in this PR. You probably need to merge with main or rebase instead. Sorry if that was confusing, I was telling you to update the submodule in your local dev environment.

@ngoldbaum
Copy link
Member

It looks like the submodules are still messed up. This PR shouldn’t change the state of them.

@caph1993
Copy link
Author

Yes, indeed. I'm trying to undo those changes while being able to test locally... I'm new to submodules, multiple branches, rebasing, etc.
Just give me a few hours and I'll figure it out, or give me more detailed commands and I'll do that faster.

@ngoldbaum
Copy link
Member

ngoldbaum commented Jan 21, 2025

Untested, but I think you'd do:

$ cd numpy/_core/src/common/pythoncapi-compat/
$ git checkout 0f1d42a
$ cd ../../../../
$ cd numpy/_core/src/highway/
$ git checkout f2209b91

And then commit the changes to the submodules:

$ cd ../../../../
$ git commit -am “MAINT: fix submodules”

@caph1993
Copy link
Author

Thanks for the help @ngoldbaum. Only one test is failing now. Can you guide me on how can I debug that?
It seems to be specific to windows and the details don't say much to me.

@ngoldbaum
Copy link
Member

I can take a look at that on my Windows box sometime in the next few days

@ngoldbaum
Copy link
Member

Sorry for the trouble, I gave you incorrect instructions the other day, the vendored meson is out of date in this PR.

I just pushed a commit that should fix this.

Copy link
Author

@caph1993 caph1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All issues were resolved

@caph1993 caph1993 requested a review from Thijss January 27, 2025 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement triage review Issue/PR to be discussed at the next triage meeting
Projects
Status: Awaiting a code review
4 participants