ENH: Better handling of nan values in np.unique for multidimensional arrays #27345

caph1993 · 2024-09-04T16:46:58Z

Enhancement of `equal_nan` option in `np.unique` for multi-dimensional data

fixes gh-23286.
fixes gh-20873.

This PR affects the np.unique function when equal_nan is set to true and when the input is either a multi-dimensional array with given axis (not None) or a structured 1-dimensional array.

Before this PR, the results of the following two expressions are the same:

np.unique([[0, np.nan], [0, np.nan], [np.nan, 1]], equal_nan=True, axis=0) # Expression 1
np.unique([[0, np.nan], [0, np.nan], [np.nan, 1]], equal_nan=False, axis=0) # Expression 2

Their result is [[0, np.nan], [0, np.nan], [np.nan, 1]], which is the result one would expect only when equal_nan is False. In other words, equal_nan is basically ignored and assumed as False (but not explicitly in the code) for multi-dimensional arrays and 1-dimensional structured arrays. The same behavior is observed if the arrays are 1-dimensional with structured dtype instead of 2-dimensional.

After this PR, the first expression returns [[0, np.nan], [np.nan, 1]] while the second returns the old result, which is more meaningful and resolves some exisiting github issues. This behavior matches the semantics of "nans being equal", in the sense that the function is essentially treating the comparison (1, np.nan) == (1, np.nan) as if it was true. The same good behavior is observed after this PR if the arrays are 1-dimensional with structured dtype instead of 2-dimensional.

…al_1 and _2

ngoldbaum · 2024-09-06T17:12:49Z

I see https://github.com/numpy/numpy/pull/23288/files is a prior attempt at a fix. Can you explain a little how this approach differs from that and why this avoids the issues @seberg saw in that PR?

caph1993 · 2024-09-09T09:18:09Z

Hello. Thank you for bringing the PR #23288 to the discussion. I hadn't noticed it before.

I couldn't find any specific complaint about that PR other than the code being too ad-hoc. In particular that code has a specific separate behavior when the array has 2 dimensions. This is not the case with the code I am providing. Also, checking the diff, my PR changes just 38 lines compared to 181 lines from that one.

However, in any case, I think that this is the right moment to add support for masked arrays as well. So I'm planning to change this PR today to support both equal_nan and masked arrays using np.lexsort as suggested by @MatteoRaso in the discussion of that PR.

Do you think it's better if I make the new version in a new PR or directly on top this one?

…d arrays and nan_equals

caph1993 · 2024-09-11T08:40:34Z

Hello, as promised in the previous comment, I implemented support for masked arrays. This new code also fixes issue gh-23281, which is already closed but for which people would appreciate an actual solution.

This PR brings several changes. I hope that doesn't make it difficult to approve it. For instance:

The documentation for np.unique says that dtype=object is unsupported. Now it is supported.
Same doc says that structured arrays with some fields having dtype=object are unsupported. Now they are supported
The documentation of np.ma.unique does not have axis, return_counts, axis, and equal_nan. Now they are supported.

If this PR is accepted, several parts of the documentation about unsupported features would need to change, and a proper explanation of how the algorithm works for new supported data would be needed.

For masked arrays with NaN values, it's worth mentioning that the order of precedence is: small values first, then large values, infinity, nan and finally invalid (mask=True)
Perhaps some examples should be added for axis!=None, two structured-tuples containing nans are considered different if nan_equals is False, otherwise, they are equal if all their non-nan coordinates coincide and all nans are in the same positions. This is perhaps not the easiest way of explaining it. Maybe we could simplify it to saying that nan_equals is like making (3,nan,4.5)==(3,nan,4.5) true as well as (3, Nan+1j) ==(3, nan*1j) for complex nan.
The same kind of explanation is also needed with sub-arrays (homogeneous or structured), and with the additional comment that if a mask exists, then nan_equals means, for example that [[3,nan],[--,nan]]==[[3,nan],[--,nan]] and [3,nan,--]!=[3,--,nan] are true. And if nan_equals is false, then both are different. Also, regardless of nan_equals, the comparison [3,--,--]==[3,--,--] would be true.

Should I proceed to change the docs? I would like some feedback or guidance before to find the correct balance between being easy to understand and being very precise.

caph1993 · 2024-09-12T16:18:41Z

Hello, I created more test cases locally and my solution is not passing them.
Please don't accept this PR until I add all the tests and make sure they pass.

melissawm · 2024-09-13T17:28:46Z

@caph1993 you can convert this issue to be in "draft" mode while you are working on it. I'll do it for you right now, but you can find that option in the github web interface as well. Once you are done, feel free to mark it as ready so the maintainers review it again. Cheers!

caph1993 · 2024-09-22T16:05:43Z

Dear community, now I'm very confident of this implementation. All numpy tests passed, and 50k messy locally generated tests passed as well.

The changes introduced by this PR to the function np.unique can be summarized as these two:

Added support for structured arrays (e.g. arrays with fields), masked arrays and arrays with dtype=object with the parameter objects to treat all objects as equal, as different, or to compare them with < and == (__lt__ and __eq__).
Clearly defined behavior for equal_nan=True and equal_nan=False (explained below).

The implementation changed significantly with respect to numpy's current implementation. The general ideas of moving the axis to the first position, then reshaping as 2D array (a table), then sorting lexicographically accross the first axis and finally comparing consecutive rows are still the same, of course. But the exact method for sorting and comparing changed, and I devoted the rest of this comment to explain why.

The previous implementation was collapsing the 2D array (table) of rows into a single 1D array with a special "multi-columnar dtype", then using numpy's np.argsort and finally testing for equality of consecutive elements with ar[:-1]==ar[1:] and some ad-hoc treatments for nan's depending on equal_nan. The current implementation of the equality check that uses equal_nan is wrong, for instance for ex1 = [[np.nan, 42], [np.nan, -1], [np.nan, 42]], calling np.unique(ex1, axis=0, equal_nan=True) yields array([[nan, -1.], [nan, 42.], [nan, 42.]]) instead of array([[nan, -1.], [nan, 42.]]). A fix for this would be to undo the view so that the 1D array becomes 2D and then using np.all((ar2D[1:]==ar2D[:-1])|(np.isnan(ar2D[1:]) & np.isnan(ar2D[:-1])), axis=1) for consecutive equality.

But even if we apply the patch, there is a fundamental problem with this design: sorting has a specified way of handling nan that can not be controlled with the parameter equal_nan. For example, for ex2=[[complex(3, np.nan), 42], [complex(1, np.nan), 42], [complex(2, np.nan), -1] ], calling np.unique(ex2, axis=0, equal_nan=True) yields array([[ 1.+nanj, 42. +0.j], [ 2.+nanj, -1. +0.j], [ 3.+nanj, 42. +0.j]]). In this case, no matter what patch we use for the comparison of consecutive elements, won't be able to detect that [ 1.+nanj, 42. +0.j] is the same as [ 3.+nanj, 42. +0.j] (given equal_nan=True) because they are not consecutive. Again, we could add a second patch that replaces all types of complex nans with just nan when equal_nan=True before sorting, but this implies creating a copy of the array and post-processing the result to report the original nan value. That would work, but this succession of patches feels like a house of cards, and it doesn't seem simple to add support non-homogenous arrays in which both calling isnan and collapsing to 1D are problematic.

So, part of the solution is to use lexsort instead of argsort , as suggested by ... The reason is that lexsort is defined for both homogeneous and non-homogeneous cases, which simplifies the code and makes everything more general. However, lexsort suffers the same problem as argsort: it handles nan on its own way. Plus, lexsort sorts all the columns in its implementation, and this may be inconvenient with dtype=object if the user is expecting the algorithm to compare only when necessary, or if the object's comparison algorithm is expensive, e.g. for [(2, obj1), (2, obj2) (3, obj3)], clearly the first two objects (obj1 and obj2) need to be compared, but not the third (obj3) because 3 > 2, so it comes last.

For these reason I opted for reimplementing lexsort but adding support for equal_nan parameter, and I took the opportunity to add support for masked arrays, structured arrays (possibly nested, like dtype=((int,float), (float, int))), multi-dimensional arrays, arrays with dtype=object, or combinations of them.

The logic for handling nan in _lexargsort and therefore also in np.unique is the following.
When equal_nan = False, the order of nan elements follows the documentation of np.sort, e.g. 100+1j < 1+nanj < 3-nanj < nan-3j < nan+nanj. In case of ties, the first occurrence is the leader. Since all nans are treated as different elements, they are all marked as unique.
When equal_nan = True, all nan values are considered equal, e.g. 1+nanj = nan-3j, and their order is given by the order in which they appear in the input. Since all nan values are equal, only the first occurrence is marked as unique.

The implementation for _lexargsort simply uses argsort column by column from left to right in blocks of rows that are equal so far. It is very concise for the job it does and also easy to read in my opinion, but it has a python loop for the blocks, so it might be slower than the existing implementation which uses wrappers around C or C++.

numpy/ma/extras.py

Thijss · 2025-01-09T07:57:34Z

hi there, nice work. I'd love to see this merged.
What would be the next steps?

I can think of two minor issues, but not sure how we can get this on the roadmap for merging afterwards:

fix merge conflicts
resolve issue with commented out code (https://github.com/numpy/numpy/pull/27345/files#r1908306897)

ngoldbaum · 2025-01-09T14:27:10Z

I added the triage review label because it looks like this PR was overlooked. Thanks for the ping here.

…ith_equal_nan_for_multidimensional_arrays

caph1993 · 2025-01-21T07:02:08Z

I'm unable to fix the build process. I'm getting the error reported in issue #27516, shown next.
Does someone know how to fix it?

(python311-env) ➜  numpy git:(unique_with_equal_nan_for_multidimensional_arrays) spin test -v -t numpy/lib/tests/test_arraysetops.py

Invoking `build` prior to running tests:
$ /home/carlos/applications/python311-env/bin/python3 vendored-meson/meson/meson.py compile -C build
INFO: autodetecting backend as ninja
INFO: calculating backend command to run: /home/carlos/applications/python311-env/bin/ninja -C /home/carlos/numpy/build
ninja: Entering directory `/home/carlos/numpy/build'
[1/98] Generating numpy/generate-version with a custom command
Saving version to numpy/version.py
[3/98] Compiling C object numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o
FAILED: numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o 
cc -Inumpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p -Inumpy/_core -I../numpy/_core -Inumpy/_core/include -I../numpy/_core/include -I../numpy/_core/src/common -I../numpy/_core/src/multiarray -I../numpy/_core/src/npymath -I../numpy/_core/src/umath -I../numpy/_core/src/highway -I/usr/include/openblas -I/usr/include/python3.11 -I/home/carlos/numpy/build/meson_cpu -fvisibility=hidden -fdiagnostics-color=always -Wall -Winvalid-pch -std=c11 -O2 -g -fno-strict-aliasing -msse -msse2 -msse3 -DNPY_HAVE_SSE2 -DNPY_HAVE_SSE -DNPY_HAVE_SSE3 -fPIC -DHAVE_CBLAS -DNPY_INTERNAL_BUILD -DHAVE_NPY_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -MD -MQ numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o -MF numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o.d -o numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o -c ../numpy/_core/src/multiarray/textreading/rows.c
../numpy/_core/src/multiarray/textreading/rows.c: In function ‘create_conv_funcs’:
../numpy/_core/src/multiarray/textreading/rows.c:63:5: error: implicit declaration of function ‘Py_BEGIN_CRITICAL_SECTION’ [-Wimplicit-function-declaration]
   63 |     Py_BEGIN_CRITICAL_SECTION(converters);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~
../numpy/_core/src/multiarray/textreading/rows.c:116:5: error: implicit declaration of function ‘Py_END_CRITICAL_SECTION’ [-Wimplicit-function-declaration]
  116 |     Py_END_CRITICAL_SECTION();
      |     ^~~~~~~~~~~~~~~~~~~~~~~
[12/98] Compiling C++ object numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_clip.cpp.o
ninja: build stopped: subcommand failed.

…ith_equal_nan_for_multidimensional_arrays

ngoldbaum · 2025-01-21T14:08:21Z

Just a guess, but have you updated your git submodules in a while? Those macros are included in pythoncapi-compat, which we vendor as a git submodule.

ngoldbaum · 2025-01-21T20:48:37Z

You shouldn’t need to update the submodule in this PR. You probably need to merge with main or rebase instead. Sorry if that was confusing, I was telling you to update the submodule in your local dev environment.

…imensional_arrays' into unique_with_equal_nan_for_multidimensional_arrays

ngoldbaum · 2025-01-21T21:24:10Z

It looks like the submodules are still messed up. This PR shouldn’t change the state of them.

caph1993 · 2025-01-21T21:28:16Z

Yes, indeed. I'm trying to undo those changes while being able to test locally... I'm new to submodules, multiple branches, rebasing, etc.
Just give me a few hours and I'll figure it out, or give me more detailed commands and I'll do that faster.

ngoldbaum · 2025-01-21T21:33:02Z

Untested, but I think you'd do:

$ cd numpy/_core/src/common/pythoncapi-compat/
$ git checkout 0f1d42a
$ cd ../../../../
$ cd numpy/_core/src/highway/
$ git checkout f2209b91

And then commit the changes to the submodules:

$ cd ../../../../
$ git commit -am “MAINT: fix submodules”

caph1993 · 2025-01-22T01:38:24Z

Thanks for the help @ngoldbaum. Only one test is failing now. Can you guide me on how can I debug that?
It seems to be specific to windows and the details don't say much to me.

ngoldbaum · 2025-01-22T01:39:48Z

I can take a look at that on my Windows box sometime in the next few days

ngoldbaum · 2025-01-23T16:42:43Z

Sorry for the trouble, I gave you incorrect instructions the other day, the vendored meson is out of date in this PR.

I just pushed a commit that should fix this.

caph1993

All issues were resolved

numpy/ma/extras.py

caph1993 added 3 commits September 4, 2024 17:50

Added tests that fail and shouldn't fail: test_unique_mutilidimension…

dd61a5b

…al_1 and _2

Proposed solution. Tests passed

a9139a5

Disabled formatter to minimize diff

0f092fb

github-actions bot added the 01 - Enhancement label Sep 4, 2024

caph1993 added 6 commits September 4, 2024 18:55

Fixed linting errors

16e10ba

Updated description in docs/release/upcoming

b2e78b8

Fixed linting error

8c6e0d2

Extended title underline

07bf8a0

Added optimizations for frequent cases

722ed2c

Updated comment

4ce84cb

caph1993 added 3 commits September 9, 2024 11:42

Added support in np.unique for masked arrays, dtype object, structure…

44ca8bc

…d arrays and nan_equals

Fixed behavior with masked arrays

aa7a8ce

Swapped the order in which NaN and mask are checked in np.unique

b821832

melissawm marked this pull request as draft September 13, 2024 17:28

Added final solution: proposed_lexsort function

5841322

caph1993 marked this pull request as ready for review September 17, 2024 15:06

caph1993 added 2 commits September 19, 2024 14:40

Improved docs

b5b0852

Fixed doc issue (multiline tests expect ... instead of >>>)

abfd4c9

caph1993 force-pushed the unique_with_equal_nan_for_multidimensional_arrays branch from 23eee27 to abfd4c9 Compare September 20, 2024 13:48

caph1993 added 5 commits September 20, 2024 17:51

Skip doctest for custom class with __lt__ and __eq__

0bf5179

Empty commit to understand error

b8775e4

Disabled some examples to avoid doctest issues

4f55947

Trying to avoid doctest warning

e31e270

Trying to avoid doctest warning

30b022b

caph1993 marked this pull request as draft September 21, 2024 17:18

Improved docs

4478591

caph1993 marked this pull request as ready for review September 22, 2024 11:09

Thijss reviewed Jan 9, 2025

View reviewed changes

numpy/ma/extras.py Show resolved Hide resolved

ngoldbaum added the triage review Issue/PR to be discussed at the next triage meeting label Jan 9, 2025

caph1993 added 2 commits January 20, 2025 13:48

Merge commit '3b423b5f2223a7ba542663e1de4cfa2940a306cd' into unique_w…

7f61f48

…ith_equal_nan_for_multidimensional_arrays

Lint errors fixed

1e16eb8

Merge commit '1688e70e6dde01a00b25db8bd12f6729a5088fc1' into unique_w…

0262f32

…ith_equal_nan_for_multidimensional_arrays

Updated pythoncapi-compat and added np.matrix as special case

e6ea738

caph1993 added 3 commits January 21, 2025 22:11

Updated pythoncapi-compat and added np.matrix as special case

7405a23

Merge remote-tracking branch 'origin/unique_with_equal_nan_for_multid…

51cb05a

…imensional_arrays' into unique_with_equal_nan_for_multidimensional_arrays

Synced highway

a087eca

caph1993 added 3 commits January 21, 2025 23:31

Undoing submodules update

935dede

Improved support for matrix and np.ma.maskedarray

4097c17

Empty commit to retry CI pipeline

82a6194

MAINT: update vendored meson to state on main

249808b

caph1993 commented Jan 27, 2025

View reviewed changes

numpy/ma/extras.py Show resolved Hide resolved

caph1993 requested a review from Thijss January 27, 2025 14:58

Uh oh!

ENH: Better handling of nan values in np.unique for multidimensional arrays #27345

Are you sure you want to change the base?

ENH: Better handling of nan values in np.unique for multidimensional arrays #27345

Uh oh!

Conversation

caph1993 commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enhancement of equal_nan option in np.unique for multi-dimensional data

Uh oh!

ngoldbaum commented Sep 6, 2024

Uh oh!

caph1993 commented Sep 9, 2024

Uh oh!

caph1993 commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caph1993 commented Sep 12, 2024

Uh oh!

melissawm commented Sep 13, 2024

Uh oh!

caph1993 commented Sep 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Thijss commented Jan 9, 2025

Uh oh!

ngoldbaum commented Jan 9, 2025

Uh oh!

caph1993 commented Jan 21, 2025

Uh oh!

ngoldbaum commented Jan 21, 2025

Uh oh!

ngoldbaum commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Jan 21, 2025

Uh oh!

caph1993 commented Jan 21, 2025

Uh oh!

ngoldbaum commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caph1993 commented Jan 22, 2025

Uh oh!

ngoldbaum commented Jan 22, 2025

Uh oh!

ngoldbaum commented Jan 23, 2025

Uh oh!

caph1993 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

caph1993 commented Sep 4, 2024 •

edited

Loading

Enhancement of `equal_nan` option in `np.unique` for multi-dimensional data

caph1993 commented Sep 11, 2024 •

edited

Loading

caph1993 commented Sep 22, 2024 •

edited

Loading

ngoldbaum commented Jan 21, 2025 •

edited

Loading

ngoldbaum commented Jan 21, 2025 •

edited

Loading