ENH, BUG: add equals_nans parameter to np.unique and fixed functionality with equal_nans along an axis #20896

rjeb · 2022-01-26T07:58:28Z

Addresses #20326 and #20873

np.unique previously had it's functionality changed so NaN values would be treated as non-unique.
This PR puts the functionality into the parameter equal_nans(default: True).

In addition, it fixes the inability to use equal_nans and axis functionality simultaneously.

Ex.

arr = np.array([
[np.nan, 5.],
[np.nan, 5.], 
[0, 0], 
[0, 0],
[1, 1], 
[2, np.nan],
[3, 3], 
[4, 4],
[5, np.nan], 
[5, np.nan]
])

np.unique(arr, axis = 0)

Result:

[[ 0.  0.]
 [ 1.  1.]
 [ 2. nan]
 [ 3.  3.]
 [ 4.  4.]
 [ 5. nan]
 [nan  5.]]

np.unique(arr, axis = 0, equal_nans = False)

Result:

[[ 0.  0.]
 [ 1.  1.]
 [ 2. nan]
 [ 3.  3.]
 [ 4.  4.]
 [ 5. nan]
 [ 5. nan]
 [nan  5.]
 [nan  5.]]

…is arg

rjeb · 2022-02-03T03:00:39Z

I am a little weary of the performance cost for checking for NaNs along an axis.
Currently this implementation is consistently 4 - 5 times slower when comparing to the release build (checked from n = 200 -> 2,000,000).

a = np.random.randint(10,90,(2000000,10))
a = np.append(a, [[np.nan, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]], axis = 0)
a = np.append(a, [[np.nan, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]], axis = 0)

np.unique(a, axis=0)

release_build = 2.240867 seconds (no equal_nan result)
PR_build = 10.631021 seconds (with equal_nan result)
PR_build = 2.202401 (equal_nans = False uses old implementation)

With 1-D arrays we check all the duplicate NaNs at the end of a sorted array, which scales with # of NaN, rather than the actual size of the array.
With 2-D arrays NaNs can appear in any index of any subarray, so I currently check each subarray and it's nan_equality with the subarray after it.
There is no built-in equality function that I have found ('==', 'array_equals', 'assert_array_equals') that can resolve equal_nans equality for a structured array so these comparisons I'm doing are also relatively expensive.

Any ideas for optimization would be much appreciated.

…gument simultaneously

seberg · 2022-06-02T19:24:08Z

The first part is merged now (so needs a rebase). The second part is still interesting, but on first glance needs a bit more thought I believe. After a brief look, I do not think we can go via strings.
I do think we may be able to go via comparisons maybe using < or >. The basic idea has to be to do some trick that allows NaNs to behave correctly. In general, that is replacing:

prev != next

with:

not prev == next

Which is the same, except that it also returns True for not np.nan == np.nan unlike !=. Now, for the "normal" branch, that is unfortunately maybe not optimal (because we have to invert the boolean array). However:

It might just be OK performance wise, we should try.
We could keep the branching, and only use this trick when the correct behavior is unclear.

The best part. That should effectively also fix object arrays containing NaNs.

rjeb added 3 commits January 24, 2022 14:31

Add equal-nans argument to np.unique

6309d34

BUG: fixed equal_nans functionality for np.unique when called with ax…

921b38d

…is arg

DOC, TST: Added test for np.unique equals_nan functionality

e9a32d1

github-actions bot added the 01 - Enhancement label Jan 26, 2022

DOC: Fix releasenote filename from 20891 to 20896

fbb3daf

rjeb force-pushed the feature-unique-nan-kwarg branch 3 times, most recently from 61bfb7b to 60ba6ff Compare February 1, 2022 08:18

MAINT: Fixed Lint warnings for added np.unique equal_nans code

d3eca24

rjeb force-pushed the feature-unique-nan-kwarg branch from 60ba6ff to d3eca24 Compare February 1, 2022 18:32

MAINT: Simplify logic for np.unique when using equal_nans and axis ar…

14fc087

…gument simultaneously

rjeb force-pushed the feature-unique-nan-kwarg branch from b23133b to 14fc087 Compare February 10, 2022 21:07

seberg added the 55 - Needs work label Jun 2, 2022

jiwaszki mentioned this pull request Mar 15, 2023

ENH/BUG: optimisations and fixes for np.unique with nan values #23288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH, BUG: add equals_nans parameter to np.unique and fixed functionality with equal_nans along an axis #20896

ENH, BUG: add equals_nans parameter to np.unique and fixed functionality with equal_nans along an axis #20896

rjeb commented Jan 26, 2022

rjeb commented Feb 3, 2022 •

edited

Loading

seberg commented Jun 2, 2022

ENH, BUG: add equals_nans parameter to np.unique and fixed functionality with equal_nans along an axis #20896

Are you sure you want to change the base?

ENH, BUG: add equals_nans parameter to np.unique and fixed functionality with equal_nans along an axis #20896

Conversation

rjeb commented Jan 26, 2022

rjeb commented Feb 3, 2022 • edited Loading

seberg commented Jun 2, 2022

rjeb commented Feb 3, 2022 •

edited

Loading