Skip to content

ENH, BUG: add equals_nans parameter to np.unique and fixed functionality with equal_nans along an axis #20896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

rjeb
Copy link
Contributor

@rjeb rjeb commented Jan 26, 2022

Addresses #20326 and #20873

np.unique previously had it's functionality changed so NaN values would be treated as non-unique.
This PR puts the functionality into the parameter equal_nans(default: True).

In addition, it fixes the inability to use equal_nans and axis functionality simultaneously.

Ex.

arr = np.array([
[np.nan, 5.],
[np.nan, 5.], 
[0, 0], 
[0, 0],
[1, 1], 
[2, np.nan],
[3, 3], 
[4, 4],
[5, np.nan], 
[5, np.nan]
])

np.unique(arr, axis = 0)

Result:

[[ 0.  0.]
 [ 1.  1.]
 [ 2. nan]
 [ 3.  3.]
 [ 4.  4.]
 [ 5. nan]
 [nan  5.]]

np.unique(arr, axis = 0, equal_nans = False)

Result:

[[ 0.  0.]
 [ 1.  1.]
 [ 2. nan]
 [ 3.  3.]
 [ 4.  4.]
 [ 5. nan]
 [ 5. nan]
 [nan  5.]
 [nan  5.]]

@rjeb rjeb force-pushed the feature-unique-nan-kwarg branch 3 times, most recently from 61bfb7b to 60ba6ff Compare February 1, 2022 08:18
@rjeb rjeb force-pushed the feature-unique-nan-kwarg branch from 60ba6ff to d3eca24 Compare February 1, 2022 18:32
@rjeb
Copy link
Contributor Author

rjeb commented Feb 3, 2022

I am a little weary of the performance cost for checking for NaNs along an axis.
Currently this implementation is consistently 4 - 5 times slower when comparing to the release build (checked from n = 200 -> 2,000,000).

a = np.random.randint(10,90,(2000000,10))
a = np.append(a, [[np.nan, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]], axis = 0)
a = np.append(a, [[np.nan, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]], axis = 0)

np.unique(a, axis=0)

  1. release_build = 2.240867 seconds (no equal_nan result)
  2. PR_build = 10.631021 seconds (with equal_nan result)
  3. PR_build = 2.202401 (equal_nans = False uses old implementation)

With 1-D arrays we check all the duplicate NaNs at the end of a sorted array, which scales with # of NaN, rather than the actual size of the array.
With 2-D arrays NaNs can appear in any index of any subarray, so I currently check each subarray and it's nan_equality with the subarray after it.
There is no built-in equality function that I have found ('==', 'array_equals', 'assert_array_equals') that can resolve equal_nans equality for a structured array so these comparisons I'm doing are also relatively expensive.

Any ideas for optimization would be much appreciated.

@rjeb rjeb force-pushed the feature-unique-nan-kwarg branch from b23133b to 14fc087 Compare February 10, 2022 21:07
@seberg
Copy link
Member

seberg commented Jun 2, 2022

The first part is merged now (so needs a rebase). The second part is still interesting, but on first glance needs a bit more thought I believe. After a brief look, I do not think we can go via strings.
I do think we may be able to go via comparisons maybe using < or >. The basic idea has to be to do some trick that allows NaNs to behave correctly. In general, that is replacing:

prev != next

with:

not prev == next

Which is the same, except that it also returns True for not np.nan == np.nan unlike !=. Now, for the "normal" branch, that is unfortunately maybe not optimal (because we have to invert the boolean array). However:

  1. It might just be OK performance wise, we should try.
  2. We could keep the branching, and only use this trick when the correct behavior is unclear.

The best part. That should effectively also fix object arrays containing NaNs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants