-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH/BUG: optimisations and fixes for np.unique with nan values #23288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Can you add a test to see if this also works for masked arrays? If possible, it'd be nice if this PR can close #23281 as well. |
Sure thing! But first, regarding the problems with >>> import numpy as np
>>> import numpy.ma as ma
>>> a = ma.array([1,2,3,4,2,3,1,4], mask=[0,0,1,1,0,0,0,1])
>>>
>>> out = np.unique(a, return_index=True)
>>> print(type(out), out)
<class 'tuple'> (masked_array(data=[1, 2, 3, --],
mask=[False, False, False, True],
fill_value=999999), array([0, 1, 5, 2]))
>>>
>>> out = np.unique(a, axis=0, return_index=True)
>>> print(type(out), out)
<class 'tuple'> (array([1, 2, 3, 4]), array([0, 1, 2, 3])) It will take me some extra time to figure out how to fix it and keep performance "as-is". Thanks for the heads up! |
I have my doubts that this works. The NaN can be only in one place (i.e. the first), in which case it isn't even sorted to the end ( It is indeed interesting whether this helps the masked array code paths, but likely the masked array fix lies not here but must be in |
@seberg you are right! I am back with the another proposition. Some highlights:
I was scratching my head a lot, this "nan issue" seemed to be unsolvable in any (efficient) way. I have found this interesting solution by @rjeb #20896 , it was also reviewed by you. It was a game-changer for me. Fixing the bug:Generally, I applied the similar idea. For structured types it goes down recursively to check every element correctly. The basic building block "checking of nans" is done like this: a != b and not(np.isnan(a) and np.isnan(b)) @rjeb solution was not perfect in "nested" scenario. Now it works flawlessly in all existing test cases. For structured types it might not be the most performant implementation... but hey! It works, looks clean and solves "nan issue":) Optimising solution:While working on this case I have found two really great optimisations to apply. I put a lot of comments for future devs. I think this is a quite advanced implementation and it will be better to look at the code at the same time. Based on some properties (like dtype, array shape and axis) there are cases which allow to simplify required computations. "Consolidating" of the array is always making it 2-D structured-like ndarray. This is not good for the performance, thus if given function input:
_nans = np.isnan(aux)
mask[1:] = np.any((aux[1:] != aux[:-1]) & ~(_nans[1:] & _nans[:-1]), axis=1) Let numbers speak for themselves:
Thanks in advance for another review! |
@seberg ping |
@seberg could you review it again? Thanks! |
d9c2a7b
to
67f12f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite complicated, to make it easier for me to get started: do I get it right that the approach is to use the structured dtype trick for sorting only?
If we undo that view immediately after sorting, we then apply the same NaN sanitizing logic (just with the right .any()
/.all()
calls to account for the extra axis sprinkled in?
In that case, I suspect you can simplify the branching a bit, just view the old dtype (if it was already structured, no harm done).
(I almost wonder if np.lexsort
might not be as well for that sorting trick.)
# Then this optimization might be used to speed up unique computation along | ||
# axis equal to either 0 or 1. | ||
if (orig_dim == 2 and len(orig_dtype) < 2 and | ||
all(i > 0 for i in orig_shape)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why orig_ndim matters. Isn't the array always flattened along all but one dim?
BUG: fix for np.unique with nan values (See #23286)
Fix np.unique which returns multiple NaNs despite setting
equal_nan=True
whenaxis!=None
.According to on-going discussion in #23286 - there should be a way to fix this behaviour using new function/ufunc. Let's treat this PR as a hot-fix.
Brief of changes:
(np.isnan(ar)).any()
to determine if there are any NaN values. If not, redirect function to "fallback" in if-else body.axis=None
and sorting properties. Leave comments for future devs.V
(np.void
) to be processed in_unique1d()
.None
/0
.Usage:
python runtests.py --bench-compare main bench_lib.Unique
axis=0+
(which is currently a price for working "NaN-detection").Main branch:
This PR:
TODO:
There is another one "early return" case which I am considering. If the input array is already 1D,
axis
value is not affecting the output shape as it can be only set to eitherNone
or0
, axis is swaping with itself: