-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: hash based np.unique #25596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: hash based np.unique #25596
Conversation
Thanks @adrinjalali. This is a lot of code ... let me try and give my first impressions.
Build time tracing result, using: $ vendored-meson/meson/meson.py setup build
$ ninja -C build
$ python ../scipy/tools/ninjatracing.py build/.ninja_log > trace.json and then loading the trace file with https://ui.perfetto.dev: ![]()
It seems to me that, while this is only a single algorithm, the performance aspect of
Is the signed integer case the one you care about most for scikit-learn? I assume It looks to me like figuring out the potential gains (is it ~2x like in your quick benchmark here, or can it be 10x or more?) first and for what dtypes that matters would be good. If the code bloat can be trimmed down via fewer dtypes, it may be closer to viable perhaps.
Don't worry about that I'd say - the location you chose is fine.
That'd be more of a meson-python tutorial I think. This level of complexity and amount of code added is very rare, and really difficult to pre-emptively document in NumPy's dev docs. Most of this is also specific to Cython, which has a quite bad way of determining file extension modules; if it worked more like C/C++ then things like that |
I agree with Ralf. Additionally, while I very much like the idea of having hashing, it would be good to have the basic chunk of it in C/C++ (maybe as chunked loops, maybe even as a ufunc since you can reach into a ufuncs core). But not sure if it is viable speed wise. It isn't perfect, but I don't care too much about starting limiting to numeric types, although I would like a plan on how to generalize it for user dtypes also. Although maybe the public API (also for user DTypes) could be at the oeration level (like (I am not sure how the hashing works, if it tries to replicate Python hash results or not. If it does, it might make sense to use a simpler hashing approach by knowing that we only compare hashes of the same dtype) |
All of that is fair enough. I was hoping porting pandas' code would be easy, since it's used and tested there. But I understand this is a lot of code and reviewing it from scratch wouldn't be easy. I'll try to see what I can come up with less code and more of a C++ solution. |
Closing this PR to keep the PR tracker less noisy. My new PR would need to be a separate one anyway 😉 |
Thanks @adrinjalali. Let me hit the close button then:) |
Fixes #11136
This an effort to speed up
np.unique
.The new implementation is not always faster. For instance:
The code is taken from
pandas
as is, except removing pandas' references andpd.NA
handling on missing values.The new implementation is only used if the user specifies
np.unique(..., sorted=False)
, which means at least there is no risk in slowing down people's existing code.This PR still needs tests, and a LOT of benchmarks. We might decide not to support hash based unique for certain dtypes maybe? Not sure.
Questions:
numpy/_core
now, but obviously it can be moved.pinging @seberg since you had commented on the original issue.
@rgommers this is where I was stuck with writing the
meson.build
file. Your pointers solved my issues, but I still think it would be nice to have them documented somewhere.