Skip to content

ENH: hash based np.unique #25596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

ENH: hash based np.unique #25596

wants to merge 9 commits into from

Conversation

adrinjalali
Copy link
Contributor

Fixes #11136

This an effort to speed up np.unique.

The new implementation is not always faster. For instance:

In [11]: a = np.random.rand(10_000_000)

In [12]: b = np.random.randint(size=10_000_000, low=0, high=1000)

In [13]: %timeit np.unique(a, sorted=True)
375 ms ± 6.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit np.unique(a, sorted=False)
2.39 s ± 38.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit np.unique(b, sorted=True)
250 ms ± 7.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [16]: %timeit np.unique(b, sorted=False)
117 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The code is taken from pandas as is, except removing pandas' references and pd.NA handling on missing values.

The new implementation is only used if the user specifies np.unique(..., sorted=False), which means at least there is no risk in slowing down people's existing code.

This PR still needs tests, and a LOT of benchmarks. We might decide not to support hash based unique for certain dtypes maybe? Not sure.

Questions:

  • Should I continue this path?
  • Where should I put the code? I just put it under numpy/_core now, but obviously it can be moved.
  • Are we okay taking the pandas' implementation, or should we use something else?

pinging @seberg since you had commented on the original issue.
@rgommers this is where I was stuck with writing the meson.build file. Your pointers solved my issues, but I still think it would be nice to have them documented somewhere.

@adrinjalali adrinjalali changed the title Add Pandas code hash based np.unique Jan 16, 2024
@charris charris changed the title hash based np.unique MAINT: hash based np.unique Jan 16, 2024
@rgommers
Copy link
Member

Thanks @adrinjalali. This is a lot of code ... let me try and give my first impressions.

  • I'm surprised that the float64 array timeit result is that much slower. Does that hold for pandas' implementation too, or is something wrong there?
  • The amount of code is going to be a maintenance burden. The templated Cython is also quite hard to read, and contains a bunch of malloc/free and low-level bit twiddling. It'd probably be more readable as C++ code than it is now.
  • Installed size is very large; hashtable.so is 10 MB in a dev build, which is about a fifth the size of all of _multiarray_umath.so (which contains most of NumPy's functionality implemented in compiled code),
  • The build time is massive, that alone looks like a clear showstopper (see profiling below),
  • You picked up several git submodule changes by accident, it'd be good to remove those

Build time tracing result, using:

$ vendored-meson/meson/meson.py setup build
$ ninja -C build
$ python ../scipy/tools/ninjatracing.py build/.ninja_log > trace.json

and then loading the trace file with https://ui.perfetto.dev:

image

Should I continue this path?
Are we okay taking the pandas' implementation, or should we use something else?

It seems to me that, while this is only a single algorithm, the performance aspect of np.unique is quite important. So pursuing this seems worthwhile in principle. However, it looks like the Pandas code is very bloated - to the point where it's a fair bit beyond what is acceptable. I don't have a good suggestion for another implementation though that could be used. Perhaps this is a case for tapping into the wisdom of a crowd, and asking on the mailing list.

We might decide not to support hash based unique for certain dtypes maybe?

Is the signed integer case the one you care about most for scikit-learn? I assume unique for floats is less interesting because floats are typically all unique if they're nonzero, and unsigned integers are fairly rarely used.

It looks to me like figuring out the potential gains (is it ~2x like in your quick benchmark here, or can it be 10x or more?) first and for what dtypes that matters would be good. If the code bloat can be trimmed down via fewer dtypes, it may be closer to viable perhaps.

Where should I put the code? I just put it under numpy/_core now, but obviously it can be moved.

Don't worry about that I'd say - the location you chose is fine.

@rgommers this is where I was stuck with writing the meson.build file. Your pointers solved my issues, but I still think it would be nice to have them documented somewhere.

That'd be more of a meson-python tutorial I think. This level of complexity and amount of code added is very rare, and really difficult to pre-emptively document in NumPy's dev docs. Most of this is also specific to Cython, which has a quite bad way of determining file extension modules; if it worked more like C/C++ then things like that fs.copyfile usage would not be needed.

@rgommers rgommers changed the title MAINT: hash based np.unique ENH: hash based np.unique Jan 16, 2024
@seberg
Copy link
Member

seberg commented Jan 17, 2024

I agree with Ralf. Additionally, while I very much like the idea of having hashing, it would be good to have the basic chunk of it in C/C++ (maybe as chunked loops, maybe even as a ufunc since you can reach into a ufuncs core). But not sure if it is viable speed wise.

It isn't perfect, but I don't care too much about starting limiting to numeric types, although I would like a plan on how to generalize it for user dtypes also. Although maybe the public API (also for user DTypes) could be at the oeration level (like unique).

(I am not sure how the hashing works, if it tries to replicate Python hash results or not. If it does, it might make sense to use a simpler hashing approach by knowing that we only compare hashes of the same dtype)

@adrinjalali
Copy link
Contributor Author

All of that is fair enough. I was hoping porting pandas' code would be easy, since it's used and tested there. But I understand this is a lot of code and reviewing it from scratch wouldn't be easy.

I'll try to see what I can come up with less code and more of a C++ solution.

@adrinjalali
Copy link
Contributor Author

Closing this PR to keep the PR tracker less noisy. My new PR would need to be a separate one anyway 😉

@rgommers
Copy link
Member

Thanks @adrinjalali. Let me hit the close button then:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unique() needlessly slow
3 participants