ENH: hash based np.unique #25596

adrinjalali · 2024-01-16T14:45:58Z

This an effort to speed up np.unique.

The new implementation is not always faster. For instance:

In [11]: a = np.random.rand(10_000_000)

In [12]: b = np.random.randint(size=10_000_000, low=0, high=1000)

In [13]: %timeit np.unique(a, sorted=True)
375 ms ± 6.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit np.unique(a, sorted=False)
2.39 s ± 38.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit np.unique(b, sorted=True)
250 ms ± 7.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [16]: %timeit np.unique(b, sorted=False)
117 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The code is taken from pandas as is, except removing pandas' references and pd.NA handling on missing values.

The new implementation is only used if the user specifies np.unique(..., sorted=False), which means at least there is no risk in slowing down people's existing code.

This PR still needs tests, and a LOT of benchmarks. We might decide not to support hash based unique for certain dtypes maybe? Not sure.

Questions:

Should I continue this path?
Where should I put the code? I just put it under numpy/_core now, but obviously it can be moved.
Are we okay taking the pandas' implementation, or should we use something else?

pinging @seberg since you had commented on the original issue.
@rgommers this is where I was stuck with writing the meson.build file. Your pointers solved my issues, but I still think it would be nice to have them documented somewhere.

rgommers · 2024-01-16T22:41:14Z

Thanks @adrinjalali. This is a lot of code ... let me try and give my first impressions.

I'm surprised that the float64 array timeit result is that much slower. Does that hold for pandas' implementation too, or is something wrong there?
The amount of code is going to be a maintenance burden. The templated Cython is also quite hard to read, and contains a bunch of malloc/free and low-level bit twiddling. It'd probably be more readable as C++ code than it is now.
Installed size is very large; hashtable.so is 10 MB in a dev build, which is about a fifth the size of all of _multiarray_umath.so (which contains most of NumPy's functionality implemented in compiled code),
The build time is massive, that alone looks like a clear showstopper (see profiling below),
You picked up several git submodule changes by accident, it'd be good to remove those

Build time tracing result, using:

$ vendored-meson/meson/meson.py setup build
$ ninja -C build
$ python ../scipy/tools/ninjatracing.py build/.ninja_log > trace.json

and then loading the trace file with https://ui.perfetto.dev:

Should I continue this path?
Are we okay taking the pandas' implementation, or should we use something else?

It seems to me that, while this is only a single algorithm, the performance aspect of np.unique is quite important. So pursuing this seems worthwhile in principle. However, it looks like the Pandas code is very bloated - to the point where it's a fair bit beyond what is acceptable. I don't have a good suggestion for another implementation though that could be used. Perhaps this is a case for tapping into the wisdom of a crowd, and asking on the mailing list.

We might decide not to support hash based unique for certain dtypes maybe?

Is the signed integer case the one you care about most for scikit-learn? I assume unique for floats is less interesting because floats are typically all unique if they're nonzero, and unsigned integers are fairly rarely used.

It looks to me like figuring out the potential gains (is it ~2x like in your quick benchmark here, or can it be 10x or more?) first and for what dtypes that matters would be good. If the code bloat can be trimmed down via fewer dtypes, it may be closer to viable perhaps.

Where should I put the code? I just put it under numpy/_core now, but obviously it can be moved.

Don't worry about that I'd say - the location you chose is fine.

@rgommers this is where I was stuck with writing the meson.build file. Your pointers solved my issues, but I still think it would be nice to have them documented somewhere.

That'd be more of a meson-python tutorial I think. This level of complexity and amount of code added is very rare, and really difficult to pre-emptively document in NumPy's dev docs. Most of this is also specific to Cython, which has a quite bad way of determining file extension modules; if it worked more like C/C++ then things like that fs.copyfile usage would not be needed.

seberg · 2024-01-17T07:45:46Z

I agree with Ralf. Additionally, while I very much like the idea of having hashing, it would be good to have the basic chunk of it in C/C++ (maybe as chunked loops, maybe even as a ufunc since you can reach into a ufuncs core). But not sure if it is viable speed wise.

It isn't perfect, but I don't care too much about starting limiting to numeric types, although I would like a plan on how to generalize it for user dtypes also. Although maybe the public API (also for user DTypes) could be at the oeration level (like unique).

(I am not sure how the hashing works, if it tries to replicate Python hash results or not. If it does, it might make sense to use a simpler hashing approach by knowing that we only compare hashes of the same dtype)

adrinjalali · 2024-01-22T09:46:16Z

All of that is fair enough. I was hoping porting pandas' code would be easy, since it's used and tested there. But I understand this is a lot of code and reviewing it from scratch wouldn't be easy.

I'll try to see what I can come up with less code and more of a C++ solution.

adrinjalali · 2024-01-31T18:21:58Z

Closing this PR to keep the PR tracker less noisy. My new PR would need to be a separate one anyway 😉

rgommers · 2024-01-31T19:02:55Z

Thanks @adrinjalali. Let me hit the close button then:)

adrinjalali added 7 commits December 14, 2023 16:16

Add Pandas code

8157c34

first commit

470b510

add python script

83b5354

...

58bb41d

it compiles

dfb8ba8

fix dispatch signature

b207bff

delete unneeded file

35238cd

adrinjalali changed the title ~~Add Pandas code~~ hash based np.unique Jan 16, 2024

adrinjalali added 2 commits January 16, 2024 15:49

Merge remote-tracking branch 'upstream/main' into unique

10bfe37

fix linter

2c2b22b

charris changed the title ~~hash based np.unique~~ MAINT: hash based np.unique Jan 16, 2024

rgommers changed the title ~~MAINT: hash based np.unique~~ ENH: hash based np.unique Jan 16, 2024

rgommers closed this Jan 31, 2024

adrinjalali mentioned this pull request May 13, 2024

ENH add hash based unique #26018

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: hash based np.unique #25596

ENH: hash based np.unique #25596

adrinjalali commented Jan 16, 2024

rgommers commented Jan 16, 2024

seberg commented Jan 17, 2024

adrinjalali commented Jan 22, 2024

adrinjalali commented Jan 31, 2024

rgommers commented Jan 31, 2024

ENH: hash based np.unique #25596

ENH: hash based np.unique #25596

Conversation

adrinjalali commented Jan 16, 2024

rgommers commented Jan 16, 2024

seberg commented Jan 17, 2024

adrinjalali commented Jan 22, 2024

adrinjalali commented Jan 31, 2024

rgommers commented Jan 31, 2024