ENH add hash based unique #26018

adrinjalali · 2024-03-14T15:23:52Z

This calculates unique values with an unordered hashset in C++ for certain dtypes.

ngoldbaum · 2024-03-15T15:13:39Z

I marked this as draft in the github UI, hope you don't mind.

adrinjalali · 2024-03-15T17:28:52Z

Thanks, I had intended to have it as draft, but forgot.

adrinjalali · 2024-03-19T14:30:25Z

Still got work to do regarding figuring out dtypes / type lengths supported on all platforms and all, and figuring out a way to filter dtypes that should be included in the hash based unique, but for now, this is giving me a somewhat 10x speedup on 100_000_000 to 1_000_000_000 data points.

EDIT: the speedup is more of 2-3x on these numbers with compiler optimization enabled.

adrinjalali · 2024-03-21T14:21:15Z

Example times on 1,000,000,000 samples:

unique count (hashmap):  1000
7.815 secs
unique count (numpy main):  1000
21.436 secs
unique count (polars):  1000
16.768 secs

Now that the code is pretty small, is this something we'd like to have in numpy?

cc @seberg , @rgommers since you've been involved in this.

I still need to fix the compiler issues for all different compilers, and maybe figure out the macro defs to add 96 and 128 bit data, but for now I think we can discuss if this is the right thing to do.

Also, the script I use to benchmark:

import time
import numpy as np
from numpy._core.unique import unique_hash
import polars as pl

arr = np.random.randint(0, 1000, 1000_000_000)

time_start = time.perf_counter()
print("unique count (hashmap): ", len(unique_hash(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

time_start = time.perf_counter()
print("unique count (numpy main): ", len(np.unique(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

time_start = time.perf_counter()
print("unique count (polars): ", len(pl.Series(arr).unique()))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

Haven't included pandas since I guess they still need to do a release with numpy2 first.

ngoldbaum · 2024-03-21T14:41:59Z

This seems like a reasonable amount of code to maintain; it's more or less a wrapper around the C++ standard library unordered_map.

ngoldbaum · 2024-03-21T15:38:52Z

Also totally a style thing, but to keep things consistent it probably makes more sense to add the unique_hash function to the main numpy._core._multiarray_umath module that most other numpy core functionality lives in than to define a new np._core.unique module.

rgommers · 2024-03-21T16:07:15Z

Now that the code is pretty small, is this something we'd like to have in numpy?

Nice work. I agree with what @ngoldbaum said above. This seems like a very clean and concise implementation.

numpy/_core/src/multiarray/unique.cpp

jorisvandenbossche · 2024-03-22T14:45:26Z

Haven't included pandas since I guess they still need to do a release with numpy2 first.

You could install the nightly wheels from the scientific python channel if you want (I don't think there is any (compiled) library that already has a 2.0 compatible release given there is not yet a stable ABI ?)

I tried your benchmark case locally with pandas. I didn't fetch this branch to compare with, but also ran with released numpy as a point of comparison:

%time len(pd.unique(arr))
CPU times: user 2.74 s, sys: 0 ns, total: 2.74 s
Wall time: 2.74 s

%time len(np.unique(arr))
CPU times: user 39.7 s, sys: 1.15 s, total: 40.8 s
Wall time: 40.9 s

Related to std::unordered_map, you find quite some content "on the internet" complaining about how slow it is compared to other implementations (I remember from looking a bit around some time ago for pandas, wondering if we should consider using a different implementations; pandas uses khash). But a lost of those posts are also quite some years old, and C++ compilers might have improved their implementation. It's definitely convenient to use)

adrinjalali · 2024-05-13T11:10:23Z

@ngoldbaum

Also totally a style thing, but to keep things consistent it probably makes more sense to add the unique_hash function to the main numpy._core._multiarray_umath module that most other numpy core functionality lives in than to define a new np._core.unique module.

I tried and failed. One is C++ the other C, and they also have different macros related to compilation. So not sure how to merge them.

@seberg

bools and floats: good points. Will make this PR "less smart" and dtype specific code for those types. Although I think those can also be a separate PR to keep this one small.
refactoring: I'll try
Could you please elaborate on OwnedRef please? I'll checkout the pybind thingy to see how exceptions should be handled in the meantime.
randomized hash: they're compiler specific and AFAIK at least for gcc they might be anything by randomized. Pandas uses https://github.com/veorq/SipHash which is certainly cryptographically secure. But there are other randomized hash functions which are faster, but not as secure. We could also consider adding that / changing to it in a separate PR maybe? Or would you say having a terrible hash function as the one in this PR is too bad to be merged into numpy?

@jorisvandenbossche
Yes, pandas is really fast, and last I compared, for unique, it was faster than polars. But it's also a huge piece of code on the pandas side. Porting that in here was the first thing I tried (#25596).

seberg · 2024-05-13T11:38:13Z

I tried and failed. One is C++ the other C

Should work, I think but I guess you could defer that for now. You will have to export an extern "C" { function to be called from multiarray-module.

Could you please elaborate on OwnedRef please?

In C, you do cleanup with a goto error: /* cleanup code */. In C++ you don't use goto error: but rely on exceptions, when an exception gets triggered (which could be in many places e.g. when out of memory), you need to run that cleanup code.
In particular, if you have PyObject * you must decref them. There are three solutions to that:

You use C++ only in a core function that uses only borrowed references.
You wrap everything in a try/catch...
You introduce a new OwnedRef which which does nothing but store a PyObject * but defines a dealloc. That way C++ does the right thing.

In this case, I am not sure what is better, it might be 1 (have a simple C entry-point and just call into some C++ code that maybe doesn't even need objects at all) or 3 if you want to work with objects in C++.

having a terrible hash function as the one in this PR is too bad to be merged into numpy?

I doubt it matters much, there is nothing cryptographic here, the issue would be denial of service by creating a dataset with a huge number of unique values which all hash to the same thing. Note that e.g. Python only randomizes the hash for strings (unless dicts randomize them for non-strings?).

adrinjalali · 2025-02-21T10:03:56Z

Doesn't seem like the failure is related to this PR

=================================== FAILURES ===================================
___________ TestStringDiscovery.test_nested_arrays_stringlength[1.2] ___________

self = <test_array_coercion.TestStringDiscovery object at 0x3cd6fb17d010>
obj = 1.2

    @pytest.mark.parametrize("obj",
            [object(), 1.2, 10**43, None, "string"],
            ids=["object", "1.2", "10**43", "None", "string"])
    def test_nested_arrays_stringlength(self, obj):
        length = len(str(obj))
        expected = np.dtype(f"S{length}")
        arr = np.array(obj, dtype="O")
>       assert np.array([arr, arr], dtype="S").dtype == expected
E       RuntimeWarning: invalid value encountered in cast

arr        = array(1.2, dtype=object)
expected   = dtype('S3')
length     = 3
obj        = 1.2
self       = <test_array_coercion.TestStringDiscovery object at 0x3cd6fb17d010>

ngoldbaum · 2025-02-21T15:56:29Z

The fix for that was just merged

doc/release/upcoming_changes/26018.performance.rst

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

…lues`

seberg

I added a small commit to:

Use sorted=False for unique_values (and a release note)
Some smaller maintenance, partially just small preferences, though.
I removed the 0-size special case. I think it is just nice to not need it (and I don't consider 0-size to be important to optimize)

This LGTM, would be nice if someone could have a quick look over my changes.

seberg · 2025-02-25T12:16:10Z

numpy/_core/src/multiarray/unique.cpp

+    NPY_ALLOW_C_API;
+    PyArray_Descr *descr = PyArray_DESCR(self);
+    Py_INCREF(descr);
+    PyObject *res_obj = PyArray_NewFromDescr(


If there was a C++ exception after this, we would leak it... But, I don't think that is possible, so let's not worry about it.

adrinjalali · 2025-02-25T13:00:29Z

There's a segfault and the docs need updating since output is not sorted now @seberg

…e slightly)

seberg · 2025-02-25T13:40:27Z

There's a segfault and the docs need updating since output is not sorted now

Whoops, need to not try to iterate (I think the setup is OK though).

The output being sorted is not documented for unique_values, which is the one I changed.

EDIT: 🤦 sorry, the examples do need updating of course, oops.

(let's pretend it may not even be stable!)

seberg · 2025-02-25T14:50:44Z

Just to note the observation: The s390x build segfaulted once here. (i.e. not our test, but the compilation)

seberg · 2025-02-27T10:47:11Z

Let's give this a shot, thanks @adrinjalali, we discussed it briefly yesterday and I don't think it is helpful to try to iterate here more.

There are a lot of smaller and larger follow-ups (larger e.g. thinking about using another hash-map, maybe even one with some randomization, etc.).

adrinjalali · 2025-02-27T10:52:15Z

Thank y'all for all the patience, all the reviews, and all the help. This was a steep learning curve for me, doing C++ after about 12 years, and never having had done cpython bindings using its barebone API 😅

I loved every bit of this experience, and look forward to more contributions on this little corner ❤️

ENH add hash based unique

2933930

adrinjalali mentioned this pull request Mar 14, 2024

DOC update C-API docs #26019

Open

getting closer

8da2e72

ngoldbaum marked this pull request as draft March 15, 2024 14:58

adrinjalali added 9 commits March 15, 2024 22:50

trying to expose as a module

961ef5b

trying to create a module

a6b1847

fix build

1f1c36c

...

0bc43c3

Merge remote-tracking branch 'upstream/main' into unique-cpp

f94cf89

segfault fix, imported numpy

a37b151

getting unique back, refcount issues exist

8f42b0b

unique works

f56634f

remove header

bce7534

adrinjalali added 2 commits March 21, 2024 14:17

cleanups and comments

0c6c588

change type

9b7d6f6

adrinjalali changed the title ~~[DRAFT] [IGNORE] ENH add hash based unique~~ [DRAFT] ENH add hash based unique Mar 21, 2024

adrinjalali added 2 commits March 21, 2024 22:16

fix for initialization issue

85cf692

trying to move module

3db3349

seberg reviewed Mar 22, 2024

View reviewed changes

numpy/_core/src/multiarray/unique.cpp Show resolved Hide resolved

goruha mentioned this pull request Mar 23, 2024

Follow this PRs intervals-mining-lab/foapy#18

Open

use macro to return notimplemented

ab86574

Merge remote-tracking branch 'upstream/main' into unique-cpp

08d7d62

lorentzenchr reviewed Feb 22, 2025

View reviewed changes

doc/release/upcoming_changes/26018.performance.rst Outdated Show resolved Hide resolved

seberg and others added 2 commits February 25, 2025 12:51

Apply suggestions from code review

e1e2ddf

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

MAINT,ENH: Smaller reorgs/maint and use sorted=False for `unique_va…

f96411a

…lues`

seberg removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Feb 25, 2025

seberg approved these changes Feb 25, 2025

View reviewed changes

Ensure we don't iterate if iterator is empty (also change thread stat…

2319947

…e slightly)

DOC: unique_values doc examples may have different order now

e188bf3

(let's pretend it may not even be stable!)

seberg merged commit 9e557eb into numpy:main Feb 27, 2025
68 checks passed

adrinjalali deleted the unique-cpp branch February 27, 2025 10:52

dfm mentioned this pull request Mar 3, 2025

Update jnp.unique to support upstream interface changes. jax-ml/jax#26883

Merged

This was referenced Mar 13, 2025

TST: test_array_functions.py::test_unique_values is failing against numpy dev yt-project/unyt#564

Closed

BUG: np.unique_values(...) isn't equivalent to np.unique(..., equal_nan=False) on main #28493

Closed

mhvk mentioned this pull request Mar 13, 2025

DOC: let docstring mention that unique_values is now unsorted #28495

Merged

seberg mentioned this pull request Mar 14, 2025

API: Always allow sorted=False and make a note about it #28503

Merged

mdhaber mentioned this pull request Mar 31, 2025

Deal with unsorted unique_values? mdhaber/marray#111

Closed

math-hiyoko mentioned this pull request Apr 19, 2025

ENH: np.unique: support hash based unique for string dtype #28767

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add hash based unique #26018

ENH add hash based unique #26018

adrinjalali commented Mar 14, 2024 •

edited

Loading

ngoldbaum commented Mar 15, 2024

adrinjalali commented Mar 15, 2024

adrinjalali commented Mar 19, 2024 •

edited

Loading

adrinjalali commented Mar 21, 2024

ngoldbaum commented Mar 21, 2024

ngoldbaum commented Mar 21, 2024

rgommers commented Mar 21, 2024

jorisvandenbossche commented Mar 22, 2024 •

edited

Loading

adrinjalali commented May 13, 2024

seberg commented May 13, 2024

adrinjalali commented Feb 21, 2025

ngoldbaum commented Feb 21, 2025

seberg left a comment

seberg Feb 25, 2025

adrinjalali commented Feb 25, 2025

seberg commented Feb 25, 2025 •

edited

Loading

seberg commented Feb 25, 2025 •

edited

Loading

seberg commented Feb 27, 2025

adrinjalali commented Feb 27, 2025

ENH add hash based unique #26018

ENH add hash based unique #26018

Conversation

adrinjalali commented Mar 14, 2024 • edited Loading

ngoldbaum commented Mar 15, 2024

adrinjalali commented Mar 15, 2024

adrinjalali commented Mar 19, 2024 • edited Loading

adrinjalali commented Mar 21, 2024

ngoldbaum commented Mar 21, 2024

ngoldbaum commented Mar 21, 2024

rgommers commented Mar 21, 2024

jorisvandenbossche commented Mar 22, 2024 • edited Loading

adrinjalali commented May 13, 2024

seberg commented May 13, 2024

adrinjalali commented Feb 21, 2025

ngoldbaum commented Feb 21, 2025

seberg left a comment

Choose a reason for hiding this comment

seberg Feb 25, 2025

Choose a reason for hiding this comment

adrinjalali commented Feb 25, 2025

seberg commented Feb 25, 2025 • edited Loading

seberg commented Feb 25, 2025 • edited Loading

seberg commented Feb 27, 2025

adrinjalali commented Feb 27, 2025

adrinjalali commented Mar 14, 2024 •

edited

Loading

adrinjalali commented Mar 19, 2024 •

edited

Loading

jorisvandenbossche commented Mar 22, 2024 •

edited

Loading

seberg commented Feb 25, 2025 •

edited

Loading

seberg commented Feb 25, 2025 •

edited

Loading