ENH: np.unique: support hash based unique for string dtype #28767

math-hiyoko · 2025-04-18T11:14:36Z

Description

This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.

Benchmark Results

The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described #26018 (comment)

import random
import string
import time

import numpy as np
import polars as pl

chars = string.ascii_letters + string.digits
arr = np.array(
    [
        ''.join(random.choices(chars, k=random.randint(5, 10)))
        for _ in range(1_000)
    ] * 1_000_000,
    dtype='T',
)
np.random.shuffle(arr)

time_start = time.perf_counter()
print("unique count (hash based): ", len(np.unique(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

time_start = time.perf_counter()
print("unique count (polars): ", len(pl.Series(arr).unique()))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

Result

unique count (hash based):  1000
35.127 secs
unique count (numpy main):  1000
498.011 secs
unique count (polars):  1000
74.023 secs

close #28364

…to feature/numpy#28364

ngoldbaum · 2025-04-25T14:38:18Z

numpy/lib/tests/test_arraysetops.py

+    def test_unique_vstring_hash_based(self):
+        # test for unicode and nullable string arrays
+        a = np.array(['straße', None, 'strasse', 'straße', None, 'niño', 'nino', 'élève', 'eleve', 'niño', 'élève'], dtype=StringDType(na_object=None))
+        unq_sorted_wo_none = ['eleve', 'nino', 'niño', 'strasse', 'straße', 'élève']


These strings should probably include items that exercise more of the functionality in StringDType internals. See NEP 55 and the StringDType tests for examples, but basically you only include "short" strings here, but probably also want to include both "medium" and "long" strings that are allocated on the heap outside of the ndarray buffer.

I actually encountered memory errors when adding medium and long strings to the tests, so I’ve implemented fixes addressing those issues as well. Thanks for pointing that out!

ngoldbaum · 2025-04-28T18:20:11Z

numpy/_core/src/multiarray/unique.cpp

+// function to caluculate the hash of a string
+template <typename T>
+size_t str_hash(const T *str, npy_intp num_chars) {
+    // https://www.boost.org/doc/libs/1_88_0/libs/container_hash/doc/html/hash.html#notes_hash_combine


This link says boost hasn't used this algorithm since Boost 1.56 in 2014 and currently uses an algorithm based on murmurhash3.

Rather than trying to combine the hashes of all the individual bytes, why not just calculate the hash directly using the char * or Py_UCS4 * buffers and a byte hashing algorithm?

Did you try FNV hash or another non-cryptographic hash? FNV is nice because it's very easy to implement as a standalone bit of C code.

http://www.isthe.com/chongo/tech/comp/fnv/index.html

I've updated the hash algorithm from the original implementation to FNV-1a.
Although the overall execution speed has slightly decreased due to other changes introduced simultaneously, the hash algorithm change itself resulted in a 94.4% reduction in hashing time.

ngoldbaum

Left a few comments inline.

ngoldbaum · 2025-04-28T18:22:24Z

numpy/_core/src/multiarray/unique.cpp

+    npy_string_allocator *in_allocator = NpyString_acquire_allocator((PyArray_StringDTypeObject *)descr);
+    auto in_allocator_dealloc = finally([&]() {
+        NpyString_release_allocator(in_allocator);
+    });


nice, might be worth adding C++ wrappers for the NpyString API somewhere in NumPy so this pattern can be used elsewhere.

ngoldbaum · 2025-04-28T18:27:46Z

numpy/lib/tests/test_arraysetops.py

+        assert_equal(count_none, 1)
+
+        a1_wo_none = sorted(x for x in a1 if x is not None)
+        assert_array_equal(a1_wo_none, unq_sorted_wo_none)


Nice, I guess this handles missing strings fine.

ngoldbaum · 2025-04-28T18:29:40Z

numpy/_core/src/multiarray/unique.cpp

+        if ((*it)->buf == NULL) {
+            NpyString_pack_null(out_allocator, packed_string);
+        } else {
+            NpyString_pack(out_allocator, packed_string, (*it)->buf, (*it)->size);


Both of these functions can fail and you have to check for and handle the error case.

I’ve updated it so that if packing fails, it returns NULL.

ngoldbaum · 2025-04-28T18:32:35Z

numpy/_core/src/multiarray/unique.cpp

+    npy_intp isize = PyArray_SIZE(self);
+
+    // Reserve hashset capacity in advance to minimize reallocations.
+    std::unordered_set<T> hashset(isize * 2);


Why isize * 2? Shouldn't isize (all distinct values) be enough?

Actually, I set the hashset capacity to 2 * isize not just to avoid reallocations, but also to reduce hash collisions as much as possible. I've added a comment in the code to clarify this reasoning as well.
https://www.cse.cuhk.edu.hk/irwin.king/_media/teaching/csci2100b/csci2100b-2013-05-hash.pdf

Relatively random comment, but I don't think we should reserve "enough" from the start (the * 2 also looks a bit like it includes the hashmap load factor?). This scheme is potentially larger than the original array, even though often the array may have very few unique elements.

If reserving is worth it, maybe it would make sense reserve a relatively large default with min(isize, not_too_large), where not_to_large is maybe a few kiB?

Since we’re now only storing pointers in the unordered_set, its size shouldn’t be larger than the original array—especially for strings, where the set just holds pointers rather than the string data itself. I agree that for integers, your suggested approach would be more appropriate.

If strings are short that argument doesn't quite hold. So for simplicity, I would suggest to just use the same scheme for everything.

(Also, I am not sure that the *2 is right at all, because I expect C++ will already add that factor based on the set load factor.)

Got it. I’ll use min(isize, not_too_large) instead.
For not_too_large, I’m thinking of setting it to 1024 (since each element is a pointer, so that’s about 4 KiB).

I’ve updated the code to use min(isize, 1024) as the initial bucket size for the hash set. I’ve also added a comment in the code to explain this choice.

ngoldbaum · 2025-04-28T18:33:30Z

numpy/_core/src/multiarray/unique.cpp

-    npy_intp *strideptr = NpyIter_GetInnerStrideArray(iter);
-    npy_intp *innersizeptr = NpyIter_GetInnerLoopSizePtr(iter);
+    NPY_DISABLE_C_API;
+    PyThreadState *_save2 = PyEval_SaveThread();


you could use an RAII wrapper for this too.

I’ve added an RAII wrapper for this now.

ngoldbaum · 2025-04-28T18:37:06Z

numpy/_core/src/multiarray/unique.cpp

+    });
+    auto hash = [](const npy_static_string *value) -> size_t {
+        if (value->buf == NULL) {
+            return 0;


Is it a problem that missing strings all hash the same? Probably not?

I actually think they should have the same hash value—otherwise, it wouldn’t be possible to implement behavior like equal_nan=True.

It would be a problem if they are not considered equal. In that case, this should hash based on the address.

In that case, I’ll update the unique_vstring function to accept an equal_nan option and handle the behavior accordingly.

seberg · 2025-05-01T15:46:12Z

numpy/_core/src/multiarray/multiarraymodule.c

@@ -4572,7 +4572,7 @@ static struct PyMethodDef array_module_methods[] = {
    {"from_dlpack", (PyCFunction)from_dlpack,
        METH_FASTCALL | METH_KEYWORDS, NULL},
    {"_unique_hash",  (PyCFunction)array__unique_hash,
-        METH_O, "Collect unique values via a hash map."},
+        METH_VARARGS | METH_KEYWORDS, "Collect unique values via a hash map."},


Please use METH_FASTCALL | METH_KEYWORDS, you'll have to change the parsing, but you should find an example easily (and the call overhead will be much smaller).

ngoldbaum · 2025-05-02T12:07:36Z

numpy/_core/src/multiarray/unique.cpp

+    // NumPy API calls and Python object manipulations require holding the GIL.
+    Py_INCREF(descr);
+
+    // variables for the vstring, this operation require holding the GIL.


Why did you make this change? I don’t think this is true.

from this comment and implementation

numpy/numpy/_core/src/multiarray/stringdtype/static_string.c

Lines 286 to 306 in 4905619

/*NUMPY_API

* Acquire the mutex locking the allocator attached to *descr*.

*

* NpyString_release_allocator must be called on the allocator returned

* by this function exactly once.

*

* Note that functions requiring the GIL should not be called while the

* allocator mutex is held, as doing so may cause deadlocks.

*/

NPY_NO_EXPORT npy_string_allocator *

NpyString_acquire_allocator(const PyArray_StringDTypeObject *descr)

{

#if PY_VERSION_HEX < 0x30d00b3

if (!PyThread_acquire_lock(descr->allocator->allocator_lock, NOWAIT_LOCK)) {

PyThread_acquire_lock(descr->allocator->allocator_lock, WAIT_LOCK);

}

#else

PyMutex_Lock(&descr->allocator->allocator_lock);

#endif

return descr->allocator;

}

The docstring for that function doesn’t say you need to hold the GIL to call it. It says you shouldn’t call into functions requiring the GIL while holding the allocator lock, as that might deadlock.

Does that make sense? Is there a better way to phrase the docstring to avoid confusion?

I understand now. I don't think the docstring is misleading.

math-hiyoko · 2025-05-04T07:53:26Z

@seberg @ngoldbaum
I've addressed all comments received so far.

math-hiyoko added 13 commits April 16, 2025 01:20

Support NPY_STRING, NPY_UNICODE

f620f3b

unique for NPY_STRING and NPY_UNICODE

20ccefe

fix construct array

38626b9

remove unneccessary include

56bd858

refactor

f79736a

refactoring

c4e5438

comment

7c51049

feature: unique for NPY_VSTRING

bd70552

refactoring

cc8ece6

remove unneccessary include

f7b20a0

add test

d0170ed

add error message

dbb140f

linter

49ed502

math-hiyoko marked this pull request as draft April 18, 2025 11:14

github-actions bot added the 01 - Enhancement label Apr 18, 2025

math-hiyoko added 15 commits April 18, 2025 20:16

linter

0238cee

reserve bucket

6905978

remove emoji from testcase

2fc1378

fix testcase

1ad6d6c

remove error

b478e15

fix testcase

95bc405

fix testcase name

3f1811b

use basic_string

99e3662

fix testcase

b99542a

add ValueError

2589dd7

fix testcase

3f40cdc

fix memory error

68d5a7b

remove multibyte char

d38c3e3

refactoring

8cf2c63

add multibyte char

0165d6a

Merge branch 'feature/numpy#28364' of github.com:math-hiyoko/numpy in…

52a982d

…to feature/numpy#28364

ngoldbaum reviewed Apr 25, 2025

View reviewed changes

ENH: support medium and long vstrings

fff254e

ngoldbaum reviewed Apr 28, 2025

View reviewed changes

math-hiyoko added 12 commits April 29, 2025 00:06

FIX: comment

370bd8f

ENH: use RAII wrapper

49dfcb4

FIX: error handling of string packing

c5745bf

FIX: error handling of string packing

3ba9788

FIX: change default bucket size

376ad09

FIX: include

aa0db48

FIX: cast

7a2892f

ENH: support equal_nan=False

896bcba

FIX: function equal

f1c1947

FIX: check the case if pack_status douesn't return NULL

f35123a

FIX: check the case if pack_status douesn't return NULL

e6ea015

FIX: stderr

ddff98f

seberg reviewed May 1, 2025

View reviewed changes

math-hiyoko added 2 commits May 1, 2025 18:39

ENH: METH_VARARGS -> METH_FASTCALL

2758e27

FIX: log

a6dc86a

ngoldbaum reviewed May 2, 2025

View reviewed changes

math-hiyoko force-pushed the feature/#28364 branch from d9333fa to a6dc86a Compare May 3, 2025 13:19

math-hiyoko added 7 commits May 3, 2025 07:23

FIX: release allocator

9a936eb

FIX: comment

1e967ee

FIX: delete log

52c2326

ENH: implemented FNV-1a as hash function

6f18a43

bool -> npy_bool

2a1bd41

FIX: cast

8b632f2

34sec -> 35.1sec

a7bfc08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: np.unique: support hash based unique for string dtype #28767

ENH: np.unique: support hash based unique for string dtype #28767

math-hiyoko commented Apr 18, 2025 •

edited

Loading

ngoldbaum Apr 25, 2025

math-hiyoko Apr 26, 2025

ngoldbaum Apr 28, 2025 •

edited

Loading

math-hiyoko May 4, 2025

ngoldbaum left a comment

ngoldbaum Apr 28, 2025

ngoldbaum Apr 28, 2025

ngoldbaum Apr 28, 2025

math-hiyoko Apr 29, 2025

ngoldbaum Apr 28, 2025

math-hiyoko Apr 29, 2025

seberg Apr 29, 2025

math-hiyoko Apr 29, 2025

seberg Apr 29, 2025 •

edited

Loading

math-hiyoko Apr 29, 2025

math-hiyoko Apr 29, 2025

ngoldbaum Apr 28, 2025

math-hiyoko Apr 29, 2025

ngoldbaum Apr 28, 2025

math-hiyoko Apr 29, 2025

seberg Apr 29, 2025

math-hiyoko Apr 30, 2025

seberg May 1, 2025

ngoldbaum May 2, 2025

math-hiyoko May 2, 2025

ngoldbaum May 2, 2025 •

edited

Loading

math-hiyoko May 3, 2025

math-hiyoko commented May 4, 2025

	/*NUMPY_API
	* Acquire the mutex locking the allocator attached to descr.
	*
	* NpyString_release_allocator must be called on the allocator returned
	* by this function exactly once.
	*
	* Note that functions requiring the GIL should not be called while the
	* allocator mutex is held, as doing so may cause deadlocks.
	*/
	NPY_NO_EXPORT npy_string_allocator *
	NpyString_acquire_allocator(const PyArray_StringDTypeObject *descr)
	{
	#if PY_VERSION_HEX < 0x30d00b3
	if (!PyThread_acquire_lock(descr->allocator->allocator_lock, NOWAIT_LOCK)) {
	PyThread_acquire_lock(descr->allocator->allocator_lock, WAIT_LOCK);
	}
	#else
	PyMutex_Lock(&descr->allocator->allocator_lock);
	#endif
	return descr->allocator;
	}

ENH: np.unique: support hash based unique for string dtype #28767

Are you sure you want to change the base?

ENH: np.unique: support hash based unique for string dtype #28767

Conversation

math-hiyoko commented Apr 18, 2025 • edited Loading

Description

Benchmark Results

Result

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum Apr 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg Apr 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum May 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

math-hiyoko commented May 4, 2025

math-hiyoko commented Apr 18, 2025 •

edited

Loading

ngoldbaum Apr 28, 2025 •

edited

Loading

seberg Apr 29, 2025 •

edited

Loading

ngoldbaum May 2, 2025 •

edited

Loading