Skip to content

ENH: np.unique: support hash based unique for string dtype #28767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 76 commits into
base: main
Choose a base branch
from

Conversation

math-hiyoko
Copy link

@math-hiyoko math-hiyoko commented Apr 18, 2025

Description

This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.

Benchmark Results

The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described #26018 (comment)

import random
import string
import time

import numpy as np
import polars as pl

chars = string.ascii_letters + string.digits
arr = np.array(
    [
        ''.join(random.choices(chars, k=random.randint(5, 10)))
        for _ in range(1_000)
    ] * 1_000_000,
    dtype='T',
)
np.random.shuffle(arr)

time_start = time.perf_counter()
print("unique count (hash based): ", len(np.unique(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

time_start = time.perf_counter()
print("unique count (polars): ", len(pl.Series(arr).unique()))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

Result

unique count (hash based):  1000
35.127 secs
unique count (numpy main):  1000
498.011 secs
unique count (polars):  1000
74.023 secs

close #28364

@math-hiyoko math-hiyoko marked this pull request as draft April 18, 2025 11:14
def test_unique_vstring_hash_based(self):
# test for unicode and nullable string arrays
a = np.array(['straße', None, 'strasse', 'straße', None, 'niño', 'nino', 'élève', 'eleve', 'niño', 'élève'], dtype=StringDType(na_object=None))
unq_sorted_wo_none = ['eleve', 'nino', 'niño', 'strasse', 'straße', 'élève']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings should probably include items that exercise more of the functionality in StringDType internals. See NEP 55 and the StringDType tests for examples, but basically you only include "short" strings here, but probably also want to include both "medium" and "long" strings that are allocated on the heap outside of the ndarray buffer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually encountered memory errors when adding medium and long strings to the tests, so I’ve implemented fixes addressing those issues as well. Thanks for pointing that out!

// function to caluculate the hash of a string
template <typename T>
size_t str_hash(const T *str, npy_intp num_chars) {
// https://www.boost.org/doc/libs/1_88_0/libs/container_hash/doc/html/hash.html#notes_hash_combine
Copy link
Member

@ngoldbaum ngoldbaum Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link says boost hasn't used this algorithm since Boost 1.56 in 2014 and currently uses an algorithm based on murmurhash3.

Rather than trying to combine the hashes of all the individual bytes, why not just calculate the hash directly using the char * or Py_UCS4 * buffers and a byte hashing algorithm?

Did you try FNV hash or another non-cryptographic hash? FNV is nice because it's very easy to implement as a standalone bit of C code.

http://www.isthe.com/chongo/tech/comp/fnv/index.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the hash algorithm from the original implementation to FNV-1a.
Although the overall execution speed has slightly decreased due to other changes introduced simultaneously, the hash algorithm change itself resulted in a 94.4% reduction in hashing time.

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments inline.

npy_string_allocator *in_allocator = NpyString_acquire_allocator((PyArray_StringDTypeObject *)descr);
auto in_allocator_dealloc = finally([&]() {
NpyString_release_allocator(in_allocator);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, might be worth adding C++ wrappers for the NpyString API somewhere in NumPy so this pattern can be used elsewhere.

assert_equal(count_none, 1)

a1_wo_none = sorted(x for x in a1 if x is not None)
assert_array_equal(a1_wo_none, unq_sorted_wo_none)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I guess this handles missing strings fine.

if ((*it)->buf == NULL) {
NpyString_pack_null(out_allocator, packed_string);
} else {
NpyString_pack(out_allocator, packed_string, (*it)->buf, (*it)->size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these functions can fail and you have to check for and handle the error case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated it so that if packing fails, it returns NULL.

npy_intp isize = PyArray_SIZE(self);

// Reserve hashset capacity in advance to minimize reallocations.
std::unordered_set<T> hashset(isize * 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isize * 2? Shouldn't isize (all distinct values) be enough?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I set the hashset capacity to 2 * isize not just to avoid reallocations, but also to reduce hash collisions as much as possible. I've added a comment in the code to clarify this reasoning as well.
https://www.cse.cuhk.edu.hk/irwin.king/_media/teaching/csci2100b/csci2100b-2013-05-hash.pdf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatively random comment, but I don't think we should reserve "enough" from the start (the * 2 also looks a bit like it includes the hashmap load factor?). This scheme is potentially larger than the original array, even though often the array may have very few unique elements.

If reserving is worth it, maybe it would make sense reserve a relatively large default with min(isize, not_too_large), where not_to_large is maybe a few kiB?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we’re now only storing pointers in the unordered_set, its size shouldn’t be larger than the original array—especially for strings, where the set just holds pointers rather than the string data itself. I agree that for integers, your suggested approach would be more appropriate.

Copy link
Member

@seberg seberg Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If strings are short that argument doesn't quite hold. So for simplicity, I would suggest to just use the same scheme for everything.

(Also, I am not sure that the *2 is right at all, because I expect C++ will already add that factor based on the set load factor.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I’ll use min(isize, not_too_large) instead.
For not_too_large, I’m thinking of setting it to 1024 (since each element is a pointer, so that’s about 4 KiB).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated the code to use min(isize, 1024) as the initial bucket size for the hash set. I’ve also added a comment in the code to explain this choice.

npy_intp *strideptr = NpyIter_GetInnerStrideArray(iter);
npy_intp *innersizeptr = NpyIter_GetInnerLoopSizePtr(iter);
NPY_DISABLE_C_API;
PyThreadState *_save2 = PyEval_SaveThread();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use an RAII wrapper for this too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added an RAII wrapper for this now.

});
auto hash = [](const npy_static_string *value) -> size_t {
if (value->buf == NULL) {
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a problem that missing strings all hash the same? Probably not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think they should have the same hash value—otherwise, it wouldn’t be possible to implement behavior like equal_nan=True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a problem if they are not considered equal. In that case, this should hash based on the address.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I’ll update the unique_vstring function to accept an equal_nan option and handle the behavior accordingly.

@@ -4572,7 +4572,7 @@ static struct PyMethodDef array_module_methods[] = {
{"from_dlpack", (PyCFunction)from_dlpack,
METH_FASTCALL | METH_KEYWORDS, NULL},
{"_unique_hash", (PyCFunction)array__unique_hash,
METH_O, "Collect unique values via a hash map."},
METH_VARARGS | METH_KEYWORDS, "Collect unique values via a hash map."},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use METH_FASTCALL | METH_KEYWORDS, you'll have to change the parsing, but you should find an example easily (and the call overhead will be much smaller).

// NumPy API calls and Python object manipulations require holding the GIL.
Py_INCREF(descr);

// variables for the vstring, this operation require holding the GIL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you make this change? I don’t think this is true.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from this comment and implementation

/*NUMPY_API
* Acquire the mutex locking the allocator attached to *descr*.
*
* NpyString_release_allocator must be called on the allocator returned
* by this function exactly once.
*
* Note that functions requiring the GIL should not be called while the
* allocator mutex is held, as doing so may cause deadlocks.
*/
NPY_NO_EXPORT npy_string_allocator *
NpyString_acquire_allocator(const PyArray_StringDTypeObject *descr)
{
#if PY_VERSION_HEX < 0x30d00b3
if (!PyThread_acquire_lock(descr->allocator->allocator_lock, NOWAIT_LOCK)) {
PyThread_acquire_lock(descr->allocator->allocator_lock, WAIT_LOCK);
}
#else
PyMutex_Lock(&descr->allocator->allocator_lock);
#endif
return descr->allocator;
}

Copy link
Member

@ngoldbaum ngoldbaum May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for that function doesn’t say you need to hold the GIL to call it. It says you shouldn’t call into functions requiring the GIL while holding the allocator lock, as that might deadlock.

Does that make sense? Is there a better way to phrase the docstring to avoid confusion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now. I don't think the docstring is misleading.

@math-hiyoko
Copy link
Author

@seberg @ngoldbaum
I've addressed all comments received so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

np.unique: support string dtypes
3 participants