-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: np.unique: support hash based unique for string dtype #28767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments inline.
npy_intp isize = PyArray_SIZE(self); | ||
|
||
// Reserve hashset capacity in advance to minimize reallocations. | ||
std::unordered_set<T> hashset(isize * 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isize * 2
? Shouldn't isize
(all distinct values) be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I set the hashset capacity to 2 * isize not just to avoid reallocations, but also to reduce hash collisions as much as possible. I've added a comment in the code to clarify this reasoning as well.
https://www.cse.cuhk.edu.hk/irwin.king/_media/teaching/csci2100b/csci2100b-2013-05-hash.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relatively random comment, but I don't think we should reserve "enough" from the start (the * 2
also looks a bit like it includes the hashmap load factor?). This scheme is potentially larger than the original array, even though often the array may have very few unique elements.
If reserving is worth it, maybe it would make sense reserve a relatively large default with min(isize, not_too_large)
, where not_to_large
is maybe a few kiB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we’re now only storing pointers in the unordered_set, its size shouldn’t be larger than the original array—especially for strings, where the set just holds pointers rather than the string data itself. I agree that for integers, your suggested approach would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If strings are short that argument doesn't quite hold. So for simplicity, I would suggest to just use the same scheme for everything.
(Also, I am not sure that the *2 is right at all, because I expect C++ will already add that factor based on the set load factor.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I’ll use min(isize, not_too_large)
instead.
For not_too_large, I’m thinking of setting it to 1024 (since each element is a pointer, so that’s about 4 KiB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve updated the code to use min(isize, 1024)
as the initial bucket size for the hash set. I’ve also added a comment in the code to explain this choice.
}); | ||
auto hash = [](const npy_static_string *value) -> size_t { | ||
if (value->buf == NULL) { | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a problem that missing strings all hash the same? Probably not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think they should have the same hash value—otherwise, it wouldn’t be possible to implement behavior like equal_nan=True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a problem if they are not considered equal. In that case, this should hash based on the address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I’ll update the unique_vstring function to accept an equal_nan option and handle the behavior accordingly.
d9333fa
to
a6dc86a
Compare
@seberg @ngoldbaum |
Description
This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.
Benchmark Results
The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described #26018 (comment)
Result
close #28364