-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH add hash based unique #26018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add hash based unique #26018
Conversation
I marked this as draft in the github UI, hope you don't mind. |
Thanks, I had intended to have it as draft, but forgot. |
Still got work to do regarding figuring out dtypes / type lengths supported on all platforms and all, and figuring out a way to filter dtypes that should be included in the hash based unique, but for now, this is giving me a somewhat 10x speedup on 100_000_000 to 1_000_000_000 data points. EDIT: the speedup is more of 2-3x on these numbers with compiler optimization enabled. |
Example times on 1,000,000,000 samples:
Now that the code is pretty small, is this something we'd like to have in numpy? cc @seberg , @rgommers since you've been involved in this. I still need to fix the compiler issues for all different compilers, and maybe figure out the macro defs to add 96 and 128 bit data, but for now I think we can discuss if this is the right thing to do. Also, the script I use to benchmark: import time
import numpy as np
from numpy._core.unique import unique_hash
import polars as pl
arr = np.random.randint(0, 1000, 1000_000_000)
time_start = time.perf_counter()
print("unique count (hashmap): ", len(unique_hash(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))
time_start = time.perf_counter()
print("unique count (numpy main): ", len(np.unique(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))
time_start = time.perf_counter()
print("unique count (polars): ", len(pl.Series(arr).unique()))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed)) Haven't included pandas since I guess they still need to do a release with numpy2 first. |
This seems like a reasonable amount of code to maintain; it's more or less a wrapper around the C++ standard library |
Also totally a style thing, but to keep things consistent it probably makes more sense to add the |
Nice work. I agree with what @ngoldbaum said above. This seems like a very clean and concise implementation. |
You could install the nightly wheels from the scientific python channel if you want (I don't think there is any (compiled) library that already has a 2.0 compatible release given there is not yet a stable ABI ?) I tried your benchmark case locally with pandas. I didn't fetch this branch to compare with, but also ran with released numpy as a point of comparison:
Related to |
I tried and failed. One is C++ the other C, and they also have different macros related to compilation. So not sure how to merge them.
@jorisvandenbossche |
Should work, I think but I guess you could defer that for now. You will have to export an
In C, you do cleanup with a
In this case, I am not sure what is better, it might be 1 (have a simple C entry-point and just call into some C++ code that maybe doesn't even need objects at all) or 3 if you want to work with objects in C++.
I doubt it matters much, there is nothing cryptographic here, the issue would be denial of service by creating a dataset with a huge number of unique values which all hash to the same thing. Note that e.g. Python only randomizes the hash for strings (unless dicts randomize them for non-strings?). |
Doesn't seem like the failure is related to this PR =================================== FAILURES ===================================
___________ TestStringDiscovery.test_nested_arrays_stringlength[1.2] ___________
self = <test_array_coercion.TestStringDiscovery object at 0x3cd6fb17d010>
obj = 1.2
@pytest.mark.parametrize("obj",
[object(), 1.2, 10**43, None, "string"],
ids=["object", "1.2", "10**43", "None", "string"])
def test_nested_arrays_stringlength(self, obj):
length = len(str(obj))
expected = np.dtype(f"S{length}")
arr = np.array(obj, dtype="O")
> assert np.array([arr, arr], dtype="S").dtype == expected
E RuntimeWarning: invalid value encountered in cast
arr = array(1.2, dtype=object)
expected = dtype('S3')
length = 3
obj = 1.2
self = <test_array_coercion.TestStringDiscovery object at 0x3cd6fb17d010> |
The fix for that was just merged |
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a small commit to:
- Use
sorted=False
forunique_values
(and a release note) - Some smaller maintenance, partially just small preferences, though.
- I removed the 0-size special case. I think it is just nice to not need it (and I don't consider 0-size to be important to optimize)
This LGTM, would be nice if someone could have a quick look over my changes.
NPY_ALLOW_C_API; | ||
PyArray_Descr *descr = PyArray_DESCR(self); | ||
Py_INCREF(descr); | ||
PyObject *res_obj = PyArray_NewFromDescr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was a C++ exception after this, we would leak it... But, I don't think that is possible, so let's not worry about it.
There's a segfault and the docs need updating since output is not sorted now @seberg |
Whoops, need to not try to iterate (I think the setup is OK though). The output being sorted is not documented for EDIT: 🤦 sorry, the examples do need updating of course, oops. |
(let's pretend it may not even be stable!)
Just to note the observation: The s390x build segfaulted once here. (i.e. not our test, but the compilation) |
Let's give this a shot, thanks @adrinjalali, we discussed it briefly yesterday and I don't think it is helpful to try to iterate here more. There are a lot of smaller and larger follow-ups (larger e.g. thinking about using another hash-map, maybe even one with some randomization, etc.). |
Thank y'all for all the patience, all the reviews, and all the help. This was a steep learning curve for me, doing C++ after about 12 years, and never having had done cpython bindings using its barebone API 😅 I loved every bit of this experience, and look forward to more contributions on this little corner ❤️ |
This calculates unique values with an unordered hashset in C++ for certain dtypes.
Towards #11136