Skip to content

ENH add hash based unique #26018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 53 commits into from
Feb 27, 2025
Merged

ENH add hash based unique #26018

merged 53 commits into from
Feb 27, 2025

Conversation

adrinjalali
Copy link
Contributor

@adrinjalali adrinjalali commented Mar 14, 2024

This calculates unique values with an unordered hashset in C++ for certain dtypes.

Towards #11136

@ngoldbaum ngoldbaum marked this pull request as draft March 15, 2024 14:58
@ngoldbaum
Copy link
Member

I marked this as draft in the github UI, hope you don't mind.

@adrinjalali
Copy link
Contributor Author

Thanks, I had intended to have it as draft, but forgot.

@adrinjalali
Copy link
Contributor Author

adrinjalali commented Mar 19, 2024

Still got work to do regarding figuring out dtypes / type lengths supported on all platforms and all, and figuring out a way to filter dtypes that should be included in the hash based unique, but for now, this is giving me a somewhat 10x speedup on 100_000_000 to 1_000_000_000 data points.

EDIT: the speedup is more of 2-3x on these numbers with compiler optimization enabled.

@adrinjalali
Copy link
Contributor Author

Example times on 1,000,000,000 samples:

unique count (hashmap):  1000
7.815 secs
unique count (numpy main):  1000
21.436 secs
unique count (polars):  1000
16.768 secs

Now that the code is pretty small, is this something we'd like to have in numpy?

cc @seberg , @rgommers since you've been involved in this.

I still need to fix the compiler issues for all different compilers, and maybe figure out the macro defs to add 96 and 128 bit data, but for now I think we can discuss if this is the right thing to do.

Also, the script I use to benchmark:

import time
import numpy as np
from numpy._core.unique import unique_hash
import polars as pl

arr = np.random.randint(0, 1000, 1000_000_000)

time_start = time.perf_counter()
print("unique count (hashmap): ", len(unique_hash(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

time_start = time.perf_counter()
print("unique count (numpy main): ", len(np.unique(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

time_start = time.perf_counter()
print("unique count (polars): ", len(pl.Series(arr).unique()))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

Haven't included pandas since I guess they still need to do a release with numpy2 first.

@adrinjalali adrinjalali changed the title [DRAFT] [IGNORE] ENH add hash based unique [DRAFT] ENH add hash based unique Mar 21, 2024
@ngoldbaum
Copy link
Member

This seems like a reasonable amount of code to maintain; it's more or less a wrapper around the C++ standard library unordered_map.

@ngoldbaum
Copy link
Member

Also totally a style thing, but to keep things consistent it probably makes more sense to add the unique_hash function to the main numpy._core._multiarray_umath module that most other numpy core functionality lives in than to define a new np._core.unique module.

@rgommers
Copy link
Member

Now that the code is pretty small, is this something we'd like to have in numpy?

Nice work. I agree with what @ngoldbaum said above. This seems like a very clean and concise implementation.

@jorisvandenbossche
Copy link
Contributor

jorisvandenbossche commented Mar 22, 2024

Haven't included pandas since I guess they still need to do a release with numpy2 first.

You could install the nightly wheels from the scientific python channel if you want (I don't think there is any (compiled) library that already has a 2.0 compatible release given there is not yet a stable ABI ?)

I tried your benchmark case locally with pandas. I didn't fetch this branch to compare with, but also ran with released numpy as a point of comparison:

%time len(pd.unique(arr))
CPU times: user 2.74 s, sys: 0 ns, total: 2.74 s
Wall time: 2.74 s

%time len(np.unique(arr))
CPU times: user 39.7 s, sys: 1.15 s, total: 40.8 s
Wall time: 40.9 s

Related to std::unordered_map, you find quite some content "on the internet" complaining about how slow it is compared to other implementations (I remember from looking a bit around some time ago for pandas, wondering if we should consider using a different implementations; pandas uses khash). But a lost of those posts are also quite some years old, and C++ compilers might have improved their implementation. It's definitely convenient to use)

@adrinjalali
Copy link
Contributor Author

@ngoldbaum

Also totally a style thing, but to keep things consistent it probably makes more sense to add the unique_hash function to the main numpy._core._multiarray_umath module that most other numpy core functionality lives in than to define a new np._core.unique module.

I tried and failed. One is C++ the other C, and they also have different macros related to compilation. So not sure how to merge them.

@seberg

  • bools and floats: good points. Will make this PR "less smart" and dtype specific code for those types. Although I think those can also be a separate PR to keep this one small.
  • refactoring: I'll try
  • Could you please elaborate on OwnedRef please? I'll checkout the pybind thingy to see how exceptions should be handled in the meantime.
  • randomized hash: they're compiler specific and AFAIK at least for gcc they might be anything by randomized. Pandas uses https://github.com/veorq/SipHash which is certainly cryptographically secure. But there are other randomized hash functions which are faster, but not as secure. We could also consider adding that / changing to it in a separate PR maybe? Or would you say having a terrible hash function as the one in this PR is too bad to be merged into numpy?

@jorisvandenbossche
Yes, pandas is really fast, and last I compared, for unique, it was faster than polars. But it's also a huge piece of code on the pandas side. Porting that in here was the first thing I tried (#25596).

@seberg
Copy link
Member

seberg commented May 13, 2024

I tried and failed. One is C++ the other C

Should work, I think but I guess you could defer that for now. You will have to export an extern "C" { function to be called from multiarray-module.

Could you please elaborate on OwnedRef please?

In C, you do cleanup with a goto error: /* cleanup code */. In C++ you don't use goto error: but rely on exceptions, when an exception gets triggered (which could be in many places e.g. when out of memory), you need to run that cleanup code.
In particular, if you have PyObject * you must decref them. There are three solutions to that:

  1. You use C++ only in a core function that uses only borrowed references.
  2. You wrap everything in a try/catch...
  3. You introduce a new OwnedRef which which does nothing but store a PyObject * but defines a dealloc. That way C++ does the right thing.

In this case, I am not sure what is better, it might be 1 (have a simple C entry-point and just call into some C++ code that maybe doesn't even need objects at all) or 3 if you want to work with objects in C++.

having a terrible hash function as the one in this PR is too bad to be merged into numpy?

I doubt it matters much, there is nothing cryptographic here, the issue would be denial of service by creating a dataset with a huge number of unique values which all hash to the same thing. Note that e.g. Python only randomizes the hash for strings (unless dicts randomize them for non-strings?).

@adrinjalali
Copy link
Contributor Author

Doesn't seem like the failure is related to this PR

=================================== FAILURES ===================================
___________ TestStringDiscovery.test_nested_arrays_stringlength[1.2] ___________

self = <test_array_coercion.TestStringDiscovery object at 0x3cd6fb17d010>
obj = 1.2

    @pytest.mark.parametrize("obj",
            [object(), 1.2, 10**43, None, "string"],
            ids=["object", "1.2", "10**43", "None", "string"])
    def test_nested_arrays_stringlength(self, obj):
        length = len(str(obj))
        expected = np.dtype(f"S{length}")
        arr = np.array(obj, dtype="O")
>       assert np.array([arr, arr], dtype="S").dtype == expected
E       RuntimeWarning: invalid value encountered in cast

arr        = array(1.2, dtype=object)
expected   = dtype('S3')
length     = 3
obj        = 1.2
self       = <test_array_coercion.TestStringDiscovery object at 0x3cd6fb17d010>

@ngoldbaum
Copy link
Member

The fix for that was just merged

seberg and others added 2 commits February 25, 2025 12:51
@seberg seberg removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Feb 25, 2025
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a small commit to:

  • Use sorted=False for unique_values (and a release note)
  • Some smaller maintenance, partially just small preferences, though.
  • I removed the 0-size special case. I think it is just nice to not need it (and I don't consider 0-size to be important to optimize)

This LGTM, would be nice if someone could have a quick look over my changes.

NPY_ALLOW_C_API;
PyArray_Descr *descr = PyArray_DESCR(self);
Py_INCREF(descr);
PyObject *res_obj = PyArray_NewFromDescr(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was a C++ exception after this, we would leak it... But, I don't think that is possible, so let's not worry about it.

@adrinjalali
Copy link
Contributor Author

There's a segfault and the docs need updating since output is not sorted now @seberg

@seberg
Copy link
Member

seberg commented Feb 25, 2025

There's a segfault and the docs need updating since output is not sorted now

Whoops, need to not try to iterate (I think the setup is OK though).

The output being sorted is not documented for unique_values, which is the one I changed.

EDIT: 🤦 sorry, the examples do need updating of course, oops.

(let's pretend it may not even be stable!)
@seberg
Copy link
Member

seberg commented Feb 25, 2025

Just to note the observation: The s390x build segfaulted once here. (i.e. not our test, but the compilation)

@seberg
Copy link
Member

seberg commented Feb 27, 2025

Let's give this a shot, thanks @adrinjalali, we discussed it briefly yesterday and I don't think it is helpful to try to iterate here more.

There are a lot of smaller and larger follow-ups (larger e.g. thinking about using another hash-map, maybe even one with some randomization, etc.).

@seberg seberg merged commit 9e557eb into numpy:main Feb 27, 2025
68 checks passed
@adrinjalali
Copy link
Contributor Author

Thank y'all for all the patience, all the reviews, and all the help. This was a steep learning curve for me, doing C++ after about 12 years, and never having had done cpython bindings using its barebone API 😅

I loved every bit of this experience, and look forward to more contributions on this little corner ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants