Skip to content

ENH: np.unique: support hash based unique for string dtype #28767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 76 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
f620f3b
Support NPY_STRING, NPY_UNICODE
math-hiyoko Apr 15, 2025
20ccefe
unique for NPY_STRING and NPY_UNICODE
math-hiyoko Apr 16, 2025
38626b9
fix construct array
math-hiyoko Apr 16, 2025
56bd858
remove unneccessary include
math-hiyoko Apr 16, 2025
f79736a
refactor
math-hiyoko Apr 16, 2025
c4e5438
refactoring
math-hiyoko Apr 17, 2025
7c51049
comment
math-hiyoko Apr 17, 2025
bd70552
feature: unique for NPY_VSTRING
math-hiyoko Apr 18, 2025
cc8ece6
refactoring
math-hiyoko Apr 18, 2025
f7b20a0
remove unneccessary include
math-hiyoko Apr 18, 2025
d0170ed
add test
math-hiyoko Apr 18, 2025
dbb140f
add error message
math-hiyoko Apr 18, 2025
49ed502
linter
math-hiyoko Apr 18, 2025
0238cee
linter
math-hiyoko Apr 18, 2025
6905978
reserve bucket
math-hiyoko Apr 18, 2025
2fc1378
remove emoji from testcase
math-hiyoko Apr 18, 2025
1ad6d6c
fix testcase
math-hiyoko Apr 18, 2025
b478e15
remove error
math-hiyoko Apr 18, 2025
95bc405
fix testcase
math-hiyoko Apr 18, 2025
3f1811b
fix testcase name
math-hiyoko Apr 18, 2025
99e3662
use basic_string
math-hiyoko Apr 18, 2025
b99542a
fix testcase
math-hiyoko Apr 18, 2025
2589dd7
add ValueError
math-hiyoko Apr 18, 2025
3f40cdc
fix testcase
math-hiyoko Apr 18, 2025
68d5a7b
fix memory error
math-hiyoko Apr 18, 2025
d38c3e3
remove multibyte char
math-hiyoko Apr 18, 2025
8cf2c63
refactoring
math-hiyoko Apr 18, 2025
0165d6a
add multibyte char
math-hiyoko Apr 18, 2025
243be6b
refactoring
math-hiyoko Apr 18, 2025
a6e5d3c
fix memory error
math-hiyoko Apr 18, 2025
78b9dc6
fix GIL
math-hiyoko Apr 18, 2025
0464617
fix strlen
math-hiyoko Apr 18, 2025
908f495
remove PyArray_GETPTR1
math-hiyoko Apr 19, 2025
30d1d1a
refactoring
math-hiyoko Apr 19, 2025
36c167c
refactoring
math-hiyoko Apr 19, 2025
79d31e4
use optional
math-hiyoko Apr 19, 2025
00143f9
refactoring
math-hiyoko Apr 19, 2025
1cc09f3
refactoring
math-hiyoko Apr 19, 2025
b29981d
refactoring
math-hiyoko Apr 19, 2025
91c5d42
refactoring
math-hiyoko Apr 19, 2025
e9c3aac
fix comment
math-hiyoko Apr 19, 2025
8191f5f
linter
math-hiyoko Apr 19, 2025
4faf36a
add doc
math-hiyoko Apr 19, 2025
c6aaf39
DOC: fix
math-hiyoko Apr 19, 2025
1053bcb
DOC: fix format
math-hiyoko Apr 20, 2025
1afefbe
MNT: refactoring
math-hiyoko Apr 20, 2025
b5610b1
MNT: refactoring
math-hiyoko Apr 20, 2025
c28a7ce
ENH: Store pointers to strings in the set instead of the strings them…
math-hiyoko Apr 24, 2025
b17011e
FIX: length in memcmp
math-hiyoko Apr 24, 2025
c2d5868
ENH: refactoring
math-hiyoko Apr 24, 2025
7d4afe0
DOC: 49sec -> 34sec
math-hiyoko Apr 24, 2025
ad843b0
Update numpy/lib/_arraysetops_impl.py
math-hiyoko Apr 25, 2025
45ec2b3
DOC: Mention that hash-based np.unique returns unsorted strings
math-hiyoko Apr 25, 2025
52a982d
Merge branch 'feature/#28364' of github.com:math-hiyoko/numpy into fe…
math-hiyoko Apr 25, 2025
fff254e
ENH: support medium and long vstrings
math-hiyoko Apr 26, 2025
370bd8f
FIX: comment
math-hiyoko Apr 29, 2025
49dfcb4
ENH: use RAII wrapper
math-hiyoko Apr 29, 2025
c5745bf
FIX: error handling of string packing
math-hiyoko Apr 29, 2025
3ba9788
FIX: error handling of string packing
math-hiyoko Apr 29, 2025
376ad09
FIX: change default bucket size
math-hiyoko Apr 29, 2025
aa0db48
FIX: include
math-hiyoko Apr 30, 2025
7a2892f
FIX: cast
math-hiyoko Apr 30, 2025
896bcba
ENH: support equal_nan=False
math-hiyoko May 1, 2025
f1c1947
FIX: function equal
math-hiyoko May 1, 2025
f35123a
FIX: check the case if pack_status douesn't return NULL
math-hiyoko May 1, 2025
e6ea015
FIX: check the case if pack_status douesn't return NULL
math-hiyoko May 1, 2025
ddff98f
FIX: stderr
math-hiyoko May 1, 2025
2758e27
ENH: METH_VARARGS -> METH_FASTCALL
math-hiyoko May 2, 2025
a6dc86a
FIX: log
math-hiyoko May 2, 2025
9a936eb
FIX: release allocator
math-hiyoko May 3, 2025
1e967ee
FIX: comment
math-hiyoko May 3, 2025
52c2326
FIX: delete log
math-hiyoko May 3, 2025
6f18a43
ENH: implemented FNV-1a as hash function
math-hiyoko May 3, 2025
2a1bd41
bool -> npy_bool
math-hiyoko May 3, 2025
8b632f2
FIX: cast
math-hiyoko May 3, 2025
a7bfc08
34sec -> 35.1sec
math-hiyoko May 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions doc/release/upcoming_changes/28767.change.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
``unique_values`` for string dtypes may return unsorted data
------------------------------------------------------------
np.unique now supports hash‐based duplicate removal for string dtypes.
This enhancement extends the hash-table algorithm to byte strings ('S'),
Unicode strings ('U'), and the experimental string dtype ('T', StringDType).
As a result, calling np.unique() on an array of strings will use
the faster hash-based method to obtain unique values.
Note that this hash-based method does not guarantee that the returned unique values will be sorted.
This also works for StringDType arrays containing None (missing values)
when using equal_nan=True (treating missing values as equal).
10 changes: 10 additions & 0 deletions doc/release/upcoming_changes/28767.performance.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Performance improvements to ``np.unique`` for string dtypes
-----------------------------------------------------------
The hash-based algorithm for unique extraction provides
an order-of-magnitude speedup on large string arrays.
In an internal benchmark with about 1 billion string elements,
the hash-based np.unique completed in roughly 35.1 seconds,
compared to 498 seconds with the sort-based method
– about 14× faster for unsorted unique operations on strings.
This improvement greatly reduces the time to find unique values
in very large string datasets.
2 changes: 1 addition & 1 deletion numpy/_core/src/multiarray/multiarraymodule.c
Original file line number Diff line number Diff line change
Expand Up @@ -4572,7 +4572,7 @@ static struct PyMethodDef array_module_methods[] = {
{"from_dlpack", (PyCFunction)from_dlpack,
METH_FASTCALL | METH_KEYWORDS, NULL},
{"_unique_hash", (PyCFunction)array__unique_hash,
METH_O, "Collect unique values via a hash map."},
METH_FASTCALL | METH_KEYWORDS, "Collect unique values via a hash map."},
{NULL, NULL, 0, NULL} /* sentinel */
};

Expand Down
Loading
Loading