ENH: np.unique and sorting is slow for StringDType #26510

ngoldbaum · 2024-05-23T16:01:30Z

I also did some basic performance tests in v2.0.0rc2 for the experimental StringDType but it seems to be much slower than "O" or "U" dtypes in certain cases, are you aware of these issues:

import numpy
import random
import timeit

print(numpy.__version__)  # '2.0.0rc2'

options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype="T", count=len(lst))
arr_o = numpy.fromiter(lst, dtype="O", count=len(lst))
arr_u = numpy.fromiter(lst, dtype="U5", count=len(lst))

print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000))  # 4.270 <- why so much slower?
print(timeit.timeit(lambda: numpy.unique(arr_o), number=10000))  # 2.357
print(timeit.timeit(lambda: numpy.unique(arr_u), number=10000))  # 0.502
print(timeit.timeit(lambda: sorted(set(lst)), number=10000))  #  0.078

Profiling (will include in a separate comment below) indicates that the overhead is from acquiring and releasing the allocator lock (it does it for every entry in the array because sorting is implemented using getitem).

We could improve performance for sorting specifically by adding a fast getitem path that assumes the allocator lock is already acquired and using it in the sorting implementation after acquiring the lock once.

Optimizing getitem and setitem would also help (maybe by adding a stringdtype scalar that interns the packed string).

Originally posted by @MarkusSintonen in #25693 (comment)

The text was updated successfully, but these errors were encountered:

ngoldbaum · 2024-05-23T16:08:35Z

Running this script under py-spy with --native:

import numpy as np
import random

options = ["a", "bb", "cc", "ddd"]
lst = random.choices(options, k=1000)
arr_s = np.array(lst, dtype="T")

for _ in range(10000):
    arr_s.sort()

Produces this flame graph:

ngoldbaum · 2024-08-16T20:23:16Z

I looked at this again.

One thing that immediately helps is switching to a more performant mutex implementation, and indeed running on Python 3.13rc1 that uses PyMutex I found that the stringdtype version beats object array, but unicode is still substantially faster:

± spin python ../numpy-experiments/test.py
Invoking `build` prior to invoking Python:
$ /Users/goldbaum/.pyenv/versions/3.13.0rc1/bin/python3.13 vendored-meson/meson/meson.py compile -C build
INFO: autodetecting backend as ninja
INFO: calculating backend command to run: /Users/goldbaum/.pyenv/versions/3.13.0rc1/bin/ninja -C /Users/goldbaum/Documents/numpy/build
ninja: Entering directory `/Users/goldbaum/Documents/numpy/build'
[1/1] Generating numpy/generate-version with a custom command
Saving version to numpy/version.py
$ /Users/goldbaum/.pyenv/versions/3.13.0rc1/bin/python3.13 vendored-meson/meson/meson.py install --only-changed -C build --destdir ../build-install
$ export PYTHONPATH="/Users/goldbaum/Documents/numpy/build-install/usr/lib/python3.13/site-packages"
🐍 Launching Python with PYTHONPATH="/Users/goldbaum/Documents/numpy/build-install/usr/lib/python3.13/site-packages"
$ /Users/goldbaum/.pyenv/versions/3.13.0rc1/bin/python3.13 -P ../numpy-experiments/test.py
2.2.0.dev0+git20240816.26cba75
0.9557965830899775
1.3458311669528484
0.2885478753596544
0.06409212481230497

So I think it's probably still worth it to special-case the sorting. Or at least do something so the lock only needs to be acquired once. Looking at actually adding special sorting routines would mean re-implementing the algorithms in quicksort.cpp, timsort.cpp, etc etc which seems not worth doing.

Maybe better to simply special-case stringdtype sorting in _new_sortlike and _new_argsortlike so that the lock is acquired at the beginning of the sort and released at the end.

ngoldbaum · 2024-08-16T21:45:17Z

Ah, I see why unicode sorting is fast, there's a special case for string sorting in quicksort.cpp.

charris · 2024-08-16T22:13:29Z

How does argsorting compare? Do you actually move strings around, or just change the offsets?

ngoldbaum mentioned this issue May 23, 2024

TSK: Follow-up things for stringdtype #25693

Open

14 tasks

rgommers added the 01 - Enhancement label May 26, 2024

otsaloma mentioned this issue Dec 15, 2024

Sort and unique slow for strings otsaloma/dataiter#28

Open

MaanasArora linked a pull request Mar 14, 2025 that will close this issue

ENH, API: New sorting mechanism for DType API #28516

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: np.unique and sorting is slow for StringDType #26510

ENH: np.unique and sorting is slow for StringDType #26510

ngoldbaum commented May 23, 2024

ngoldbaum commented May 23, 2024 •

edited

Loading

Uh oh!

ngoldbaum commented Aug 16, 2024 •

edited

Loading

Uh oh!

ngoldbaum commented Aug 16, 2024

Uh oh!

charris commented Aug 16, 2024

Uh oh!

Uh oh!

ENH: np.unique and sorting is slow for StringDType #26510

ENH: np.unique and sorting is slow for StringDType #26510

Comments

ngoldbaum commented May 23, 2024

ngoldbaum commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Aug 16, 2024

Uh oh!

charris commented Aug 16, 2024

Uh oh!

ngoldbaum commented May 23, 2024 •

edited

Loading

ngoldbaum commented Aug 16, 2024 •

edited

Loading