Skip to content

ENH: np.unique and sorting is slow for StringDType #26510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ngoldbaum opened this issue May 23, 2024 · 4 comments · May be fixed by #28516
Open

ENH: np.unique and sorting is slow for StringDType #26510

ngoldbaum opened this issue May 23, 2024 · 4 comments · May be fixed by #28516

Comments

@ngoldbaum
Copy link
Member

I also did some basic performance tests in v2.0.0rc2 for the experimental StringDType but it seems to be much slower than "O" or "U" dtypes in certain cases, are you aware of these issues:

import numpy
import random
import timeit

print(numpy.__version__)  # '2.0.0rc2'

options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype="T", count=len(lst))
arr_o = numpy.fromiter(lst, dtype="O", count=len(lst))
arr_u = numpy.fromiter(lst, dtype="U5", count=len(lst))

print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000))  # 4.270 <- why so much slower?
print(timeit.timeit(lambda: numpy.unique(arr_o), number=10000))  # 2.357
print(timeit.timeit(lambda: numpy.unique(arr_u), number=10000))  # 0.502
print(timeit.timeit(lambda: sorted(set(lst)), number=10000))  #  0.078

Profiling (will include in a separate comment below) indicates that the overhead is from acquiring and releasing the allocator lock (it does it for every entry in the array because sorting is implemented using getitem).

We could improve performance for sorting specifically by adding a fast getitem path that assumes the allocator lock is already acquired and using it in the sorting implementation after acquiring the lock once.

Optimizing getitem and setitem would also help (maybe by adding a stringdtype scalar that interns the packed string).

Originally posted by @MarkusSintonen in #25693 (comment)

@ngoldbaum
Copy link
Member Author

ngoldbaum commented May 23, 2024

Running this script under py-spy with --native:

import numpy as np
import random

options = ["a", "bb", "cc", "ddd"]
lst = random.choices(options, k=1000)
arr_s = np.array(lst, dtype="T")

for _ in range(10000):
    arr_s.sort()

Produces this flame graph:

python-2024-05-23T09:53:41-06:00

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Aug 16, 2024

I looked at this again.

One thing that immediately helps is switching to a more performant mutex implementation, and indeed running on Python 3.13rc1 that uses PyMutex I found that the stringdtype version beats object array, but unicode is still substantially faster:

± spin python ../numpy-experiments/test.py
Invoking `build` prior to invoking Python:
$ /Users/goldbaum/.pyenv/versions/3.13.0rc1/bin/python3.13 vendored-meson/meson/meson.py compile -C build
INFO: autodetecting backend as ninja
INFO: calculating backend command to run: /Users/goldbaum/.pyenv/versions/3.13.0rc1/bin/ninja -C /Users/goldbaum/Documents/numpy/build
ninja: Entering directory `/Users/goldbaum/Documents/numpy/build'
[1/1] Generating numpy/generate-version with a custom command
Saving version to numpy/version.py
$ /Users/goldbaum/.pyenv/versions/3.13.0rc1/bin/python3.13 vendored-meson/meson/meson.py install --only-changed -C build --destdir ../build-install
$ export PYTHONPATH="/Users/goldbaum/Documents/numpy/build-install/usr/lib/python3.13/site-packages"
🐍 Launching Python with PYTHONPATH="/Users/goldbaum/Documents/numpy/build-install/usr/lib/python3.13/site-packages"
$ /Users/goldbaum/.pyenv/versions/3.13.0rc1/bin/python3.13 -P ../numpy-experiments/test.py
2.2.0.dev0+git20240816.26cba75
0.9557965830899775
1.3458311669528484
0.2885478753596544
0.06409212481230497

So I think it's probably still worth it to special-case the sorting. Or at least do something so the lock only needs to be acquired once. Looking at actually adding special sorting routines would mean re-implementing the algorithms in quicksort.cpp, timsort.cpp, etc etc which seems not worth doing.

Maybe better to simply special-case stringdtype sorting in _new_sortlike and _new_argsortlike so that the lock is acquired at the beginning of the sort and released at the end.

@ngoldbaum
Copy link
Member Author

Ah, I see why unicode sorting is fast, there's a special case for string sorting in quicksort.cpp.

@charris
Copy link
Member

charris commented Aug 16, 2024

How does argsorting compare? Do you actually move strings around, or just change the offsets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants