Skip to content

Use wmemchr in stringlib if sizeof(STRINGLIB_CHAR) == sizeof(wchar_t) #93033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
goldsteinn opened this issue May 20, 2022 · 2 comments
Closed
Assignees
Labels
performance Performance or resource usage topic-unicode type-feature A feature request or enhancement

Comments

@goldsteinn
Copy link
Contributor

Feature or enhancement

(A clear and concise description of your proposal.)

memchr is currently used if sizeof(STRINGLIB_CHAR) == sizeof(char) because it faster
than the standard C loops alternatives.

wmemchr has often roughly the same performance as memchr and can be used
in the same way as memchr to speedup some of the string functions STRINGLIB(find_char)
and STRINGLIB(replace_1char_inplace) are the two easiest candidates.

Pitch

(Explain why this feature or enhancement should be implemented and how it would be used.
Add examples, if applicable.)

It would, in some instances, make wide_str.find(wide_str_of_len_one) faster.

Previous discussion

The idea was discussed a bit in #69009 but wasn't the main topic of the issue
or put into any patches.

@goldsteinn goldsteinn added the type-feature A feature request or enhancement label May 20, 2022
@corona10
Copy link
Member

Looks like @methane san will have interests in this issue :)

@goldsteinn
Copy link
Contributor Author

@corona10 have a PR for it at 93034

goldsteinn added a commit to goldsteinn/cpython that referenced this issue May 21, 2022
This was brought up a bit in python#69009 but the larger issue is mostly
different.

Generally comparable perf for the "good" case where memchr doesn't
return any collisions (false matches on lower byte) but clearly faster
with collisions.

Some notes on correctness:

wchar_t being signed/unsigned shouldn't matter here BUT wmemchr (along
with just about all the other wide-char string functions) can and
often does (x86_64 for example) assume that the input is aligned
relative to the sizeof(wchar_t). If this is not the case for
Py_UCS{2|4} then this patch is broken.

Also I think the way I implemented `#define STRINGLIB_FAST_MEMCHR` for
ucs{2|4}lib break strict-aliasing. If this is an issue but otherwise
the patch is fine, any suggestions for how to fix it?

Test results:
```
$> ./python -m test -j4
...
== Tests result: SUCCESS ==

406 tests OK.

30 tests skipped:
    test_bz2 test_curses test_dbm_gnu test_dbm_ndbm test_devpoll
    test_idle test_ioctl test_kqueue test_launcher test_msilib
    test_nis test_ossaudiodev test_readline test_smtpnet
    test_socketserver test_sqlite3 test_startfile test_tcl test_tix
    test_tk test_ttk_guionly test_ttk_textonly test_turtle
    test_urllib2net test_urllibnet test_winconsoleio test_winreg
    test_winsound test_xmlrpc_net test_zipfile64
```

Benchmarked on:
model name	: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

sizeof(wchar_t) == 4

GLIBC 2.35

```
./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018210")' ## Long, No match, No collision
No wmemchr  : 1000 loops, best of 100: 127 nsec per loop
With wmemchr: 1000 loops, best of 100: 123 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018200"' -- 's.find("\U00018208")' ## Long, No match, High collision
No wmemchr  : 1000 loops, best of 100: 1.29 usec per loop
With wmemchr: 1000 loops, best of 100: 123 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018210"' -- 's.find("\U00018210")' ## Long, match, No collision
No wmemchr  : 1000 loops, best of 100: 136 nsec per loop
With wmemchr: 1000 loops, best of 100: 130 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 200 + "\U00018208"' -- 's.find("\U00018208")' ## Long, match, High collision
No wmemchr  : 1000 loops, best of 100: 1.35 usec per loop
With wmemchr: 1000 loops, best of 100: 131 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018210")' ## Short, No match, No collision
No wmemchr  : 1000 loops, best of 100: 50.2 nsec per loop
With wmemchr: 1000 loops, best of 100: 52.9 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018200"' -- 's.find("\U00018208")' ## Short, No match, High collision
No wmemchr  : 1000 loops, best of 100: 69.1 nsec per loop
With wmemchr: 1000 loops, best of 100: 53.7 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018210"' -- 's.find("\U00018210")' ## Short, match, No collision
No wmemchr  : 1000 loops, best of 100: 53.6 nsec per loop
With wmemchr: 1000 loops, best of 100: 53.6 nsec per loop

./python -m timeit -s 's = "\U00010200\U00010201\U00010202\U00010203\U00010204\U00010205\U00010206\U00010207\U00010208\U00010209\U0001020a\U0001020b\U0001020c\U0001020d\U0001020e\U0001020f" * 3 + "\U00018208"' -- 's.find("\U00018208")' ## Short, match, High collision
No wmemchr  : 1000 loops, best of 100: 69 nsec per loop
With wmemchr: 1000 loops, best of 100: 50.9 nsec per loop
```
@serhiy-storchaka serhiy-storchaka added topic-unicode performance Performance or resource usage labels May 22, 2022
methane pushed a commit that referenced this issue May 24, 2022
Generally comparable perf for the "good" case where memchr doesn't
return any collisions (false matches on lower byte) but clearly faster
with collisions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-unicode type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants