Skip to content

gh-128942: make arraymodule.c free-thread safe (lock-free) #130771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

tom-pytel
Copy link
Contributor

@tom-pytel tom-pytel commented Mar 2, 2025

I added lock-free single element reads and writes by mostly copying the list object's homework. TL;DR: pyperformance scimark seems to be back to about what it was without the free-thread safe stuff (pending confirmation of course). Tried a few other things but the list strategy seems good enough (except for the negative index thing I mentioned in #130744, if that is an issue).

Timings, the relevant ones are "OLD" - non free-thread safe arraymodule, "SLOW" - the previous slower PR and the last two "LFREERW".

### scimark_fft ###
WITHGIL: Mean +- std dev: 200 ms +- 6 ms
OLD:     Mean +- std dev: 238 ms +- 7 ms
SLOW:    Mean +- std dev: 274 ms +- 4 ms
ITEMFIX: Mean +- std dev: 272 ms +- 14 ms
NOLOCK:  Mean +- std dev: 240 ms +- 4 ms
TEST1:   Mean +- std dev: 267 ms +- 4 ms
TEST2:   Mean +- std dev: 255 ms +- 6 ms
LFREERD: Mean +- std dev: 256 ms +- 7 ms   # left in standard Py_SIZE() in getarrayitem
LFREERD: Mean +- std dev: 262 ms +- 7 ms   # atomic read of size in getarrayitem
LFREERD: Mean +- std dev: 270 ms +- 7 ms   # build and run again
LFREERD: Mean +- std dev: 255 ms +- 10 ms  # orthodox
LFREERD: Mean +- std dev: 259 ms +- 8 ms   # build and run again
LFREERW: Mean +- std dev: 239 ms +- 7 ms   # lockfree write single element
LFREERW: Mean +- std dev: 242 ms +- 6 ms   # build and run again
STATICL: Mean +- std dev: 223 ms +- 6 ms   # statically linked
STATICL: Mean +- std dev: 223 ms +- 5 ms   # build and run again
CLEANUP: Mean +- std dev: 224 ms +- 5 ms   # including atomic get/set
AMEMCPY: Mean +- std dev: 226 ms +- 6 ms   # atomic item copy on resize (instead of memcpy)
AMEMCPY: Mean +- std dev: 222 ms +- 5 ms   # build and run again
LOCKWR:  Mean +- std dev: 226 ms +- 4 ms   # locked writes again, but much better with static linking
LOCKWR:  Mean +- std dev: 226 ms +- 5 ms   # build and run again
AMCAGG:  Mean +- std dev: 222 ms +- 7 ms   # aggregate memcpy (just using best build)

### scimark_lu ###
WITHGIL: Mean +- std dev: 64.6 ms +- 1.9 ms
OLD:     Mean +- std dev: 89.0 ms +- 2.1 ms
SLOW:    Mean +- std dev: 95.4 ms +- 3.1 ms
ITEMFIX: Mean +- std dev: 92.0 ms +- 2.1 ms
NOLOCK:  Mean +- std dev: 88.5 ms +- 2.5 ms
TEST1:   Mean +- std dev: 89.7 ms +- 2.1 ms
TEST2:   Mean +- std dev: 90.5 ms +- 2.4 ms
LFREERD: Mean +- std dev: 89.9 ms +- 2.7 ms
LFREERD: Mean +- std dev: 90.9 ms +- 2.6 ms
LFREERD: Mean +- std dev: 93.4 ms +- 4.0 ms
LFREERD: Mean +- std dev: 91.9 ms +- 3.7 ms
LFREERD: Mean +- std dev: 91.0 ms +- 5.0 ms
LFREERW: Mean +- std dev: 89.2 ms +- 2.8 ms
LFREERW: Mean +- std dev: 88.8 ms +- 2.5 ms
STATICL: Mean +- std dev: 86.9 ms +- 1.9 ms
STATICL: Mean +- std dev: 87.1 ms +- 2.2 ms
CLEANUP: Mean +- std dev: 88.9 ms +- 2.5 ms
AMEMCPY: Mean +- std dev: 87.9 ms +- 1.9 ms
AMEMCPY: Mean +- std dev: 88.7 ms +- 4.8 ms
LOCKWR:  Mean +- std dev: 88.0 ms +- 2.4 ms
LOCKWR:  Mean +- std dev: 88.5 ms +- 3.5 ms
AMCAGG:  Mean +- std dev: 88.5 ms +- 3.5 ms

### scimark_monte_carlo ###
WITHGIL: Mean +- std dev: 40.1 ms +- 2.3 ms
OLD:     Mean +- std dev: 53.4 ms +- 1.0 ms
SLOW:    Mean +- std dev: 56.8 ms +- 1.4 ms
ITEMFIX: Mean +- std dev: 55.9 ms +- 0.7 ms
NOLOCK:  Mean +- std dev: 54.4 ms +- 1.3 ms
TEST1:   Mean +- std dev: 54.1 ms +- 0.6 ms
TEST2:   Mean +- std dev: 54.1 ms +- 0.6 ms
LFREERD: Mean +- std dev: 54.3 ms +- 0.9 ms
LFREERD: Mean +- std dev: 55.9 ms +- 0.9 ms
LFREERD: Mean +- std dev: 55.4 ms +- 1.1 ms
LFREERD: Mean +- std dev: 55.4 ms +- 1.1 ms
LFREERD: Mean +- std dev: 55.2 ms +- 1.5 ms
LFREERW: Mean +- std dev: 53.2 ms +- 1.1 ms
LFREERW: Mean +- std dev: 53.9 ms +- 1.5 ms
STATICL: Mean +- std dev: 55.0 ms +- 1.9 ms
STATICL: Mean +- std dev: 53.3 ms +- 1.0 ms
CLEANUP: Mean +- std dev: 52.6 ms +- 0.5 ms
AMEMCPY: Mean +- std dev: 53.3 ms +- 0.6 ms
AMEMCPY: Mean +- std dev: 53.4 ms +- 1.2 ms
LOCKWR:  Mean +- std dev: 55.4 ms +- 4.0 ms
LOCKWR:  Mean +- std dev: 54.8 ms +- 1.8 ms
AMCAGG:  Mean +- std dev: 53.1 ms +- 1.0 ms

### scimark_sor ###
WITHGIL: Mean +- std dev: 70.3 ms +- 3.3 ms
OLD:     Mean +- std dev: 98.1 ms +- 1.8 ms
SLOW:    Mean +- std dev: 99.9 ms +- 1.1 ms
ITEMFIX: Mean +- std dev:  100 ms +- 1 ms
NOLOCK:  Mean +- std dev: 97.3 ms +- 1.4 ms
TEST1:   Mean +- std dev: 97.5 ms +- 0.7 ms
TEST2:   Mean +- std dev: 98.5 ms +- 1.5 ms
LFREERD: Mean +- std dev: 97.0 ms +- 2.0 ms
LFREERD: Mean +- std dev: 98.2 ms +- 1.4 ms
LFREERD: Mean +- std dev: 98.4 ms +- 2.7 ms
LFREERD: Mean +- std dev: 97.6 ms +- 1.5 ms
LFREERD: Mean +- std dev: 97.9 ms +- 1.8 ms
LFREERW: Mean +- std dev: 97.2 ms +- 1.6 ms
LFREERW: Mean +- std dev: 96.9 ms +- 2.0 ms
STATICL: Mean +- std dev: 97.8 ms +- 1.9 ms
STATICL: Mean +- std dev: 97.2 ms +- 1.7 ms
CLEANUP: Mean +- std dev: 97.7 ms +- 1.2 ms
AMEMCPY: Mean +- std dev: 96.0 ms +- 2.0 ms
AMEMCPY: Mean +- std dev: 96.4 ms +- 1.5 ms
LOCKWR:  Mean +- std dev: 97.7 ms +- 1.5 ms
LOCKWR:  Mean +- std dev: 96.8 ms +- 2.5 ms
AMCAGG:  Mean +- std dev: 95.6 ms +- 0.5 ms

### scimark_sparse_mat_mult ###
WITHGIL: Mean +- std dev: 2.89 ms +- 0.11 ms
OLD:     Mean +- std dev: 3.94 ms +- 0.15 ms
SLOW:    Mean +- std dev: 4.68 ms +- 0.07 ms
ITEMFIX: Mean +- std dev: 4.34 ms +- 0.07 ms
NOLOCK:  Mean +- std dev: 3.94 ms +- 0.58 ms
TEST1:   Mean +- std dev: 3.99 ms +- 0.12 ms
TEST2:   Mean +- std dev: 3.80 ms +- 0.12 ms
LFREERD: Mean +- std dev: 3.84 ms +- 0.17 ms
LFREERD: Mean +- std dev: 3.83 ms +- 0.15 ms
LFREERD: Mean +- std dev: 3.94 ms +- 0.12 ms
LFREERD: Mean +- std dev: 3.74 ms +- 0.12 ms
LFREERD: Mean +- std dev: 3.83 ms +- 0.19 ms
LFREERW: Mean +- std dev: 3.84 ms +- 0.19 ms
LFREERW: Mean +- std dev: 3.87 ms +- 0.17 ms
STATICL: Mean +- std dev: 3.37 ms +- 0.08 ms
STATICL: Mean +- std dev: 3.49 ms +- 0.12 ms
CLEANUP: Mean +- std dev: 3.45 ms +- 0.14 ms
AMEMCPY: Mean +- std dev: 3.44 ms +- 0.13 ms
AMEMCPY: Mean +- std dev: 3.46 ms +- 0.14 ms
LOCKWR:  Mean +- std dev: 3.52 ms +- 0.47 ms
LOCKWR:  Mean +- std dev: 3.44 ms +- 0.10 ms
AMCAGG:  Mean +- std dev: 3.38 ms +- 0.09 ms

@tom-pytel
Copy link
Contributor Author

ping @colesbury

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Disclaimer: I'm not an expert on the FT list implementation, so take some of my comments with a grain of salt.

Seeing good single-threaded performance is nice, but what about multi-threaded scaling? The number of locks that are still here scare me a little--it would be nice if this scaled well for concurrent use as well, especially for operations that don't require concurrent writes (e.g., comparisons and copies).

@tom-pytel
Copy link
Contributor Author

Note, this is not ready to go, there is the memory issue which needs resolving.

@tom-pytel
Copy link
Contributor Author

@ZeroIntensity you can remove the do-not-merge, its not an arraymodule issue, its a QSBR issue, see #130794.

@tom-pytel
Copy link
Contributor Author

The main thing here for acceptance is a benchmark run which I am not able to start (I only did local pyperformance check against main), so someone with access will have to initiate that to compare with main.

@colesbury colesbury self-requested a review March 4, 2025 16:22
Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten a chance to look through arraymodule.c yet. I'll review that later this week.

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall approach here seems good. A few comments below.

@tom-pytel
Copy link
Contributor Author

tom-pytel commented Mar 5, 2025

The actual arraymodule cleanup is fine, just one thing you should pay attention to when looking at this:

data->items is not a deref but pointer arithmetic so is safe to use if data is NULL in memcopies and _PyBytes_Repeat when associated size is 0. But there are places where this might not be safe which are accomodated, specifically:

  • array_array_buffer_info_impl
  • array_array_tobytes_impl
  • array_array_fromunicode_impl
  • array_array_tounicode_impl
  • array_buffer_getbuf_lock_held

Are there any other places where this needs to take place?

Its the test and trying to run it with parallel-threads with tsan that's problematic.

  • warnings doesn't play nice with --parallel-threads, found a least evil solution but its not wonderful. This still gets "env changed" but at least can run. Is there a way to check for parallel-threads other than this?
    def setUp(self):
        if (not support.Py_GIL_DISABLED or
                any('"parallel_threads": null' in a for a in sys.argv) or
                all('parallel_threads' not in a for a in sys.argv)):
            self.enterContext(warnings.catch_warnings())
  • --parallel-threads doesn't play well with the threaded FreeThreadingTest, get dangling threads sometimes.

  • ./Include/cpython/pyatomic_gcc.h:615:3: warning: ‘atomic_thread_fence’ is not supported with ‘-fsanitize=thread’ [-Wtsan]
    Do we care?

  • And worst of all, cleaned up data races in arraymodule (as far as I can see). I used to be able to run test stuff setting supperssions without problems but with test_array I get:

$ ./python -m test --parallel-threads=4 -j4 test_array
Using random seed: 2354862211
0:00:00 load avg: 0.55 Run 1 test in parallel using 1 worker process
0:00:30 load avg: 0.80 running (1): test_array (30.1 sec)
0:00:41 load avg: 0.75 [1/1/1] test_array worker non-zero exit code (Exit code 66)
==================

WARNING: ThreadSanitizer: data race (pid=1075807)
  Read of size 8 at 0x7f58e22c0098 by thread T4:
    #0 _Py_TYPE Include/object.h:268 (python+0x22a21a)
    #1 Py_IS_TYPE Include/object.h:291 (python+0x22a21a)
    #2 compare_unicode_unicode_threadsafe Objects/dictobject.c:1413 (python+0x22a21a)
    #3 do_lookup Objects/dictobject.c:1010 (python+0x22bb38)
    #4 unicodekeys_lookup_unicode_threadsafe Objects/dictobject.c:1438 (python+0x22bd9b)
    #5 _Py_dict_lookup_threadsafe Objects/dictobject.c:1493 (python+0x232e50)
    #6 _PyDict_GetItemRef_KnownHash Objects/dictobject.c:2342 (python+0x233f39)
    #7 PyDict_GetItemRef Objects/dictobject.c:2378 (python+0x234070)
    #8 cache_struct_converter Modules/_struct.c:2524 (_struct.cpython-314td-x86_64-linux-gnu.so+0xc916)
    #9 calcsize Modules/clinic/_struct.c.h:249 (_struct.cpython-314td-x86_64-linux-gnu.so+0xceb0)
    #10 cfunction_vectorcall_O Objects/methodobject.c:523 (python+0x25961c)
    #11 _PyObject_VectorcallTstate Include/internal/pycore_call.h:167 (python+0x1896b3)
    #12 PyObject_Vectorcall Objects/call.c:327 (python+0x189810)
    #13 _PyEval_EvalFrameDefault Python/generated_cases.c.h:1371 (python+0x42310d)
    ...

  Previous write of size 8 at 0x7f58e22c0098 by thread T1:
    #0 memset ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:799 (libtsan.so.0+0x614cb)
    #1 memset ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:797 (libtsan.so.0+0x614cb)
    #2 memset /usr/include/x86_64-linux-gnu/bits/string_fortified.h:59 (python+0x274169)
    #3 fill_mem_debug Objects/obmalloc.c:2739 (python+0x274169)
    #4 _PyMem_DebugRawAlloc Objects/obmalloc.c:2823 (python+0x27429d)
    #5 _PyMem_DebugRawMalloc Objects/obmalloc.c:2839 (python+0x2742d6)
    #6 _PyMem_DebugMalloc Objects/obmalloc.c:3004 (python+0x274331)
    #7 PyObject_Malloc Objects/obmalloc.c:1412 (python+0x296e92)
    #8 PyUnicode_New Objects/unicodeobject.c:1407 (python+0x31fd7d)
    #9 PyUnicode_Concat Objects/unicodeobject.c:11647 (python+0x335e53)
    #10 PyNumber_Add Objects/abstract.c:1135 (python+0x14e11c)
    #11 _PyEval_EvalFrameDefault Python/generated_cases.c.h:62 (python+0x41b60b)
    ...

WARNING: ThreadSanitizer: data race (pid=1079529)
  Write of size 8 at 0x7f0050cbd0f0 by thread T17:
    #0 descr_get_qualname Objects/descrobject.c:624 (python+0x1ac687)
    #1 getset_get Objects/descrobject.c:193 (python+0x1aaf41)
    #2 _PyObject_GenericGetAttrWithDict Objects/object.c:1694 (python+0x268a4e)
    ...

  Previous write of size 8 at 0x7f0050cbd0f0 by thread T20:
    #0 descr_get_qualname Objects/descrobject.c:624 (python+0x1ac687)
    #1 getset_get Objects/descrobject.c:193 (python+0x1aaf41)
    #2 _PyObject_GenericGetAttrWithDict Objects/object.c:1694 (python+0x268a4e)
    ...

Which is not arraymodule stuff and not suppressed. So as of now its not runnable clean under tsan (at least on my system how I am running it).

Left the bad setUp() in test_array for now so u can see but it will be getting removed (or fixed).

@colesbury
Copy link
Contributor

colesbury commented Mar 5, 2025

I'd like test_array added to TSAN_TESTS, not TSAN_PARALLEL_TESTS.

warnings doesn't play nice with --parallel-threads

Yes, warnings is not therad-safe (even with the GIL). Neil is work on making it thread-safe, but for now we shouldn't add the test to TSAN_PARALLEL_TESTS for now. It's not worth trying to work around the warnings stuff.

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
tom-pytel and others added 2 commits July 21, 2025 16:04
@tom-pytel
Copy link
Contributor Author

Its been a while so I forgot that the array module shouldn't be in Setup.stdlib.in but rather in Setup.bootstrap.in so it is compiled as built-in as per: #130771 (comment), sent up fix.

}
}
else { // typecode == 'w'
Py_ssize_t ustr_length = PyUnicode_GetLength(ustr);
Py_ssize_t old_size = Py_SIZE(self);
Py_ssize_t new_size = old_size + ustr_length;

if (new_size < 0 || (size_t)new_size > PY_SSIZE_T_MAX / sizeof(Py_UCS4)) {
if (new_size < 0 || !arraydata_size_valid(new_size, sizeof(Py_UCS4))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_size < 0 checks for overflow here. Maybe it is worth to do clarify it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the < 0 check into arraydata_size_valid() instead. I can technically remove some of the other checks like (Py_SIZE(self) > PY_SSIZE_T_MAX - Py_SIZE(b) with this, but want opinion first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not big expert in C, but AFAIK an integer overflow is undefined behavior in C. So, it should be avoided as I can tell.

Copy link
Contributor Author

@tom-pytel tom-pytel Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you really want to enforce that then this code was non-compliant to begin with and continues to be so, with or without the new_size < 0. But I'm guessing the "correct" behavior would be overkill as it is not used elsewhere here either.

Or if you want I could make them all compliant, your call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two ways:

  1. size_t new_size = old_size + ustr_length; if (new_size > Py_SSIZE_T_MAX || ...
  2. if ((size_t)old_size + ustr_length > Py_SSIZE_T_MAX) { PyErr_NoMemory(); } Py_ssize_t new_size = old_size + ustr_lengt; ...

It is up to you what is a best here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what I meant mostly. But if you want to dot all the i's, the size_t new_size = old_size + ustr_length is UB from what I understand without the explicit cast size_t new_size = (size_t)old_size + ustr_length.

Also note the if ((size_t)old_size + ustr_length > Py_SSIZE_T_MAX) is handled implicitly by the check in arraydata_size_valid(). Also the array_resize(self, new_size) using a size_t new_size which is guaranteed to be <= Py_SSIZE_T_MAX is defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t new_size = (size_t)old_size + ustr_length. - yeah, good catch!

if ((size_t)old_size + ustr_length > Py_SSIZE_T_MAX) - I mean that you should check overflow before declaring a Py_ssize_t new_size variable. If you use size_t new_size it is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we are in agreement.

# array module is not free-thread safe.

def setUp(self):
self.enterContext(warnings.catch_warnings())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What warnings are emitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't remember. It might be from when the thing was non-racy and running TSAN tests, but useless now so will remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need these changes anymore as we aren't using _Alignas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice _Alignof in a few places, so will leave as a keyword but remove the _Alignas rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants