Skip to content

BUG: Race adding legacy casts to custom dtype under free threading #28048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hawkinsp opened this issue Dec 20, 2024 · 17 comments · Fixed by #28290
Closed

BUG: Race adding legacy casts to custom dtype under free threading #28048

hawkinsp opened this issue Dec 20, 2024 · 17 comments · Fixed by #28290
Labels
00 - Bug 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703)

Comments

@hawkinsp
Copy link
Contributor

hawkinsp commented Dec 20, 2024

Describe the issue:

If the following code is run under Python 3.13.1t, it fails nondeterministically with A cast was already added for <class 'numpy.dtype[rational]'> -> <class 'numpy.dtypes.Int8DType'>. (method: legacy_cast).

Reproduce the code example:

import concurrent.futures
import functools
import threading
import numpy as np
import numpy._core._rational_tests as _rational_tests

num_threads = 1000

def closure(b):
    b.wait()
    for _ in range(100):
        np.full((10, 10), 1, _rational_tests.rational)


with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
    b = threading.Barrier(num_threads)
    futures = [executor.submit(functools.partial(closure, b)) for _ in range(num_threads)]
    [f.result() for f in futures]

Error message:

Traceback (most recent call last):
  File "/Users/goldbaum/.pyenv/versions/3.13.1t/lib/python3.13t/site-packages/numpy/_core/numeric.py", line 353, in full
    multiarray.copyto(a, fill_value, casting='unsafe')
RuntimeError: A cast was already added for <class 'numpy.dtype[rational]'> -> <class 'numpy.dtypes.Int8DType'>. (method: legacy_cast)

Python and NumPy Versions:

2.3.0.dev0+git20241219.35b2c4a
3.13.1 experimental free-threading build (tags/v3.13.1:06714517797, Dec 15 2024, 15:38:01) [Clang 18.1.8 (11)]

Runtime Environment:

[{'numpy_version': '2.3.0.dev0+git20241219.35b2c4a',
'python': '3.13.1 experimental free-threading build '
'(tags/v3.13.1:06714517797, Dec 15 2024, 15:38:01) [Clang 18.1.8 '
'(11)]',
'uname': uname_result(system='Linux', node='', release='', version='https://github.com/https://github.com//pull/1 SMP PREEMPT_DYNAMIC Debian 6.redacted (2024-10-16)', machine='x86_64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_KNL',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'Zen',
'filepath': '/usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.27.so',
'internal_api': 'openblas',
'num_threads': 128,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.27'}]

Context for the issue:

Found when working on free-threading support in JAX.

@charris charris added the 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) label Dec 21, 2024
@seberg
Copy link
Member

seberg commented Dec 30, 2024

Hmmmm, I wonder if we can/should do this at registration time already to avoid having to add another locking point (registration should happen at import, I think. And if not, it should be the callers problem to lock).

@ngoldbaum
Copy link
Member

Here's the C traceback:

* thread #10, stop reason = breakpoint 2.1
  * frame #0: 0x0000000100d60340 _multiarray_umath.cpython-313t-darwin.so`PyArray_AddCastingImplementation(meth=0x0000038d680100f0) at convert_datatype.c:2011:23 [opt]
    frame #1: 0x0000000100d60460 _multiarray_umath.cpython-313t-darwin.so`PyArray_AddCastingImplementation_FromSpec(spec=0x0000000178e66248, private=1) at convert_datatype.c:2036:15 [opt]
    frame #2: 0x0000000100dd4230 _multiarray_umath.cpython-313t-darwin.so`PyArray_AddLegacyWrapping_CastingImpl(from=0x0000038d564d6400, to=0x0000038d564d0800, casting=NPY_UNSAFE_CASTING) at usertypes.c:0 [opt]
    frame #3: 0x0000000100d5d15c _multiarray_umath.cpython-313t-darwin.so`PyArray_GetCastingImpl(from=0x0000038d564d6400, to=<unavailable>) at convert_datatype.c:144:13 [opt]
    frame #4: 0x0000000100d5de34 _multiarray_umath.cpython-313t-darwin.so`PyArray_CanCastSafely(fromtype=<unavailable>, totype=<unavailable>) at convert_datatype.c:540:29 [opt]
    frame #5: 0x0000000100dd3e3c _multiarray_umath.cpython-313t-darwin.so`legacy_userdtype_common_dtype_function(cls=0x0000038d564d6400, other=0x0000038d564d0800) at usertypes.c:522:9 [opt]
    frame #6: 0x0000000100d45f90 _multiarray_umath.cpython-313t-darwin.so`int_common_dtype(__NPY_UNUSED_TAGGEDcls=<unavailable>, other=0x0000038d564d6400) at abstractdtypes.c:179:34 [opt]
    frame #7: 0x0000000100d5b820 _multiarray_umath.cpython-313t-darwin.so`PyArray_CommonDType(dtype1=0x0000000100f8b5f0, dtype2=0x0000038d564d6400) at common_dtype.c:58:20 [opt]
    frame #8: 0x0000000100d465e4 _multiarray_umath.cpython-313t-darwin.so`npy_find_descr_for_scalar(scalar=0, original_descr=0x0000000100f81278, in_DT=<unavailable>, op_DT=<unavailable>) at abstractdtypes.c:460:33 [opt]
    frame #9: 0x0000000100dafd58 _multiarray_umath.cpython-313t-darwin.so`array_copyto(__NPY_UNUSED_TAGGEDignored=<unavailable>, args=<unavailable>, len_args=<unavailable>, kwnames=<unavailable>) at multiarraymodule.c:1966:32 [opt]

NumPy is lazily generating casts, but we can probably initialize casts between user dtypes and the built-in DTypes using this machinery at DType registration time.

Failing that, we can keep the current implementation and lock the DType when we need to fill the cache. Possibly behind a shared_mutex like the dispatch cache.

@ngoldbaum
Copy link
Member

ngoldbaum commented Jan 14, 2025

I edited the reproducer to use _rational_tests.rational instead of ml_dtypes. Thankfully it triggers with rational. This should make it much easier to add a test along with a fix.

@ngoldbaum
Copy link
Member

ngoldbaum commented Jan 16, 2025

@seberg I need your opinion about what the best way forward here is.

The reason this is set up this way is because user dtypes first create a DType instance and then after that register a bunch of casts. There's no function that says "this DType is fully initialized, do finalization steps."

The current approach for wrapping the legacy user dtype casts avoids the need to monkey with the internals of the legacy dtype implementation (including all the cancastto pointers, besides reading them after they're already set up) because by the time a cast actually happens, the legacy dtype is already done setting up, including all the casts.

I see three ways to fix this and none are great.

  1. Add a new public function to finalize legacy dtypes. This must be called after all casts are registered. The new function could then call PyArray_AddLegacyWrapping_CastingImpl for all built-in dtypes to initialize casts during DType registration time. ml_dtypes could adopt this but I doubt legacy user dtypes are well enough maintained to be able to adopt a new C API function to fix a bug you're only likely to hit under free-threading.
  2. Initialize a basic set of casts in the DTypeMeta wrapping the legacy user DType right after calling dtypemeta_wrap_legacy_descriptor using PyArray_AddLegacyWrapping_CastingImpl but then inside PyArray_RegisterCanCast I'd need to update the casting safety of the wrapping ArrayMethod, which is an annoying inversion of abstractions, because PyArray_RegisterCanCast is deep in the guts of the legacy user dtype implementation.
  3. As alluded to above, lock the castingimpls slot of PyArray_DTypeMeta instances, probably behind a shared_mutex. This will probably require converting a bunch of code to C++ and might also not scale well. That also means adding locking to handle an issue that only affects legacy user DTypes to all DTypes.

@seberg
Copy link
Member

seberg commented Jan 17, 2025

Ah, so I had a very good reason to put it exactly where it is :(. My thoughts:

  1. My first gut-feeling was to not do this. But it doesn't seem so bad, I agree only ml_types would use this, but maybe that is OK? Unmaintained extension dtypes may not be "advertising" free-threading support anyway.
    We could probably hack a DeprecationWarningor a straight error that enforces the API when free-threading is enabled. (not back-portable, but...)
  2. This seems too akward and hard to follow to me...
  3. Converting to C++ shouldn't stop us, it is churn, but likely churn that is good if it happens in the long run anyway.
    But maybe there are even easier options? castingimpls is just a dict, IIRC, so I think we can call GetItem() even while an other thread may be adding a new cast? If that is the case, we should be able to get away with just a critical section (and another check if it showed up) in the function adding the casts?

Since I think being a dict should make locking easy, locking seems like a good approach to me now. But if it is not, I think going the new API route isn't as bad as it may seem initially.

@ngoldbaum
Copy link
Member

castingimpls is just a dict, IIRC, so I think we can call GetItem() even while an other thread may be adding a new cast? If that is the case, we should be able to get away with just a critical section (and another check if it showed up) in the function adding the casts?

I'll have to look closer at this, but I think you'd need to apply a critical section when reading castingimpls, and you'd introduce contention in the common case. It would be safe but if someone tried to call the same cast operation simultaneously from many threads it would scale poorly, just like the ufunc scaling issue we saw with critical sections.

@ngoldbaum
Copy link
Member

Another perhaps less satisfying option is to let the race happen and just ignore this error case. The cast created by both threads should be identical.

@ngoldbaum
Copy link
Member

castingimpls isn't actually used in all that many places, so maybe requiring C++ to access it isn't so bad:

_core/src/umath/_scaled_float_dtype.c
881:    NPY_DT_SLOTS(&PyArray_SFloatDType)->castingimpls = PyDict_New();
882:    if (NPY_DT_SLOTS(&PyArray_SFloatDType)->castingimpls == NULL) {

_core/src/multiarray/usertypes.c
348:            NPY_DT_SLOTS(NPY_DTYPE(descr))->castingimpls, (PyObject *)to_DType);

_core/src/multiarray/dtypemeta.h
80:    PyObject *castingimpls;

_core/src/multiarray/_multiarray_tests.c.src
886:        while (PyDict_Next(NPY_DT_SLOTS(from_dtype)->castingimpls,

_core/src/multiarray/dtypemeta.c
43:    Py_XDECREF(NPY_DT_SLOTS(self)->castingimpls);
323:    NPY_DT_SLOTS(DType)->castingimpls = PyDict_New();
324:    if (NPY_DT_SLOTS(DType)->castingimpls == NULL) {
1153:    dt_slots->castingimpls = PyDict_New();
1154:    if (dt_slots->castingimpls == NULL) {

_core/src/multiarray/convert_datatype.c
83:        res = PyDict_GetItemWithError(NPY_DT_SLOTS(from)->castingimpls, (PyObject *)to);
130:                if (PyDict_SetItem(NPY_DT_SLOTS(from)->castingimpls,
159:    if (PyDict_SetItem(NPY_DT_SLOTS(from)->castingimpls,
2007:    if (PyDict_Contains(NPY_DT_SLOTS(meth->dtypes[0])->castingimpls,
2014:    if (PyDict_SetItem(NPY_DT_SLOTS(meth->dtypes[0])->castingimpls,

I'll look at converting the bits of NumPy needed to put castingimpls behind a shared_mutex.

@seberg
Copy link
Member

seberg commented Jan 23, 2025

Yeah, in fact, I think the convert_datatype.c function is probably the only place that really matters (not sure if dict iteration is safe, but if not it is just a little helper that doesn't matter).
All other occurrences are at dtype creation time or should be (i.e. a the usertypes.c one is a sanity check that tells users that a cast was registered after being used) and it just uses GetItem.

If that isn't possible (and not just not the nicest way to do it!), I would still like to understand why it isn't OK to call GetItemWithError and only lock (very unlikely) if the result is NULL. (Although then checking if it was added in the meantime is necessary of course.)

@ngoldbaum
Copy link
Member

ngoldbaum commented Jan 30, 2025

I managed to get the numpy tests passing after converting convert_datatype.c and dtypemeta.c to C++ along with adding some extern "C" { in several headers. See #28253

I think the most straightforward way to add the mutex is to extend the NPY_DType_Slots struct to have a pointer to a mutex at the end:

diff --git a/numpy/_core/src/multiarray/dtypemeta.h b/numpy/_core/src/multiarray/dtypemeta.h
index a8b78e3f75..888ec09fba 100644
--- a/numpy/_core/src/multiarray/dtypemeta.h
+++ b/numpy/_core/src/multiarray/dtypemeta.h
@@ -85,6 +85,12 @@ typedef struct {
      * dtype instance for backward compatibility.  (Keep this at end)
      */
     PyArray_ArrFuncs f;
+
+    /*
+     *  Mutex to protect against race conditions modifying castingimpls at
+     *  runtime
+     */
+    void *mutex;
 } NPY_DType_Slots;

I can then heap-allocate the mutex during DType creation to avoid polluting the internal C API with C++ constructs and follow the pattern of the other shared_mutex use inside the dispatch cache.

If anyone spots any issues with that let me know but I think I'm going to try to implement this next.

@seberg
Copy link
Member

seberg commented Jan 30, 2025

I mean, I still really don't understand why we shouldn't just try once (even with this mutex!) and then a critical section on the castingimpl's itself seems just sa well (very very unlikely to be hit in practice).
But presumably converting to C++ is straight-forward and adding a mutex is fine.

@ngoldbaum
Copy link
Member

Isn't that thread-unsafe? All accesses need to be done under the critical section otherwise you could be reading from the dict while another thread writes to it. That's why I want an RWlock.

@seberg
Copy link
Member

seberg commented Jan 30, 2025

Built-in types like dict, list, and set use internal locks to protect against concurrent modifications in ways that behave similarly to the GIL

I mean, the whole point of the GetRef() API was to make it thread-safe no?

(Which doesn't mean we shouldn't just do the C++ conversion maybe, it is maybe time...)

@seberg
Copy link
Member

seberg commented Jan 30, 2025

(I somewhat remember things as: for ufuncs we said that using a dict rather than a custom hash-table would be a solution where we don't need a custom lock.)

@ngoldbaum
Copy link
Member

ngoldbaum commented Jan 30, 2025

What's happening right now is technically thread-safe in the sense that there are no data races, but there is a race condition. We could use PyDict_GetItemRef instead, but it doesn't matter that we're creating a borrowed reference because the casts live as long as the DType, and we have a strong reference to the DType inside of PyArray_GetCastingImpl. PyDict_GetItemRef only helps you if you don't know that a borrowed reference is safe.

What I want to do is make the race condition impossible, by blocking all threads from reading the castingimpls while any thread tries to write to it, which should only happen the first time someone tries to do a cast with a legacy user DType for some pair of DTypes.

Does that make more sense?

@seberg
Copy link
Member

seberg commented Jan 31, 2025

@ngoldbaum but the GetItemRef is thread-safe, in the sense that the return is either the correct cast impl or NULL, no?
If it is the correct cast impl, we are all good!

If the return is NULL, we need to lock down castingimpls! Then there are two possibilities:

  • Another thread was faster, by the time we get the castingimpls lock, the cast is already added.
  • We were faster, and we should add the cast.

So, if the return was NULL (and only then, which happens only once for each cast!), we have to lock castingimpls and then check again before we create it to avoid adding it twice (if another thread was faster).

Since castingimpls will effectively always return non-null (we add a None to indicate "undefined") and we don't need any additional mutex for the first GetItemRef, it seems nice to just avoid the mutex for the first step in any case?

@ngoldbaum
Copy link
Member

I opened #28290

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants