-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: fix race initializing legacy dtype casts #28290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is decently complicated but I added the backport candidate label because I think this is only possibly risky for the free-threaded build and it fixes an annoying Jax runtime crash on that build that they can't really do anything about. |
if (PyDict_SetItem(NPY_DT_SLOTS(from)->castingimpls, | ||
(PyObject *) to, Py_None) < 0) { | ||
return NULL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
gets inserted into to castingimpls later on before exiting the critical section so there's no need to do it here
* Using this function outside of module initialization without holding a | ||
* critical section on the castingimpls dict may lead to a race to fill the | ||
* dict. Use PyArray_GetGastingImpl to lazily register casts at runtime | ||
* safely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually verified that this function is only used during module initialization or inside the critical section I added in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking. I guess we are fine as long as we enforce that casts from/to a new DType must be registered at the DType creation. (which is the enforced right now, since this is not public API)
This comment was marked as outdated.
This comment was marked as outdated.
Pushed a fix that should make ASAN and TSAN CI output more helpful. Also, yay, TSAN CI found a race condition! |
8120070
to
8f53a12
Compare
Not sure why the 32 bit runners are unhappy, will have to come back to this. |
9b5289f
to
137ca3c
Compare
It turns out the hangs and crashes on 32-bit runners were because of memory exhaustion, I think the latest push should pass all the CI. |
Not quite right. At some arbitrary thread count that depends on the precise details of the memory allocation patterns of threads, Python will raise a runtime error saying that it can't spawn any more threads when you ask it to spawn a thread. Normally the exception would bubble up properly. This time, the exception coming from The fix is to put the code working with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I can review this properly, but it looks good.
return NULL; | ||
} | ||
Py_RETURN_NONE; | ||
PyErr_Clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation now funny.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Natahn! LGTM. The only behavior change, might be that we init the within castingimpl immediately now. That seems OK.
If this is annoying someone in practice, I am OK with backporting this.
I left two comments about code cleanup, but I don't feel strongly about them.
@@ -111,6 +113,8 @@ jobs: | |||
pip install -r requirements/build_requirements.txt | |||
pip install -r requirements/ci_requirements.txt | |||
pip install -r requirements/test_requirements.txt | |||
# xdist captures stdout/stderr, but we want the TSAN output | |||
pip uninstall -y pytest-xdist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know xdist kicks in without -n
, but seems good.
if (!return_error && | ||
PyDict_SetItem(NPY_DT_SLOTS(from)->castingimpls, | ||
(PyObject *)to, res) < 0) { | ||
return_error = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be slightly nicer to keep this inside the res == NULL
branch (if it isn't we don't need to set it again).
The first if could go into an else if (res == NULL)
(that includes this).
If you are than OK with using Py_CLEAR(res)
on this line, you can delete return_error
(with the funny thing that the first if
does nothing).
To me that would look nicer, I think, but happy to keep this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it is better to use else if
if you can't early return. I'd rather keep the return_error
since it makes things a little more explicit in the weird coding style you have to use inside critical sections.
* Using this function outside of module initialization without holding a | ||
* critical section on the castingimpls dict may lead to a race to fill the | ||
* dict. Use PyArray_GetGastingImpl to lazily register casts at runtime | ||
* safely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking. I guess we are fine as long as we enforce that casts from/to a new DType must be registered at the DType creation. (which is the enforced right now, since this is not public API)
numpy/testing/_private/utils.py
Outdated
finally: | ||
if n_submitted < max_workers and pass_barrier: | ||
barrier.abort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be clearer to write as:
except: # or BaseExecption if you prefer
if pass_barrier:
barrier.abort()
raise
and avoid counting the n_submitted
above.
(but maybe I missed a subtlety?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but maybe I missed a subtlety?)
bare except:
and except BaseException
is the sort of thing that linters mark as a code smell or a naive reader in the future would replace with except Exception
(which is wrong). I liketry/finally
because it matches the intention and uses a language feature that linters won't flag.
I can get rid of n_submitted
and just check len(futures)
though.
Looks like all the CI failures are unrelated network issues. |
The loongarch64 failure was legit, but unrelated. #28320 should fix it. |
Thanks Nathan. |
Fixes #28048
Locks
castingimpls
with a critical section when thecastingimpls
cache is empty before writing to the dict.Because the critical section macros have braces, I refactored
PyArray_GetCastingImpl
into three functions that call each other. Only the "middle" function needs the critical section, so that reduces the pain of not being able to early return in a critical section a little bit.I also added some comments because this took me quite a while to fully understand so future readers will hopefully be less confused.
Edit: there are also some unrelated fixes for the test infrastructure to avoid CI crashes and generate better debug output from CI jobs. Happy to do those separately if anyone requests.