Skip to content

BUG: fix race initializing legacy dtype casts #28290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 11, 2025

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Feb 6, 2025

Fixes #28048

Locks castingimpls with a critical section when the castingimpls cache is empty before writing to the dict.

Because the critical section macros have braces, I refactored PyArray_GetCastingImpl into three functions that call each other. Only the "middle" function needs the critical section, so that reduces the pain of not being able to early return in a critical section a little bit.

I also added some comments because this took me quite a while to fully understand so future readers will hopefully be less confused.

Edit: there are also some unrelated fixes for the test infrastructure to avoid CI crashes and generate better debug output from CI jobs. Happy to do those separately if anyone requests.

@ngoldbaum ngoldbaum added 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) 09 - Backport-Candidate PRs tagged should be backported labels Feb 6, 2025
@ngoldbaum
Copy link
Member Author

ngoldbaum commented Feb 6, 2025

This is decently complicated but I added the backport candidate label because I think this is only possibly risky for the free-threaded build and it fixes an annoying Jax runtime crash on that build that they can't really do anything about.

if (PyDict_SetItem(NPY_DT_SLOTS(from)->castingimpls,
(PyObject *) to, Py_None) < 0) {
return NULL;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None gets inserted into to castingimpls later on before exiting the critical section so there's no need to do it here

* Using this function outside of module initialization without holding a
* critical section on the castingimpls dict may lead to a race to fill the
* dict. Use PyArray_GetGastingImpl to lazily register casts at runtime
* safely.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually verified that this function is only used during module initialization or inside the critical section I added in this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. I guess we are fine as long as we enforce that casts from/to a new DType must be registered at the DType creation. (which is the enforced right now, since this is not public API)

@ngoldbaum

This comment was marked as outdated.

@ngoldbaum ngoldbaum marked this pull request as draft February 6, 2025 19:33
@ngoldbaum
Copy link
Member Author

Pushed a fix that should make ASAN and TSAN CI output more helpful. Also, yay, TSAN CI found a race condition!

@ngoldbaum ngoldbaum force-pushed the castingimpls-critical-section branch from 8120070 to 8f53a12 Compare February 6, 2025 21:46
@ngoldbaum
Copy link
Member Author

Not sure why the 32 bit runners are unhappy, will have to come back to this.

@ngoldbaum ngoldbaum force-pushed the castingimpls-critical-section branch from 9b5289f to 137ca3c Compare February 7, 2025 18:25
@ngoldbaum
Copy link
Member Author

It turns out the hangs and crashes on 32-bit runners were because of memory exhaustion, I think the latest push should pass all the CI.

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Feb 7, 2025

were because of memory exhaustion

Not quite right.

At some arbitrary thread count that depends on the precise details of the memory allocation patterns of threads, Python will raise a runtime error saying that it can't spawn any more threads when you ask it to spawn a thread. Normally the exception would bubble up properly. This time, the exception coming from ThreadPoolExecutor.submit triggered a latent deadlock bug in the implementation of run_threaded. Since the exception case didn't cancel the threading.Barrier(), the threads that did manage to get started hit the barrier and then hang.

The fix is to put the code working with the ThreadPoolExecutor.submit inside a try/finally block to guarantee a call to barrier.abort() if an error happens before all the threads were successfully submitted. This makes sure the barrier is canceled and avoids the deadlock.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I can review this properly, but it looks good.

return NULL;
}
Py_RETURN_NONE;
PyErr_Clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation now funny.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Natahn! LGTM. The only behavior change, might be that we init the within castingimpl immediately now. That seems OK.

If this is annoying someone in practice, I am OK with backporting this.

I left two comments about code cleanup, but I don't feel strongly about them.

@@ -111,6 +113,8 @@ jobs:
pip install -r requirements/build_requirements.txt
pip install -r requirements/ci_requirements.txt
pip install -r requirements/test_requirements.txt
# xdist captures stdout/stderr, but we want the TSAN output
pip uninstall -y pytest-xdist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know xdist kicks in without -n, but seems good.

if (!return_error &&
PyDict_SetItem(NPY_DT_SLOTS(from)->castingimpls,
(PyObject *)to, res) < 0) {
return_error = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be slightly nicer to keep this inside the res == NULL branch (if it isn't we don't need to set it again).

The first if could go into an else if (res == NULL) (that includes this).
If you are than OK with using Py_CLEAR(res) on this line, you can delete return_error (with the funny thing that the first if does nothing).

To me that would look nicer, I think, but happy to keep this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it is better to use else if if you can't early return. I'd rather keep the return_error since it makes things a little more explicit in the weird coding style you have to use inside critical sections.

* Using this function outside of module initialization without holding a
* critical section on the castingimpls dict may lead to a race to fill the
* dict. Use PyArray_GetGastingImpl to lazily register casts at runtime
* safely.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. I guess we are fine as long as we enforce that casts from/to a new DType must be registered at the DType creation. (which is the enforced right now, since this is not public API)

Comment on lines 2746 to 2748
finally:
if n_submitted < max_workers and pass_barrier:
barrier.abort()
Copy link
Member

@seberg seberg Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be clearer to write as:

except:  # or BaseExecption if you prefer
    if pass_barrier:
        barrier.abort()
    raise

and avoid counting the n_submitted above.

(but maybe I missed a subtlety?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but maybe I missed a subtlety?)

bare except: and except BaseException is the sort of thing that linters mark as a code smell or a naive reader in the future would replace with except Exception (which is wrong). I liketry/finally because it matches the intention and uses a language feature that linters won't flag.

I can get rid of n_submitted and just check len(futures) though.

@ngoldbaum
Copy link
Member Author

Looks like all the CI failures are unrelated network issues.

@charris
Copy link
Member

charris commented Feb 11, 2025

The loongarch64 failure was legit, but unrelated. #28320 should fix it.

@charris charris merged commit 1893f03 into numpy:main Feb 11, 2025
64 of 68 checks passed
@charris
Copy link
Member

charris commented Feb 11, 2025

Thanks Nathan.

@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Race adding legacy casts to custom dtype under free threading
4 participants