Skip to content

gh-128679: fix race condition in tracemalloc #128695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix race condition in tracemalloc causing abort.
30 changes: 24 additions & 6 deletions Python/tracemalloc.c
Original file line number Diff line number Diff line change
Expand Up @@ -538,11 +538,20 @@ tracemalloc_alloc(int use_calloc, void *ctx, size_t nelem, size_t elsize)
return NULL;

TABLES_LOCK();
if (ADD_TRACE(ptr, nelem * elsize) < 0) {
/* Failed to allocate a trace for the new memory block */
TABLES_UNLOCK();
alloc->free(alloc->ctx, ptr);
return NULL;
/* This operation can be executed outside of GIL protection due to
allocations needed for an initial PyGILState_Ensure(). Because of this,
tracing may be turned off concurrently in another thread which currently
holds the GIL between the initial check of `tracing` variable and here.
This is a sanity check to account for this possibility and avoid a
potential abort where it can be handled gracefully instead.
See gh-128679. */
if (tracemalloc_config.tracing) {
if (ADD_TRACE(ptr, nelem * elsize) < 0) {
/* Failed to allocate a trace for the new memory block */
TABLES_UNLOCK();
alloc->free(alloc->ctx, ptr);
return NULL;
}
}
TABLES_UNLOCK();
return ptr;
Expand Down Expand Up @@ -963,8 +972,11 @@ _PyTraceMalloc_Stop(void)
if (!tracemalloc_config.tracing)
return;

/* stop tracing Python memory allocations */
/* stop tracing Python memory allocations,
but not while something might be in the middle of an operation */
TABLES_LOCK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to lock writes, we need to lock reads as well, but see my other comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading without lock like the first check in PyTraceMalloc_Track() can be done for an early out without problems in that particular case (though this absolutely necessitates the second check after GIL acquire). Atomic operations should be used in these cases though where they might be outside GIL protection, that should be safe then if relying on GIL. For free-threading safety more changes would be needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really mix and match atomics. If something relies on a lock for thread-safety, then you must hold that lock in order to read or write to it. It's not safe to rely on the GIL for some reads, and then rely on a different lock for others. It might be OK in some cases with the GIL because of implicit lock-ordering, but that can break very easily.

In this specific case, this would race with PyTraceMalloc_Track if it was called without a thread state (i.e., the GIL), because that doesn't hold the tables lock when reading, so there's no synchronization between the two operations. It's only slightly OK because the GIL will serialize it for you in almost every case--there are very few people that are using PyTraceMalloc_Track while their thread state is detached. But that's besides the point, we should go for a more robust fix here rather than relying on lock-ordering with the GIL. (We'll also kill two birds with one stone--it will fix it for both the default build and for free-threading.)

I would either go with atomic reads and writes, or a compare-exchange if there ends up being some problems with inconsistent state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that adding TABLES_LOCK/TABLES_UNLOCK while setting tracing avoids a race condition. At least, I don't understand how. What are your trying to protect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that adding TABLES_LOCK/TABLES_UNLOCK while setting tracing avoids a race condition.

I'm not sure either, I was practicing defensive programming. Dunno if setting this to 0 here would cause a problem for an operation simultaneously taking place in another thread in a tables_lock critical section but the possible outcomes are:

  1. Doesn't cause a problem and protect - good outcome.
  2. Causes a problem and protect - good outcome.
  3. Doesn't cause a problem and no protect - good outcome.
  4. Causes a problem and no protect - bad outcome.

I am assuming @vstinner you are familiar with the codebase so if you say this would not cause a problem then I can certainly remove it.

I would either go with atomic reads and writes, or a compare-exchange if there ends up being some problems with inconsistent state.

Neither of those would solve the OP problem by themselves and just changing the tracing reads and writes to atomic or cmpxchg wouldn't be enough for free-threading either as you would need to explicitly replicate the GIL protection mechanism. On the other hand they won't hurt anything so if you want I can change the tracing reads/writes to atomic load/store? (they don't actually do anything special on x86 anyway)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, switching it to completely atomic won't fix it, but it's a good start because again, this approach is error-prone. It might pass on the test case, but that doesn't necessarily mean it's the correct approach.

Imagine a case where:

  • Thread 1 calls tracemalloc_alloc with the GIL held.
  • Thread 2 calls PyGILState_Ensure and is now blocked until it's released.
  • Thread 1 acquires the tables lock.
  • Thread 1 triggers a garbage collection via ADD_TRACE
  • A finalizer releases the GIL.
  • Thread 2 takes the GIL, and begins waiting on the tables lock.
  • Deadlock ensues, as thread 1 is waiting on the GIL, and thread 2 is waiting on the tables lock. (This wouldn't be an issue if PyThread_acquire_lock released the GIL, but I'm not certain that it does, because that's not desirable for all locks.)

If you think that atomic operations will be too difficult to use here, then you can use a dedicated PyMutex for the tracing field, but then that lock should be held for absolutely all reads and writes to tracing.

tracemalloc_config.tracing = 0;
TABLES_UNLOCK();

/* unregister the hook on memory allocators */
#ifdef TRACE_RAW_MALLOC
Expand Down Expand Up @@ -1317,6 +1329,12 @@ PyTraceMalloc_Track(unsigned int domain, uintptr_t ptr,

gil_state = PyGILState_Ensure();

if (!tracemalloc_config.tracing) {
/* tracing may have been turned off as we were acquiring the GIL */
PyGILState_Release(gil_state);
return -2;
}

TABLES_LOCK();
res = tracemalloc_add_trace(domain, ptr, size);
TABLES_UNLOCK();
Expand Down
Loading