Skip to content

[3.13] gh-128679: Fix tracemalloc.stop() race conditions #128897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 18, 2025

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Jan 15, 2025

tracemalloc_alloc(), tracemalloc_realloc(), PyTraceMalloc_Track(), PyTraceMalloc_Untrack() and _PyTraceMalloc_TraceRef() now check tracemalloc_config.tracing after calling TABLES_LOCK().

_PyTraceMalloc_Stop() now protects more code with TABLES_LOCK(), especially setting tracemalloc_config.tracing to 1.

Add a test using PyTraceMalloc_Track() to test tracemalloc.stop() race condition.

tracemalloc_alloc(), tracemalloc_realloc(), PyTraceMalloc_Track(),
PyTraceMalloc_Untrack() and _PyTraceMalloc_TraceRef() now check
tracemalloc_config.tracing after calling TABLES_LOCK().

_PyTraceMalloc_Stop() now protects more code with TABLES_LOCK(),
especially setting tracemalloc_config.tracing to 1.

Add a test using PyTraceMalloc_Track() to test tracemalloc.stop()
race condition.
@vstinner vstinner marked this pull request as ready for review January 16, 2025 12:53
@vstinner
Copy link
Member Author

@ZeroIntensity: Here is a simplified change for the 3.12 and 3.13 branch. Would you mind to review it?

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple other (unmodified) functions where I'm seeing lockless reads:

  • tracemalloc_get_traceback
  • _PyMem_DumpTraceback
  • _PyTraceMalloc_GetTraces
  • _PyTraceMalloc_GetTracedMemory
  • _PyTraceMalloc_ResetPeak

@@ -1376,16 +1395,21 @@ _PyTraceMalloc_TraceRef(PyObject *op, PyRefTracerEvent event, void* Py_UNUSED(ig
int res = -1;

TABLES_LOCK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also read tracing without a lock before this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking tracing without the lock is a workaround for the Python finalization. TABLES_LOCK() can no longer be called after _PyTraceMalloc_Fini() (the lock is destroyed), whereas _PyTraceMalloc_TraceRef() is still called after _PyTraceMalloc_Fini().

tracing is checked again once we acquired the lock.

The issue was fixed properly in the main branch by calling (void)PyRefTracer_SetTracer(NULL, NULL); in _PyTraceMalloc_Stop(). Since this change is designed to be backported to Python 3.12, I didn't include this fix.

Once this change is merged in 3.13 and 3.12, we can remove the unprotected tracing read in 3.13 in _PyTraceMalloc_TraceRef() by adding (void)PyRefTracer_SetTracer(NULL, NULL);.

@vstinner
Copy link
Member Author

Ok, I completed my PR to cover most (almost all?) functions, instead of focusing on alloc/realloc/free and Track/Untrack functions.

@ZeroIntensity: Please review the updated PR.

@vstinner
Copy link
Member Author

vstinner commented Jan 16, 2025

_PyTraceMalloc_GetTraces

Fixing properly _PyTraceMalloc_GetTraces() requires to redesign tracemalloc! For example, currently, the PyMem_Malloc() hook uses TABLES_LOCK() on a reentrant call. In the main branch, the whole _PyTraceMalloc_GetTraces() function is run with TABLES_LOCK(). If we do the same in 3.13, the PyMem_Malloc() hook must be modified to no longer use TABLES_LOCK() on a reentrant call. It would change the behavior (for correctness!), I'm not sure that it's a good thing in a stable branch.

In the main branch, I redesigned the code to be able to hold the lock for all tables operations. For that, I had to modify the behavior:

tracemalloc_realloc() and tracemalloc_free() no longer remove the trace on reentrant call.

I'm open to consider fixing _PyTraceMalloc_GetTraces() by redesigning tracemalloc in 3.13+3.12, but I'm not sure if it's worth it. Calling get_traces() is a deliberate action from the user, it's different from the tracemalloc hooks which called implicitly by running regular Python code.

@ZeroIntensity
Copy link
Member

It looks like tests somehow were broken.

So TABLES_LOCK() can be called in _PyMem_DumpTraceback() even if
tracemalloc is not tracing.
@vstinner
Copy link
Member Author

It looks like tests somehow were broken.

Fixed.

Well... Fixing more functions require to backport more changes. I backported the _PyTraceMalloc_Init() change to fix test_capi and test_gc.

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. With the exception of a few special cases that you've mentioned, everything is properly locked! Due to the changes to the lifecycle and the size of the change, I think we should run buildbots before merging.

@ZeroIntensity ZeroIntensity added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 16, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @ZeroIntensity for commit 54c8f9e 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 16, 2025
@vstinner
Copy link
Member Author

buildbot/AMD64 FreeBSD Refleaks PR failed

test_tracemalloc_track_race (test.test_tracemalloc.TestCAPI.test_tracemalloc_track_race) ...

Debug memory block at address p=0x82647c840: API 'r'
    16 bytes originally requested
    The 7 pad bytes at p-7 are FORBIDDENBYTE, as expected.
    The 8 pad bytes at tail=0x82647c850 are FORBIDDENBYTE, as expected.
    Data at p: 10 e3 a6 42 08 00 00 00 60 c9 47 26 08 00 00 00

Enable tracemalloc to get the memory block allocation traceback

Fatal Python error: _PyMem_DebugRawFree: bad ID: Allocated using API 'r', verified using API 'P'
Python runtime state: initialized

Thread 0x0000000831196500 (most recent call first):
  <no Python frame>

(...)

@vstinner vstinner marked this pull request as draft January 16, 2025 17:00
@vstinner
Copy link
Member Author

I mark this PR as a draft because of the FreeBSD crash.

Oh, I can also reproduce the test_tracemalloc_track_race() crash on the main branch on FreeBSD.

@vstinner
Copy link
Member Author

So far, I cannot reproduce the crash on Linux.

@vstinner
Copy link
Member Author

I can trigger a crash if I insert a delay in PyMem_SetAllocator():

diff --git a/Objects/obmalloc.c b/Objects/obmalloc.c
index b103deb01ca..c20eb47e8a8 100644
--- a/Objects/obmalloc.c
+++ b/Objects/obmalloc.c
@@ -859,7 +859,14 @@ set_allocator_unlocked(PyMemAllocatorDomain domain, PyMemAllocatorEx *allocator)
 {
     switch(domain)
     {
-    case PYMEM_DOMAIN_RAW: _PyMem_Raw = *allocator; break;
+    case PYMEM_DOMAIN_RAW:
+        _PyMem_Raw.ctx = allocator->ctx;
+        (void)sched_yield();
+        _PyMem_Raw.malloc = allocator->malloc;
+        _PyMem_Raw.calloc = allocator->calloc;
+        _PyMem_Raw.realloc = allocator->realloc;
+        _PyMem_Raw.free = allocator->free;
+        break;
     case PYMEM_DOMAIN_MEM: _PyMem = *allocator; break;
     case PYMEM_DOMAIN_OBJ: _PyObject = *allocator; break;
     /* ignore unknown domain */

The write is protected by a lock (write), but PyMem_RawMalloc() and PyMem_RawFree() don't use the lock (read).

@vstinner
Copy link
Member Author

vstinner commented Jan 18, 2025

Only debug build are affected, not release build. I modified the test in the main branch to skip it on debug build: #128988

…ython#128988)

There is a race condition between PyMem_SetAllocator() and
PyMem_RawMalloc()/PyMem_RawFree(). While PyMem_SetAllocator() write
is protected by a lock, PyMem_RawMalloc()/PyMem_RawFree() reads are
not protected by a lock. PyMem_RawMalloc()/PyMem_RawFree() can be
called with an old context and the new function pointer.

On a release build, it's not an issue since the context is not used.
On a debug build, the debug hooks use the context and so can crash.

(cherry picked from commit 9bc1964)
@vstinner vstinner marked this pull request as ready for review January 18, 2025 23:19
@vstinner vstinner enabled auto-merge (squash) January 18, 2025 23:20
@vstinner vstinner merged commit 6b47499 into python:3.13 Jan 18, 2025
39 checks passed
@vstinner vstinner deleted the tracemalloc_stop13 branch January 18, 2025 23:39
@miss-islington-app
Copy link

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

@miss-islington-app
Copy link

Sorry, @vstinner, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 6b47499510e47c0401d1f6cca2660fc12c496e39 3.12

@bedevere-app
Copy link

bedevere-app bot commented Jan 19, 2025

GH-129022 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label Jan 19, 2025
vstinner added a commit that referenced this pull request Jan 19, 2025
…129022)

[3.13] gh-128679: Fix tracemalloc.stop() race conditions (#128897)

tracemalloc_alloc(), tracemalloc_realloc(), PyTraceMalloc_Track(),
PyTraceMalloc_Untrack() and _PyTraceMalloc_TraceRef() now check
tracemalloc_config.tracing after calling TABLES_LOCK().

_PyTraceMalloc_Stop() now protects more code with TABLES_LOCK(),
especially setting tracemalloc_config.tracing to 1.

Add a test using PyTraceMalloc_Track() to test tracemalloc.stop()
race condition.

Call _PyTraceMalloc_Init() at Python startup.

(cherry picked from commit 6b47499)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants