Skip to content

gh-113055: Fix memory leak of obmalloc state #113218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

nascheme
Copy link
Member

@nascheme nascheme commented Dec 16, 2023

Free the obmalloc arenas, radix tree nodes and allarenas array when the interpreter state is freed.

Free the obmalloc arenas, radix tree nodes and allarenas array when
the interpreter state is freed.
}
// free the array containing pointers to all arenas
PyMem_RawFree(allarenas);
#if WITH_PYMALLOC_RADIX_TREE
Copy link
Contributor

@neonene neonene Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_freeze_module is defaulted to be a 32-bit application on Windows, which does not define the USE_INTERIOR_NODES flag during the build.

Make work for --without-pymalloc and for WITH_PYMALLOC_RADIX_TREE=0.
@neonene
Copy link
Contributor

neonene commented Dec 17, 2023

4471ac7 says my debug build crashes:

>_testembed_d test_repeated_init_exec "import _testcapi"  // or _ctypes
--- Loop #1 ---
--- Loop #2 ---  // Crash

@nascheme
Copy link
Member Author

4471ac7 says my debug build crashes: ...

I suspect this is a use-after-free bug. If I print out how many obmalloc blocks are still in use when free_arenas() is called, there are a lot, like thousands. I can't reproduce the crash now but it's maybe something to do with the dict keys freelist.

@nascheme
Copy link
Member Author

I don't know if this is the same bug but I can make it crash by running the following:

./Programs/_testembed test_repeated_init_exec "$(cat crash.py)"

Where the crash.py file contains:

import unittest

unittest.main('test.test_dict', verbosity=1, exit=False)
unittest.main('test.test_enum', verbosity=1, exit=False)
unittest.main('test.test_set', verbosity=1, exit=False)

Part of the stack trace:

#0  PyDict_SetItem (op=op@entry=0x7ffff6dbab40, key=0x7ffff6e4d2b0, value=value@entry=0x7ffff6e447e0) at ./Include/object.h:1180
#1  0x000055555565fd43 in PyDict_SetItemString (v=v@entry=0x7ffff6dbab40, key=key@entry=0x7ffff72c34e6 "resolution", 
    item=item@entry=0x7ffff6e447e0) at Objects/dictobject.c:4053
#2  0x00007ffff72c1dc5 in _datetime_exec (module=module@entry=0x7ffff6e39670) at ./Modules/_datetimemodule.c:6850
#3  0x00007ffff72c26a0 in PyInit__datetime () at ./Modules/_datetimemodule.c:6966
#4  0x0000555555742fa8 in _PyImport_LoadDynamicModuleWithSpec (spec=spec@entry=0x7ffff6e447a0, fp=fp@entry=0x0) at ./Python/importdl.c:170
#5  0x0000555555740c81 in _imp_create_dynamic_impl (module=module@entry=0x7ffff7cc1440, spec=0x7ffff6e447a0, file=<optimized out>)
    at Python/import.c:3749
#6  0x0000555555740cf3 in _imp_create_dynamic (module=0x7ffff7cc1440, args=0x7ffff6e44748, nargs=1) at Python/clinic/import.c.h:485
#7  0x00005555556697ae in cfunction_vectorcall_FASTCALL (func=0x7ffff7cc3880, args=0x7ffff6e44748, nargsf=<optimized out>, 
    kwnames=<optimized out>) at Objects/methodobject.c:425
#8  0x000055555562faec in _PyVectorcall_Call (tstate=tstate@entry=0x555555a84e28 <_PyRuntime+493384>, 
    func=0x555555669752 <cfunction_vectorcall_FASTCALL>, callable=callable@entry=0x7ffff7cc3880, tuple=tuple@entry=0x7ffff6e44730, 
    kwargs=kwargs@entry=0x7ffff6e4d180) at Objects/call.c:273
#9  0x000055555562fc78 in _PyObject_Call (tstate=0x555555a84e28 <_PyRuntime+493384>, callable=callable@entry=0x7ffff7cc3880, 
    args=args@entry=0x7ffff6e44730, kwargs=kwargs@entry=0x7ffff6e4d180) at Objects/call.c:348
#10 0x000055555562fcb3 in PyObject_Call (callable=callable@entry=0x7ffff7cc3880, args=args@entry=0x7ffff6e44730, 
    kwargs=kwargs@entry=0x7ffff6e4d180) at Objects/call.c:373
#11 0x00005555557035d7 in _PyEval_EvalFrameDefault (tstate=0x555555a84e28 <_PyRuntime+493384>, frame=0x7ffff7fb8b60, throwflag=<optimized out>)
    at Python/generated_cases.c.h:1225
#12 0x00005555557126fe in _PyEval_EvalFrame (tstate=tstate@entry=0x555555a84e28 <_PyRuntime+493384>, frame=<optimized out>, 
    throwflag=throwflag@entry=0) at ./Include/internal/pycore_ceval.h:115
#13 0x00005555557127e0 in _PyEval_Vector (tstate=0x555555a84e28 <_PyRuntime+493384>, func=<optimized out>, locals=locals@entry=0x0, 
    args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:1785
#14 0x000055555562e6f1 in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at Objects/call.c:413
#15 0x000055555562e8bd in _PyObject_VectorcallTstate (tstate=tstate@entry=0x555555a84e28 <_PyRuntime+493384>, 
    callable=callable@entry=0x7ffff75002c0, args=args@entry=0x7fffffffc470, nargsf=nargsf@entry=2, kwnames=kwnames@entry=0x0)
    at ./Include/internal/pycore_call.h:168

The op argument to PyDict_SetItem has an invalid ob_type pointer.

@neonene
Copy link
Contributor

neonene commented Dec 18, 2023

I also hit a crash when importing datetime module twice. The globals-to-fix.tsv shows extension modules related to me.

Modules/_ctypes/stgdict.c - PyCStgDict_Type -
Modules/_cursesmodule.c - PyCursesWindow_Type -
Modules/_datetimemodule.c - PyDateTime_DateTimeType -

Maybe a related attempt and crash?: #44470 (comment)

@ericsnowcurrently
Copy link
Member

I also hit a crash when importing datetime module twice.

I'd expect any single-phase init module to potentially have problems with multiple init/fini cycles, since such legacy modules have no direct opportunity to finalize their global state. Regarding datetime specifically, see gh-71587 (and others related).

@ericsnowcurrently
Copy link
Member

Likewise, _testcapi (from #113218 (comment)) is single-phase init. That may be the problem.

Copy link
Member

@ericsnowcurrently ericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with a caveat for my lack of familiarity with the radix tree stuff)

I'm not marking "Approve" only due to remaining uncertainty about new crashes for embedded CPython across init/fini cycles.

Comment on lines +2675 to +2682
for (uint i = 0; i < maxarenas; ++i) {
// free each obmalloc memory arena
struct arena_object *ao = &allarenas[i];
_PyObject_Arena.free(_PyObject_Arena.ctx,
(void *)ao->address, ARENA_SIZE);
}
// free the array containing pointers to all arenas
PyMem_RawFree(allarenas);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've double-checked that this is correct. I didn't see any other allocations for arenas (outside the radix tree stuff).

Comment on lines +2685 to +2686
// Free the middle and bottom nodes of the radix tree. These are allocated
// by arena_map_mark_used() but not freed when arenas are freed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't they freed when arenas are freed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better performance and simpler code. It might not be too expensive to free them. We would have to track how many nodes are children and then remove when count reaches zero.

Comment on lines +2683 to +2702
#if WITH_PYMALLOC_RADIX_TREE
#ifdef USE_INTERIOR_NODES
// Free the middle and bottom nodes of the radix tree. These are allocated
// by arena_map_mark_used() but not freed when arenas are freed.
for (int i1 = 0; i1 < MAP_TOP_LENGTH; i1++) {
arena_map_mid_t *mid = arena_map_root.ptrs[i1];
if (mid == NULL) {
continue;
}
for (int i2 = 0; i2 < MAP_MID_LENGTH; i2++) {
arena_map_bot_t *bot = arena_map_root.ptrs[i1]->ptrs[i2];
if (bot == NULL) {
continue;
}
PyMem_RawFree(bot);
}
PyMem_RawFree(mid);
}
#endif
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not as familiar with the radix tree stuff. I do see the corresponding allocations in arena_map_get() (via arena_map_mark_used()) though. The logic you have here make sense relative to that.

@ericsnowcurrently
Copy link
Member

I'm not marking "Approve" only due to remaining uncertainty about new crashes for embedded CPython across init/fini cycles.

We certainly don't want regressions there. However, I'd expect the parts causing crashes would already be problematic across init/fini cycles.

@nascheme
Copy link
Member Author

Likewise, _testcapi (from #113218 (comment)) is single-phase init. That may be the problem.

It seems to me that we are not ready to be moving the obmalloc internal state to be per-interpreter. With so many modules allocating memory from obmalloc that outlives the interpreter we are suck in a no-win situation. Either we leak the obmalloc arenas on an init/finish interp cycles or we free them and risk use-after-free crashes.

I think my free_arenas() function is only safe to call if the obmalloc used block count is zero. We are very far from that and any real Python program likely has no hope of getting there without massive efforts to fix all of the extensions it uses. So, I don't see what the point of moving the obmalloc state actually was. I think we should move it back so we don't have this leak and then work on fixing modules with global state.

Is it possible that mimalloc and the nogil changes will help solve this? Maybe we should leave obmalloc alone (i.e. revert to 3.11 behaviour) and then phase it out when mimalloc is fully working.

@nascheme
Copy link
Member Author

An idea for an improvement: we could check allocated blocks per arena and free only the arenas that have zero used blocks. That should be safe and free at least some memory. Then, if you want to avoid leaks on init/finish cycles you just have to fix all the extensions so they don't leak.

@tim-one
Copy link
Member

tim-one commented Dec 18, 2023

It seems to me that we are not ready to be moving the obmalloc internal state to be per-interpreter.

Indeed, even having that as "a goal" makes scant sense to me, no more than, e.g., that different interpreters should need to use their own system malloc(), or ...

If a sub-interpreter wants to end its life, it should be responsible for freeing the memory it allocated.

Is it possible that mimalloc and the nogil changes will help solve this? Maybe we should leave obmalloc alone (i.e. revert to 3.11 behaviour) and then phase it out when mimalloc is fully working.

mimalloc will help with nogil, but I doubt it will help at all in out-guessing sub-interpreters that won't/can't free the memory they allocated.

@nascheme
Copy link
Member Author

mimalloc will help with nogil, but I doubt it will help at all in out-guessing sub-interpreters that won't/can't free the memory they allocated.

My understanding is that mimalloc is thread safe and so we could have one instance of mimalloc per-process and shared between threads and sub-interpreters. However, to avoid leaking, you would need some kind of cross-interpreter or cross-thread reference counting system, so memory gets freed when it's no longer used. E.g. like what nogil's biased reference counting does.

@ericsnowcurrently
Copy link
Member

I made obmalloc state per-interpreter to avoid a global lock under per-interpreter GIL.

@neonene
Copy link
Contributor

neonene commented Dec 19, 2023

For fixing memory leaks from sub-interpreters (issue: #110411), the current free_arenas() looks safe if it is correct that no legacy extension is allowed to be loaded by a sub-interpreter: PEP 554, C-API

@ericsnowcurrently
Copy link
Member

For fixing memory leaks from sub-interpreters (issue: #110411), the current free_arenas() looks safe if it is correct that no legacy extension is allowed to be loaded by a sub-interpreter: PEP 554, C-API

That's correct for isolated subinterpreters (have their own GIL). It's already a non-issue for legacy subinterpreters, which share the main interpreter's GIL as well as the main interpreter's obmalloc state.

The only caveat for isolated subinterpreters is that the only way to identify a single-phase init module is to call its "init" function and check if it returned a module object (rather than a PyModuleDef). For an isolated subinterpreter we immediately clear/free the module object and error out. However, there is never an opportunity to clear out the extension's global state, if any.

As to the crashes with this PR, relative to global init/fini, that's still an issue. Here are some possible fixes:

  • dlclose() all extension module files during runtime fini (or interp fini in some cases)
  • use something like dlmopen() to load extensions in isolated symbol namespaces
  • track extension modules that store objects in global variables and don't clean them up, and do not clear obmalloc state for the corresponding interpreter
  • go back to 3.11 obmalloc, but use a PyMutex lock around obmalloc for thread-safety
  • switch to mimalloc sooner rather than later

@neonene
Copy link
Contributor

neonene commented Dec 20, 2023

the only way to identify a single-phase init module is to call its "init" function and check if it returned a module object

Sorry for not knowing much about the internals, but is there any chance isolated subinterpreters are excluded from the suggested approaches, by receiving the module identification from the main interpreter and by using the current free_arena() function?

@neonene
Copy link
Contributor

neonene commented Dec 22, 2023

  • track extension modules that store objects in global variables and don't clean them up, and do not clear obmalloc state for the corresponding interpreter

I would prefer this, leaving aside the issue of isolated subinterpreters.

@nascheme
Copy link
Member Author

nascheme commented Dec 22, 2023

I tried a slightly different approach, as gh-113412. There is a slight overhead for get_state() vs 3.11 but I don't think that matters too much. If the Py_RTFLAGS_USE_MAIN_OBMALLOC flag is set (which I think is the default) the behaviour is like 3.11 in that there is one obmalloc state for all interpreters.

If you want to use isolated sub-interpreters, which is think requires that the Py_RTFLAGS_USE_MAIN_OBMALLOC flag is cleared, then you will need to make sure the sub-interpreters use obmalloc correctly. E.g. don't share memory allocated from obmalloc between interpreters and don't hang on to memory after the interpreter has been freed.

@nascheme
Copy link
Member Author

For Python 3.13, GH-113412 was merged as an alternative to this. This fix could possibly be applied to 3.12 but it needs more polish and testing first.

@nascheme nascheme added the 3.12 only security fixes label Oct 15, 2024
@nascheme
Copy link
Member Author

Closing, this is not going to get applied.

@nascheme nascheme closed this Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 only security fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants