Skip to content

Refleaks on free-threaded builds #134557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
emmatyping opened this issue May 23, 2025 · 19 comments
Closed

Refleaks on free-threaded builds #134557

emmatyping opened this issue May 23, 2025 · 19 comments
Labels
tests Tests in the Lib/test dir topic-subinterpreters type-bug An unexpected behavior, bug, or error

Comments

@emmatyping
Copy link
Member

emmatyping commented May 23, 2025

Bug report

Bug description:

The nogil refleak buildbots have started failing for main and 3.14 with leaks in test_sys, test_capi and a few others.

See e.g. https://buildbot.python.org/#/builders/1714/builds/90

I bisected this to 09e72cf, cc @ericsnowcurrently

CPython versions tested on:

CPython main branch, 3.14

Operating systems tested on:

Linux

Linked PRs

@emmatyping emmatyping added type-bug An unexpected behavior, bug, or error tests Tests in the Lib/test dir labels May 23, 2025
@emmatyping emmatyping changed the title refleaks on free-threaded builds Refleaks on free-threaded builds May 23, 2025
@ericsnowcurrently
Copy link
Member

Yeah, I'll be home soon and will take a look.

@ericsnowcurrently
Copy link
Member

Any ideas on why there would be leaks only on free-threaded builds?

Also, FWIW, the leaks may actually originate in gh-133482.

@ZeroIntensity
Copy link
Member

Hm, I don't see any obvious leaks on the PRs. I wouldn't be surprised if it only revealed an existing leak on free-threading.

@ZeroIntensity
Copy link
Member

Ah, @ericsnowcurrently, this looks like a leak:

PyObject *res = PyMemoryView_FromObject(obj);

obj needs to decref'd on the successful path. I'm not sure if that's the cause of the buildbots failures, though. No idea why it would only fail under free-threading (maybe it's because the object structure is bigger?)

@emmatyping
Copy link
Member Author

I tried adding a decref after that but the tests still show a refleak.

@ericsnowcurrently
Copy link
Member

FWIW, the results have been consistent for me across runs, so it doesn't seem like a race is involved.

@ericsnowcurrently
Copy link
Member

It may be due to interned strings. The only refleak-failing test in test_sys is test_subinterp_intern_dynamically_allocated.

(Thus this is more likely to be an existing bug than one introduced yesterday.)

@picnixz
Copy link
Member

picnixz commented May 23, 2025

While reviewing the new code, I saw that

PyObject *exc = _PyErr_GetRaisedException(tstate);
PyObject *cause = PyException_GetCause(exc);
if (cause != NULL) {
Py_DECREF(exc);
exc = cause;
}
else {
assert(PyException_GetContext(exc) == NULL);
}

should use Py_SETREF instead. I don't think it's the cause of the leak though

@ericsnowcurrently
Copy link
Member

ericsnowcurrently commented May 23, 2025

In test_capi (test_capi.test_misc), the leak is manifesting exclusively in test_isolated_subinterpreter (specifically the first two subtests).

@ericsnowcurrently
Copy link
Member

I minimized the leaky test in test_capi. The following leaks:

    def test_subinterp_leak(self):
        interpid = _interpreters.create()
        self.addCleanup(lambda: _interpreters.destroy(interpid))
        timeout = time.time() + 30
        _interpreters.run_string(interpid, f"""if True:
            import time
            time.time() < {timeout}
            """)

but the following does not:

    def test_subinterp_leak(self):
        interpid = _interpreters.create()
        self.addCleanup(lambda: _interpreters.destroy(interpid))
        timeout = 30000000000
        _interpreters.run_string(interpid, f"""if True:
            import time
            time.time() < {timeout}
            """)

nor does:

    def test_subinterp_leak(self):
        interpid = _interpreters.create()
        self.addCleanup(lambda: _interpreters.destroy(interpid))
        _interpreters.run_string(interpid, f"""if True:
            import time
            time.time() < 30000000000
            """)

@ericsnowcurrently
Copy link
Member

I'm reverting for now, but I suspect there's still an existing leak to be fixed.

@emmatyping
Copy link
Member Author

FWIW the valgrind report for your minimal reproducer is:

==210093== 249,175 bytes in 1 blocks are definitely lost in loss record 8 of 8
==210093==    at 0x484D953: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==210093==    by 0x2DB669: _PyMem_DebugRawAlloc (Objects/obmalloc.c:2785)
==210093==    by 0x2DB669: _PyMem_DebugRawCalloc (???:2830)
==210093==    by 0x4BB06F: alloc_interpreter (Python/pystate.c:468)
==210093==    by 0x4BB06F: _PyInterpreterState_New (???:656)
==210093==    by 0x4BB53D: PyInterpreterState_New (Python/pystate.c:702)
==210093==    by 0x49DD60: new_interpreter (Python/pylifecycle.c:2294)
==210093==    by 0x49DCB2: Py_NewInterpreterFromConfig (Python/pylifecycle.c:2397)
==210093==    by 0x4442EF: _PyXI_NewInterpreter (Python/crossinterp.c:2736)
==210093==    by 0x51A7A90: interp_create (_interpretersmodule.c:632)
==210093==    by 0x2B9FE0: cfunction_call (Objects/methodobject.c:565)
==210093==    by 0x230CAA: _PyObject_MakeTpCall (Objects/call.c:242)
==210093==    by 0x3E230F: _PyEval_EvalFrameDefault (generated_cases.c.h:1619)
==210093==    by 0x3DB20C: _PyEval_EvalFrame (pycore_ceval.h:119)
==210093==    by 0x3DB20C: _PyEval_Vector (???:1961)

So perhaps an interpreter is not getting freed?

@ericsnowcurrently
Copy link
Member

It should be freed during test cleanup.

ericsnowcurrently added a commit that referenced this issue May 23, 2025
This reverts commit 09e72cf, AKA gh-134511.

We are reverting due to refleaks on free-threaded builds.
ericsnowcurrently added a commit that referenced this issue May 23, 2025
…ta()" (gh-134600)

This reverts commit bbf8048, AKA gh-134515.

We are reverting due to refleaks on free-threaded builds.
@ericsnowcurrently
Copy link
Member

At least for test_sys and test_capi, only scripts built using the BUILD_STRING opcode appear to be affected.

@ZeroIntensity
Copy link
Member

Ugh, why is it that subinterpreter changes seem to cause undebuggable buildbot failures. I can't reproduce a leak when running that test manually with -Xshowrefcount:

import _interpreters
import time

interpid = _interpreters.create()
timeout = time.time() + 30
_interpreters.run_string(interpid, f"""if True:
    import time
    time.time() < {timeout}
    """)
_interpreters.destroy(interpid)

@ericsnowcurrently
Copy link
Member

Are you using a free-threading build?

@ZeroIntensity
Copy link
Member

Yeah, I'm getting [0 refs, 0 blocks] on both FT or GIL-ful builds.

@neonene
Copy link
Contributor

neonene commented May 24, 2025

  • _PyCode_Init() -> intern_one_constant(), which interns the float timeout:

    cpython/Objects/codeobject.c

    Lines 3088 to 3095 in 7b1010a

    #ifdef Py_REF_DEBUG
    Py_ssize_t refcnt = Py_REFCNT(op);
    if (refcnt != 1) {
    // Adjust the reftotal to account for the fact that we only
    // restore a single reference in _PyCode_Fini.
    _Py_AddRefTotal(_PyThreadState_GET(), -(refcnt - 1));
    }
    #endif

  • _PyCode_Fini():

    cpython/Objects/codeobject.c

    Lines 3245 to 3252 in 7b1010a

    _PyCode_Fini(PyInterpreterState *interp)
    {
    #ifdef Py_GIL_DISABLED
    // Free interned constants
    struct _py_code_state *state = &interp->code_state;
    if (state->constants) {
    _Py_hashtable_foreach(state->constants, &clear_containers, NULL);
    _Py_hashtable_destroy(state->constants);

  • _interpreters.run_string() -> _PyXI_Exit() -> _exit_session():

    cpython/Python/crossinterp.c

    Lines 2345 to 2351 in 7b1010a

    if (session->prev_tstate != session->init_tstate) {
    assert(session->own_init_tstate);
    session->own_init_tstate = 0;
    PyThreadState_Clear(tstate);
    PyThreadState_Swap(session->prev_tstate);
    PyThreadState_Delete(tstate);
    }

_PyCode_Fini() seems to follow _PyXI_Exit(), which would affect the stat?
(Unlike main, 09e72cf runs _PyCode_Init() and _PyCode_Fini() on different threads.)

@ericsnowcurrently
Copy link
Member

Thanks for tracking down the leak, @neonene! That's some impressive sleuthing.

ericsnowcurrently added a commit to ericsnowcurrently/cpython that referenced this issue May 26, 2025
ericsnowcurrently pushed a commit that referenced this issue May 26, 2025
…free-threading (gh-134686)

Disable immortalization around Py_CompileString*().

The same approach as 332356b that fixed the refleaks in compile() and eval().

E: 09e72cf can pass test_capi, test_sys and test__interpchannels with this patch for me.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 26, 2025
…under free-threading (pythongh-134686)

Disable immortalization around Py_CompileString*().

The same approach as 332356b that fixed the refleaks in compile() and eval().
(cherry picked from commit c60f39a)

Co-authored-by: neonene <53406459+neonene@users.noreply.github.com>
E: 09e72cf can pass test_capi, test_sys and test__interpchannels with this patch for me.
ericsnowcurrently pushed a commit that referenced this issue May 26, 2025
… under free-threading (gh134738)

Disable immortalization around Py_CompileString*().

The same approach as 332356b that fixed the refleaks in compile() and eval().

E: 09e72cf can pass test_capi, test_sys and test__interpchannels with this patch for me.

(cherry picked from commit c60f39a, AKA gh-134686)

Co-authored-by: neonene <53406459+neonene@users.noreply.github.com>
@github-project-automation github-project-automation bot moved this from Todo to Done in Subinterpreters May 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Tests in the Lib/test dir topic-subinterpreters type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

5 participants