Description
Bug report
Bug description:
import traceback
import gc
class Obj:
def __init__(self, name: str):
self.name = name
def __repr__(self):
return f"Obj({self.name!r})"
def __del__(self):
print("del", self)
def deep(i: int):
a = Obj(f"a, i={i}")
if i == 2:
raise Exception(f"exception at i={i}")
print(a)
def func():
for i in range(5):
gc.collect()
print("** i:", i)
try:
deep(i)
except Exception as exc:
print("caught", exc)
print_tb(exc.__traceback__)
# traceback.clear_frames(prev_exc.__traceback__)
clear_tb(exc.__traceback__)
continue # continue with next i
print("deep", i, "done")
def print_tb(tb):
print("Call stack:")
while tb:
frame_i = tb.tb_frame.f_locals.get("i")
print(f" {tb.tb_frame.f_code.co_name}: i={frame_i}")
tb = tb.tb_next
def clear_tb(tb):
print("Clearing stack:")
while tb:
print(tb.tb_frame)
try:
tb.tb_frame.clear()
except RuntimeError:
print(" cannot clear?")
else:
print(" cleared")
# Using this code triggers that the ref actually goes out of scope, otherwise it does not!
# print(" now:", tb.tb_frame.f_locals)
tb = tb.tb_next
if __name__ == '__main__':
func()
print("exit")
Running this code gives the following output:
** i: 0
Obj('a, i=0')
del Obj('a, i=0')
deep 0 done
** i: 1
Obj('a, i=1')
del Obj('a, i=1')
deep 1 done
** i: 2
caught exception at i=2
Call stack:
func: i=2
deep: i=2
Clearing stack:
<frame at 0x7f9ee1cc72a0, file '/u/zeyer/code/playground/py-oom-out-of-scope.py', line 34, code func>
cannot clear?
<frame at 0x7f9ee1c168c0, file '/u/zeyer/code/playground/py-oom-out-of-scope.py', line 20, code deep>
cleared
** i: 3
Obj('a, i=3')
del Obj('a, i=3')
deep 3 done
** i: 4
Obj('a, i=4')
del Obj('a, i=4')
deep 4 done
exit
del Obj('a, i=2')
You see that Obj('a, i=2')
only is deleted at exit.
This only happens when the print_tb
is used before, which will access f_locals
of each frame.
traceback.clear_frames
should have cleared the locals. But as you see from the output, it does not.
clear_tb
is basically a copy of traceback.clear_frames
.
The problem goes away if you access tb.tb_frame.f_locals
after it was cleared (i.e. tb.tb_frame.clear()
was called).
Looking at the C code, this is what tb_frame.clear()
will do:
https://github.com/python/cpython/blob/3.12/Objects/frameobject.c#L933-L946
static int
frame_tp_clear(PyFrameObject *f)
{
Py_CLEAR(f->f_trace);
/* locals and stack */
PyObject **locals = _PyFrame_GetLocalsArray(f->f_frame);
assert(f->f_frame->stacktop >= 0);
for (int i = 0; i < f->f_frame->stacktop; i++) {
Py_CLEAR(locals[i]);
}
f->f_frame->stacktop = 0;
return 0;
}
However, if you accessed tb_frame.f_locals
before, it will have created a dictionary in frame->f_locals
here: https://github.com/python/cpython/blob/5c238225f60c33cf1931b1a8c9a3310192c716ae/Objects/frameobject.c#L1218C18-L1218C33
That frame->f_locals
dict will also have references to all the local vars. And that f_locals
dict is not cleared in tb_frame.clear()
.
However, then when you access tb_frame.f_locals
again, it will update the existing frame->f_locals
dict, and delete all the local vars in it, because they are not available anymore. Here:
https://github.com/python/cpython/blob/3.12/Objects/frameobject.c#L1256C13-L1256C55
I think it's a bug (or at least very unexpected) that tb_frame.clear()
does not clear frame->f_locals
.
So my suggestion would be to add Py_CLEAR(f->f_frame->f_locals)
in frame_tp_clear
.
There is then another related issue: When the except
block is left, the exception goes out of scope, so then it should free all the locals (even when frame.clear()
was not called). However, this is also not the case.
After inspecting this further: Once frame.f_locals
was accessed from the current frame where the exception is handled, this frame.f_locals
still has a reference to the exception, and thus to the frames, even though the DELETE_FAST
for the exception deleted it from the fast locals. See the comments below for more on this.
Note, for PyTorch and others, when you first do extended exception reporting which accesses f_locals
in any way, this here fixes two arising problems. Related:
- Inconsistent recovery from CUDA OOMs pytorch/pytorch#18853
- Free Memory after CUDA out of memory error pytorch/pytorch#27600
E.g., this came up for us because we have this extended exception reporting, which accesses f_locals
:
# Extend exception message by module call stack.
module_names_by_id = {} # id -> name
for name, mod in model.named_modules():
if id(mod) not in module_names_by_id:
module_names_by_id[id(mod)] = name or "(root)"
exc_ext = []
for frame in iter_traceback(exc.__traceback__):
if frame.f_code.co_nlocals == 0:
continue
frame_self = frame.f_locals.get("self")
if isinstance(frame_self, (torch.nn.Module, rf.Module)):
func = get_func_from_code_object(frame.f_code, frame=frame)
if func and func.__name__ and func.__name__.startswith("_") and not func.__name__.startswith("__"):
continue
func_name = (func and func.__qualname__) or type(frame_self).__name__
exc_ext.append(f"({func_name}) {module_names_by_id.get(id(frame_self), '(unknown)')}")
if not exc_ext:
exc_ext.append("(No module call frames.)")
if len(exc.args) == 1 and isinstance(exc.args[0], str) and not always_direct_print:
exc.args = ("\n".join([exc.args[0], "", "Module call stack:"] + exc_ext),)
else:
print("Module call stack:", file=log.v3)
for msg in exc_ext:
print(msg, file=log.v3)
The normal traceback.clear_frames
here does not help.
CPython versions tested on:
3.11, 3.12, 3.13
Operating systems tested on:
Linux