Skip to content

Conversation

AJMansfield
Copy link
Contributor

Submitting as a PR for the sake of CI and easy reference to the results --- this SHOULD NOT be merged.
Part of research into some layout-sensitive inconsistent GC finaliser behavior I discovered in #18005.
This PR adds instrumentation to this branch to detect and report any other cases in the test suite that trigger the particular circumstances that trigger this bug.

@AJMansfield
Copy link
Contributor Author

As can be seen from this branch's test failures, this instrumentation gets triggered only in the VFS tests --- I suspect the commonality here is that this is the only other use of mp_obj_malloc_var_with_finaliser in the MicroPython codebase, and that this actually represents a bug in which the VFS finalisers are not being called.

@AJMansfield
Copy link
Contributor Author

The fact that this branch isn't triggered in any portion of the codebase except VFS, also supports the removal of this condition --- it's a pure optimisation that attempts to avoid a potentially-expensive object attribute lookup, and it's never one that actually fails in any cases other than the ones it's causing bugs in.

AJMansfield added a commit to AJMansfield/micropython that referenced this pull request Aug 30, 2025
This check is a remnant of the original finaliser implementation from
4f7e9f5, all the way from pre-1.0.
This was a valid optimisation back then, but after its ordering with
other finaliser code changed in 8a2ff2c
the invariant it exploits is no longer valid, and it breaks down
completely in the case of variable-length allocations made with
`mp_obj_malloc_var_with_finaliser`.

This resulted in a latent bug in the VFS system, where depending on the
precise memory layout, VFS finaliser routines could sometimes be
inadvertently missed. This bug was discovered when encountering an
identical issue with an attempt to implement __del__ for user classes.

As can be seen from the test runs with the instrumentation in micropython#18013,
this branch is also never actually hit by normal code outside of VFS.
Therefore, whatever the original intent and whether or not it was ever,
it's not actually an optimisation now; removing this condition is a
strict improvement to garbage collector code size and performance.

Signed-off-by: Anson Mansfield <amansfield@mantaro.com>
AJMansfield added a commit to AJMansfield/micropython that referenced this pull request Aug 30, 2025
This check is a remnant of the original finaliser implementation from
4f7e9f5, all the way from pre-1.0.
This was a valid optimisation back then, but after its ordering with
other finaliser code changed in 8a2ff2c
the invariant it exploits is no longer valid, and it breaks down
completely in the case of variable-length allocations made with
`mp_obj_malloc_var_with_finaliser`.

This resulted in a latent bug in the VFS system, where depending on the
precise memory layout, VFS finaliser routines could sometimes be
inadvertently missed. This bug was discovered when encountering an
identical issue with an attempt to implement `__del__` for user classes
in micropython#18005.

As can be seen from the test runs with the instrumentation in micropython#18013,
this branch is also never actually hit by normal code outside of VFS.
Therefore, whatever the original intent and whether or not it was ever,
it's not actually an optimisation now; removing this condition is a
strict improvement to garbage collector code size and performance.

Signed-off-by: Anson Mansfield <amansfield@mantaro.com>
Signed-off-by: Anson Mansfield <amansfield@mantaro.com>
Copy link

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:  +177 +0.094% [incl +8(data)]
   unix x64:  +112 +0.013% standard[incl +32(bss)]
      stm32:   +68 +0.017% PYBV10[incl +4(bss)]
     mimxrt:  +112 +0.030% TEENSY40
        rp2:   +52 +0.006% RPI_PICO_W[incl +4(bss)]
       samd:   +96 +0.035% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:   +74 +0.016% VIRT_RV32[incl +4(bss)]

Signed-off-by: Anson Mansfield <amansfield@mantaro.com>
Signed-off-by: Anson Mansfield <amansfield@mantaro.com>
@AJMansfield AJMansfield changed the title py/gc: Instrument finaliser code to detect instances of finaliser AT_MARK bug. py/gc: Instrument finaliser code for instances of finaliser encountering AT_MARK vs collecting. Aug 31, 2025
@AJMansfield AJMansfield changed the title py/gc: Instrument finaliser code for instances of finaliser encountering AT_MARK vs collecting. py/gc: Instrument finaliser code for instances of finaliser encountering AT_MARK. Aug 31, 2025
@dpgeorge
Copy link
Member

dpgeorge commented Sep 1, 2025

Doing a quick test, FTB_SET() is only called once and when it's called it points to a block that is a HEAD block. This means something is going wrong where FTB being set does not correspond to a HEAD block.

I suggest making a function gc_consistency() which checks that for each FTB is set, it corresponds to a HEAD block. Then run this function in many places (eg in the VM at opcode dispatch) and see when it starts to fail.

@AJMansfield
Copy link
Contributor Author

Doing a quick test, FTB_SET() is only called once and when it's called it points to a block that is a HEAD block. This means something is going wrong where FTB being set does not correspond to a HEAD block.

The garbage collector is changing it to AT_MARK in gc_mark_subtree, the same way it would if the object were still properly reachable. The real question at this point, is why the garbage collector still thinks it's reachable.

I suggest making a function gc_consistency() which checks that for each FTB is set, it corresponds to a HEAD block. Then run this function in many places (eg in the VM at opcode dispatch) and see when it starts to fail.

I've got similar thoughts --- what I really need is better tooling here. I'm currently working on a GDB plugin that should hopefully make it a bit easier to see what's going on by generating graphviz data-structure graphs of MicroPython's heap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants