Skip to content

[Do not merge] Experiment: Pluggable GC to detect missing write barriers #13557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 43 commits into
base: master
Choose a base branch
from

Conversation

jhawthorn
Copy link
Member

@jhawthorn jhawthorn commented Jun 8, 2025

Important

I will not be merging this, at least not any time soon. Not looking for code review, just sharing an experiment early in its development.

I've been working on a tool to help detect missed write barriers more reliably. This is one of the most common, yet for a WB bug to actually surface it requires very specific conditions so it's hard to tell when one exists or even has been fixed.

The basic algorithm is:

  • When each new object is allocated, also allocate a structure to track its references
  • At the soonest time after GC work is allowed to happen (next allocation?). Run the object's mark function and record a snapshot of its children. This is the object's initial state. a.references = reachable_objects_from(a)
  • Whenever a write barrier occurs, add the new reference to the list. WB a -> b - a.references << b
  • At some time later we check that for all objects (reachable_objects_from(a) - a.references).empty?, otherwise we've missed a write barrier

I've implemented this by writing a new pluggable GC just for testing and debugging, which applies these checks to every object. These rules might be more strict than the default GC requires, for example WBs that could only happen from young to old objects, however I think it will more reliably reproduce issues. For example this has found a few cases where initialize/initialize_copy was missing write barriers, which is unlikely to cause an issue in practice, but one could Object.allocate an object, let it get old, and then obj.send(:initialize, young_reference) to cause a crash. I think we should follow the stricter rules to ensure we don't miss any write barriers and to avoid assumptions about the GC implementation.

Example usage (finding a real bug in Set! See #13558):

$ RUBY_GC_LIBRARY=wbcheck ./miniruby -e 'Set["a","b","c"].collect!(&:upcase)'
WBCHECK ERROR: Missed write barrier detected!
  Parent object: 0x50700001cdb0 (wb_protected: true)
    rb_obj_info_dump: (Set)set
  Reference counts - stored: 4, current: 4, missed: 3
  Missing reference to: 0x50400007f2d0
    rb_obj_info_dump: (String) len: 1, capa: 15 "A"
  Missing reference to: 0x50400007f350
    rb_obj_info_dump: (String) len: 1, capa: 15 "B"
  Missing reference to: 0x50400005f990
    rb_obj_info_dump: (String) len: 1, capa: 15 "C"

WBCHECK SUMMARY: Found 1 objects with missed write barriers (3 total violations)

Limitations:

  • This never actually collects garbage, all objects live forever and are only verified at the very end of the program
  • We only verify objects once, so this can't detect write barriers that happen at the wrong time (too late/early)
  • This only detects missing write barriers where the object reference is marked
  • It's very slow (this is intentional)
  • There are some bugs. I think I'm not calling finalizers correctly, or missing calling dfree functions on T_DATA, so there's some hacks around that
  • make btest has about 4 bugs. make test-all shows a lot of problems, some are likely false positives, but I've found a few seemingly legitimate issues.
  • I used Cursor and claude-4-sonnet, so some of the comments are AI slop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant