Skip to content

[Bug #19922] Ensure Objspace.dump_all is a consistnt heap snapshot #9353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

KJTsanaktsidis
Copy link
Contributor

This PR contains two things:

  • A new building kind of lock in vm_sync.h, RB_VM_LOCK_ENTER_ISOLATE. This is designed to ensure that the locked code is the only code that can run anywhere in the VM during its scope, even if the code performs something that would normally be a blocking operation and trigger a thread switch. It does this by doing four things:
    • It stops all other ractors from executing (using the existing barrier interrupt mechanism)
    • It stops the current thread from descheduling and switching to another thread (by adding a new dont_switch flag to the thread, and consulting that before marking the thread as performing a "blocking operation")
    • It stops the current thread from switching to a new fiber with the fiber scheduler (by using the thread->blocking counter)
    • It stops signal trap interrupts from executing (by changing the interrupt mask).
  • It makes Objspace.dump_all use this mechanism to ensure that no code can run while we're dumping the heap, resulting in a consistent heap snapshot, even if the heap is being dumped to an IO device.

Three things I'm not sure about

  • Should we also stop GC from happening within RB_VM_LOCK_ENTER_ISOLATE? Presumably you don't want GC to trigger during heap dumping?
  • Should we also stop postponed jobs from running? This will mostly be profiler interrupts e.g. stackprof. If we allow them, then the profiler might allocate memory inside the heap dump; if we don't allow them, then Objspace.dump_all becomes invisible to profilers.
  • Is this extra complexity justified to fix the problem?

https://bugs.ruby-lang.org/issues/19922

@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_19922 branch 2 times, most recently from f78496e to 7701e10 Compare December 25, 2023 10:54
@ko1
Copy link
Contributor

ko1 commented Dec 28, 2023

Is this extra complexity justified to fix the problem?

I believe "NO".

I understand:

It stops all other ractors from executing (using the existing barrier interrupt mechanism)

But I couldn't understand why they are needed:

It stops the current thread from descheduling and switching to another thread (by adding a new dont_switch flag to the thread, and consulting that before marking the thread as performing a "blocking operation")
It stops the current thread from switching to a new fiber with the fiber scheduler (by using the thread->blocking counter)
It stops signal trap interrupts from executing (by changing the interrupt mask).

dump_all is C method and no switching, not?
If IO is used, I believe we should store data without Ruby's IO but use system IO (and sendfile to the Ruby's IO).

@KJTsanaktsidis
Copy link
Contributor Author

If IO is used, I believe we should store data without Ruby's IO but use system IO (and sendfile to the Ruby's IO).

Are there any portability problems with doing this, do you know? Can we extract the FD from the IO object, set it to blocking, and then just call write into it directly?

@KJTsanaktsidis
Copy link
Contributor Author

@ko1 I did some more research into this, and I think it's less complex to add this RB_VM_LOCK_ENTER_ISOLATE abstraction than the alternative.

The problem is, if we're given an IO to dump to, we don't know if that IO's file descriptor is set as blocking or nonblocking at the OS level. That means we have two options:

  1. Check if the FD is blocking, and if so, switch it to blocking and then back to nonblocking when we're done.
  2. Call write(2) on the FD, and if we get an EEAGAIN, call into poll(2) to wait for it to be writable again.

The problem with option 1, is that although we have a rb_io_set_nonblock method in io.h to put a FD into nonblocking mode, Ruby doesn't provide a method to set it back to blocking. There's the io/nonblock extension (which reimplements the fcntl calls itself), but that's now a (bundled? default?) gem. If we wanted to do this properly, we need to add two new APIs to io.h; one to set the FD to blocking mode, and a way to get the current mode (so we know whether to set it back). There will also need to be changes in win32.c to support this.

I guess the signature of what we want to add is...

void rb_io_set_nonblock2(VALUE io, bool nonblocking, bool *old_value)

or something along those lines.

The problem with option 2 is that we would need to replicate the machinery about whether to use select or poll etc which is wrapped up in rb_thread_wait_for_single_fd.

The reason I think my suggested approach in this PR is better is that it doesn't leak platform-specific stuff like FD O_NONBLOCK state, select vs poll, etc into objspace. Instead, we tell Ruby what we want to achieve ("I want this block to be the only possible code that can run in the VM until it finishes"), and then keep using all the ordinary I/O machinery just like we always would.

@ko1
Copy link
Contributor

ko1 commented Jan 1, 2024

@ko1 I did some more research into this, and I think it's less complex to add this RB_VM_LOCK_ENTER_ISOLATE abstraction than the alternative.

Do not introduce general Thread control mechanism for the specific purpose like this. It is too much only for this purpose.

@ko1
Copy link
Contributor

ko1 commented Jan 1, 2024

Can we extract the FD from the IO object, set it to blocking, and then just call write into it directly?

For example, we can dump all output to a temp file with C manner (no-Ruby related). After that we can put Ruby's IO from that file. Of course we can store all dumps to the memory but I'm not sure it is feasible.

@KJTsanaktsidis
Copy link
Contributor Author

we can dump all output to a temp file with C manner (no-Ruby related)

Oh that's not a bad idea actually. Let me try that. Thanks for the suggestion.

@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_19922 branch 4 times, most recently from 578cc36 to 12fdfa9 Compare January 2, 2024 07:38
@KJTsanaktsidis
Copy link
Contributor Author

Well this almost works, except I need to exclude other Ractors still, and RB_VM_LOCK_ENTER() won't work from objspace because it's an extension module and the symbols needed by that macro aren't exported.

@ko1 would you mind if I added an exported version of that macro to vm_sync.c so it can be called from objspace_dump?

(alternatively: I wonder if objspace dump should be in an extension module at all - maybe it should just be linked with directly? it requires intimate knowledge about how objects in cruby work anyway)

@ko1
Copy link
Contributor

ko1 commented Jan 2, 2024

@ko1 would you mind if I added an exported version of that macro to vm_sync.c so it can be called from objspace_dump?

  • In general normal C extensions should not touch this kind of feature so I'm against to expose this feature.
  • When Ractor local GC (heap) will be introduced I'm not sure what should be done for it.
  • Exporting current macro (sync begin/end respectively) is bad idea.

objspace extension is tightly coupled with MRI internal so I suggest:

  • make some sophisticated APIs
  • put it on src/internal/xxx.h (racotr.h?)
  • use it from src/ext/objspace

@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_19922 branch from 12fdfa9 to 247e3e2 Compare January 3, 2024 00:38
This needs to give a consistent snapshot of the heap, but if we're
writing the dump to a file object, the IO could actually cause a thread
switch to occur. Currently, this can cause a crash if dump_all
invocations are run concurrently. However, even if it didn't, the heap
snapshot still wouldn't be consistent.

We achieve this by:

* Exclude other Ractors by using a wrapper around RB_VM_LOCK_ENTER
* Exclude other threads by perfoming all file IO using native C blocking
  IO (to a tempfile), and only copying it back to Ruby once the dump is
  done.

[Bug #19922]
rb_objspace_reachable_objects_from has it too, so I figure it's most
likely required for _from_root as well.
@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_19922 branch from 247e3e2 to c3a76c3 Compare January 3, 2024 02:26
@KJTsanaktsidis
Copy link
Contributor Author

OK, I added

RUBY_SYMBOL_EXPORT_BEGIN
void rb_vm_lock_enter_ext(unsigned int *lev, const char *file, int line);
void rb_vm_lock_leave_ext(unsigned int *lev, const char *file, int line);
RUBY_SYMBOL_EXPORT_END

to vm_sync.h, which is an internal header, so this hasn't added to the actual Ruby API at all.

Is this what you had in mind? Or did you more have in mind something like

rb_ractor_isolate_begin(unsigned int *lev);
rb_ractor_isolate_end(unsigned int *lev);

to be added to ractor.h instead? The end result will be pretty much the same either way.

@KJTsanaktsidis
Copy link
Contributor Author

@ko1 Would appreciate your opinion on ☝️ when you have time 🙏 thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants