-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Fix busy-loop when waiting for file descriptors to close #7865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix busy-loop when waiting for file descriptors to close #7865
Conversation
On Win32, currently, rb_nativethread_cond_t is an incomplete type because it's a typedef for `struct rb_thread_cond_struct`. That means you can't actually allocate a rb_nativethread_cond_t unless you also include THREAD_IMPL_H (since its defined in thread_win32.h) (alternatively, including vm_core.h also works). Move the definition of rb_thread_cond_struct into thread_native.h to alleviate this.
When one thread is closing a file descriptor whilst another thread is concurrently reading it, we need to wait for the reading thread to be done with it to prevent a potential EBADF (or, worse, file descriptor reuse). At the moment, that is done by keeping a list of threads still using the file descriptor in io_close_fptr. It then continually calls rb_thread_schedule() in fptr_finalize_flush until said list is empty. That busy-looping seems to behave rather poorly on some OS's, particulary FreeBSD. It can cause the TestIO#test_race_gets_and_close test to fail (even with its very long 200 second timeout) because the closing thread starves out the using thread. To fix that, I introduce the concept of struct rb_io_close_wait_list; a list of threads still using a file descriptor that we want to close. We call `rb_notify_fd_close` to let the thread scheduler know we're closing a FD, which fills the list with threads. Then, we call rb_notify_fd_close_wait which will block the thread until all of the still-using threads are done. This is implemented with a condition variable sleep, so no busy-looping is required.
This test should be fixed and fast now because the closing thread sleeps appropriately waiting for the file descriptor to be unused.
715cfa9
to
0fc9e46
Compare
I would have liked to discuss this a little more, as I have a counter-proposal to remove a large amount of this logic: #5384 |
I hadn't seen https://bugs.ruby-lang.org/issues/18455 before. I think it makes sense that the waiting-on-close should be done per-IO, rather than per-FD. However, we'll still need this fix or something like it, right, to make the closing thread actually sleep while waiting for the other threads to finish being interrupted? I guess the list of "threads currently blocked on this IO instance" could then be stored on the IO itself in And it lookms like "store a list of threads currently blocked on this IO instance" is exactly what you did - e.g. - https://github.com/ruby/ruby/pull/5384/files#diff-161b2a279f4c67a1ab075a7890ecf6f3f1d483d910910fa52b3715e25cfdcbd7R2530 However - you removed the part that blocks the closing thread from actually closing until the other threads have fully been woken up: https://github.com/ruby/ruby/pull/5384/files#diff-92194f057884b3287a3a6bf84e6e3b2bf433a556b68562799252a091744e7854L5233 (unless i've drasticly misunderstood things) so it seems possible you might get spurious EBADF's? tl;dr - I think we should do both this PR and your PR? EDIT: I see you were planning to handle the actual close asynchronously afterwards it seems - #5384 (comment). Anyway - I'm more than happy for this fix to get blown away by your proposal to do this per-IO. |
I like that you did work on this, so thanks, and I appreciate your detailed reply. It's a complex problem. I basically have a problem if I think the goal of close with read operations is to avoid using EBADF as a signal that the IO has closed. However, it's been a while since I looked at this. I'll try to revisit this today and rebase my PR. |
This test seems to be explicitly checking that Line 3828 in cc698c6
So yeah if we can keep that property whilst avoiding the global list of IO ops (and it sounds like we can) I'm 💯 for that. Thanks for your feedback, and let me know if I can help out with what you're doing in any way! |
One initial issue that came out of the PR, was the desire/need to hide at least some (but ideally all) of I don't know what is the best way to do this, which is both safe (compatible where possible) and good (achieving our goal of removing it completely). If you have any thoughts on this... could be good to explore that space. The reason why it was important is because I wanted to introduce more complex data structures to |
Since this change, Solaris 10 CI has been failing. @KJTsanaktsidis @ioquatix @nobu Can you please take a look at it? http://rubyci.s3.amazonaws.com/solaris10-gcc/ruby-master/log/20230526T070003Z.fail.html.gz
|
@mame are you able to point me to some details on how these solaris10 tests are configured? The chkbuild link (http://rubyci.s3.amazonaws.com/solaris10-gcc/) just gives me a S3 error page... I rented a SPARC solaris 10 VM and got Ruby compiled with GCC 5.5 and the test in question passes for me. Do I need a machine with more or fewer CPUs perhaps? |
Ah I spoke too soon - I have it failing here when all of the ruby spec suite ( |
OK - I have a rough idea of what's going on here. When the deadlock happens, there are two threads. One thread is calling
The second thread is using the IO in question, blocked trying to read from it:
The issue is that when the closing thread yields the GVL to begin waiting on the condvar, that begins a round of "spam all the threads that have pending interrupts with I demonstrated that this is the issue by applying this patch - it forces a timer thread to be created when calling I think the actual solution is one of two things:
I can work on trying one or both of these tomorrow or the day after I think. |
Because a thread calling IO#close now blocks in a native condvar wait, it's possible for there to be _no_ threads left to actually handle incoming signals/ubf calls/etc. This manifested as failing tests on Solaris 10 (SPARC), because: * One thread called IO#close, which sent a SIGVTALRM to the other thread to interrupt it, and then waited on the condvar to be notified that the reading thread was done. * One thread was calling IO#read, but it hadn't yet reached the actual call to select(2) when the SIGVTALRM arrived, so it never unblocked itself. This results in a deadlock. The fix is to use a real Ruby mutex for the close lock; that way, the closing thread goes into sigwait-sleep and can keep trying to interrupt the select(2) thread. See the discussion in: ruby#7865
Because a thread calling IO#close now blocks in a native condvar wait, it's possible for there to be _no_ threads left to actually handle incoming signals/ubf calls/etc. This manifested as failing tests on Solaris 10 (SPARC), because: * One thread called IO#close, which sent a SIGVTALRM to the other thread to interrupt it, and then waited on the condvar to be notified that the reading thread was done. * One thread was calling IO#read, but it hadn't yet reached the actual call to select(2) when the SIGVTALRM arrived, so it never unblocked itself. This results in a deadlock. The fix is to use a real Ruby mutex for the close lock; that way, the closing thread goes into sigwait-sleep and can keep trying to interrupt the select(2) thread. See the discussion in: #7865
@KJTsanaktsidis I have confirmed that Solaris 10 CI is now green. It's amazing that you were able to solve the problem just by reading the phenomenon and the code! |
It wasn’t quite so magical 😂 - I actually found a company selling SPARC hosting, and spin up a Solaris 10 server. The deadlock happened pretty reliably from running the tests so I could just attach gdb to it. |
Aha. Thanks for spending your time and resource! It looks like #7884 fixed some other people's deadlock as well. Thanks anyway! |
When one thread is closing a file descriptor whilst another thread is concurrently reading it, we need to wait for the reading thread to be done with it to prevent a potential EBADF (or, worse, file descriptor reuse).
At the moment, that is done by keeping a list of threads still using the file descriptor in io_close_fptr. It then continually calls rb_thread_schedule() in fptr_finalize_flush until said list is empty.
That busy-looping seems to behave rather poorly on some OS's, particulary FreeBSD. It can cause the TestIO#test_race_gets_and_close test to fail (even with its very long 200 second timeout) because the closing thread starves out the using thread.
To fix that, I introduce the concept of struct rb_io_close_wait_list; a list of threads still using a file descriptor that we want to close. We call
rb_notify_fd_close
to let the thread scheduler know we're closing a FD, which fills the list with threads. Then, we call rb_notify_fd_close_wait which will block the thread until all of the still-using threads are done.This is implemented with a condition variable sleep, so no busy-looping is required.
After applying this fix, the timeout on
TestIO#test_race_gets_and_close
can be reduced and it passes reliably on FreeBSD.