-
Notifications
You must be signed in to change notification settings - Fork 5.4k
YJIT: GC and recompile all code pages #6406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think in general I would avoid having two different implementation paths when we don't have data to back up the idea that having these two separate behaviors yields perf improvements. It's best to minimize complexity and avoid unforeseen strange bugs.
I agree with that 👍 It might also mean that there's an advantage in having smaller code pages wrt memory usage. We will need to benchmark it.
This is unfortunately not quite true. We set the alignment to 64 bytes in I think your idea of "painting" pages incrementally when generating code and when freeing ISEQs is fairly clever. What I was originally thinking is that we would have a bitmap that is essentially one bit per page. We would zero out all the bits and then iterate through all the live ISEQs to mark the live ISEQs. We have access to all of the YJIT blocks for all the ISEQs so we can map from ISEQ->blocks->address ranges->pages. Maybe we could start with that? Unless you can think of an incremental marking scheme that can work with arbitrary address ranges for ISEQs.
We may want to think about that fairly early? At least convince ourselves that it will work with our design. We'll need to split each code page into an inline and outlined half so that the outlined code is always close to the inlined code. I think that in terms of marking, we basically don't care about outlined code. We can assume that the code in the outlined half belongs to the same ISEQs that are in the inlined half. Therefore, if it's safe to discard the inlined code, it should be safe to discard the outlined code. |
Makes sense. I prefer just using
Got it. It was a too optimistic assumption 🙂 If we currently pack different ISEQs in such calls, adding alignment there may have an unwanted impact. Let's avoid designs that need this assumption.
That sounds like the simplest solution we could have now. I think we need to newly introduce a list of ISEQs that have been JITed in the past, but given that we're targeting only a few thousands of ISEQs, it's probably fine in terms of both time and space complexity.
I do think of an incremental marking scheme that works without the alignment assumption though. If we have For this to work, whenever we write new code for an ISEQ, we should be able to know the ISEQ's current set of address ranges and its new one in order to increment only elements for newly modified pages. Being able to know all the code addresses of an ISEQ is needed for a non-incremental design as well, so virtually we have no blocker for taking this design. Anyway, I'm planning on implementing your non-incremental design first because it's simpler and I think it's better to have a working first version as early as possible and incrementally improve it.
It's kind of related to the "only free code pages when an entire page is marked unused" idea. No matter what layout we choose for each page or over multiple pages, we'll need to be able to identify whether a particular page(s) is in use or not. We can make the page usage tracking either fully isolated or tightly integrated with something to manage a code page, but even for the latter case, the core logic of the memory page management would stay the same when the layout is changed. I'm not saying that we should do that early though. I'm rather suggesting that we could work on them at the same time especially when we have multiple people working on this project. I just prioritized solving the primary goal anyway because I have the above belief. |
IIRC SFR is on the order of 10K to 20K ISEQs. Does CRuby have a straightforward way of iterating through all the ISEQs? If so, we could use that. If not, then I guess we have to add a list/set of JITted ISEQs as you suggest. For a given ISEQ, we can check if there exist block versions in YJIT: https://github.com/ruby/ruby/blob/master/yjit/src/core.rs#L686 And this function gets called when marking an ISEQ:
IIUC what you are suggesting here is using one integer per page representing the number of ISEQs that overlap on that page? If so then I guess this would be reference counting. Presumably that would work. We might still need to reconstruct the refcount from time to time if we ever decide to throw away all the code because we've reached the limit though.
Alrighty 👍 :)
One challenging point is that we don't really track address ranges for outlined code blocks right now. If we split each page between inline and outlined halves, we can guarantee that all of the outlined code for an ISEQ will live in the same page as its inline code. Without that design, then we need to track the outlined address ranges. Maybe not the hardest thing, but something to think about. We do also want to solve the problem of making inline and outlined code closer together (within the range of conditional branches on arm).
I apologize because I do realize that it's hard to coordinate on this project async and there are a lot of different elements to keep track of. |
📝
I currently don't know of such a thing. Iterating through all objects might work if every ISeq has a Ruby object, but I'm not sure if it's efficient enough. At least MJIT has been doing that by itself for years and nobody has pointed out an alternative.
Exactly.
For that to be useful, we'd need to assume only memory pages used by inline code are used by outlined code, i.e. outlined code shouldn't grow faster than inline code, or otherwise we'll need extra tracking. It would probably hold if you make each page a half-and-half layout, but once you start thinking about optimizing it like 80% inline and 20% outlined, again, you'll need something. However, it makes sense to keep thinking about the relationship between those projects to potentially simplify the whole design.
Thank you for bearing with me being in a different timezone. Let me know if there's anything I can help with to improve our async communication. |
|
Ah right, I've completely forgotten that we need to do that thing for making every insn a |
1997370
to
08c83fb
Compare
I pushed my latest code at 08c83fb. Before marking this ready, it worked for a simple synthetic workload locally. I plan to work on including YJIT entry code in inline code addresses and using While working on this, I realized it's probably useful to change the code layout first because:
So, yeah, I'll work on that first 😅 |
Well, I was thinking, we split the pages half and half. There is a current page that we're generating code into, and when we fill either the inline or outlined half, we switch to the next page, to maintain the invariant that we're always putting the outlined code corresponding with some inline code in the same page. It has the potential to be somewhat wasteful but it should work?
Nice. I will read through the code that you wrote so far. I'm going to be conservative in merging code GC because I feel it's very important that we get this piece right. We should test this on railsbench and other larger benchmarks in
In my mind it's also, if we're going to modify the design of our code block system, we might as well tackle this problem. Ever since we started talking about arm support months ago, I've been wanting us to improve the way we do code layout. |
I wrote a few comments but I want to say I think you did a good job and your current implementation makes a lot of sense. I think the main missing piece is the inline/outlined page split, and it would be interesting to have some numbers in terms of how much memory can be freed on railsbench, erubi_rails after initialization but before the first benchmark iteration (and what percentage of the total allocated that is). |
What I meant is that, if we currently don't know any addresses of outlined code, when we fill the outlined half and only outlined code of an ISEQ goes to the next page, there will be no way to detect the ISEQ is using the next page. So, for that case, I thought tracking the outlined address ranges (or outlined code pages) is needed while you seem to assume it's not needed with your half-and-half idea. Managing the relationship between an ISEQ and code pages hopefully doesn't require a lot of space, so it's probably not a big deal though.
Agreed 👍 |
Yeah I think we either need to go with half and half, or we need to track where the outlined code for an iseq lives. Either way though, we want the outlined code to be close to the inline code it's associated with to avoid jumping too far. Assuming we don't go with the half and half strategy: If we assume that we have a current inline page that we generate inline code into and a current outlined page that we generate outlined code into... When we fill the current inline or outlined page, we need to move to the next page. We need some kind of mechanism to allocate the next page. In order to keep the in/out code close to each other, we would ideally want the next pages to be allocated to be close to each other. The problem here is IMO that we're going to run into fragmentation issues once we start to GC code. I guess this may or may not be a problem in practice. If we keep a list of free pages that is sorted by increasing address, it might work out OK because ARM CPUs can jump +/-1 1MiB. That's actually pretty far, so as long as the next inline/outlined pages are within 1MiB (256 x 4KiB pages), we can still do a one-instruction jump.
It's a bit annoying to do but it is doable. Would have to keep track of all the CodePtr targets of instructions that get passed to the assembler I think. That or record every call to |
I've been experimenting with a few designs for the inline/outlined page split. While still WIP, I filed #6460 for sharing the current design. |
620714f
to
a92e6e7
Compare
2629505
to
2ccc754
Compare
06d894f
to
4486eeb
Compare
fe2e930
to
976314f
Compare
when it fails to allocate a new page. Co-authored-by: Alan Wu <alansi.xingwu@shopify.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! 👏
so long and thanks for all the fish! :) |
when it fails to allocate a new page. Co-authored-by: Alan Wu <alansi.xingwu@shopify.com>
It's worth noting that golang also moved away from |
I have created https://bugs.ruby-lang.org/issues/19122 to mirror that Go issue and propose that Ruby also stops using MADV_FREE. |
Yeah, I mean, we decided to use only MADV_DONTNEED in this PR, so we're good. |
Sorry, I should have clarified: that issue is for the remaining usage of MADV_FREE in |
This PR implements code GC for YJIT. The design is as follows:
--yjit-exec-mem-size
is too small, but the benchmark numbers were not bad for such cases.--yjit-exec-mem-size
, it starts marking on-stack ISEQs, and then frees all pages that have no on-stack ISEQs.--yjit-exec-mem-size
, it could leave unneeded code forever. To address that, we could incrementally mark an unused ISEQ when the ISEQ is GCed, and then free pages when all of their ISEQs are GCed. However, we've found that very few ISEQs are GCed on railsbench YJIT: Avoid creating payloads for non-JITed ISEQs #6549, so there's little benefit in doing so. Moreover, by not doing it incrementally, you can completely avoid the overhead of code GC when it doesn't reach the max size. Alternatively, we could also add API to manually trigger code GC, and call it after an application boot.Benchmark
I measured code GC's performance on railsbench with various
--yjit-exec-mem-size
s.--yjit-exec-mem-size=6
doesn't trigger GC while 1~5 do, so only 1~6 are tested.On railsbench, without code GC, YJIT uses 335 code pages (8~16Kib * 335 = 2.61~5.23MiB). With enough
--yjit-exec-mem-size
and code GC, YJIT leaves only 146~180 code pages (8~16KiB * 146~180 = 1.14~2.81MiB).Logs:
--enable-yjit
for benchmarking the speed--yjit-exec-mem-size=1
--yjit-exec-mem-size=2
--yjit-exec-mem-size=3
--yjit-exec-mem-size=4
--yjit-exec-mem-size=5
--yjit-exec-mem-size=6
Logs:
--enable-yjit=dev
for stats--yjit-exec-mem-size=1
--yjit-exec-mem-size=2
--yjit-exec-mem-size=3
--yjit-exec-mem-size=4
--yjit-exec-mem-size=5
--yjit-exec-mem-size=6