Skip to content

YJIT: Allow inlining ISEQ calls with a block #9622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 23, 2024

Conversation

k0kubun
Copy link
Member

@k0kubun k0kubun commented Jan 20, 2024

This PR introduces an annotation Primitive.attr! :inline_block to allow monomorphizing yield in methods like Array#each #9533.

yjit-bench

This PR or Array#each in Ruby (#9533) doesn't have a significant impact on benchmarks by itself.

However, combining this PR and Array#each in Ruby with the annotation (branch), it gives the following speedups.

  • hexapdf is 7% faster
  • chunky-png and ruby-lsp are 2% faster
  • activerecord, liquid-c, and railsbench are 1% faster
before: ruby 3.4.0dev (2024-01-20T01:12:07Z master 99d6e2f1ee) +YJIT [x86_64-linux]
after: ruby 3.4.0dev (2024-01-20T04:37:44Z yjit-inline-block-.. ae09d8f7a5) +YJIT [x86_64-linux]

--------------  -----------  ----------  ----------  ----------  -------------  ------------
bench           before (ms)  stddev (%)  after (ms)  stddev (%)  after 1st itr  before/after
activerecord    54.1         0.6         53.8        0.7         1.01           1.01
chunky-png      558.9        0.2         550.4       0.6         1.02           1.02
erubi-rails     1610.9       0.9         1611.4      0.9         1.01           1.00
hexapdf         2035.2       1.0         1895.8      2.0         0.99           1.07
liquid-c        71.3         1.3         70.7        0.9         1.00           1.01
liquid-compile  72.2         0.3         72.2        0.4         0.99           1.00
liquid-render   96.2         0.4         96.4        1.1         1.00           1.00
lobsters        945.4        1.1         941.4       1.0         0.99           1.00
mail            145.3        0.4         145.4       1.6         1.00           1.00
psych-load      1912.1       0.2         1921.0      0.2         0.99           1.00
railsbench      2066.7       0.2         2043.3      0.1         0.99           1.01
ruby-lsp        57.5         26.2        56.5        27.2        1.00           1.02
sequel          86.3         0.6         86.1        1.3         1.00           1.00
--------------  -----------  ----------  ----------  ----------  -------------  ------------

See also: the number of Ruby block calls from Array#each Shopify/yjit-bench#168

Memory usage

Running 25 itrs on lobsters with this PR and Array#each using the annotation (branch), code_region_size was increased by 1.2%, and yjit_alloc_size was increased by 4.5%.

For optimizing the size of Context, I'm thinking of having a global HashMap like { iseq1 => 1, iseq2 => 2, ... } and store an u16 integer instead of an u64 ISEQ pointer as a next step, which would likely decrease yjit_alloc_size.

Before

code_region_size:         15,024,128
yjit_alloc_size:          27,201,208

After

code_region_size:         15,204,352
yjit_alloc_size:          28,414,571
max_inline_versions:             121

microbenchmark

This PR makes the synthetic benchmark of megamorphic yield in Integer#times 4.5x faster.

$ benchmark-driver benchmark/loop_times_megamorphic.yml --chruby 'before::before --yjit-call-threshold=1;after::after --yjit-call-threshold=1'
Warming up --------------------------------------
loop_times_megamorphic    18.810k i/s -     20.004k times in 1.063473s (53.16μs/i)
Calculating -------------------------------------
                           before       after
loop_times_megamorphic    17.820k     79.926k i/s -     56.430k times in 3.166588s 0.706025s

Comparison:
             loop_times_megamorphic
                 after:     79926.3 i/s
                before:     17820.4 i/s - 4.49x  slower

@k0kubun k0kubun force-pushed the yjit-inline-block branch 2 times, most recently from 09b9d49 to 92acf33 Compare January 20, 2024 01:55
@k0kubun k0kubun marked this pull request as ready for review January 20, 2024 05:16
@matzbot matzbot requested a review from a team January 20, 2024 05:17
@maximecb
Copy link
Contributor

I don't quite understand how this PR works. Please bear with me. The code for send is so complex that I find it hard to follow sometimes.

You say it's duplicating the ISEQ for each block. Conceptually, I understand what that means. However, I don't understand why this has an impact on the maximum number of versions in the JIT.

Are multiple entry blocks being generated for the ISEQs based on the block given in the call?

I guess I'm wondering: would it be possible to somehow duplicate the ISEQ directly at the interpreter level? That might be slightly less complex on the JIT side, and it would avoid having to add extra state to the context?

@k0kubun
Copy link
Member Author

k0kubun commented Jan 23, 2024

You say it's duplicating the ISEQ for each block. Conceptually, I understand what that means. However, I don't understand why this has an impact on the maximum number of versions in the JIT.

Whenever it duplicates each basic block of Array#each, it adds a version to that basic block that is called from a specific callsite and is going to call a particular block ISEQ.

Say, we have:

module Kernel
  def tap
    yield(self) # basic block 1: putself + invokeblock (version a, version b)
    self        # basic block 2: putself + leave (version a, version b)
  end
end

1.tap { :a }
2.tap { :b }

1.tap directly jumps to version a of basic block 1, which should only jump to { :a }, then jumps to version a of basic block 2, and returns to the caller.

They would be located linearly in the generated code. It's basically inlining except the frame push/pop for Array#each is still there.

Are multiple entry blocks being generated for the ISEQs based on the block given in the call?

If "entry blocks" means jit_entry, no. If it means the first block of the ISEQ, yes.

This PR duplicates blocks of Array#each for each blockiseq given to send. One blockiseq in the above example is { :a }, and another one is { :b }. An ISEQ literal is usually unique to each callsite. It could be extracted from &block and then shared by multiple callsites, but that doesn't make yield megamorphic as long as it's the same ISEQ. This is why we want to duplicate Array#each for each blockiseq.

I guess I'm wondering: would it be possible to somehow duplicate the ISEQ directly at the interpreter level? That might be slightly less complex on the JIT side, and it would avoid having to add extra state to the context?

Technically, it's possible if you rewrite how the interpreter does method lookup. The lookup is currently based on klass and mid. The lookup should include blockiseq, or it needs an extra check for annotated ISEQs after the main lookup. However, the former would be hard to implement, and the latter would make every non-cached ISEQ lookup slower.

It would also mean to add extra memory overhead to allocate more ISEQs and add more GC pressure because of them.

So, yes, it's possible, but harder to implement, uses more memory, and makes things slower in unrelated places.

@maximecb maximecb requested a review from XrXr January 23, 2024 18:40
// If the callee has :inline_block annotation and the callsite has a block ISEQ,
// duplicate a callee block for each block ISEQ to make its `yield` monomorphic.
if let (Some(BlockHandler::BlockISeq(iseq)), true) = (block, builtin_attrs & BUILTIN_ATTR_INLINE_BLOCK != 0) {
callee_ctx.set_inline_block(iseq);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the magic sauce. Because the callee_ctx is probably doesn't match anything that currently exists, it's going to generate a new block directly following the call.

@k0kubun k0kubun enabled auto-merge (squash) January 23, 2024 19:21
@k0kubun k0kubun merged commit 27c1dd8 into ruby:master Jan 23, 2024
@k0kubun k0kubun deleted the yjit-inline-block branch January 23, 2024 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants