YJIT: Allow inlining ISEQ calls with a block #9622

k0kubun · 2024-01-20T01:52:44Z

This PR introduces an annotation Primitive.attr! :inline_block to allow monomorphizing yield in methods like Array#each #9533.

yjit-bench

This PR or Array#each in Ruby (#9533) doesn't have a significant impact on benchmarks by itself.

However, combining this PR and Array#each in Ruby with the annotation (branch), it gives the following speedups.

hexapdf is 7% faster
chunky-png and ruby-lsp are 2% faster
activerecord, liquid-c, and railsbench are 1% faster

before: ruby 3.4.0dev (2024-01-20T01:12:07Z master 99d6e2f1ee) +YJIT [x86_64-linux]
after: ruby 3.4.0dev (2024-01-20T04:37:44Z yjit-inline-block-.. ae09d8f7a5) +YJIT [x86_64-linux]

--------------  -----------  ----------  ----------  ----------  -------------  ------------
bench           before (ms)  stddev (%)  after (ms)  stddev (%)  after 1st itr  before/after
activerecord    54.1         0.6         53.8        0.7         1.01           1.01
chunky-png      558.9        0.2         550.4       0.6         1.02           1.02
erubi-rails     1610.9       0.9         1611.4      0.9         1.01           1.00
hexapdf         2035.2       1.0         1895.8      2.0         0.99           1.07
liquid-c        71.3         1.3         70.7        0.9         1.00           1.01
liquid-compile  72.2         0.3         72.2        0.4         0.99           1.00
liquid-render   96.2         0.4         96.4        1.1         1.00           1.00
lobsters        945.4        1.1         941.4       1.0         0.99           1.00
mail            145.3        0.4         145.4       1.6         1.00           1.00
psych-load      1912.1       0.2         1921.0      0.2         0.99           1.00
railsbench      2066.7       0.2         2043.3      0.1         0.99           1.01
ruby-lsp        57.5         26.2        56.5        27.2        1.00           1.02
sequel          86.3         0.6         86.1        1.3         1.00           1.00
--------------  -----------  ----------  ----------  ----------  -------------  ------------

See also: the number of Ruby block calls from Array#each Shopify/yjit-bench#168

Memory usage

Running 25 itrs on lobsters with this PR and Array#each using the annotation (branch), code_region_size was increased by 1.2%, and yjit_alloc_size was increased by 4.5%.

For optimizing the size of Context, I'm thinking of having a global HashMap like { iseq1 => 1, iseq2 => 2, ... } and store an u16 integer instead of an u64 ISEQ pointer as a next step, which would likely decrease yjit_alloc_size.

Before

code_region_size:         15,024,128
yjit_alloc_size:          27,201,208

After

code_region_size:         15,204,352
yjit_alloc_size:          28,414,571
max_inline_versions:             121

microbenchmark

This PR makes the synthetic benchmark of megamorphic yield in Integer#times 4.5x faster.

$ benchmark-driver benchmark/loop_times_megamorphic.yml --chruby 'before::before --yjit-call-threshold=1;after::after --yjit-call-threshold=1'
Warming up --------------------------------------
loop_times_megamorphic    18.810k i/s -     20.004k times in 1.063473s (53.16μs/i)
Calculating -------------------------------------
                           before       after
loop_times_megamorphic    17.820k     79.926k i/s -     56.430k times in 3.166588s 0.706025s

Comparison:
             loop_times_megamorphic
                 after:     79926.3 i/s
                before:     17820.4 i/s - 4.49x  slower

maximecb · 2024-01-22T23:28:36Z

I don't quite understand how this PR works. Please bear with me. The code for send is so complex that I find it hard to follow sometimes.

You say it's duplicating the ISEQ for each block. Conceptually, I understand what that means. However, I don't understand why this has an impact on the maximum number of versions in the JIT.

Are multiple entry blocks being generated for the ISEQs based on the block given in the call?

I guess I'm wondering: would it be possible to somehow duplicate the ISEQ directly at the interpreter level? That might be slightly less complex on the JIT side, and it would avoid having to add extra state to the context?

k0kubun · 2024-01-23T00:07:41Z

You say it's duplicating the ISEQ for each block. Conceptually, I understand what that means. However, I don't understand why this has an impact on the maximum number of versions in the JIT.

Whenever it duplicates each basic block of Array#each, it adds a version to that basic block that is called from a specific callsite and is going to call a particular block ISEQ.

Say, we have:

module Kernel
  def tap
    yield(self) # basic block 1: putself + invokeblock (version a, version b)
    self        # basic block 2: putself + leave (version a, version b)
  end
end

1.tap { :a }
2.tap { :b }

1.tap directly jumps to version a of basic block 1, which should only jump to { :a }, then jumps to version a of basic block 2, and returns to the caller.

They would be located linearly in the generated code. It's basically inlining except the frame push/pop for Array#each is still there.

Are multiple entry blocks being generated for the ISEQs based on the block given in the call?

If "entry blocks" means jit_entry, no. If it means the first block of the ISEQ, yes.

This PR duplicates blocks of Array#each for each blockiseq given to send. One blockiseq in the above example is { :a }, and another one is { :b }. An ISEQ literal is usually unique to each callsite. It could be extracted from &block and then shared by multiple callsites, but that doesn't make yield megamorphic as long as it's the same ISEQ. This is why we want to duplicate Array#each for each blockiseq.

I guess I'm wondering: would it be possible to somehow duplicate the ISEQ directly at the interpreter level? That might be slightly less complex on the JIT side, and it would avoid having to add extra state to the context?

Technically, it's possible if you rewrite how the interpreter does method lookup. The lookup is currently based on klass and mid. The lookup should include blockiseq, or it needs an extra check for annotated ISEQs after the main lookup. However, the former would be hard to implement, and the latter would make every non-cached ISEQ lookup slower.

It would also mean to add extra memory overhead to allocate more ISEQs and add more GC pressure because of them.

So, yes, it's possible, but harder to implement, uses more memory, and makes things slower in unrelated places.

XrXr · 2024-01-23T18:57:27Z

yjit/src/codegen.rs

+    // If the callee has :inline_block annotation and the callsite has a block ISEQ,
+    // duplicate a callee block for each block ISEQ to make its `yield` monomorphic.
+    if let (Some(BlockHandler::BlockISeq(iseq)), true) = (block, builtin_attrs & BUILTIN_ATTR_INLINE_BLOCK != 0) {
+        callee_ctx.set_inline_block(iseq);


I guess this is the magic sauce. Because the callee_ctx is probably doesn't match anything that currently exists, it's going to generate a new block directly following the call.

k0kubun force-pushed the yjit-inline-block branch 2 times, most recently from 09b9d49 to 92acf33 Compare January 20, 2024 01:55

YJIT: Allow inlining ISEQ calls with a block

5357897

k0kubun force-pushed the yjit-inline-block branch from 92acf33 to 5357897 Compare January 20, 2024 04:37

k0kubun marked this pull request as ready for review January 20, 2024 05:16

matzbot requested a review from a team January 20, 2024 05:17

maximecb requested a review from XrXr January 23, 2024 18:40

maximecb approved these changes Jan 23, 2024

View reviewed changes

XrXr approved these changes Jan 23, 2024

View reviewed changes

Leave a TODO comment about u16 inline_block

15f6a02

k0kubun enabled auto-merge (squash) January 23, 2024 19:21

k0kubun merged commit 27c1dd8 into ruby:master Jan 23, 2024

k0kubun deleted the yjit-inline-block branch January 23, 2024 19:36

k0kubun mentioned this pull request Jan 23, 2024

Rewrite Array#each in Ruby using Primitive #9533

Merged

k0kubun mentioned this pull request Feb 27, 2024

[DOC] Stop discouraging the use of Array#each #10119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

YJIT: Allow inlining ISEQ calls with a block #9622

YJIT: Allow inlining ISEQ calls with a block #9622

Uh oh!

k0kubun commented Jan 20, 2024 •

edited

Loading

Uh oh!

maximecb commented Jan 22, 2024

Uh oh!

k0kubun commented Jan 23, 2024 •

edited

Loading

Uh oh!

XrXr Jan 23, 2024

Uh oh!

Uh oh!

YJIT: Allow inlining ISEQ calls with a block #9622

YJIT: Allow inlining ISEQ calls with a block #9622

Uh oh!

Conversation

k0kubun commented Jan 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

yjit-bench

Memory usage

Before

After

microbenchmark

Uh oh!

maximecb commented Jan 22, 2024

Uh oh!

k0kubun commented Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XrXr Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k0kubun commented Jan 20, 2024 •

edited

Loading

k0kubun commented Jan 23, 2024 •

edited

Loading