-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Introduce BOP_CMP for optimized comparisons #6851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you benchmark it with either the original motivation, Mastodon, or yjit-bench https://github.com/shopify/yjit-bench? edit: It looks like it would just consume one extra |
4eef079
to
40f0be0
Compare
Thank you! I wasn't familiar with https://github.com/shopify/yjit-bench, so I'm glad to have learned about it. Here's the output I got:
That looks fairly same-ish to me. Do you agree? I also reran a couple individual benchmarks to see how much the results would vary:
|
Yeah, at least it doesn't look bad 👍 |
It looks like some tests are failing with |
It was a fault of old master. Retrying them might fix it, but I've seen GitHub Actions tests a PR against old master even after retries, so it might be the best to rebase your branch against latest master. |
This commit moves ruby_basic_operators and the unredefined macros out of vm_core.h and into basic_operators.h so that we can use them more broadly in places where we currently use a method look up via `rb_method_basic_definition_p` (e.g. object.c, numeric.c, complex.c, enum.c, but also in internal/compar.h after introducing BOP_CMP and elsewhere if we introduce more BOPs) The most controversial part of this change is probably moving redefined_flag out of rb_vm_t. [vm_opt_method_def_table and vm_opt_mid_table](https://github.com/ruby/ruby/blob/9da2a5204f32a4f2ce135fddde2abb6e07d647e9/vm.c) are not part of rb_vm_t either, and I think this fits well with those. But more significantly it seems to result in one fewer instruction. For example: Before: ``` (lldb) disassemble -n vm_opt_str_freeze miniruby`vm_exec_core: miniruby[0x10028233e] <+14558>: movq 0x11a86b(%rip), %rax ; ruby_current_vm_ptr miniruby[0x100282345] <+14565>: testb $0x4, 0x242c(%rax) ``` After: ``` (lldb) disassemble -n vm_opt_str_freeze ruby`vm_exec_core: ruby[0x100280ebe] <+14510>: testb $0x4, 0x120147(%rip) ; ruby_vm_redefined_flag + 43 ``` Co-authored-by: John Hawthorn <jhawthorn@github.com>
Prior to this commit the `OPTIMIZED_CMP` macro relied on a method lookup to determine whether `<=>` was overridden. The result of the lookup was cached, but only for the duration of the specific method that initialized the cmp_opt_data cache structure. With this method lookup, `[x,y].max` is slower than doing `x > y ? x : y` even though there's an optimized instruction for "new array max". (John noticed somebody a proposed micro-optimization based on this fact in mastodon/mastodon#19903.) ```rb a, b = 1, 2 Benchmark.ips do |bm| bm.report('conditional') { a > b ? a : b } bm.report('method') { [a, b].max } bm.compare! end ``` Before: ``` Comparison: conditional: 22603733.2 i/s method: 19820412.7 i/s - 1.14x (± 0.00) slower ``` This commit replaces the method lookup with a new CMP basic op, which gives the examples above equivalent performance. After: ``` Comparison: method: 24022466.5 i/s conditional: 23851094.2 i/s - same-ish: difference falls within error ``` Relevant benchmarks show an improvement to Array#max and Array#min when not using the optimized newarray_max instruction as well. They are noticeably faster for small arrays with the relevant types, and the same or maybe a touch faster on larger arrays. ``` $ make benchmark COMPARE_RUBY=<master@5958c30> ITEM=array_min $ make benchmark COMPARE_RUBY=<master@5958c30> ITEM=array_max ``` The benchmarks added in this commit also look generally improved. Co-authored-by: John Hawthorn <jhawthorn@github.com>
40f0be0
to
ca330c5
Compare
ca330c5
to
9ce084c
Compare
@composerinteralia @jhawthorn The first commit of this PR c43951e seems to have slowed down MJIT on optcarrot and nbody. Here's the comparison of c43951e and its parent 9d4483f:
Do you have any idea to make it as fast as before? This made MJIT slower than YJIT on Optcarrot. (mission accomplished?) |
Out of curiosity, I benchmarked that commit with the interpreter as well. The difference is small, but I can reliably reproduce the interpreter slowdown on nbody:
Logs
In the Logs, you can see it used to be consistently 67ms (with a few 68ms itrs) whereas the first commit of this PR makes it 68ms (with a few 69ms itrs). I can kind of see such a slightly worse trend on the blue line of nbody on rubybench from that revision too. |
@k0kubun thanks for spotting. Unfortunately I don't seem to be able to reproduce. On my machine (a couple years old zen2) I'm seeing nbody slightly faster on the interpreter with the change (seems consistent, though there is definitely a difference in how much between executions), and no difference to MJIT.
What machine are you on? Your benchmark does look like it's running run a fair bit faster than mine 😓. What arguments are you giving to yjit-bench for MJIT? |
Just --mjit. My local machine uses Ryzen 7 5800X, and the benckmark server is https://rubybench.github.io/hardware.html. Both on Ubuntu 22.04, GCC. I could give you ssh access to the benchmark server if you want to try it. |
This PR introduces a new basic operator, BOP_CMP, so we can quickly check whether
<=>
has been redefined when doing optimized comparisons.The PR is in two commits (plus one more to update the deps), which could be reviewed separtely:
Move BOP macros to separate file
This first commit moves
ruby_basic_operators
and theUNREDEFINED
macros out ofvm_core.h and into basic_operators.h so that we can use them in internal/compar.h, as well as more broadly in places where we currently do a method look up via
rb_method_basic_definition_p
(e.g. object.c, numeric.c, complex.c, enum.c, and a number of other places if we introduce more BOPs).The most controversial part of this first commit is probably moving
redefined_flag
out ofrb_vm_t
. Neither vm_opt_method_def_table or vm_opt_mid_table are part of rb_vm_t either, and I do thinkredefined_flag
fits well with those. But more significantly it seems to result in one fewer instruction. For example:Before:
After:
Introduce BOP_CMP for optimized comparison
Prior to this commit the
OPTIMIZED_CMP
macro relied on a method lookup to determine whether<=>
was overridden. The result of the lookup was cached, but only for the duration of the specific method that initialized the cmp_opt_data cache structure.With this method lookup,
[x,y].max
is slower than doingx > y ? x : y
even though there's an optimized instruction for "new array max". (John noticed somebody a proposed micro-optimization based on this fact in mastodon/mastodon#19903.)Before:
This commit replaces the method lookup with a new CMP basic op, which gives the examples above equivalent performance.
After:
Relevant benchmarks show an improvement to Array#max and Array#min when not using the optimized newarray_max instruction as well. They are noticeably faster for small arrays with the relevant types, and the same or maybe a touch faster on larger arrays.
Benchmark output comparing master@5958c305 against these changes
array_max_int
array_max_float
array_max_str
array_min
array_sort_int
range_min
enum_minmax
enum_sort