Skip to content

ZJIT: Use Vec instead of HashMap for profiling #13809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 11, 2025

Conversation

tekknolagi
Copy link
Contributor

@tekknolagi tekknolagi commented Jul 7, 2025

This is notably faster: no need to hash indices.

Before:

plum% samply record ~/.rubies/ruby-zjit/bin/ruby --zjit benchmarks/getivar.rb
ruby 3.5.0dev (2025-07-10T14:40:49Z master 51252ef8d7) +ZJIT dev +PRISM [arm64-darwin24]
itr:   time
 #1: 5311ms
 #2:   49ms
 #3:   49ms
 #4:   48ms

After:

plum% samply record ~/.rubies/ruby-zjit/bin/ruby --zjit benchmarks/getivar.rb
ruby 3.5.0dev (2025-07-10T15:09:06Z mb-benchmark-compile 42ffd3c1ee) +ZJIT dev +PRISM [arm64-darwin24]
itr:   time
 #1: 1332ms
 #2:   49ms
 #3:   48ms
 #4:   48ms

@tekknolagi tekknolagi force-pushed the mb-benchmark-compile branch from 005a81b to 42ffd3c Compare July 10, 2025 15:09
@tekknolagi tekknolagi marked this pull request as ready for review July 10, 2025 15:32
@tekknolagi tekknolagi requested a review from a team July 10, 2025 15:32
@tekknolagi
Copy link
Contributor Author

Now I just have to remove the opcode parameter from the C function the interpreter is calling

This comment has been minimized.

}

impl IseqProfile {
pub fn new(iseq: IseqPtr) -> Self {
// Pre-size all the operand slots in the opnd_types table so profiling is as fast as possible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite understand how the pre-sizing is helping here. You're still giving vec![] as the initial value to each vector and then resizing each.

If we resize the vector when we insert to the index (only) for the first time, it's probably not slower, is it? If we do so, we can skip resizing the vector for unused instructions, so not doing the pre-sizing could be faster and save memory, IIUC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pre-sizing happens once per ISEQ per VM

The previous thing checked once per profile per opcode per ISEQ on the hot path if the vec was the right size

It was an additional 100-200ms savings, I think

Copy link
Member

@k0kubun k0kubun Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that the current version is faster, but I'm only suggesting the same speedup could be achieved without pre-sizing.

It seems like the slowness in the previous version came (not from "checking" it but) from the fact that the vector was optional and we ended up using clone() every time to upsert Some(). But now that it's a non-optional vector and you can always get &mut of it, you should be able to check if the vector length is 0 and resize it on the first profile, without paying an allocation overhead on future profiles for the same opcode. Checking the length of a vector should be a negligible overhead compared to cloning (or resizing) a vector.

So I didn't get why it was changed to pre-sizing, but since it's already an improvement, I'm fine with merging this as is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the second commit for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, reverting it seems to slow it down by 5% or so. I guess doing multiple allocations at once helps? Since it does give a speedup in benchmarks, I'm actually fine with not reverting it.

Running benchmark "getivar" (1/1)
+ setarch x86_64 -R taskset -c 10 /opt/rubies/before/bin/ruby --zjit -I harness /home/k0kubun/src/github.com/Shopify/yjit-bench/benchmarks/getivar.rb
ruby 3.5.0dev (2025-07-11T15:51:28Z mb-benchmark-compile c00566de23) +ZJIT +PRISM [x86_64-linux]
itr:   time
 #1:  651ms
RSS: 14.7MiB
MAXRSS: 15.9MiB
Running benchmark "getivar" (1/1)
+ setarch x86_64 -R taskset -c 10 /opt/rubies/after/bin/ruby --zjit -I harness /home/k0kubun/src/github.com/Shopify/yjit-bench/benchmarks/getivar.rb
ruby 3.5.0dev (2025-07-11T15:51:28Z mb-benchmark-compile 993d5aa2a4) +ZJIT +PRISM [x86_64-linux]
itr:   time
 #1:  687ms
RSS: 14.6MiB
MAXRSS: 15.9MiB
Total time spent benchmarking: 1s

before: ruby 3.5.0dev (2025-07-11T15:51:28Z mb-benchmark-compile c00566de23) +ZJIT +PRISM [x86_64-linux]
after: ruby 3.5.0dev (2025-07-11T15:51:28Z mb-benchmark-compile 993d5aa2a4) +ZJIT +PRISM [x86_64-linux]

-------  -----------  ----------  ----------  ----------  -------------  ------------
bench    before (ms)  stddev (%)  after (ms)  stddev (%)  after 1st itr  before/after
getivar  651.6        0.0         687.2       0.0         0.948          0.948
-------  -----------  ----------  ----------  ----------  -------------  ------------
Legend:
- after 1st itr: ratio of before/after time for the first benchmarking iteration.
- before/after: ratio of before/after time. Higher is better for after. Above 1 represents a speedup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably instead change profiling to not use Type but instead do a ClassDistribution-type thing of (VALUE class, shape_id_t shape) tuples. Is there a clean way of representing:

  • Ruby class
  • Fixnum & other immediates
  • Shape

in one small data structure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such thing in CRuby itself. YJIT built its own variable-length encoding of profiling data in Context::encode()/Context::decode(). We probably need something different from YJIT's, so I guess we'll build one for ZJIT too.

Btw shapes are effectively used only for ivars today, so it's probably inefficient to use a Tuple that contains a shape for every instruction.

@tekknolagi tekknolagi force-pushed the mb-benchmark-compile branch from 19dda26 to 993d5aa Compare July 11, 2025 15:51
@tekknolagi tekknolagi enabled auto-merge (squash) July 11, 2025 15:51
@tekknolagi tekknolagi merged commit b0b1712 into ruby:master Jul 11, 2025
84 checks passed
@tekknolagi tekknolagi deleted the mb-benchmark-compile branch July 11, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants