Skip to content

Embed const caches that aren't cref dependent #12900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

byroot
Copy link
Member

@byroot byroot commented Mar 11, 2025

Opening this as a draft to get some CI coverage. I already know RJIT is broken.

Currently all iseq_inline_constant_cache have extra storage inside an IMEMO/constcache object.
That extension object stores one flag (so one bit), the reference to the cached value (8B) and the cref pointer (8B).

However the vast majority of const caches don't have a cref, which means we're allocating 40B, and use 8B to reference it, to store 8B plus 1 bit.

Given that iseq_inline_constant_cache.segments is a pointer, we can use the lower 3 bits to store tags.

The idea here is to merge iseq_inline_constant_cache_entry.flags into iseq_inline_constant_cache.segments lower bits,
and also store an extra flag that records whether this cache is cref dependent.

When it's not (which is the vast majority of cases) we can store the cached value directly inside the inline cache, and not need the _entry object.

Expectations:

  • The vast majority of IMEMO/constcache instances are eliminated.
  • We're saving one pointer chasing, in exchange for a bitmask check and some pointer untagging.

cc @etiennebarrie

@byroot
Copy link
Member Author

byroot commented Mar 11, 2025

I already know RJIT is broken.

Actually, it might not be, because it doesn't seem to use const caches that don't have a cref.

@byroot byroot force-pushed the inline-const-cache-yjit branch from 224da92 to 0a68a85 Compare March 11, 2025 09:02
@byroot byroot marked this pull request as ready for review March 11, 2025 09:03
@matzbot matzbot requested a review from a team March 11, 2025 09:03
Copy link

launchable-app bot commented Mar 11, 2025

All Tests passed!

✖️no tests failed ✔️61905 tests passed(3 flakes)

@byroot byroot force-pushed the inline-const-cache-yjit branch from c85dd5e to 0a0b09a Compare March 11, 2025 11:03
@byroot
Copy link
Member Author

byroot commented Mar 11, 2025

@ruby/yjit would someone have a bit of time to help me update YJIT to handle the new constcache layout?

I suspect I need to check for the HAS_EXT flag in the generated YJIT code, but I'm not quite certain, it's the best solution. It might be better to just invalidate the code to not have to generate an extra branch?

@byroot byroot force-pushed the inline-const-cache-yjit branch from 3dad53e to eea34d8 Compare March 11, 2025 13:16
vm_insnhelper.c Outdated
Comment on lines 6411 to 6428
const rb_cref_t *cref = vm_get_const_key_cref(reg_ep);
if (cref) {
vm_icc_set_flag(ic, CONST_CACHE_HAS_EXT);
struct iseq_inline_constant_cache_entry *ice = IMEMO_NEW(struct iseq_inline_constant_cache_entry, imemo_constcache, 0);
RB_OBJ_WRITE(iseq, &ic->c.ext, ice);
RB_OBJ_WRITE(ic->c.ext, &ice->value, val);
ice->ic_cref = cref;
}
Copy link
Member

@XrXr XrXr Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @ko1 made it always allocate an imemo for ractor data race reasons, that way cache fills are always an atomic pointer swap: e7fc353

The multiple pointer sets here don't look thread safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's good info thank you. I'll see if I can preserve the thread safety. I think with a bit of care it is possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, actually, not so sure, the problem is really that both RACTOR_SHAREABLE flag and the VALUE have to be set atomically, so short of having access to 128bit atomics, I don't see how it could be achieved 😢

@byroot byroot closed this Mar 11, 2025
@byroot byroot reopened this Mar 13, 2025
@matzbot matzbot requested a review from a team March 13, 2025 19:05
@byroot
Copy link
Member Author

byroot commented Mar 13, 2025

Reopening, because this might actually be possible using CAS.

vm_core.h Outdated
@@ -269,22 +273,91 @@ STATIC_ASSERT(sizeof_iseq_inline_constant_cache_entry,
sizeof(const rb_cref_t *)) <= RVALUE_SIZE);

struct iseq_inline_constant_cache {
struct iseq_inline_constant_cache_entry *entry;
union {
struct iseq_inline_constant_cache_entry *ext;
Copy link
Member

@k0kubun k0kubun Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "ext" stand for? "Ext"ended? I find this name confusing.

Could we just call this extended and rename the flag to CONST_CACHE_HAS_EXTENDED (if that's what it means)? Or maybe heap and CONST_CACHE_IN_HEAP (or CONST_CACHE_EMBEDDED).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that what it stand for. If I managed to make this work, I'll rename it as you suggested.

@byroot
Copy link
Member Author

byroot commented Mar 14, 2025

@XrXr I think I pushed a new version that work with ractors, but I'm not super accustomed to all the constraints so I'd really appreciate if you could review this.

How it works now:

  • IC->value is initialized to Qundef, and the segments pointer is not tagged by default.
  • When a thread hits a miss, it does the constant resolution like before, and assign the cached value using CAS
  • IC->value is either the constant value, or an IMEMO, the caller has to check for that
  • The CONST_CACHE_SHAREABLE flag isn't set atomically, but we can call is_shearable_p when it is missing since we'd raise an exception anyway, so the overhead is negligible, and this should only happen when there is a race.

Here I somewhat assume vm_icc_reset is only called with the ractor lock, but if I'm wrong I think we can get away with removing the flag before setting IC->value = Qundef?

How does that sound?

vm_insnhelper.c Outdated
Comment on lines 6421 to 6428
const rb_cref_t *cref = vm_get_const_key_cref(reg_ep);
if (RB_UNLIKELY(cref)) {
struct iseq_inline_constant_cache_entry *ice = IMEMO_NEW(struct iseq_inline_constant_cache_entry, imemo_constcache, 0);
RB_OBJ_WRITE(ice, &ice->value, val);
ice->ic_cref = cref;
cache_val = (VALUE)ice;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a short comment to explain what this bit does. What does it mean for cref to be null or not null?

Copy link
Member

@XrXr XrXr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think letting CONST_CACHE_SHAREABLE race this way is correct. Since it's not set atomically with the cached value, it's possible to observe a stale set flag from another thread, for a value that's actually not sharable, allowing for smuggling of values between ractors without moving.

@byroot
Copy link
Member Author

byroot commented Mar 14, 2025

😞, do you have any easy to understand literature you'd recommend on this topic? I'd like to better understand the constraints here.

While I get the write are not atomic, I don't understand if they're ordered, what's the granularity etc.

Alternatively I looked at 128bit atomics, but from what I could find it seems support is rather erratic.

@XrXr
Copy link
Member

XrXr commented Mar 14, 2025

Sorry, I misspoke. This is actually about atomicity, not ordering. The possible schedule is this:

Thread 1       Thread 2           Thread 3

set value
             set unsharable value
set flag                        
                                  read value
                                  read flag
             clear flag

Generally if two states are not set together atomically, all possible combinations of the two are observable. So even if you put clear before setting the value, a smuggling schedule is still possible.

Sorry, I know you two put effort into this and it's not fun to learn about the undocumented intentions, especially since the current code is not 100% correct either. Because it's not using atomics, the writes during a cache fill can be observed in any order, including one where the other thread sees the imemo pointer, but not the initialized contents of the pointed to imemo. (This one is actually about ordering.)

Memory ordering is a complex topic and takes a while to build up intuition. This is one of blog posts I read: https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ but I learned from a bunch of other sources. Another phrase to search around for is "Acquire-Release semantics".

@byroot
Copy link
Member Author

byroot commented Mar 14, 2025

Ok, I think I understand what you mean, and thank you for the link.

If I understood the link correctly, given that the cache is only ever cleared from the main ractor (these are const cache), and AFAIK with the global lock held, we could clear the flag before clearing the value:

vm_icc_reset(*ic)
{
        cc->tagged_segments &= ~CONST_CACHE_FLAGS_MASK;
        // Insert StoreStore fence (not sure how to do that)
        cc->value = Qundef;
}

This way, it's still possible to see a sharable value without the corresponding flag, but never an unsharable value with the flag. no?

If I'm still incorrect just say so, it's ok, I don't want to take your time any further.

@XrXr
Copy link
Member

XrXr commented Mar 14, 2025

No, smuggling is still possible since the race happens after the fence, when threads compete to fill the cache. So basically the same schedule is possible:

Thread 1       Thread 2           Thread 3

set value
             set unsharable value
set flag                        
                                  read value
                                  read flag

Actually "clear cache" in the older schedule is a bit misleading since it's not really relevant. It's just that there is a window between setting the value and the flag.

@byroot
Copy link
Member Author

byroot commented Mar 14, 2025

Ok, thank you. So short of having 128bit atomics, or of packing that bit in the VALUE, there isn't a proper way to re-inline these caches 😢

@XrXr
Copy link
Member

XrXr commented Mar 14, 2025

You could use a spin lock to synchronize fills and reads I guess, but that sacrifices single threaded speed.

@byroot
Copy link
Member Author

byroot commented Mar 14, 2025

Or I could inline them only if no other ractor has been spawned yet?

@XrXr
Copy link
Member

XrXr commented Mar 14, 2025

That could work. Since there's no way to transition from embedded to not without 128B atomics, you can treat all embedded caches as busts in multi-ractor mode and fill and clear the flag. And make it so only share-ability claims in imemos are valid.

That sounds like it irons out all the inconsistent states. It's hard to reason about, though, and I might be missing something.

@byroot
Copy link
Member Author

byroot commented Mar 14, 2025

you can treat all embedded caches as busts in multi-ractor mode and fill and clear the flag.

I don't think you need to though. You can keep the embeded caches even after a second ractor has been spawned.

@XrXr
Copy link
Member

XrXr commented Mar 14, 2025

Ah right. No one writes embedded caches anymore after a ractor spawns.

etiennebarrie and others added 15 commits April 8, 2025 14:13
Now `ic->value` is set with `RUBY_ATOMIC_VALUE_CAS`.
If we fail to write the value, we don't set the
`CONST_CACHE_SHAREABLE` flag either.

This leave a small race condition window during which
`ic->value` is set, but the `CONST_CACHE_SHAREABLE` glaf hasn't
been assigned.

However, given in such case we'd throw an error, it's OK performance
wise to call `rb_ractor_shareable_p` to double check before
raising the error.
@byroot byroot force-pushed the inline-const-cache-yjit branch from 0d02e6a to b87dc3f Compare April 8, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants