-
Notifications
You must be signed in to change notification settings - Fork 5.4k
get_next_shape_internal: Skip VM lock for single child case #13191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
✅ All Tests passed!✖️no tests failed ✔️61968 tests passed(3 flakes) |
I did just that and it seems my assumption holds pretty well. From a big monolith:
Overall 72.8% of shapes have a single child. |
shape.c
Outdated
@@ -499,13 +499,23 @@ get_next_shape_internal(rb_shape_t * shape, ID id, enum shape_type shape_type, b | |||
|
|||
*variation_created = false; | |||
|
|||
// Fast path: if the shape has a single child, we can check it without a lock | |||
struct rb_id_table *edges = shape->edges; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use atomics here and below where we write shape->edges
. Otherwise the compiler could rearrange the initialization of the shape
until after the pointer is written and on some platforms the hardware could do the same memory reordering.
There is definitely elsewhere in Ruby we're doing the same thing (ex. inline caches) and apparently not seeing issues, but maybe that is due to insufficient testing on platforms with weaker memory models (arm), and luck that the compiler doesn't choose to rearrange too much/the hardware isn't running "concurrently enough" to hit the issue 🤷♂️.
It looks like we have ATOMIC_PTR_LOAD, but not ATOMIC_PTR_SET. I implemented ATOMIC_VALUE_SET here, which really should be the same thing. TBH I'm not sure why we have so much duplication in ruby/atomic.h
where I think we know some of these types to be identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. I think I understand the concern.
If I understand it correctly, this is mostly setting the pointer that need to use atomic, so that we guaranteed the shape content as fully been written.
I'm not sure I understand why we'd need an atomic load though. Presumably if we can see the pointer in shape->edges
, the child shape content should be flushed, and we'd see it consistently too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lazy answer is that the spec says we have to, otherwise it's UB and TSan will complain if we don't. This read of shape->edges
could be combined with the one inside the lock (so we would see stale information after taking the lock). But I do think even aside from that it's still necessary to use an atomic.
I'm not 100% on this, but let me try for the less lazy answer 😅. My understanding of the memory model is that if we want writes from one thread to all be visible in another we need the writer to do an atomic operation with at least memory_order_release
and then have it read with at least memory_order_acquire
(all Ruby's atomics use seq_cst, which should just be stricter). The GCC docs link to a wiki page with some examples.
On x86 I think what you wrote about the content through the pointer being flushed is true (all stores are equivalent to "release" and all loads are equivalent to "acquire"), but on ARM's weaker memory model I don't think it is we end up with different instructions when using atomics and I think those are necessary for the "happens before" guarantee.
I tried making a program to show this happening but did not succeed 😅. So maybe there's something I don't understand about this or the cache lines didn't line up.
If the shape has only one child, we check it lock-free without compromising thread safety. I haven't computed hard data as to how often that it the case, but we can assume that it's not too rare for shapes to have a single child that is often requested, typically when freezing and object.
0b2ebf0
to
3a49e76
Compare
@jhawthorn alright, I did what you suggested, hopefully this is correct now. |
@@ -499,13 +499,27 @@ get_next_shape_internal(rb_shape_t * shape, ID id, enum shape_type shape_type, b | |||
|
|||
*variation_created = false; | |||
|
|||
// Fast path: if the shape has a single child, we can check it without a lock | |||
struct rb_id_table *edges = RUBY_ATOMIC_PTR_LOAD(shape->edges); | |||
if (edges && SINGLE_CHILD_P(edges)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the case where edges
is NULL
we could also allocate the child atomically, and set it with CAS.
Not sure if it's really worth it though, because supposedly code running in ractors produce stable shapes, so write would become rare quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we could do that, an issue I think that would create though is that if two Ractors are trying to insert the exact same shape, we would end up creating two shapes for it, for one Ractor the CAS would fail (as is normal with this pattern), but then there'd be no ability to free the shape it had created, since they are never freed (though I suppose we could create a freelist specifically for this case).
If the shape has only one child, we check it lock-free without compromising thread safety.
I haven't computed hard data as to how often that it the case, but we can assume that it's not too rare for shapes to have a single child that is often requested, typically when freezing and object.