-
Notifications
You must be signed in to change notification settings - Fork 797
[Stack Switching] Add basic support for resume/suspend in the interpreter #7771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Out of curiosity I measured the overhead of maintaining the value stack, and it is 25%-30% or so. So coroutines that do not suspend/resume will run at around that speed, slightly slower than normal code, but still pretty reasonable. (Actual suspend and resume operations will be where things are actually slow.) |
And the 10-15% overhead of this PR seems to come from
If we really wanted to reduce this, we could probably templatize the core interpreter on Flow and a new SuspendingFlow, and only modify |
src/wasm/wasm.cpp
Outdated
// This will never be executed and the instruction will not be emitted. | ||
// Model this with an uninhabitable cast type. | ||
type = cont->type.with(NonNullable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resume
does not return a continuation, so this is not correct. It should return the result types given by the resumed continuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ditto)
edit: oh, wait, this is something else, sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is wrong, but perhaps I can leave it to a later followup that gets the upstream spec test passing? All this will be tested at that time fully, I think. (Changes in this PR are just to get some amount of spec tests passing, enough to get to interesting parts.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's difficult to tell whether things are intentionally or unintentionally wrong :) Maybe it would be worth getting validation working earlier rather than later? We should be able to validate the spec tests before we can execute them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the thing is that the spec test is not ordered in such a way... which would have been nice.
I am basically just going top-down in that test, to make sure I don't miss anything.
I promise we will pass it all after all followups land! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this TODO, I'm not sure how to fix it. Yes, this should return the result of the continuation, but there is no continuation type here - it is null - so what can we do aside from emit an uninhabitable type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can take inspiration from CallRef
, which has the same problem. When the call target is null, it makes any reference that appears in the existing type uninhabitable, but leaves any non-reference type as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, opened #7794 for that.
In general I'm not thrilled about the extra complexity here, although I understand that this direction is desirable because of the estimated performance benefits and smaller implementation effort compared to the proposed new interpreter. I still think the new interpreter is worth working on in the medium to long term because of how much simpler it has the potential to be and because it would scale better to use cases like fuzzing threads, where we really want to be able to execute just one instruction at a time. |
Co-authored-by: Thomas Lively <tlively123@gmail.com>
Co-authored-by: Thomas Lively <tlively123@gmail.com>
Interesting about fuzzing threads, yeah, for that we might want a very different interpreter. Otherwise, for stack switching I actually think this ended up surprisingly simple? It's possible the new interpreter would be simpler in some ways (like not needing What might be significantly simpler than either could be to use C++20 coroutines, which would save not just execution but values on the stack, entirely automatically. But the refactoring to get there would be significant, I think... And also
|
The complexity about being able to pick up execution in the middle of the IR should be entirely encapsulated inside a forward iterator over the IR, though. There are several ways of implementing such an iterator, including using C++20 coroutines as explored in #5447. The new interpreter would also look much, much closer to how the semantics are specified and would have much less boilerplate, so I think it would be much easier to maintain and contribute to. It would also save us from having to update arbitrary recursion limits when stack frames change size. |
Agreed. But if we do not use C++20 coroutines for such an iterator, then I think it might end up being more complicated than this PR (the relevant parts of both, I mean; it would be more encapsulated, though, fair point).
Hmm, while closer to the wasm spec's semantics, it would be less of a match for Binaryen IR semantics, which are more structured. So I'm not sure which is best in this codebase.
Agreed, avoiding the manual visiting logic would be nicer. (That is what I was hoping to achieve in #7762...)
Agreed. |
test/lit/exec/cont_simple.wast
Outdated
(i32.const 0) | ||
(suspend $more) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well write this completely unnested:
i32.const 0
suspend $more
i32.eqz
suspend $more
i32.eqz
...
test/lit/exec/cont_simple.wast
Outdated
(suspend $more) | ||
(i32.add | ||
(i32.const 4) | ||
(block (result i32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this block.
test/lit/exec/cont_simple.wast
Outdated
(block (result i32) | ||
(suspend $more) | ||
(i32.add | ||
(block (result i32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
test/lit/exec/cont_simple.wast
Outdated
;; CHECK-NEXT: [LoggingExternalInterface logging 42] | ||
;; CHECK-NEXT: [LoggingExternalInterface logging 3] | ||
;; CHECK-NEXT: [LoggingExternalInterface logging 300] | ||
(func $run-multi-locals (export "run-multi-locals") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move this up to be located next to the original test for a single local?
Tests refactored. |
I realized that we can implement suspend/resume on the current
interpreter with only a small amount of changes, basically using
the same idea as Asyncify: pause and resume structured control
flow by rewinding the stack in a "resume" mode that skips normal
execution. That is, this does not use a program counter (no goto),
nor continuation-passing style, and instead adds interpreter
support to unwind and rewind the stack. This is not the fastest
way to do things, but it is the simplest by far.
The basic idea is the same as in this 2019 blogpost:
https://kripken.github.io/blog/wasm/2019/07/16/asyncify.html
This is quite efficient in the things we care about: suspend/
resume is slow, but normal execution hardly regresses, which is
important to keep the Precompute pass from getting slower.
While the more intrusive experiment #7762 made that pass 2x
slower, this only adds 10-15% overhead. The main reason is that
this pass keeps us down to a single indirect call per instruction,
while e.g. separating decoding and execution - to maintain a
normal value stack - ends up doing two indirect calls. The "trick"
in this PR is that, yes, we do need a value stack (that is the only
practical way to stash the values on the stack when suspending),
but we can populate that stack only when inside a coroutine
(when we might suspend). So we still use the normal way of
getting child instruction values - just calling
visit(curr->child)
from the parent's
visitFoo()
method - and do not lose thatspeed, but still have a stack of values when we need it.
(10-15% is still significant, but it is just on a single pass, so it
seems acceptable, and there might be ways to optimize this further.)
Notes:
suspending, when we suspend.
callFunction
in the interpreter returns a Flow,so that we can propagate suspensions out of functions.
assert_suspension
support inwasm-shell
.cont.wast
from the test suite, and minorfixes for it in
wasm.cpp
.resume_throw
and various other partsof the stack switching proposal are TODOs, but I think this PR does the
hard parts.