Skip to content

[Stack Switching] Add basic support for resume/suspend in the interpreter #7771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 128 commits into from
Aug 1, 2025

Conversation

kripken
Copy link
Member

@kripken kripken commented Jul 30, 2025

I realized that we can implement suspend/resume on the current
interpreter with only a small amount of changes, basically using
the same idea as Asyncify: pause and resume structured control
flow by rewinding the stack in a "resume" mode that skips normal
execution. That is, this does not use a program counter (no goto),
nor continuation-passing style, and instead adds interpreter
support to unwind and rewind the stack. This is not the fastest
way to do things, but it is the simplest by far.

The basic idea is the same as in this 2019 blogpost:

https://kripken.github.io/blog/wasm/2019/07/16/asyncify.html

This is quite efficient in the things we care about: suspend/
resume is slow, but normal execution hardly regresses, which is
important to keep the Precompute pass from getting slower.
While the more intrusive experiment #7762 made that pass 2x
slower, this only adds 10-15% overhead. The main reason is that
this pass keeps us down to a single indirect call per instruction,
while e.g. separating decoding and execution - to maintain a
normal value stack - ends up doing two indirect calls. The "trick"
in this PR is that, yes, we do need a value stack (that is the only
practical way to stash the values on the stack when suspending),
but we can populate that stack only when inside a coroutine
(when we might suspend). So we still use the normal way of
getting child instruction values - just calling visit(curr->child)
from the parent's visitFoo() method - and do not lose that
speed, but still have a stack of values when we need it.

(10-15% is still significant, but it is just on a single pass, so it
seems acceptable, and there might be ways to optimize this further.)

Notes:

  • Flow now has a "suspendTag" property, which denotes the tag we are
    suspending, when we suspend.
  • As part of this change, callFunction in the interpreter returns a Flow,
    so that we can propagate suspensions out of functions.
  • For testing, this adds assert_suspension support in wasm-shell.
  • For testing, add part of cont.wast from the test suite, and minor
    fixes for it in wasm.cpp.
  • This only adds basic support: resume_throw and various other parts
    of the stack switching proposal are TODOs, but I think this PR does the
    hard parts.

@kripken kripken requested a review from tlively July 30, 2025 20:32
@kripken
Copy link
Member Author

kripken commented Jul 30, 2025

Out of curiosity I measured the overhead of maintaining the value stack, and it is 25%-30% or so. So coroutines that do not suspend/resume will run at around that speed, slightly slower than normal code, but still pretty reasonable. (Actual suspend and resume operations will be where things are actually slow.)

@kripken
Copy link
Member Author

kripken commented Jul 30, 2025

And the 10-15% overhead of this PR seems to come from

  1. the larger size of Flow (now supporting a suspend tag)
  2. changes to the main visit() method (even though the branch there is well-predicted if we never run a coroutine, it is a branch I guess), and
  3. other sources (perhaps the changes to control flow structures).

If we really wanted to reduce this, we could probably templatize the core interpreter on Flow and a new SuspendingFlow, and only modify visit() in the latter, etc., but perhaps that would not be worth the code complexity.

Comment on lines 1531 to 1533
// This will never be executed and the instruction will not be emitted.
// Model this with an uninhabitable cast type.
type = cont->type.with(NonNullable);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resume does not return a continuation, so this is not correct. It should return the result types given by the resumed continuation.

Copy link
Member Author

@kripken kripken Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ditto)

edit: oh, wait, this is something else, sorry

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is wrong, but perhaps I can leave it to a later followup that gets the upstream spec test passing? All this will be tested at that time fully, I think. (Changes in this PR are just to get some amount of spec tests passing, enough to get to interesting parts.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to tell whether things are intentionally or unintentionally wrong :) Maybe it would be worth getting validation working earlier rather than later? We should be able to validate the spec tests before we can execute them.

Copy link
Member Author

@kripken kripken Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the thing is that the spec test is not ordered in such a way... which would have been nice.

I am basically just going top-down in that test, to make sure I don't miss anything.

I promise we will pass it all after all followups land! 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this TODO, I'm not sure how to fix it. Yes, this should return the result of the continuation, but there is no continuation type here - it is null - so what can we do aside from emit an uninhabitable type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can take inspiration from CallRef, which has the same problem. When the call target is null, it makes any reference that appears in the existing type uninhabitable, but leaves any non-reference type as-is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, opened #7794 for that.

@tlively
Copy link
Member

tlively commented Jul 31, 2025

In general I'm not thrilled about the extra complexity here, although I understand that this direction is desirable because of the estimated performance benefits and smaller implementation effort compared to the proposed new interpreter.

I still think the new interpreter is worth working on in the medium to long term because of how much simpler it has the potential to be and because it would scale better to use cases like fuzzing threads, where we really want to be able to execute just one instruction at a time.

kripken and others added 3 commits July 31, 2025 13:10
Co-authored-by: Thomas Lively <tlively123@gmail.com>
Co-authored-by: Thomas Lively <tlively123@gmail.com>
@kripken
Copy link
Member Author

kripken commented Jul 31, 2025

Interesting about fuzzing threads, yeah, for that we might want a very different interpreter.

Otherwise, for stack switching I actually think this ended up surprisingly simple? It's possible the new interpreter would be simpler in some ways (like not needing restoredValuesMap) but it will be more complex in others (need a way to continue a walk of the IR "from the middle").

What might be significantly simpler than either could be to use C++20 coroutines, which would save not just execution but values on the stack, entirely automatically. But the refactoring to get there would be significant, I think... And also

  1. I'm not sure how fast it would be
  2. there are some questions about compiler support.

@tlively
Copy link
Member

tlively commented Jul 31, 2025

The complexity about being able to pick up execution in the middle of the IR should be entirely encapsulated inside a forward iterator over the IR, though. There are several ways of implementing such an iterator, including using C++20 coroutines as explored in #5447. The new interpreter would also look much, much closer to how the semantics are specified and would have much less boilerplate, so I think it would be much easier to maintain and contribute to. It would also save us from having to update arbitrary recursion limits when stack frames change size.

@kripken
Copy link
Member Author

kripken commented Jul 31, 2025

The complexity about being able to pick up execution in the middle of the IR should be entirely encapsulated inside a forward iterator over the IR, though. There are several ways of implementing such an iterator, including using C++20 coroutines as explored in #5447.

Agreed. But if we do not use C++20 coroutines for such an iterator, then I think it might end up being more complicated than this PR (the relevant parts of both, I mean; it would be more encapsulated, though, fair point).

The new interpreter would also look much, much closer to how the semantics are specified

Hmm, while closer to the wasm spec's semantics, it would be less of a match for Binaryen IR semantics, which are more structured. So I'm not sure which is best in this codebase.

and would have much less boilerplate, so I think it would be much easier to maintain and contribute to.

Agreed, avoiding the manual visiting logic would be nicer. (That is what I was hoping to achieve in #7762...)

It would also save us from having to update arbitrary recursion limits when stack frames change size.

Agreed.

Comment on lines 370 to 371
(i32.const 0)
(suspend $more)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well write this completely unnested:

i32.const 0
suspend $more
i32.eqz
suspend $more
i32.eqz
...

(suspend $more)
(i32.add
(i32.const 4)
(block (result i32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this block.

(block (result i32)
(suspend $more)
(i32.add
(block (result i32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

;; CHECK-NEXT: [LoggingExternalInterface logging 42]
;; CHECK-NEXT: [LoggingExternalInterface logging 3]
;; CHECK-NEXT: [LoggingExternalInterface logging 300]
(func $run-multi-locals (export "run-multi-locals")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this up to be located next to the original test for a single local?

@kripken
Copy link
Member Author

kripken commented Aug 1, 2025

Tests refactored.

@kripken kripken merged commit adabf63 into WebAssembly:main Aug 1, 2025
16 checks passed
@kripken kripken deleted the resumey branch August 1, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants