[DO NOT LAND] Experiment with using pops in the current interpreter #7762

kripken · 2025-07-28T19:52:23Z

Uploading this just for future reference and discussion.

Our current interpreter is pretty efficient, it turns out: It interprets the IR in-place, without constructing any new IR, and while doing so it does this at each expression, e.g. on Binary

visitBinary(curr) {
  var left = interpret(curr->left);
  var right = interpret(curr->right);
  return left + right; // For binary, we need all the others too, - * / etc.
}

The point of this simplified pseudocode is that each execution of an expression ends up doing three things:

Call the right visit*() method, e.g., for a Binary we call visitBinary. (this does a switch on the expression id)
Get the children. (here this is done by doing curr->left etc.)
Execute the instruction on the children. (here this is done with the +)

What is efficient here is that we do a single switch on the expression id, and once we are in the proper visit*() method, we then have all the specific logic there: both finding the children, and executing after them.

This PR experiments with splitting these operations, using ChildVisitor to find the children - this does one switch - and then, once we have them, call the right visit*() - this does another switch - which then pops those values (called getChild() in this code). That is, this uses a value stack, like the wip new interpreter.

The benefit of this separation is that we can handle control flow in a generic place, no more if (breaking()) return, and this organization of code is a step to adding stack switching support (we can handle switching in that single generic place; we can save and restore the value stack; etc.).

However, this makes the interpreter 2x slower. I tried various ways to optimize it, but it does seem that fundamentally the current interpreter is just very efficient, and doing two switches on the expression class is just worse, in addition to the stack itself etc. (but the stack seems less of an issue). And I believe this experiment is a good model for the performance of the wip new interpreter @tlively , unless I am missing something.

Perhaps we can still avoid two switches. The first switch could find not only the children but also put the proper visit*() method on the stack, to be fetched and executed at the proper time. But that will still be an unpredictable indirect call, so I am not optimistic.

Anyhow, this is not the end of the world - while this would make Precompute 2x slower, the total execution time may not be that bad, it is just one pass among many - but I am still looking into further options here.

tlively · 2025-07-28T20:52:35Z

It's nice to see how much boilerplate can be removed by using an explicit stack, even in this minimal change to the existing interpreter. I think the 2x performance difference is more or less a worst-case bound, right? There's still some baggage here that a finished stack-based interpreter would not use, such as Flow and Literals. We could also experiment with different iterator designs and with tail calls, etc. to try to eke out more performance. I'm not looking at the profiles you're looking at, but I hope there's room for improvement.

kripken · 2025-07-28T20:59:52Z

I'm not sure if it's a worst-case bound, but it is what I saw on a handful of large/realistic wasm files.

Yes, maybe we can optimize a little more here, but as I said, the big slowdown is just the two switches, as best I can measure (I did things like see the cost of adding a value stack without handling using another switch). It's double the indirect calls compared to the current interpreter. For an interpreter, which does little real work in each iteration, I think an indirect call is just expensive.

…eter (#7771) It turns out we can implement suspend/resume on the current interpreter with only a small amount of changes, basically using the same idea as Asyncify: pause and resume structured control flow by rewinding the stack in a "resume" mode that skips normal execution. That is, this does not use a program counter (no goto), nor continuation-passing style, and instead adds interpreter support to unwind and rewind the stack. This is not the fastest way to do things, but it is the simplest by far. The basic idea is the same as in this 2019 blogpost: https://kripken.github.io/blog/wasm/2019/07/16/asyncify.html This is quite efficient in the things we care about: suspend/ resume is slow, but normal execution hardly regresses, which is important to keep the Precompute pass from getting slower. While the more intrusive experiment #7762 made that pass 2x slower, this only adds 10-15% overhead. The main reason is that this pass keeps us down to a single indirect call per instruction, while e.g. separating decoding and execution - to maintain a normal value stack - ends up doing two indirect calls. The "trick" in this PR is that, yes, we do need a value stack (that is the only practical way to stash the values on the stack when suspending), but we can populate that stack only when inside a coroutine (when we might suspend). So we still use the normal way of getting child instruction values - just calling `visit(curr->child)` from the parent's `visitFoo()` method - and do not lose that speed, but still have a stack of values when we need it. (10-15% is still significant, but it is just on a single pass, so it seems acceptable, and there might be ways to optimize this further.) Notes: * Flow now has a "suspendTag" property, which denotes the tag we are suspending, when we suspend. * As part of this change, `callFunction` in the interpreter returns a Flow, so that we can propagate suspensions out of functions. * For testing, this adds `assert_suspension` support in `wasm-shell`. * For testing, add part of `cont.wast` from the test suite, and minor fixes for it in `wasm.cpp`. * This only adds basic support: `resume_throw` and various other parts of the stack switching proposal are TODOs, but I think this PR does the hard parts.

kripken added 30 commits July 24, 2025 09:02

yolo

7984f03

script

9685d6e

script

93ea61c

script

ba18f37

work

7093708

work

53c3657

work

09fbf00

work

8d8022a

work

4db3030

weird

1d2f536

TRY.maybe.revert

f81efad

FIX

1d6fc38

order.debug

0da3cfa

fix

87b74e7

nicer

7c40729

fix

2b1af67

revertfix

7409bca

fix

d683b38

undo

ca5d8cd

fixen

675a306

fix

483f5f9

clean

2b8a16e

fix

13c35fe

fix

26d09a7

simpl

2b43ba8

tryfix

f660691

chaks

3d3bb69

typo

876d246

simpl

38f14db

fix

7b08621

kripken added 9 commits July 25, 2025 09:11

fix

e2740d4

format

be17df4

clean

e36ac68

faster

a6884eb

faster

e11e69c

faster

317967d

SIMPL?

a823f14

simpler

8e24404

common

f26e8a7

kripken mentioned this pull request Jul 30, 2025

[Stack Switching] Add basic support for resume/suspend in the interpreter #7771

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO NOT LAND] Experiment with using pops in the current interpreter #7762

[DO NOT LAND] Experiment with using pops in the current interpreter #7762

Uh oh!

kripken commented Jul 28, 2025

Uh oh!

tlively commented Jul 28, 2025

Uh oh!

kripken commented Jul 28, 2025

Uh oh!

Uh oh!

[DO NOT LAND] Experiment with using pops in the current interpreter #7762

Are you sure you want to change the base?

[DO NOT LAND] Experiment with using pops in the current interpreter #7762

Uh oh!

Conversation

kripken commented Jul 28, 2025

Uh oh!

tlively commented Jul 28, 2025

Uh oh!

kripken commented Jul 28, 2025

Uh oh!

Uh oh!