Skip to content

[DO NOT LAND] Experiment with using pops in the current interpreter #7762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

kripken
Copy link
Member

@kripken kripken commented Jul 28, 2025

Uploading this just for future reference and discussion.

Our current interpreter is pretty efficient, it turns out: It interprets the IR in-place, without constructing any new IR, and while doing so it does this at each expression, e.g. on Binary

visitBinary(curr) {
  var left = interpret(curr->left);
  var right = interpret(curr->right);
  return left + right; // For binary, we need all the others too, - * / etc.
}

The point of this simplified pseudocode is that each execution of an expression ends up doing three things:

  • Call the right visit*() method, e.g., for a Binary we call visitBinary. (this does a switch on the expression id)
  • Get the children. (here this is done by doing curr->left etc.)
  • Execute the instruction on the children. (here this is done with the +)

What is efficient here is that we do a single switch on the expression id, and once we are in the proper visit*() method, we then have all the specific logic there: both finding the children, and executing after them.

This PR experiments with splitting these operations, using ChildVisitor to find the children - this does one switch - and then, once we have them, call the right visit*() - this does another switch - which then pops those values (called getChild() in this code). That is, this uses a value stack, like the wip new interpreter.

The benefit of this separation is that we can handle control flow in a generic place, no more if (breaking()) return, and this organization of code is a step to adding stack switching support (we can handle switching in that single generic place; we can save and restore the value stack; etc.).

However, this makes the interpreter 2x slower. I tried various ways to optimize it, but it does seem that fundamentally the current interpreter is just very efficient, and doing two switches on the expression class is just worse, in addition to the stack itself etc. (but the stack seems less of an issue). And I believe this experiment is a good model for the performance of the wip new interpreter @tlively , unless I am missing something.

Perhaps we can still avoid two switches. The first switch could find not only the children but also put the proper visit*() method on the stack, to be fetched and executed at the proper time. But that will still be an unpredictable indirect call, so I am not optimistic.

Anyhow, this is not the end of the world - while this would make Precompute 2x slower, the total execution time may not be that bad, it is just one pass among many - but I am still looking into further options here.

@tlively
Copy link
Member

tlively commented Jul 28, 2025

It's nice to see how much boilerplate can be removed by using an explicit stack, even in this minimal change to the existing interpreter. I think the 2x performance difference is more or less a worst-case bound, right? There's still some baggage here that a finished stack-based interpreter would not use, such as Flow and Literals. We could also experiment with different iterator designs and with tail calls, etc. to try to eke out more performance. I'm not looking at the profiles you're looking at, but I hope there's room for improvement.

@kripken
Copy link
Member Author

kripken commented Jul 28, 2025

I'm not sure if it's a worst-case bound, but it is what I saw on a handful of large/realistic wasm files.

Yes, maybe we can optimize a little more here, but as I said, the big slowdown is just the two switches, as best I can measure (I did things like see the cost of adding a value stack without handling using another switch). It's double the indirect calls compared to the current interpreter. For an interpreter, which does little real work in each iteration, I think an indirect call is just expensive.

kripken added a commit that referenced this pull request Aug 1, 2025
…eter (#7771)

It turns out we can implement suspend/resume on the current
interpreter with only a small amount of changes, basically using
the same idea as Asyncify: pause and resume structured control
flow by rewinding the stack in a "resume" mode that skips normal
execution. That is, this does not use a program counter (no goto),
nor continuation-passing style, and instead adds interpreter
support to unwind and rewind the stack. This is not the fastest
way to do things, but it is the simplest by far.

The basic idea is the same as in this 2019 blogpost:

https://kripken.github.io/blog/wasm/2019/07/16/asyncify.html

This is quite efficient in the things we care about: suspend/
resume is slow, but normal execution hardly regresses, which is
important to keep the Precompute pass from getting slower.
While the more intrusive experiment #7762 made that pass 2x
slower, this only adds 10-15% overhead. The main reason is that
this pass keeps us down to a single indirect call per instruction,
while e.g. separating decoding and execution - to maintain a
normal value stack - ends up doing two indirect calls. The "trick"
in this PR is that, yes, we do need a value stack (that is the only
practical way to stash the values on the stack when suspending),
but we can populate that stack only when inside a coroutine
(when we might suspend). So we still use the normal way of
getting child instruction values - just calling `visit(curr->child)`
from the parent's `visitFoo()` method - and do not lose that
speed, but still have a stack of values when we need it.

(10-15% is still significant, but it is just on a single pass, so it
seems acceptable, and there might be ways to optimize this further.)

Notes:

* Flow now has a "suspendTag" property, which denotes the tag we are
  suspending, when we suspend.
* As part of this change, `callFunction` in the interpreter returns a
  Flow, so that we can propagate suspensions out of functions.
* For testing, this adds `assert_suspension` support in `wasm-shell`.
* For testing, add part of `cont.wast` from the test suite, and minor
  fixes for it in `wasm.cpp`.
* This only adds basic support: `resume_throw` and various other parts
  of the stack switching proposal are TODOs, but I think this PR does the
  hard parts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants