-
Notifications
You must be signed in to change notification settings - Fork 796
[DO NOT LAND] Experiment with using pops in the current interpreter #7762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It's nice to see how much boilerplate can be removed by using an explicit stack, even in this minimal change to the existing interpreter. I think the 2x performance difference is more or less a worst-case bound, right? There's still some baggage here that a finished stack-based interpreter would not use, such as |
I'm not sure if it's a worst-case bound, but it is what I saw on a handful of large/realistic wasm files. Yes, maybe we can optimize a little more here, but as I said, the big slowdown is just the two switches, as best I can measure (I did things like see the cost of adding a value stack without handling using another switch). It's double the indirect calls compared to the current interpreter. For an interpreter, which does little real work in each iteration, I think an indirect call is just expensive. |
…eter (#7771) It turns out we can implement suspend/resume on the current interpreter with only a small amount of changes, basically using the same idea as Asyncify: pause and resume structured control flow by rewinding the stack in a "resume" mode that skips normal execution. That is, this does not use a program counter (no goto), nor continuation-passing style, and instead adds interpreter support to unwind and rewind the stack. This is not the fastest way to do things, but it is the simplest by far. The basic idea is the same as in this 2019 blogpost: https://kripken.github.io/blog/wasm/2019/07/16/asyncify.html This is quite efficient in the things we care about: suspend/ resume is slow, but normal execution hardly regresses, which is important to keep the Precompute pass from getting slower. While the more intrusive experiment #7762 made that pass 2x slower, this only adds 10-15% overhead. The main reason is that this pass keeps us down to a single indirect call per instruction, while e.g. separating decoding and execution - to maintain a normal value stack - ends up doing two indirect calls. The "trick" in this PR is that, yes, we do need a value stack (that is the only practical way to stash the values on the stack when suspending), but we can populate that stack only when inside a coroutine (when we might suspend). So we still use the normal way of getting child instruction values - just calling `visit(curr->child)` from the parent's `visitFoo()` method - and do not lose that speed, but still have a stack of values when we need it. (10-15% is still significant, but it is just on a single pass, so it seems acceptable, and there might be ways to optimize this further.) Notes: * Flow now has a "suspendTag" property, which denotes the tag we are suspending, when we suspend. * As part of this change, `callFunction` in the interpreter returns a Flow, so that we can propagate suspensions out of functions. * For testing, this adds `assert_suspension` support in `wasm-shell`. * For testing, add part of `cont.wast` from the test suite, and minor fixes for it in `wasm.cpp`. * This only adds basic support: `resume_throw` and various other parts of the stack switching proposal are TODOs, but I think this PR does the hard parts.
Uploading this just for future reference and discussion.
Our current interpreter is pretty efficient, it turns out: It interprets the IR in-place, without constructing any new IR, and while doing so it does this at each expression, e.g. on Binary
The point of this simplified pseudocode is that each execution of an expression ends up doing three things:
visit*()
method, e.g., for a Binary we call visitBinary. (this does a switch on the expression id)curr->left
etc.)+
)What is efficient here is that we do a single switch on the expression id, and once we are in the proper
visit*()
method, we then have all the specific logic there: both finding the children, and executing after them.This PR experiments with splitting these operations, using ChildVisitor to find the children - this does one switch - and then, once we have them, call the right
visit*()
- this does another switch - which then pops those values (calledgetChild()
in this code). That is, this uses a value stack, like the wip new interpreter.The benefit of this separation is that we can handle control flow in a generic place, no more
if (breaking()) return
, and this organization of code is a step to adding stack switching support (we can handle switching in that single generic place; we can save and restore the value stack; etc.).However, this makes the interpreter 2x slower. I tried various ways to optimize it, but it does seem that fundamentally the current interpreter is just very efficient, and doing two switches on the expression class is just worse, in addition to the stack itself etc. (but the stack seems less of an issue). And I believe this experiment is a good model for the performance of the wip new interpreter @tlively , unless I am missing something.
Perhaps we can still avoid two switches. The first switch could find not only the children but also put the proper
visit*()
method on the stack, to be fetched and executed at the proper time. But that will still be an unpredictable indirect call, so I am not optimistic.Anyhow, this is not the end of the world - while this would make Precompute 2x slower, the total execution time may not be that bad, it is just one pass among many - but I am still looking into further options here.