Tail call VM #17849

arnaud-lb · 2025-02-18T12:02:58Z

This implements the technique described in https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html, which addresses the issues described in http://lua-users.org/lists/lua-l/2011-02/msg00742.html. Python recently implemented this, which resulted in a 9-15% performance improvements: https://blog.reverberate.org/2025/02/10/tail-call-updates.html.

It turns out that @dstogov already addressed these by using a different technique, enabled when compiling with GCC, so this will not improve performances with this compiler, but it makes PHP on Clang as fast as on GCC.

Benchmarks

Zend/bench.php:

Benchmark 1: /tmp/gcc-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
  Time (mean ± σ):      1.006 s ±  0.002 s    [User: 0.984 s, System: 0.020 s]
  Range (min … max):    1.003 s …  1.008 s    10 runs
 
Benchmark 2: /tmp/clang-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
  Time (mean ± σ):      1.783 s ±  0.009 s    [User: 1.761 s, System: 0.019 s]
  Range (min … max):    1.771 s …  1.801 s    10 runs
 
Benchmark 3: /tmp/clang-tail/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
  Time (mean ± σ):      1.017 s ±  0.003 s    [User: 0.998 s, System: 0.018 s]
  Range (min … max):    1.014 s …  1.023 s    10 runs
 
Summary
  /tmp/gcc-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php ran
    1.01 ± 0.00 times faster than /tmp/clang-tail/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php
    1.77 ± 0.01 times faster than /tmp/clang-base/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 10 Zend/bench.php

PHP/Clang was 77% slower in this benchmark, now only 1% slower.

Symfony Demo:

gcc-base:    mean:  0.5064;  stddev:  0.0008;  diff:  +0.00%
clang-base:  mean:  0.5344;  stddev:  0.0006;  diff:  +5.53%
clang-tail:  mean:  0.5017;  stddev:  0.0008;  diff:  -0.94%

PHP/Clang was 5% slower in this benchmark.

Current interpreter

The interpreter is generated by Zend/zend_vm_gen.php. Multiple modes are supported, but the default (and only supported mode) is the hybrid one, which generates both a call-based interpreter and a GCC-specific interpreter. Which one is actually compiled depends on the compiler being used.

In the call-based interpreter, op code handlers are separate functions, the next opline to execute is stored in execute_data, and execute_data is passed as argument to op handlers:

void execute_ex() {
    while (1) {
        int ret = execute_data->opline->handler(execute_data);
        if (ret != 0) {
            // leave interpreter
        }
    }
}

// example op handler
int ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data) {
    // load opline
    const zend_op *opline = execute_data->opline;

    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    execute_data->opline++;
    return 0; // ZEND_VM_CONTINUE()
}

Handlers typically load execute_data->opline, execute the operation, update execute_data->opline, and return.

There is quite a lot of overhead: The call instruction pushes a return address on the stack, the function saves/spills registers, etc. E.g. the code of ZEND_INIT_FCALL_SPEC_CONST_HANDLER() starts with

push   %rbp
push   %r15
push   %r14
push   %rbx
push   %rax

Also, opline needs to be loaded/stored from/to memory.

The GCC interpreter manages to eliminate the overhead. opline->handler is a computed-goto target, which calls the actual handler. Hot handlers are inlined, FP/IP (execute_data/opline) are register variables, handlers take no arguments and have no return value:

void execute_ex() {
    goto opline->handler;
    ZEND_INIT_FCALL_SPEC_CONST_LABEL:
        ZEND_INIT_FCALL_SPEC_CONST_HANDLER(); // inlined
        goto opline->handler;
    ... (other handlers)
    ZEND_RETURN:
        // leave interpreter
}

void always_inline ZEND_INIT_FCALL_SPEC_CONST_HANDLER(void) {
    // opline is already in a register
    
    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    opline++;
    return;
}

Changes

Here I had a variation of the call-based interpreter, enabled when using clang-19:

execute_data and opline are passed as op handler arguments, so they are always in registers unless they are spilled on the stack
handlers tail call the next opline handler: function call overhead is eliminated
handlers use the preserve_none calling convention: reduces register save/spills.

void execute_ex() {
    execute_data->opline->handler(execute_data);
    // leave interpreter
}

__attribute__((preserve_none))
int ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data, const zend_no *opline) {
    // opline is already loaded

    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    opline++;
    __attribute__((musttail)) return opline->handler(execute_data, opline);
}

The musttail attribute is used to force tail calling.

Unfortunately musttail rejects calls to function whose signature is not compatible with the caller, so it's not possible to tail call VM helpers that have extra parameters. Instead, we use a trampoline when calling these: The helper returns a struct{opline,handler} (in two registers) which is then tail called by the caller. Since helpers always return (unless they call other helpers), the stack doesn't grow indefinitely:

    // ZEND_VM_DISPATCH_TO_HELPER(zend_cannot_pass_by_ref_helper, _arg_num, arg_num, _arg, arg)
    zend_vm_trampoline t = zend_cannot_pass_by_ref_helper(arg_num, arg, execute_data, opline);
    __attribute__((musttail)) return t.handler(execute_data, t.opline);

I introduce a ZEND_VM_DISPATCH() macro that is used by ZEND_VM_NEXT_OPCODE() and related macros. This macro tail calls the next opline by default. In VM helpers with extra parameters, ZEND_DISPATCH() is redefined to return the trampoline value instead:

#undef  ZEND_VM_DISPATCH
#define ZEND_VM_DISPATCH ZEND_VM_DISPATCH_NOTAIL
zend_vm_trampoline zend_cannot_pass_by_ref_helper(arg_num, arg, execute_data, opline) {
   ...
}
#undef  ZEND_VM_DISPATCH
#define ZEND_VM_DISPATCH ZEND_VM_DISPATCH_DEFAULT

Caveats

The ABI of __attribute__((preserve_none)) is not stable, so we might not use it in exported functions. This has implications for JIT and user opcode handlers. We might need to generate wrappers with a stable convention.
There are now 3 interpreters to test. It may be possible to enable some of the change by default (e.g. passing opline as argument and __attribute__((preserve_none))) to reduce the differences between the call-based interpreter and the clang one.

TODO

JIT support
Measure the impact of passing opline as argument, without other changes. Maybe do that by default? (Pass opline as argument to opcode handlers in CALL VM #17952)
Measure the impact of __attribute__((preserve_none)), without other changes
Measure/test on aarch64, x86 (not sure it's supported)

Future scope:

Tweak preserve_none / preserve_most / slow paths

PRs

I'm splitting this into smaller PRs:

Pass opline as argument to opcode handlers in CALL VM #17952
Tail calling (TODO)
Calling convention (TODO)
Optimizations (TODO)

arnaud-lb · 2025-02-18T12:04:53Z

Zend/zend_portability.h

@@ -313,6 +313,18 @@ char *alloca();
 # define ZEND_FASTCALL
 #endif

+#if __has_attribute(preserve_none) && !defined(__SANITIZE_ADDRESS__)


There is an incompatibility between preserve_none and ASAN, which crashes Clang. I will report the issue.

arnaud-lb · 2025-02-18T12:05:32Z

Zend/zend_vm_def.h

@@ -8212,9 +8212,9 @@ ZEND_VM_HANDLER(150, ZEND_USER_OPCODE, ANY, ANY)
 		case ZEND_USER_OPCODE_LEAVE:
 			ZEND_VM_LEAVE();
 		case ZEND_USER_OPCODE_DISPATCH:
-			ZEND_VM_DISPATCH(opline->opcode, opline);
+			ZEND_VM_DISPATCH_OPCODE(opline->opcode, opline);


Renamed this rarely used macro so I could re-use its name

arnaud-lb · 2025-02-18T12:07:46Z

Zend/zend_vm_execute.skl

@@ -1,3 +1,5 @@
+#include "Zend/zend_vm_opcodes.h"


This makes language servers / IDEs happy when viewing zend_vm_execute.h

arnaud-lb · 2025-02-18T12:09:17Z

Zend/zend_vm_gen.php

+    $str .= "#include <main/php_config.h>\n";
+    $str .= "#include \"Zend/zend_portability.h\"\n";


This makes language servers / IDEs happy when viewing zend_vm_opcodes.h

dstogov · 2025-02-18T12:12:02Z

Interesting work! I suppose this will require special support for JIT.

arnaud-lb · 2025-02-18T12:44:20Z

Yes this does require some changes to the JIT to accommodate for the new opcode handler signature and how FP/IP are passed around. I plan to implement them unless there are major issues with the current approach.

The fact that preserve_none is an unstable ABI will complicate things a bit. Some possible solutions I have in mind are:

Generate wrappers with a stable ABI for each opcode handler, and use that in opcache/jit
Enforce that opcache must be compiled with the same clang version as the php binary
Move opcache to Zend/ and embed it in the php binary

The second one seems reasonable to me.

cmb69 · 2025-02-18T13:01:12Z

Zend/zend_vm_opcodes.h

@@ -21,6 +21,9 @@
 #ifndef ZEND_VM_OPCODES_H
 #define ZEND_VM_OPCODES_H

+#include <main/php_config.h>


Is it possible to avoid this dependency on main?

cmb69 · 2025-02-18T13:20:38Z

FWIW: feature request to support guaranteed tail calls for MSVC.

dstogov · 2025-02-19T07:29:27Z

Generate wrappers with a stable ABI for each opcode handler, and use that in opcache/jit

HYBRID VM generates two handlers for each opcode (C function with standard ABI + non standard GOTO). JIT uses one or the other when suitable. Technically, tail call does the same GOTO, so the same approach might work.

CLANG doesn't support global register variables. LLVM may achieve similar thing, using custom calling convention that pin arguments to registers (this technique used for Haskel, Erlang, HHVM ...). Unfortunately, I didn't found a way to introduce new calling convention without LLVM patching (cool OOP style). Using them in CLANG was also problematic. It was long time ago and may be something is changed.

dstogov · 2025-02-19T07:34:07Z

BTW LLVM/CLANG should support local register variables. So maybe GOTO and HYBRID VMs may be adopted.

This changes the signature of opcode handlers in the CALL VM so that the opline is passed directly via arguments. This reduces the number of memory operations on EX(opline), and makes the CALL VM considerably faster. Additionally, this unifies the CALL and HYBRID VMs a bit, as EX(opline) is now handled in the same way in both VMs. This is a part of GH-17849. Currently we have two VMs: * HYBRID: Used when compiling with GCC. execute_data and opline are global register variables * CALL: Used when compiling with something else. execute_data is passed as opcode handler arg, but opline is passed via execute_data->opline (EX(opline)). The Call VM looks like this: while (1) { ret = execute_data->opline->handler(execute_data); if (UNEXPECTED(ret != 0)) { if (ret > 0) { // returned by ZEND_VM_ENTER() / ZEND_VM_LEAVE() execute_data = EG(current_execute_data); } else { // returned by ZEND_VM_RETURN() return; } } } // example op handler int ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data) { // load opline const zend_op *opline = execute_data->opline; // instruction execution // dispatch // ZEND_VM_NEXT_OPCODE(): execute_data->opline++; return 0; // ZEND_VM_CONTINUE() } Opcode handlers return a positive value to signal that the loop must load a new execute_data from EG(current_execute_data), typically when entering or leaving a function. Here I make the following changes: * Pass opline as opcode handler argument * Return next opline from opcode handlers * ZEND_VM_ENTER / ZEND_VM_LEAVE return opline|(1<<0) to signal that execute_data must be reloaded from EG(current_execute_data) This gives us: while (1) { opline = opline->handler(execute_data, opline); if (UNEXPECTED((uintptr_t) opline & ZEND_VM_ENTER_BIT) { opline = opline & ~ZEND_VM_ENTER_BIT; if (opline != 0) { // ZEND_VM_ENTER() / ZEND_VM_LEAVE() execute_data = EG(current_execute_data); } else { // ZEND_VM_RETURN() return; } } } // example op handler const zend_op * ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data, const zend_op *opline) { // opline already loaded // instruction execution // dispatch // ZEND_VM_NEXT_OPCODE(): return ++opline; } bench.php is 23% faster on Linux / x86_64, 18% faster on MacOS / M1. Symfony Demo is 2.8% faster. When using the HYBRID VM, JIT'ed code stores execute_data/opline in two fixed callee-saved registers and rarely touches EX(opline), just like the VM. Since the registers are callee-saved, the JIT'ed code doesn't have to save them before calling other functions, and can assume they always contain execute_data/opline. The code also avoids saving/restoring them in prologue/epilogue, as execute_ex takes care of that (JIT'ed code is called exclusively from there). The CALL VM can now use a fixed register for execute_data/opline as well, but we can't rely on execute_ex to save the registers for us as it may use these registers itself. So we have to save/restore the two registers in JIT'ed code prologue/epilogue. Closes GH-17952

Tail call VM

b125e34

github-actions bot added Category: Engine ABI break labels Feb 18, 2025

arnaud-lb commented Feb 18, 2025

View reviewed changes

cmb69 reviewed Feb 18, 2025

View reviewed changes

arnaud-lb mentioned this pull request Feb 28, 2025

Pass opline as argument to opcode handlers in CALL VM #17952

Closed

3 tasks

arnaud-lb mentioned this pull request May 31, 2025

Tail call VM [2] #18720

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tail call VM #17849

Tail call VM #17849

Uh oh!

arnaud-lb commented Feb 18, 2025 •

edited

Loading

Uh oh!

arnaud-lb Feb 18, 2025

Uh oh!

arnaud-lb Feb 18, 2025

Uh oh!

arnaud-lb Feb 18, 2025

Uh oh!

arnaud-lb Feb 18, 2025

Uh oh!

dstogov commented Feb 18, 2025

Uh oh!

arnaud-lb commented Feb 18, 2025

Uh oh!

cmb69 Feb 18, 2025

Uh oh!

cmb69 commented Feb 18, 2025

Uh oh!

dstogov commented Feb 19, 2025

Uh oh!

dstogov commented Feb 19, 2025

Uh oh!

Uh oh!

		$str .= "#include <main/php_config.h>\n";
		$str .= "#include \"Zend/zend_portability.h\"\n";

Tail call VM #17849

Are you sure you want to change the base?

Tail call VM #17849

Uh oh!

Conversation

arnaud-lb commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Current interpreter

Changes

Caveats

TODO

Future scope:

PRs

Uh oh!

arnaud-lb Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

arnaud-lb Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

arnaud-lb Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

arnaud-lb Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

dstogov commented Feb 18, 2025

Uh oh!

arnaud-lb commented Feb 18, 2025

Uh oh!

cmb69 Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

cmb69 commented Feb 18, 2025

Uh oh!

dstogov commented Feb 19, 2025

Uh oh!

dstogov commented Feb 19, 2025

Uh oh!

Uh oh!

arnaud-lb commented Feb 18, 2025 •

edited

Loading