Skip to content

Numerical Computing Performance Guide

Sergey Kaplun edited this page Sep 7, 2023 · 1 revision

WARNING: Tarantool's developers disclaimer: This wiki page is a slightly modified copy of the missing LuaJIT wiki page taken from web archive. It may contain outdated content, use it with caution.

This is a mailing list post by Mike Pall.

The following things are needed to get the best speed out of numerical computations with LuaJIT (in order of importance):

  • Reduce the number of unbiased/unpredictable branches.

    • Heavily biased branches (>95% in one direction) are fine.
    • Prefer branch-free algorithms.
      • Use math.min() and math.max().
      • Use bit.*, e.g., for conditional index computations.
  • Using FFI data structures:

    • Use int32_t, avoid uint32_t data types.
    • Use double, avoid float data types.
    • Metamethods are fine, but don't overuse them.
  • Calling C functions via the FFI:

    • Avoid calling trivial functions, better rewrite them in Lua.
    • Avoid callbacks -- use pull-style APIs (read/write) and iterators instead.
  • Use plain for i = start, stop, step do ... end loops.

    • Prefer plain array indexing, e.g., a[i+2].
    • Avoid pointer arithmetic.
  • Find the right balance for unrolling.

    • Avoid inner loops with a low iteration count (< 10).
    • Only unroll loops if the loop body doesn't have too many instructions.
    • Consider using templates instead of hand-unrolling (see GSL Shell).
    • You may have to experiment a bit.
  • Define and call only local (!) functions within a module.

  • Cache often-used functions from other modules in upvalues.

    • E.g. local sin = math.sin ... local function foo() return 2 * sin(x) end
    • Don't do this for FFI C functions, cache the namespace instead, e.g., local lib = ffi.load("lib").
  • Avoid inventing your own dispatch mechanisms.

    • Prefer to use built-in mechanisms, e.g., metamethods.
  • Do not try to second-guess the JIT compiler.

    • It's perfectly ok to write z = x[a+b] + y[a+b].
    • Do not try CSE (Common Subexpression Elimination) by hand, e.g., local c = a + b.
    • It may become detrimental if the lifetime of the temporary is longer than needed. If the compiler cannot deduce that it's dead, then the useless temporary will block a register or stack slot and/or it needs to be stored on the Lua stack.
    • Duplicate expressions involving basic arithmetic operators that are relatively close to each other (and likely in the same trace) should not be manually CSEd. Loads only need to be manually hoisted, if alias analysis is likely to fail.
    • It's perfectly ok to write a[i][j] = a[i][j] * a[i][j+1].
    • Do not try to cache partial FFI struct/array references (e.g., a[i]) unless they are long-lived (e.g., in a big loop).
    • There are quite a few "easy" optimizations where the compiler is in a better position to perform them. Better focus on the difficult things, like algorithmic improvements.
  • Be careful with aliasing, especially when using multiple arrays.

    • LuaJIT uses strict type-based disambiguation, but there are limits to this due to C99 conformance.
    • E.g. in x[i] = a[i] + c[i]; y[i] = a[i] + d[i] the load of a[i] needs to be done twice because x could alias a. It does make sense to use do local t = a[i] ... end here.
  • Reduce the number of live temporary variables.

    • Best to initialize on the definition, e.g., local y = f(x)
      • Yes, this means you should interleave this with other code.
      • Do not hoist variable definitions to the start of a function -- Lua is neither JavaScript nor K&R C.
    • Use do local y = f(x) ... end to bound variable lifetimes.
  • Do not intersperse expensive or uncompiled operations.

    • print() is not compiled, use io.write().
    • E.g. avoid assert(type(x) == "number", "x is a "..mytype(x)")
      • The problem is not the assert() or the condition (basically free). The problem is the string concatenation, which has to be executed every time, even if the assertion never fails!
    • Watch the output of -jv and -jdump.

You need to take all of these factors into account before deciding on a certain algorithm. An advanced algorithm that's fast in theory may be slower than a simpler algorithm if the simpler algorithm has much fewer unbiased branches.

Developer Guidelines ↗

Architecture

How To ...?

Recipes

Upgrade instructions

Useful links

Old discussions

Personal pages

Clone this wiki locally