Skip to content

[x86_64] Isolating bits can be optimized to a pext instruction #72088

Open
@Validark

Description

@Validark

Sometimes, we have a bitstring where we want to extract certain bits. E.g. to extract the X bits from XXXX....XXXX, we can do:

(i & 0xF) | ((i >> 4) & 0xF0)

This has 4 operations and will take ~3 cycles, assuming the (i & 0xF) can run in parallel to i >> 4 or the & 0xF0 (and not counting the mov).

        mov     eax, edi
        shr     edi, 4
        and     eax, 15
        and     edi, 240
        or      eax, edi

Alternatively, we can condense this into a single pext instruction on x86_64 machines with bmi2.

        mov     eax, 3855
        pext    eax, edi, eax

Even fancier would be if we could detect more advanced bit-concentration patterns. Here is some real Zig code which concentrates the highest bit of each byte:

fn swarMovMask(v: anytype) @TypeOf(v) {
    comptime assert(@divExact(@bitSizeOf(@TypeOf(v)), 8) <= 8);
    const ones: @TypeOf(v) = @bitCast(@as(@Vector(@sizeOf(@TypeOf(v)), u8), @splat(1)));
    const msb_mask = 0x80 * ones;

    // We are exploiting a multiplication as a shifter and adder, and the derivation of this number is
    // shown here as a comptime loop.
    // This trick is often generalizable to other problems too: https://stackoverflow.com/a/14547307
    comptime var mult: @TypeOf(v) = 0;
    comptime for (0..@sizeOf(@TypeOf(v))) |i| {
        mult |= @as(@TypeOf(v), 1) << (7 * i);
    };

    // Example with 32 bit integers:
    // We want to concentrate the upper bits of each byte into a single nibble.
    // Doing the gradeschool multiplication algorithm, we can see that each 1 bit
    // in the bottom multiplicand shifts the upper multiplicand, and then we add all these
    // shifted bitstrings together. (Note `.` represents a 0)
    //   a.......b.......c.......d.......
    // * ..........1......1......1......1
    // -------------------------------------------------------------------------
    //   a.......b.......c.......d.......
    //   .b.......c.......d..............
    //   ..c.......d.....................
    // + ...d............................
    // -------------------------------------------------------------------------
    //   abcd....bcd.....cd......d.......

    // Then we simply shift to the right by `32 - 4` (bitstring size minus the number of relevant bits) to isolate the desired `abcd` bits in the least significant byte!

    // Inspired by: http://0x80.pl/articles/scalar-sse-movmask.html
    return (mult *% (v & msb_mask)) >> (@bitSizeOf(@TypeOf(v)) - @sizeOf(@TypeOf(v)));
}

For 32 bit integers, this becomes:

(0x204081 *% (v & 0x80808080)) >> 28

In assembly:

        and     edi, 0x80808080
        imul    eax, edi, 0x204081
        shr     eax, 28

Which we can reduce to:

        mov     eax, 0x80808080
        pext    eax, edi, eax

Note: for AMD CPUs, this optimization should only be applied to Zen 3 or newer. Before that, the pext instruction was implemented via microcode and was very slow.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions