[x86_64] Isolating bits can be optimized to a pext instruction

Sometimes, we have a bitstring where we want to extract certain bits. E.g. to extract the `X` bits from `XXXX....XXXX`, we can do:

`(i & 0xF) | ((i >> 4) & 0xF0)`

This has 4 operations and will take ~3 cycles, assuming the `(i & 0xF)` can run in parallel to `i >> 4` or the `& 0xF0` (and not counting the `mov`).

```asm
        mov     eax, edi
        shr     edi, 4
        and     eax, 15
        and     edi, 240
        or      eax, edi
```

Alternatively, we can condense this into a single pext instruction on x86_64 machines with bmi2.

```asm
        mov     eax, 3855
        pext    eax, edi, eax
```

Even fancier would be if we could detect more advanced bit-concentration patterns. Here is some real Zig code which concentrates the highest bit of each byte:

```zig
fn swarMovMask(v: anytype) @TypeOf(v) {
    comptime assert(@divExact(@bitSizeOf(@TypeOf(v)), 8) <= 8);
    const ones: @TypeOf(v) = @bitCast(@as(@Vector(@sizeOf(@TypeOf(v)), u8), @splat(1)));
    const msb_mask = 0x80 * ones;

    // We are exploiting a multiplication as a shifter and adder, and the derivation of this number is
    // shown here as a comptime loop.
    // This trick is often generalizable to other problems too: https://stackoverflow.com/a/14547307
    comptime var mult: @TypeOf(v) = 0;
    comptime for (0..@sizeOf(@TypeOf(v))) |i| {
        mult |= @as(@TypeOf(v), 1) << (7 * i);
    };

    // Example with 32 bit integers:
    // We want to concentrate the upper bits of each byte into a single nibble.
    // Doing the gradeschool multiplication algorithm, we can see that each 1 bit
    // in the bottom multiplicand shifts the upper multiplicand, and then we add all these
    // shifted bitstrings together. (Note `.` represents a 0)
    //   a.......b.......c.......d.......
    // * ..........1......1......1......1
    // -------------------------------------------------------------------------
    //   a.......b.......c.......d.......
    //   .b.......c.......d..............
    //   ..c.......d.....................
    // + ...d............................
    // -------------------------------------------------------------------------
    //   abcd....bcd.....cd......d.......

    // Then we simply shift to the right by `32 - 4` (bitstring size minus the number of relevant bits) to isolate the desired `abcd` bits in the least significant byte!

    // Inspired by: http://0x80.pl/articles/scalar-sse-movmask.html
    return (mult *% (v & msb_mask)) >> (@bitSizeOf(@TypeOf(v)) - @sizeOf(@TypeOf(v)));
}
```

For 32 bit integers, this becomes:

`(0x204081 *% (v & 0x80808080)) >> 28`

In assembly:

```asm
        and     edi, 0x80808080
        imul    eax, edi, 0x204081
        shr     eax, 28
```

Which we can reduce to:

```asm
        mov     eax, 0x80808080
        pext    eax, edi, eax
```

Note: for AMD CPUs, this optimization should only be applied to Zen 3 or newer. Before that, the pext instruction was implemented via microcode and was very slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[x86_64] Isolating bits can be optimized to a pext instruction #72088

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[x86_64] Isolating bits can be optimized to a pext instruction #72088

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions