Description
Sometimes, we have a bitstring where we want to extract certain bits. E.g. to extract the X
bits from XXXX....XXXX
, we can do:
(i & 0xF) | ((i >> 4) & 0xF0)
This has 4 operations and will take ~3 cycles, assuming the (i & 0xF)
can run in parallel to i >> 4
or the & 0xF0
(and not counting the mov
).
mov eax, edi
shr edi, 4
and eax, 15
and edi, 240
or eax, edi
Alternatively, we can condense this into a single pext instruction on x86_64 machines with bmi2.
mov eax, 3855
pext eax, edi, eax
Even fancier would be if we could detect more advanced bit-concentration patterns. Here is some real Zig code which concentrates the highest bit of each byte:
fn swarMovMask(v: anytype) @TypeOf(v) {
comptime assert(@divExact(@bitSizeOf(@TypeOf(v)), 8) <= 8);
const ones: @TypeOf(v) = @bitCast(@as(@Vector(@sizeOf(@TypeOf(v)), u8), @splat(1)));
const msb_mask = 0x80 * ones;
// We are exploiting a multiplication as a shifter and adder, and the derivation of this number is
// shown here as a comptime loop.
// This trick is often generalizable to other problems too: https://stackoverflow.com/a/14547307
comptime var mult: @TypeOf(v) = 0;
comptime for (0..@sizeOf(@TypeOf(v))) |i| {
mult |= @as(@TypeOf(v), 1) << (7 * i);
};
// Example with 32 bit integers:
// We want to concentrate the upper bits of each byte into a single nibble.
// Doing the gradeschool multiplication algorithm, we can see that each 1 bit
// in the bottom multiplicand shifts the upper multiplicand, and then we add all these
// shifted bitstrings together. (Note `.` represents a 0)
// a.......b.......c.......d.......
// * ..........1......1......1......1
// -------------------------------------------------------------------------
// a.......b.......c.......d.......
// .b.......c.......d..............
// ..c.......d.....................
// + ...d............................
// -------------------------------------------------------------------------
// abcd....bcd.....cd......d.......
// Then we simply shift to the right by `32 - 4` (bitstring size minus the number of relevant bits) to isolate the desired `abcd` bits in the least significant byte!
// Inspired by: http://0x80.pl/articles/scalar-sse-movmask.html
return (mult *% (v & msb_mask)) >> (@bitSizeOf(@TypeOf(v)) - @sizeOf(@TypeOf(v)));
}
For 32 bit integers, this becomes:
(0x204081 *% (v & 0x80808080)) >> 28
In assembly:
and edi, 0x80808080
imul eax, edi, 0x204081
shr eax, 28
Which we can reduce to:
mov eax, 0x80808080
pext eax, edi, eax
Note: for AMD CPUs, this optimization should only be applied to Zen 3 or newer. Before that, the pext instruction was implemented via microcode and was very slow.