add pdep/pext compress implementation #2242

Validark · 2024-08-29T04:18:51Z

Had this idea, but unfortunately I can't get it to compile so I'm a bit stuck.

I get a couple compile errors that look like this:

In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/x86gprintrin.h:43,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/x86intrin.h:27,
                 from /home/niles/Documents/github/simdjson/include/simdjson/haswell/intrinsics.h:12,
                 from /home/niles/Documents/github/simdjson/include/simdjson/haswell/begin.h:4,
                 from /home/niles/Documents/github/simdjson/include/simdjson/haswell.h:4,
                 from /home/niles/Documents/github/simdjson/src/haswell.cpp:8,
                 from /home/niles/Documents/github/simdjson/src/simdjson.cpp:27:
/usr/lib/gcc/x86_64-linux-gnu/11/include/bmi2intrin.h: In member function ‘void simdjson::haswell::{anonymous}::simd::base8_numeric<T>::compress(uint32_t, L*) const [with L = unsigned char; T = unsigned char]’:
/usr/lib/gcc/x86_64-linux-gnu/11/include/bmi2intrin.h:76:1: error: inlining failed in call to ‘always_inline’ ‘long long unsigned int _pext_u64(long long unsigned int, long long unsigned int)’: target specific option mismatch
   76 | _pext_u64 (unsigned long long __X, unsigned long long __Y)
      | ^~~~~~~~~
In file included from /home/niles/Documents/github/simdjson/include/simdjson/haswell/begin.h:13,
                 from /home/niles/Documents/github/simdjson/include/simdjson/haswell.h:4,
                 from /home/niles/Documents/github/simdjson/src/haswell.cpp:8,
                 from /home/niles/Documents/github/simdjson/src/simdjson.cpp:27:
/home/niles/Documents/github/simdjson/include/simdjson/haswell/simd.h:139:39: note: called from here
  139 |   uint64_t wanted_indices2 = _pext_u64(iota, expanded_mask2);
      |                              ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~

How do I get this to compile?
What is the right way to implement this feature? Should HAS_FAST_PDEP_AND_PEXT become runtime-dispatched code?

Validark · 2024-08-29T04:21:44Z

Maybe needed? https://stackoverflow.com/questions/466377/how-to-detect-what-cpu-is-being-used-during-runtime

lemire · 2024-08-29T21:18:17Z

We do not currently target BMI2, see this line...

simdjson/include/simdjson/haswell/begin.h

Line 7 in e341c8b

SIMDJSON_TARGET_REGION("avx2,bmi,pclmul,lzcnt,popcnt")

We do not do so because we don't use BMI2 in the haswell kernel.

Note that if we are going to use pdep/pext, we are going to have to omit pre-Zen3 AMD processors were these instructions are super slow. It may require an entirely new kernel which is a bit complicated.

It is entirely safe to use pdep/pext in the icelake kernel however.

Validark · 2024-08-30T07:42:48Z

Why not target BMI2 for Haswell? Even if you want to avoid PDEP/PEXT because of AMD, don't you still think the other instructions are nice? Does this mean that the compiler can't insert BMI2 instructions at all or does it just mean that you don't use BMI2 intrinsics?

lemire · 2024-08-30T12:21:33Z

Why not target BMI2 for Haswell? Even if you want to avoid PDEP/PEXT because of AMD, don't you still think the other instructions are nice? Does this mean that the compiler can't insert BMI2 instructions at all or does it just mean that you don't use BMI2 intrinsics?

Currently, GCC and LLVM won't use BMI2 in the haswell kernel. The reason we do not do it is because we do not expect benefits. Of course, that's an empirical issue. I will investigate later. :-)

lemire · 2024-08-30T12:25:05Z

Note that the initial versions of simdjson used BMI2 (pdep/pext). At least the way we were using it was not clearly beneficial on Intel processors, and, at the time, quite bad for AMD processors. So we progressively removed it.

lemire · 2024-08-30T12:25:19Z

(We do use BMI2 in the icelake kernel.)

Validark · 2024-08-30T16:58:17Z

Well, according to Agner Fog's instruction tables, BZHI has 1 cycle latency and a reciprocal throughput of 0.25 (i.e. you can start up to 4 per cycle if the ports are available) for Zen 1 and Zen 2. For every other machine, it's 1 cycle latency with 0.5 reciprocal throughput. uops.info confirms this, although it includes the relatively recent Alder Lake, which appears to have regressed its performance to 3 cycles of latency and a throughput of 1.

(I assume you know what reciprocal throughput is, but honestly I forget which is which sometimes)

BZHI can be called directly through an intrinsic due to its better handling of 64 as a second input (if the 2nd input has a bottom byte of 64-255, i.e. 0x40-0xff, it will return the first input unchanged), or it can be generated by the compiler when the idiom is used, although obviously that code has undefined behavior when index is 64 or greater, so for many non-BMI2-optimized builds it would likely truncate the second input to just the lower 6 bits (not that Clang would warn you about that by default, but don't worry, zig cc has better defaults).

The other instructions MULX, RORX, SARX, SHRX, SHLX are just special versions of normal operations we already had support for, but these don't affect flags and have a separate destination register from the two input operands, so these are potential wins that the compiler can give you for free. (although they can't use immediates)

So it seems to me the only drawback of BMI2 would be pre-Zen3 AMD processors being really slow at PDEP/PEXT, but I don't think compilers know how to generate those automatically [1, 2, 3], nor would they for pre-Zen3 AMD, if their cost model was correct. (Update: I just checked, PDEP/PEXT are not emitted at all in my build of simdjson without the intrinsics asking for them)

So I don't really see the point of disabling BMI2. Just because the custom PDEP/PEXT routines you had before didn't work out doesn't mean it's bad to enable BMI2 as an option for compilers.

Validark · 2024-08-30T18:09:26Z

Update: I added bmi2 to that haswell target like you said. This now compiles. Is it faster? Well, maybe. I can't really tell by just running it on the twitter.json benchmark and doing eyeball statistics (definition). However, according to LLVM's cost model, my version is faster [haswell, znver3] (I am running a Zen 3 machine). It also means we don't have to include 2576 bytes for the 3 tables, so that seems like a win. However, I am not sure how you would prefer to account for the AMD problem [znver2]. Let me know what you think.

lemire · 2024-08-30T21:14:54Z

So I don't really see the point of disabling BMI2

It is not enabled by default. So we are not disabling it. However, it is interesting to check whether enabling it would be beneficial, and I plan to investigate. As your analysis suggests, the important instruction is BZHI.

It also means we don't have to include 2576 bytes for the 3 tables,

We need the tables because we can't assume that BMI2 is present.

lemire · 2024-08-30T21:17:08Z

@Validark

This now compiles. Is it faster?

Thanks. I will review and assess.

We might introduce a kernel between icelake and haswell if the performance gains are sufficiently clear.

Of course, a Zen 4 or Zen 5 will use the icelake kernel (where we do use pdep and pext).

Validark · 2024-08-30T21:38:21Z

However, it is interesting to check whether enabling it would be beneficial, and I plan to investigate. As your analysis suggests, the important instruction is BZHI.

My message was an attempt to explain everything that is included in BMI2, not say which instructions are most important to simdjson. Based on a objdump, here are the number of occurrences for each instruction:

instruction	occurrences
`PDEP`	0
`PEXT`	0
`BZHI`	0
`MULX`	10
`RORX`	0
`SARX`	0
`SHRX`	12
`SHLX`	12

They aren't used much, but they do appear in simdjson. And they can eliminate unnecessary internal flag register allocation and possibly eliminate an unnecessary mov instruction as well.

So I don't really see the point of disabling BMI2

It is not enabled by default. So we are not disabling it.

It also means we don't have to include 2576 bytes for the 3 tables,

We need the tables because we can't assume that BMI2 is present.

Haswell has BMI2 though, right? So you can assume that BMI2 is present. LLVM gives you BMI2 for the Haswell cpu target.

lemire · 2024-08-31T03:17:31Z

They aren't used much, but they do appear in simdjson.

It is an empirical question. By how much does enabling BMI2 in LLVM or GCC improves performance. As I wrote, I plan to review this issue experimentally.

Haswell has BMI2 though, right? So you can assume that BMI2 is present.

We do not assume that the processor supports the equivalent of Haswell. This is determined at runtime by CPU detection.

Validark · 2024-08-31T08:27:03Z

They aren't used much, but they do appear in simdjson.

It is an empirical question. By how much does enabling BMI2 in LLVM or GCC improves performance. As I wrote, I plan to review this issue experimentally.

Feel free, but I think you're unlikely to notice a big difference in practice, these are sort of micro-optimizations. Yes, they save the CPU a little work, but it might be able to do that extra work without adding to the runtime.

Haswell has BMI2 though, right? So you can assume that BMI2 is present.

We do not assume that the processor supports the equivalent of Haswell. This is determined at runtime by CPU detection.

Right, but I'm only talking about the Haswell kernel? I added BMI2 to the list of features that Haswell supports, so if we dispatch to the Haswell kernel, that code can use BMI2 instructions, especially MULX, SHLX, and SHRX.

I am trying to get you to answer the question of why BMI2 is not enabled for the Haswell kernel. You said you would do it if it's beneficial, which you would test empirically. But my question isn't "Should it be enabled?", my question is, "Why not?" I'm asking what's the harm in saving the CPU a small amount of work or power consumption? I'm basically implying that I think the default position is to have BMI2 enabled for the Haswell kernel, whereas you're saying that the default position is to have BMI2 disabled for the Haswell kernel. I'm questioning why you think that way.

I think that these instructions were added to the CPU as an optimization, so if the compiler agrees, i.e., wants to insert it in your Haswell specific code, why not just let it? Again, I am not sure that the specific circumstances in which those instructions are inserted are sufficient to see a performance improvement, but the potential is there, so why not? I definitely don't think you'll see performance degradation. Why not just enable it for the Haswell kernel?

lemire · 2024-09-02T20:08:05Z

@Validark At least under some versions of GCC, and some hardware, enabling BMI2 appears to lower the performance. It does increase slightly the instruction count.

See #2243

For LLVM, it is beneficial so I am currently considering enabling it only under LLVM.

Validark · 2024-09-02T20:11:30Z

@Validark At least under some versions of GCC, and some hardware, enabling BMI2 appears to lower the performance. It does increase slightly the instruction count.

See #2243

For LLVM, it is beneficial so I am currently considering enabling it only under LLVM.

That's sad that GCC does worse with it on. Well, c'est la vie.

lemire · 2024-09-03T15:25:00Z

@Validark

Even here, in this PR, you can see that our CI reports a performance degradation:

However, I am not giving up. I think that if we are sufficiently careful, we could get more performance in some cases with BMI2.

add pdep/pext compress implementation

53256e6

Validark added 2 commits August 30, 2024 11:17

Add bmi2 to Haswell kernel

a560126

whitespace change

bd37730

add pdep/pext compress implementation #2242

Are you sure you want to change the base?

add pdep/pext compress implementation #2242

Uh oh!

Conversation

Validark commented Aug 29, 2024

Uh oh!

Validark commented Aug 29, 2024

Uh oh!

lemire commented Aug 29, 2024

Uh oh!

Validark commented Aug 30, 2024

Uh oh!

lemire commented Aug 30, 2024

Uh oh!

lemire commented Aug 30, 2024

Uh oh!

lemire commented Aug 30, 2024

Uh oh!

Validark commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

(I assume you know what reciprocal throughput is, but honestly I forget which is which sometimes)

Uh oh!

Validark commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Aug 30, 2024

Uh oh!

lemire commented Aug 30, 2024

Uh oh!

Validark commented Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Aug 31, 2024

Uh oh!

Validark commented Aug 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Sep 2, 2024

Uh oh!

Validark commented Sep 2, 2024

Uh oh!

lemire commented Sep 3, 2024

Uh oh!

Uh oh!

Validark commented Aug 30, 2024 •

edited

Loading

Validark commented Aug 30, 2024 •

edited

Loading

Validark commented Aug 30, 2024 •

edited

Loading

Validark commented Aug 31, 2024 •

edited

Loading