Genericize bitmask building to make algorithms clearer #296

jkeiser · 2019-08-23T16:30:33Z

This patch introduces simd_input::build_bitmask, making it a bit easier to build and read the (relatively frequent) algorithms that walk over 64 bytes of input and produce a bitmask. My ulterior motive is to get closer to actually genericizing the algorithms, but at the moment it still uses the underlying instruction set.

It relies on compilers inlining genericized lambdas, and perf isn't taking a hit in my tests; making a PR to see what happens :)

lemire · 2019-08-23T16:34:15Z

Love it.

jkeiser · 2019-08-23T16:34:46Z

SSE failed the perf check, though avx succeeded. Rerunning to see if it succeeds, but that's a warning sign.

Edit: I accidentally re-ran the entire workflow, and avx failed perf this time. This makes me think we want to get rid of the perf threshold. I would rather we catch the 2% performance differences and deal with re-runs, since they seem to be pretty common.

jkeiser · 2019-08-23T16:35:41Z

I think there is a small perf degradation here. How are you guys checking the difference between old and new assembly? I haven't debugged this before.

jkeiser · 2019-08-23T17:24:52Z

Hmm. I found some stuff wasn't inlined that I thought would have been; and there were static modifiers on the methods; so I am going to try removing those.

jkeiser · 2019-08-23T17:35:24Z

static did the trick, in both local and remove tests. Re-running the circle workflow just to be sure, and marking this as non-draft.

jkeiser · 2019-08-23T17:37:28Z

I'm really glad to know that lambdas don't always kill inlining/perf! My earlier experiments in that didn't pull it off for some reason.

jkeiser · 2019-08-23T17:52:37Z

When I re-ran circleci, avx failed perf again, with even smaller increments. I am pushing a patch for amd64 build_bitmask, since I realized I hadn't written that. It may well fail, I haven't compiled it :)

jkeiser · 2019-08-23T18:27:48Z

Yeah, there's something weird with perf still here. It's in the < 2% range (again!). I'll figure out how to use disassembly tools :) arm64 is up and using this, too.

If the inlining is working (I haven't directly verified), it might be due to the order of instructions? Before, we tended to interleave instructions from each register, which might let the processor run them more in parallel; now we process one register at a time. I'll see what happens if I add a map() to make that better..

lemire · 2019-08-23T18:39:44Z

@jkeiser I am not sure that there is one way to assess the performance. However, my "go to" way is to use Linux performance counters and look at instruction counts. Refactoring should not change the number of instructions being executed.

lemire · 2019-08-23T18:51:36Z

So at least for AVX, there is not change to the instruction count... the number of cache accesses is the same... What appears to have changed is the number of mispredicted branches. And our instruction per cycle is down. I have not done the math, but this could be explained by the increase in mispredicted branches.

Whether it is the code's fault or it is just blind luck is not clear to me at this point. Could be that the code is fine and we just happen to be unlucky with the particular compiler I am using, or something.

Before

$ ./parsemaster jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1207 cycles:        755 (0.07 %) ins/cycles: 1.60 mis. branches:          2 (cycles/mis.branch 354.15) cache accesses:      28146 (failure          0)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    1743028 cycles:     527835 (51.19 %) ins/cycles: 3.30 mis. branches:       1099 (cycles/mis.branch 480.00) cache accesses:      28146 (failure          1)
 stage 1 runs at 0.84 cycles per input byte.
stage 2 instructions:    1490176 cycles:     502468 (48.73 %) ins/cycles: 2.97 mis. branches:       1618  (cycles/mis.branch 310.49)  cache accesses:      51419 (failure          7)
 stage 2 runs at 0.80 cycles per input byte and 9.09 cycles per structural character.
 all stages: 1.63 cycles per input byte.
Estimated average frequency: 3.715 GHz.
Min:  0.000277537 bytes read: 631515 Gigabytes/second: 2.27543


$ ./parsemaster jsonexamples/gsoc-2018.json
number of iterations 10
number of bytes 3327831 number of structural chars 75842 ratio 0.023
mem alloc instructions:       1207 cycles:       1030 (0.03 %) ins/cycles: 1.17 mis. branches:          4 (cycles/mis.branch 245.45) cache accesses:     119836 (failure         14)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    6357962 cycles:    2015852 (54.19 %) ins/cycles: 3.15 mis. branches:       5639 (cycles/mis.branch 357.48) cache accesses:     119836 (failure       4164)
 stage 1 runs at 0.61 cycles per input byte.
stage 2 instructions:    3540556 cycles:    1703196 (45.78 %) ins/cycles: 2.08 mis. branches:      14854  (cycles/mis.branch 114.66)  cache accesses:     227183 (failure      13657)
 stage 2 runs at 0.51 cycles per input byte and 22.46 cycles per structural character.
 all stages: 1.12 cycles per input byte.
Estimated average frequency: 3.726 GHz.
Min:  0.000998531 bytes read: 3327831 Gigabytes/second: 3.33273


$ ./parsemaster jsonexamples/marine_ik.json
number of iterations 10
number of bytes 2983466 number of structural chars 643013 ratio 0.216
mem alloc instructions:       1207 cycles:       1703 (0.01 %) ins/cycles: 0.71 mis. branches:          5 (cycles/mis.branch 340.68) cache accesses:     189688 (failure         27)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    9568888 cycles:    2886723 (25.14 %) ins/cycles: 3.31 mis. branches:       8970 (cycles/mis.branch 321.79) cache accesses:     189688 (failure      73635)
 stage 1 runs at 0.97 cycles per input byte.
stage 2 instructions:   28835002 cycles:    8595283 (74.85 %) ins/cycles: 3.35 mis. branches:      49653  (cycles/mis.branch 173.11)  cache accesses:     369868 (failure     148890)
 stage 2 runs at 2.88 cycles per input byte and 13.37 cycles per structural character.
 all stages: 3.85 cycles per input byte.
Estimated average frequency: 3.693 GHz.
Min:  0.00310945 bytes read: 2983466 Gigabytes/second: 0.959484

After

$ ./parse jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1207 cycles:        738 (0.07 %) ins/cycles: 1.63 mis. branches:          2 (cycles/mis.branch 289.91) cache accesses:      28139 (failure          0)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    1743029 cycles:     539839 (51.67 %) ins/cycles: 3.23 mis. branches:       1650 (cycles/mis.branch 327.06) cache accesses:      28139 (failure          4)
 stage 1 runs at 0.85 cycles per input byte.
stage 2 instructions:    1490176 cycles:     504112 (48.25 %) ins/cycles: 2.96 mis. branches:       1660  (cycles/mis.branch 303.63)  cache accesses:      51398 (failure         22)
 stage 2 runs at 0.80 cycles per input byte and 9.12 cycles per structural character.
 all stages: 1.65 cycles per input byte.
Estimated average frequency: 3.721 GHz.
Min:  0.000280785 bytes read: 631515 Gigabytes/second: 2.24911

$ ./parse jsonexamples/gsoc-2018.json
number of iterations 10
number of bytes 3327831 number of structural chars 75842 ratio 0.023
mem alloc instructions:       1207 cycles:        927 (0.02 %) ins/cycles: 1.30 mis. branches:          5 (cycles/mis.branch 181.88) cache accesses:     120830 (failure         14)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    6357963 cycles:    2038685 (53.76 %) ins/cycles: 3.12 mis. branches:       5836 (cycles/mis.branch 349.30) cache accesses:     120830 (failure      11689)
 stage 1 runs at 0.61 cycles per input byte.
stage 2 instructions:    3540556 cycles:    1752707 (46.22 %) ins/cycles: 2.02 mis. branches:      14708  (cycles/mis.branch 119.16)  cache accesses:     231732 (failure      41454)
 stage 2 runs at 0.53 cycles per input byte and 23.11 cycles per structural character.
 all stages: 1.14 cycles per input byte.
Estimated average frequency: 3.743 GHz.
Min:  0.0010133 bytes read: 3327831 Gigabytes/second: 3.28416

$ ./parse jsonexamples/marine_ik.json
number of iterations 10
number of bytes 2983466 number of structural chars 643013 ratio 0.216
mem alloc instructions:       1207 cycles:       1582 (0.01 %) ins/cycles: 0.76 mis. branches:          4 (cycles/mis.branch 359.61) cache accesses:     186949 (failure         37)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    9560909 cycles:    2863594 (24.80 %) ins/cycles: 3.34 mis. branches:       8324 (cycles/mis.branch 344.00) cache accesses:     186949 (failure      48864)
 stage 1 runs at 0.96 cycles per input byte.
stage 2 instructions:   28835003 cycles:    8682612 (75.19 %) ins/cycles: 3.32 mis. branches:      53589  (cycles/mis.branch 162.02)  cache accesses:     369285 (failure     158319)
 stage 2 runs at 2.91 cycles per input byte and 13.50 cycles per structural character.
 all stages: 3.87 cycles per input byte.
Estimated average frequency: 3.721 GHz.
Min:  0.00310354 bytes read: 2983466 Gigabytes/second: 0.96131

lemire · 2019-08-23T18:56:09Z

Here are the results on an ARM server...

Before

$ ./parsemaster jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1161 cycles:       5280 (0.09 %) ins/cycles: 0.22 mis. branches:         12 (cycles/mis.branch 436.46) cache accesses:     193154 (failure         23)
 mem alloc runs at 0.01 cycles per input byte.
stage 1 instructions:    2287934 cycles:    3367337 (58.54 %) ins/cycles: 0.68 mis. branches:       3443 (cycles/mis.branch 977.88) cache accesses:     193154 (failure       9894)
 stage 1 runs at 5.33 cycles per input byte.
stage 2 instructions:    1996855 cycles:    2379721 (41.37 %) ins/cycles: 0.84 mis. branches:      13759  (cycles/mis.branch 172.95)  cache accesses:     921725 (failure      14896)
 stage 2 runs at 3.77 cycles per input byte and 43.06 cycles per structural character.
 all stages: 9.11 cycles per input byte.
Estimated average frequency: 3.210 GHz.
Min:  0.00179185 bytes read: 631515 Gigabytes/second: 0.352438

$  ./parsemaster jsonexamples/gsoc-2018.json
number of iterations 10
number of bytes 3327831 number of structural chars 75842 ratio 0.023
mem alloc instructions:       1161 cycles:       5229 (0.03 %) ins/cycles: 0.22 mis. branches:         15 (cycles/mis.branch 330.96) cache accesses:     653493 (failure         24)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    8345165 cycles:   12793985 (62.13 %) ins/cycles: 0.65 mis. branches:      10246 (cycles/mis.branch 1248.61) cache accesses:     653493 (failure      52038)
 stage 1 runs at 3.84 cycles per input byte.
stage 2 instructions:    5797032 cycles:    7791799 (37.84 %) ins/cycles: 0.74 mis. branches:      28813  (cycles/mis.branch 270.42)  cache accesses:    2730302 (failure      61201)
 stage 2 runs at 2.34 cycles per input byte and 102.74 cycles per structural character.
 all stages: 6.19 cycles per input byte.
Estimated average frequency: 3.041 GHz.
Min:  0.00677054 bytes read: 3327831 Gigabytes/second: 0.491516

$ ./parsemaster jsonexamples/marine_ik.json
number of iterations 10
number of bytes 2983466 number of structural chars 643013 ratio 0.216
mem alloc instructions:       1161 cycles:       5075 (0.01 %) ins/cycles: 0.23 mis. branches:         19 (cycles/mis.branch 262.98) cache accesses:    1236980 (failure         22)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:   11794027 cycles:   13741577 (39.73 %) ins/cycles: 0.86 mis. branches:      13183 (cycles/mis.branch 1042.34) cache accesses:    1236980 (failure      46656)
 stage 1 runs at 4.61 cycles per input byte.
stage 2 instructions:   29961875 cycles:   20837704 (60.25 %) ins/cycles: 1.44 mis. branches:      65816  (cycles/mis.branch 316.60)  cache accesses:    7456248 (failure     120609)
 stage 2 runs at 6.98 cycles per input byte and 32.41 cycles per structural character.
 all stages: 11.59 cycles per input byte.
Estimated average frequency: 3.099 GHz.
Min:  0.01116 bytes read: 2983466 Gigabytes/second: 0.267336

After:

$ ./parse jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1161 cycles:       4528 (0.08 %) ins/cycles: 0.26 mis. branches:          7 (cycles/mis.branch 608.70) cache accesses:     193099 (failure         22)
 mem alloc runs at 0.01 cycles per input byte.
stage 1 instructions:    2287935 cycles:    3359128 (58.87 %) ins/cycles: 0.68 mis. branches:       3257 (cycles/mis.branch 1031.18) cache accesses:     193099 (failure       9890)
 stage 1 runs at 5.32 cycles per input byte.
stage 2 instructions:    1996855 cycles:    2342154 (41.05 %) ins/cycles: 0.85 mis. branches:      13485  (cycles/mis.branch 173.68)  cache accesses:     921167 (failure      14883)
 stage 2 runs at 3.71 cycles per input byte and 42.38 cycles per structural character.
 all stages: 9.04 cycles per input byte.
Estimated average frequency: 3.246 GHz.
Min:  0.00175771 bytes read: 631515 Gigabytes/second: 0.359282

$ ./parse jsonexamples/gsoc-2018.json
number of iterations 10
number of bytes 3327831 number of structural chars 75842 ratio 0.023
mem alloc instructions:       1161 cycles:       5342 (0.03 %) ins/cycles: 0.22 mis. branches:         11 (cycles/mis.branch 477.02) cache accesses:     653458 (failure         23)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    8345166 cycles:   12773825 (62.12 %) ins/cycles: 0.65 mis. branches:      11010 (cycles/mis.branch 1160.18) cache accesses:     653458 (failure      52029)
 stage 1 runs at 3.84 cycles per input byte.
stage 2 instructions:    5797032 cycles:    7783179 (37.85 %) ins/cycles: 0.74 mis. branches:      28544  (cycles/mis.branch 272.67)  cache accesses:    2731057 (failure      61362)
 stage 2 runs at 2.34 cycles per input byte and 102.62 cycles per structural character.
 all stages: 6.18 cycles per input byte.
Estimated average frequency: 3.143 GHz.
Min:  0.00654201 bytes read: 3327831 Gigabytes/second: 0.508687

$ ./parse jsonexamples/marine_ik.json
number of iterations 10
number of bytes 2983466 number of structural chars 643013 ratio 0.216
mem alloc instructions:       1161 cycles:       4712 (0.01 %) ins/cycles: 0.25 mis. branches:         12 (cycles/mis.branch 392.67) cache accesses:    1236983 (failure         21)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:   11786048 cycles:   13705726 (39.46 %) ins/cycles: 0.86 mis. branches:      13104 (cycles/mis.branch 1045.90) cache accesses:    1236983 (failure      46656)
 stage 1 runs at 4.59 cycles per input byte.
stage 2 instructions:   29961875 cycles:   21019152 (60.52 %) ins/cycles: 1.43 mis. branches:      66853  (cycles/mis.branch 314.41)  cache accesses:    7457438 (failure     120574)
 stage 2 runs at 7.05 cycles per input byte and 32.69 cycles per structural character.
 all stages: 11.64 cycles per input byte.
Estimated average frequency: 3.001 GHz.
Min:  0.0115743 bytes read: 2983466 Gigabytes/second: 0.257767

So basically no change.

lemire · 2019-08-23T19:00:12Z

Results on an another PC:

Before:

$ ./parsemaster jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1162 cycles:        723 (0.07 %) ins/cycles: 1.61 mis. branches:          3 (cycles/mis.branch 204.33) cache accesses:      17761 (failure          0)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    1743010 cycles:     517714 (51.32 %) ins/cycles: 3.37 mis. branches:        896 (cycles/mis.branch 577.77) cache accesses:      17761 (failure          0)
 stage 1 runs at 0.82 cycles per input byte.
stage 2 instructions:    1490171 cycles:     490348 (48.61 %) ins/cycles: 3.04 mis. branches:       1841  (cycles/mis.branch 266.24)  cache accesses:      32602 (failure          1)
 stage 2 runs at 0.78 cycles per input byte and 8.87 cycles per structural character.
 all stages: 1.60 cycles per input byte.
Estimated average frequency: 3.218 GHz.
Min:  0.000313505 bytes read: 631515 Gigabytes/second: 2.01437

After:

$ ./parse jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1162 cycles:        691 (0.07 %) ins/cycles: 1.68 mis. branches:          2 (cycles/mis.branch 269.53) cache accesses:      17735 (failure          0)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    1743014 cycles:     517279 (51.53 %) ins/cycles: 3.37 mis. branches:        937 (cycles/mis.branch 551.55) cache accesses:      17735 (failure          1)
 stage 1 runs at 0.82 cycles per input byte.
stage 2 instructions:    1490171 cycles:     485871 (48.40 %) ins/cycles: 3.07 mis. branches:       1727  (cycles/mis.branch 281.24)  cache accesses:      32537 (failure          4)
 stage 2 runs at 0.77 cycles per input byte and 8.79 cycles per structural character.
 all stages: 1.59 cycles per input byte.
Estimated average frequency: 3.215 GHz.
Min:  0.000312274 bytes read: 631515 Gigabytes/second: 2.02231

lemire · 2019-08-23T19:00:59Z

Overall, I conclude that there is no evidence of a performance degradation.

lemire · 2019-08-23T19:03:09Z

My view is that if a lambda, by itself, causes a performance degradation (when starting with conventional code), then it should be considered a bug (at the compiler level). I would hope that C++ compilers are smart enough to handle well-written lambdas as well as conventional functions. Otherwise, it is a depressing world.

lemire · 2019-08-23T19:04:51Z

We appear to save about 100 lines of code.

jkeiser · 2019-08-23T19:33:51Z

I am trying something real quick here: see what happens if I interleave the instructions the way they were, using a map() instruction. I also added MAP_BITMASK/MAP_CHUNKS macros that reduce the lambda boilerplate for single-expression calculations (which many of our input mappings are).

jkeiser · 2019-08-23T19:37:21Z

Doing that makes find_whitespace_and_structurals look like this:

  whitespace = in.MAP_BITMASK( _mm_cmpeq_epi8(chunk, _mm_shuffle_epi8(white_table, chunk)) );

  auto r1 = in.MAP_CHUNKS( _mm_add_epi8(struct_offset, chunk) );
  auto r2 = in.MAP_CHUNKS( _mm_or_si128(chunk, struct_mask) );
  auto r3 = r1.MAP_CHUNKS( _mm_shuffle_epi8(structural_table, chunk) );
  structurals = simd_input<ARCHITECTURE>(
    _mm_cmpeq_epi8(r2.v0, r3.v0),
    _mm_cmpeq_epi8(r2.v1, r3.v1),
    _mm_cmpeq_epi8(r2.v2, r3.v2),
    _mm_cmpeq_epi8(r2.v3, r3.v3)
  ).to_bitmask();

It's not quite as pretty, especially because I couldn't (quickly) figure out a way to make a mapper that combines two sets of simd registers (where we cmpeq r2 and r3). But if it restores perf and still reduces complexity / LOC / potential copy/paste errors, it's still a win IMO. For comparison, what it looks like without interleaving:

    structurals = in.map([&](auto chunk) {
      _mm256i r1 = _mm256_add_epi8(struct_offset, chunk);
      _mm256i r2 = _mm256_or_si256(chunk, struct_mask);
      _mm256i r3 = _mm256_shuffle_epi8(structural_table, r1);
      return _mm256_cmpeq_epi8(r2, r3);
    }).to_bitmask();

jkeiser · 2019-08-23T19:40:10Z

sse still failed the perf check, and avx succeeded but was still slightly negative.

If this particular map() change to interleave instructions doesn't make things better, I think I'd rather go back to the previous version, since it reads a lot better.

jkeiser · 2019-08-23T19:42:57Z

Huh? It passed? I think I must have been seeing the wrong checks. Looking.

jkeiser · 2019-08-23T19:49:49Z

Looks like the version with interleaved instructions rescues perf! They do look negative in the 0.5% range still, but it's within the real margin for error. I think it's ready to go back in for review.

@lemire if you wouldn't mind double checking if that change settles down the branch misses? Linux Subsystem for Windows is still giving me a perf_event_open: Invalid argument error when I run parse. I'll put fixing that on my shortlist :)

jkeiser · 2019-08-23T19:55:26Z

And if you're still convinced there isn't an actual perf degradation, let me know and I'll revert the structural calculations back to a more readable state.

lemire · 2019-08-23T22:57:08Z

I will review this again early next week... (hopefully you will stabilize the PR by then... :-) )

jkeiser · 2019-08-23T22:59:26Z

Sounds good :) This last push was just a way to make the interleaved version a little easier on the eyes.

jkeiser · 2019-08-26T16:46:30Z

Pushed rebase from master. I have a new branch with the necessary changes to make utf8 checking use this new method, but I'll do that separately.

to make it easier to run it yourself without having to recompile the world

jkeiser · 2019-08-26T16:47:20Z

(There are no changes in the rebase.)

jkeiser · 2019-08-26T17:31:57Z

clang-avx is failing, showing a 0.8% performance difference. I really don't know what the odds are of chance there, but I'm going to assume there's something real there that even interleaving didn't fix.

I'm getting a pure Linux machine up tonight so I can see the other numbers you're looking at, and see what's what with instruction count and branch miss. I truly don't understand how branch misses could even be a factor, since I didn't touch any branches; but numbers don't lie.

jkeiser · 2019-08-26T20:04:37Z

Running the avx version on my linux machine, I've concluded that there is no difference in performance (or branch misses) between master and this patch. I think, however, that there might be a performance difference between master and v0.2.1 (with v0.2.1 being very very slightly faster).

Maybe it would be better to have PRs check against master, than against 0.2.1? It seems like a more valid check anyway, if the intent is to find out whether the PR has regressed performance.

jkeiser · 2019-08-26T20:13:14Z

Checking Linux to see if the non-interleaved version is any different ...

jkeiser · 2019-08-26T20:34:00Z

It's a little hard to tell, but interleaved may be a very, very small amount faster--0.5% or so. The new interleaved code is a little less readable than the non-interleaved version, but not so much that I'd want to take any hit at all for it. I'm going to check in a version of this patch that runs perf against master, to see if that clears up the failures.

jkeiser · 2019-08-26T21:18:51Z

I see a failure on Drone, now. Circle is succeeding but has a ridiculously high variance. The Drone failure, on the other hand, looks reasonably consistent at about 0.5%. I'm going to re-run Drone now, particularly since that's been succeeding up until this point, but it's a flag ...

jkeiser · 2019-08-26T21:27:05Z

It succeeded this time, but there's super high variance again--additionally, on both the pr and push, even though drone has high variance, both times 6/7 of the drone throughput numbers were lower than master.

jkeiser · 2019-08-26T21:35:58Z

At this point, I'm just not a good enough statistician to say what any of this means. The spread from the last 2 runs on Drone:

Branch: 1.03548 -- 1.16986
Master: 1.08713 -- 1.18017

and

Branch: 0.982987 -- 1.16856
Master: 1.08713 -- 1.18017

jkeiser · 2019-08-26T22:24:10Z

For comparison, when drone is run against master in the PR where I'm doing baseline perf runs, four runs had these spreads:

0.982987 -- 1.16856
1.08713 -- 1.18017
1.16033 -- 1.18552
1.13474 -- 1.18885

If there's a difference, I think it's noise. I just wish I could solidly convince myself of that :)

lemire · 2019-08-26T23:00:57Z

Ok. I think I need to invest some time building my own performance analysis tool... I don’t know whether I will be able to hook it up with github but I shoulx be able to generate reports.

lemire

I am 100% at ease with merging this PR as far as code performance is concerned. I cannot see any performance regression in my local test. I realize that you argue for performance regressions... and maybe your concerns are warranted, but I am not worried about it... if you are, then sure, let us dig... but let us not get too paranoid.

I am not sure that the application of macros is entirely for the good of humanity. This is really a debatable change... But it is not like long sequences of intrinsics (with redundant operations) is super great software engineering... so it is a sane change. Also we use macros elsewhere is simdjson, so it is consistent with our coding style generally.

The fact that we reduce the code size without affecting the complexity of the code (arguably) makes this a fine PR.

jkeiser · 2019-08-28T15:54:42Z

Yeah, if there is a perf difference it's hard to distinguish from noise. I'll think of other ways to validate this in the future. The interleaved version at least ought to make the compiler yield identical assembly, at any rate.

jkeiser marked this pull request as ready for review August 23, 2019 17:35

jkeiser requested a review from lemire August 23, 2019 17:35

jkeiser force-pushed the wide_mask branch 3 times, most recently from bf3b068 to ddeeec6 Compare August 23, 2019 18:11

jkeiser removed the request for review from lemire August 23, 2019 18:31

jkeiser requested a review from lemire August 23, 2019 19:49

jkeiser added 8 commits August 26, 2019 09:46

Genericize bitmask building to make algorithms clearer

b012225

Update amalgamated cpp

5e55921

Remove static modifiers

da0f1ca

Modify checkperf to print out perfdiff command

cf4ae61

to make it easier to run it yourself without having to recompile the world

Add AMD64 build_bitmask

441963c

Use map().to_bitmask() instead of build_bitmask()

9cc4ddf

Use map() to interleave instructions for parallelism

169568c

Add MAP_CHUNKS2, make parameter name related to input

f4fa5b7

jkeiser force-pushed the wide_mask branch from b9b5990 to f4fa5b7 Compare August 26, 2019 16:46

Validate perf against master, not v0.2.1

bf80838

lemire approved these changes Aug 27, 2019

View reviewed changes

jkeiser merged commit aef3f4b into master Aug 28, 2019

jkeiser deleted the wide_mask branch August 28, 2019 15:54

Genericize bitmask building to make algorithms clearer #296

Genericize bitmask building to make algorithms clearer #296

Uh oh!

Conversation

jkeiser commented Aug 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

lemire commented Aug 23, 2019

Uh oh!

lemire commented Aug 23, 2019

Uh oh!

lemire commented Aug 23, 2019

Uh oh!

lemire commented Aug 23, 2019

Uh oh!

lemire commented Aug 23, 2019

Uh oh!

lemire commented Aug 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkeiser commented Aug 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

lemire commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 23, 2019

Uh oh!

jkeiser commented Aug 26, 2019

Uh oh!

jkeiser commented Aug 26, 2019

Uh oh!

jkeiser commented Aug 26, 2019

Uh oh!

jkeiser commented Aug 26, 2019

Uh oh!

jkeiser commented Aug 26, 2019

Uh oh!

jkeiser commented Aug 26, 2019

Uh oh!

jkeiser commented Aug 26, 2019

Uh oh!

jkeiser commented Aug 26, 2019

Uh oh!

jkeiser commented Aug 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkeiser commented Aug 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Aug 26, 2019

jkeiser commented Aug 23, 2019 •

edited

Loading

jkeiser commented Aug 23, 2019 •

edited

Loading

lemire commented Aug 23, 2019 •

edited

Loading

jkeiser commented Aug 23, 2019 •

edited

Loading

jkeiser commented Aug 23, 2019 •

edited

Loading

jkeiser commented Aug 26, 2019 •

edited

Loading

jkeiser commented Aug 26, 2019 •

edited

Loading