Skip to content

Genericize bitmask building to make algorithms clearer #296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Aug 28, 2019
Merged

Conversation

jkeiser
Copy link
Member

@jkeiser jkeiser commented Aug 23, 2019

This patch introduces simd_input::build_bitmask, making it a bit easier to build and read the (relatively frequent) algorithms that walk over 64 bytes of input and produce a bitmask. My ulterior motive is to get closer to actually genericizing the algorithms, but at the moment it still uses the underlying instruction set.

It relies on compilers inlining genericized lambdas, and perf isn't taking a hit in my tests; making a PR to see what happens :)

@lemire
Copy link
Member

lemire commented Aug 23, 2019

Love it.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

SSE failed the perf check, though avx succeeded. Rerunning to see if it succeeds, but that's a warning sign.

Edit: I accidentally re-ran the entire workflow, and avx failed perf this time. This makes me think we want to get rid of the perf threshold. I would rather we catch the 2% performance differences and deal with re-runs, since they seem to be pretty common.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

I think there is a small perf degradation here. How are you guys checking the difference between old and new assembly? I haven't debugged this before.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

Hmm. I found some stuff wasn't inlined that I thought would have been; and there were static modifiers on the methods; so I am going to try removing those.

@jkeiser jkeiser marked this pull request as ready for review August 23, 2019 17:35
@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

static did the trick, in both local and remove tests. Re-running the circle workflow just to be sure, and marking this as non-draft.

@jkeiser jkeiser requested a review from lemire August 23, 2019 17:35
@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

I'm really glad to know that lambdas don't always kill inlining/perf! My earlier experiments in that didn't pull it off for some reason.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

When I re-ran circleci, avx failed perf again, with even smaller increments. I am pushing a patch for amd64 build_bitmask, since I realized I hadn't written that. It may well fail, I haven't compiled it :)

@jkeiser jkeiser force-pushed the wide_mask branch 3 times, most recently from bf3b068 to ddeeec6 Compare August 23, 2019 18:11
@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

Yeah, there's something weird with perf still here. It's in the < 2% range (again!). I'll figure out how to use disassembly tools :) arm64 is up and using this, too.

If the inlining is working (I haven't directly verified), it might be due to the order of instructions? Before, we tended to interleave instructions from each register, which might let the processor run them more in parallel; now we process one register at a time. I'll see what happens if I add a map() to make that better..

@jkeiser jkeiser removed the request for review from lemire August 23, 2019 18:31
@lemire
Copy link
Member

lemire commented Aug 23, 2019

@jkeiser I am not sure that there is one way to assess the performance. However, my "go to" way is to use Linux performance counters and look at instruction counts. Refactoring should not change the number of instructions being executed.

@lemire
Copy link
Member

lemire commented Aug 23, 2019

So at least for AVX, there is not change to the instruction count... the number of cache accesses is the same... What appears to have changed is the number of mispredicted branches. And our instruction per cycle is down. I have not done the math, but this could be explained by the increase in mispredicted branches.

Whether it is the code's fault or it is just blind luck is not clear to me at this point. Could be that the code is fine and we just happen to be unlucky with the particular compiler I am using, or something.

Before

$ ./parsemaster jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1207 cycles:        755 (0.07 %) ins/cycles: 1.60 mis. branches:          2 (cycles/mis.branch 354.15) cache accesses:      28146 (failure          0)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    1743028 cycles:     527835 (51.19 %) ins/cycles: 3.30 mis. branches:       1099 (cycles/mis.branch 480.00) cache accesses:      28146 (failure          1)
 stage 1 runs at 0.84 cycles per input byte.
stage 2 instructions:    1490176 cycles:     502468 (48.73 %) ins/cycles: 2.97 mis. branches:       1618  (cycles/mis.branch 310.49)  cache accesses:      51419 (failure          7)
 stage 2 runs at 0.80 cycles per input byte and 9.09 cycles per structural character.
 all stages: 1.63 cycles per input byte.
Estimated average frequency: 3.715 GHz.
Min:  0.000277537 bytes read: 631515 Gigabytes/second: 2.27543


$ ./parsemaster jsonexamples/gsoc-2018.json
number of iterations 10
number of bytes 3327831 number of structural chars 75842 ratio 0.023
mem alloc instructions:       1207 cycles:       1030 (0.03 %) ins/cycles: 1.17 mis. branches:          4 (cycles/mis.branch 245.45) cache accesses:     119836 (failure         14)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    6357962 cycles:    2015852 (54.19 %) ins/cycles: 3.15 mis. branches:       5639 (cycles/mis.branch 357.48) cache accesses:     119836 (failure       4164)
 stage 1 runs at 0.61 cycles per input byte.
stage 2 instructions:    3540556 cycles:    1703196 (45.78 %) ins/cycles: 2.08 mis. branches:      14854  (cycles/mis.branch 114.66)  cache accesses:     227183 (failure      13657)
 stage 2 runs at 0.51 cycles per input byte and 22.46 cycles per structural character.
 all stages: 1.12 cycles per input byte.
Estimated average frequency: 3.726 GHz.
Min:  0.000998531 bytes read: 3327831 Gigabytes/second: 3.33273


$ ./parsemaster jsonexamples/marine_ik.json
number of iterations 10
number of bytes 2983466 number of structural chars 643013 ratio 0.216
mem alloc instructions:       1207 cycles:       1703 (0.01 %) ins/cycles: 0.71 mis. branches:          5 (cycles/mis.branch 340.68) cache accesses:     189688 (failure         27)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    9568888 cycles:    2886723 (25.14 %) ins/cycles: 3.31 mis. branches:       8970 (cycles/mis.branch 321.79) cache accesses:     189688 (failure      73635)
 stage 1 runs at 0.97 cycles per input byte.
stage 2 instructions:   28835002 cycles:    8595283 (74.85 %) ins/cycles: 3.35 mis. branches:      49653  (cycles/mis.branch 173.11)  cache accesses:     369868 (failure     148890)
 stage 2 runs at 2.88 cycles per input byte and 13.37 cycles per structural character.
 all stages: 3.85 cycles per input byte.
Estimated average frequency: 3.693 GHz.
Min:  0.00310945 bytes read: 2983466 Gigabytes/second: 0.959484

After

$ ./parse jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1207 cycles:        738 (0.07 %) ins/cycles: 1.63 mis. branches:          2 (cycles/mis.branch 289.91) cache accesses:      28139 (failure          0)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    1743029 cycles:     539839 (51.67 %) ins/cycles: 3.23 mis. branches:       1650 (cycles/mis.branch 327.06) cache accesses:      28139 (failure          4)
 stage 1 runs at 0.85 cycles per input byte.
stage 2 instructions:    1490176 cycles:     504112 (48.25 %) ins/cycles: 2.96 mis. branches:       1660  (cycles/mis.branch 303.63)  cache accesses:      51398 (failure         22)
 stage 2 runs at 0.80 cycles per input byte and 9.12 cycles per structural character.
 all stages: 1.65 cycles per input byte.
Estimated average frequency: 3.721 GHz.
Min:  0.000280785 bytes read: 631515 Gigabytes/second: 2.24911

$ ./parse jsonexamples/gsoc-2018.json
number of iterations 10
number of bytes 3327831 number of structural chars 75842 ratio 0.023
mem alloc instructions:       1207 cycles:        927 (0.02 %) ins/cycles: 1.30 mis. branches:          5 (cycles/mis.branch 181.88) cache accesses:     120830 (failure         14)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    6357963 cycles:    2038685 (53.76 %) ins/cycles: 3.12 mis. branches:       5836 (cycles/mis.branch 349.30) cache accesses:     120830 (failure      11689)
 stage 1 runs at 0.61 cycles per input byte.
stage 2 instructions:    3540556 cycles:    1752707 (46.22 %) ins/cycles: 2.02 mis. branches:      14708  (cycles/mis.branch 119.16)  cache accesses:     231732 (failure      41454)
 stage 2 runs at 0.53 cycles per input byte and 23.11 cycles per structural character.
 all stages: 1.14 cycles per input byte.
Estimated average frequency: 3.743 GHz.
Min:  0.0010133 bytes read: 3327831 Gigabytes/second: 3.28416

$ ./parse jsonexamples/marine_ik.json
number of iterations 10
number of bytes 2983466 number of structural chars 643013 ratio 0.216
mem alloc instructions:       1207 cycles:       1582 (0.01 %) ins/cycles: 0.76 mis. branches:          4 (cycles/mis.branch 359.61) cache accesses:     186949 (failure         37)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    9560909 cycles:    2863594 (24.80 %) ins/cycles: 3.34 mis. branches:       8324 (cycles/mis.branch 344.00) cache accesses:     186949 (failure      48864)
 stage 1 runs at 0.96 cycles per input byte.
stage 2 instructions:   28835003 cycles:    8682612 (75.19 %) ins/cycles: 3.32 mis. branches:      53589  (cycles/mis.branch 162.02)  cache accesses:     369285 (failure     158319)
 stage 2 runs at 2.91 cycles per input byte and 13.50 cycles per structural character.
 all stages: 3.87 cycles per input byte.
Estimated average frequency: 3.721 GHz.
Min:  0.00310354 bytes read: 2983466 Gigabytes/second: 0.96131

@lemire
Copy link
Member

lemire commented Aug 23, 2019

Here are the results on an ARM server...

Before

$ ./parsemaster jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1161 cycles:       5280 (0.09 %) ins/cycles: 0.22 mis. branches:         12 (cycles/mis.branch 436.46) cache accesses:     193154 (failure         23)
 mem alloc runs at 0.01 cycles per input byte.
stage 1 instructions:    2287934 cycles:    3367337 (58.54 %) ins/cycles: 0.68 mis. branches:       3443 (cycles/mis.branch 977.88) cache accesses:     193154 (failure       9894)
 stage 1 runs at 5.33 cycles per input byte.
stage 2 instructions:    1996855 cycles:    2379721 (41.37 %) ins/cycles: 0.84 mis. branches:      13759  (cycles/mis.branch 172.95)  cache accesses:     921725 (failure      14896)
 stage 2 runs at 3.77 cycles per input byte and 43.06 cycles per structural character.
 all stages: 9.11 cycles per input byte.
Estimated average frequency: 3.210 GHz.
Min:  0.00179185 bytes read: 631515 Gigabytes/second: 0.352438

$  ./parsemaster jsonexamples/gsoc-2018.json
number of iterations 10
number of bytes 3327831 number of structural chars 75842 ratio 0.023
mem alloc instructions:       1161 cycles:       5229 (0.03 %) ins/cycles: 0.22 mis. branches:         15 (cycles/mis.branch 330.96) cache accesses:     653493 (failure         24)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    8345165 cycles:   12793985 (62.13 %) ins/cycles: 0.65 mis. branches:      10246 (cycles/mis.branch 1248.61) cache accesses:     653493 (failure      52038)
 stage 1 runs at 3.84 cycles per input byte.
stage 2 instructions:    5797032 cycles:    7791799 (37.84 %) ins/cycles: 0.74 mis. branches:      28813  (cycles/mis.branch 270.42)  cache accesses:    2730302 (failure      61201)
 stage 2 runs at 2.34 cycles per input byte and 102.74 cycles per structural character.
 all stages: 6.19 cycles per input byte.
Estimated average frequency: 3.041 GHz.
Min:  0.00677054 bytes read: 3327831 Gigabytes/second: 0.491516

$ ./parsemaster jsonexamples/marine_ik.json
number of iterations 10
number of bytes 2983466 number of structural chars 643013 ratio 0.216
mem alloc instructions:       1161 cycles:       5075 (0.01 %) ins/cycles: 0.23 mis. branches:         19 (cycles/mis.branch 262.98) cache accesses:    1236980 (failure         22)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:   11794027 cycles:   13741577 (39.73 %) ins/cycles: 0.86 mis. branches:      13183 (cycles/mis.branch 1042.34) cache accesses:    1236980 (failure      46656)
 stage 1 runs at 4.61 cycles per input byte.
stage 2 instructions:   29961875 cycles:   20837704 (60.25 %) ins/cycles: 1.44 mis. branches:      65816  (cycles/mis.branch 316.60)  cache accesses:    7456248 (failure     120609)
 stage 2 runs at 6.98 cycles per input byte and 32.41 cycles per structural character.
 all stages: 11.59 cycles per input byte.
Estimated average frequency: 3.099 GHz.
Min:  0.01116 bytes read: 2983466 Gigabytes/second: 0.267336

After:

$ ./parse jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1161 cycles:       4528 (0.08 %) ins/cycles: 0.26 mis. branches:          7 (cycles/mis.branch 608.70) cache accesses:     193099 (failure         22)
 mem alloc runs at 0.01 cycles per input byte.
stage 1 instructions:    2287935 cycles:    3359128 (58.87 %) ins/cycles: 0.68 mis. branches:       3257 (cycles/mis.branch 1031.18) cache accesses:     193099 (failure       9890)
 stage 1 runs at 5.32 cycles per input byte.
stage 2 instructions:    1996855 cycles:    2342154 (41.05 %) ins/cycles: 0.85 mis. branches:      13485  (cycles/mis.branch 173.68)  cache accesses:     921167 (failure      14883)
 stage 2 runs at 3.71 cycles per input byte and 42.38 cycles per structural character.
 all stages: 9.04 cycles per input byte.
Estimated average frequency: 3.246 GHz.
Min:  0.00175771 bytes read: 631515 Gigabytes/second: 0.359282

$ ./parse jsonexamples/gsoc-2018.json
number of iterations 10
number of bytes 3327831 number of structural chars 75842 ratio 0.023
mem alloc instructions:       1161 cycles:       5342 (0.03 %) ins/cycles: 0.22 mis. branches:         11 (cycles/mis.branch 477.02) cache accesses:     653458 (failure         23)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    8345166 cycles:   12773825 (62.12 %) ins/cycles: 0.65 mis. branches:      11010 (cycles/mis.branch 1160.18) cache accesses:     653458 (failure      52029)
 stage 1 runs at 3.84 cycles per input byte.
stage 2 instructions:    5797032 cycles:    7783179 (37.85 %) ins/cycles: 0.74 mis. branches:      28544  (cycles/mis.branch 272.67)  cache accesses:    2731057 (failure      61362)
 stage 2 runs at 2.34 cycles per input byte and 102.62 cycles per structural character.
 all stages: 6.18 cycles per input byte.
Estimated average frequency: 3.143 GHz.
Min:  0.00654201 bytes read: 3327831 Gigabytes/second: 0.508687

$ ./parse jsonexamples/marine_ik.json
number of iterations 10
number of bytes 2983466 number of structural chars 643013 ratio 0.216
mem alloc instructions:       1161 cycles:       4712 (0.01 %) ins/cycles: 0.25 mis. branches:         12 (cycles/mis.branch 392.67) cache accesses:    1236983 (failure         21)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:   11786048 cycles:   13705726 (39.46 %) ins/cycles: 0.86 mis. branches:      13104 (cycles/mis.branch 1045.90) cache accesses:    1236983 (failure      46656)
 stage 1 runs at 4.59 cycles per input byte.
stage 2 instructions:   29961875 cycles:   21019152 (60.52 %) ins/cycles: 1.43 mis. branches:      66853  (cycles/mis.branch 314.41)  cache accesses:    7457438 (failure     120574)
 stage 2 runs at 7.05 cycles per input byte and 32.69 cycles per structural character.
 all stages: 11.64 cycles per input byte.
Estimated average frequency: 3.001 GHz.
Min:  0.0115743 bytes read: 2983466 Gigabytes/second: 0.257767

So basically no change.

@lemire
Copy link
Member

lemire commented Aug 23, 2019

Results on an another PC:

Before:

$ ./parsemaster jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1162 cycles:        723 (0.07 %) ins/cycles: 1.61 mis. branches:          3 (cycles/mis.branch 204.33) cache accesses:      17761 (failure          0)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    1743010 cycles:     517714 (51.32 %) ins/cycles: 3.37 mis. branches:        896 (cycles/mis.branch 577.77) cache accesses:      17761 (failure          0)
 stage 1 runs at 0.82 cycles per input byte.
stage 2 instructions:    1490171 cycles:     490348 (48.61 %) ins/cycles: 3.04 mis. branches:       1841  (cycles/mis.branch 266.24)  cache accesses:      32602 (failure          1)
 stage 2 runs at 0.78 cycles per input byte and 8.87 cycles per structural character.
 all stages: 1.60 cycles per input byte.
Estimated average frequency: 3.218 GHz.
Min:  0.000313505 bytes read: 631515 Gigabytes/second: 2.01437

After:

$ ./parse jsonexamples/twitter.json
number of iterations 1000
number of bytes 631515 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1162 cycles:        691 (0.07 %) ins/cycles: 1.68 mis. branches:          2 (cycles/mis.branch 269.53) cache accesses:      17735 (failure          0)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    1743014 cycles:     517279 (51.53 %) ins/cycles: 3.37 mis. branches:        937 (cycles/mis.branch 551.55) cache accesses:      17735 (failure          1)
 stage 1 runs at 0.82 cycles per input byte.
stage 2 instructions:    1490171 cycles:     485871 (48.40 %) ins/cycles: 3.07 mis. branches:       1727  (cycles/mis.branch 281.24)  cache accesses:      32537 (failure          4)
 stage 2 runs at 0.77 cycles per input byte and 8.79 cycles per structural character.
 all stages: 1.59 cycles per input byte.
Estimated average frequency: 3.215 GHz.
Min:  0.000312274 bytes read: 631515 Gigabytes/second: 2.02231

@lemire
Copy link
Member

lemire commented Aug 23, 2019

Overall, I conclude that there is no evidence of a performance degradation.

@lemire
Copy link
Member

lemire commented Aug 23, 2019

My view is that if a lambda, by itself, causes a performance degradation (when starting with conventional code), then it should be considered a bug (at the compiler level). I would hope that C++ compilers are smart enough to handle well-written lambdas as well as conventional functions. Otherwise, it is a depressing world.

@lemire
Copy link
Member

lemire commented Aug 23, 2019

We appear to save about 100 lines of code.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

I am trying something real quick here: see what happens if I interleave the instructions the way they were, using a map() instruction. I also added MAP_BITMASK/MAP_CHUNKS macros that reduce the lambda boilerplate for single-expression calculations (which many of our input mappings are).

@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

Doing that makes find_whitespace_and_structurals look like this:

  whitespace = in.MAP_BITMASK( _mm_cmpeq_epi8(chunk, _mm_shuffle_epi8(white_table, chunk)) );

  auto r1 = in.MAP_CHUNKS( _mm_add_epi8(struct_offset, chunk) );
  auto r2 = in.MAP_CHUNKS( _mm_or_si128(chunk, struct_mask) );
  auto r3 = r1.MAP_CHUNKS( _mm_shuffle_epi8(structural_table, chunk) );
  structurals = simd_input<ARCHITECTURE>(
    _mm_cmpeq_epi8(r2.v0, r3.v0),
    _mm_cmpeq_epi8(r2.v1, r3.v1),
    _mm_cmpeq_epi8(r2.v2, r3.v2),
    _mm_cmpeq_epi8(r2.v3, r3.v3)
  ).to_bitmask();

It's not quite as pretty, especially because I couldn't (quickly) figure out a way to make a mapper that combines two sets of simd registers (where we cmpeq r2 and r3). But if it restores perf and still reduces complexity / LOC / potential copy/paste errors, it's still a win IMO. For comparison, what it looks like without interleaving:

    structurals = in.map([&](auto chunk) {
      _mm256i r1 = _mm256_add_epi8(struct_offset, chunk);
      _mm256i r2 = _mm256_or_si256(chunk, struct_mask);
      _mm256i r3 = _mm256_shuffle_epi8(structural_table, r1);
      return _mm256_cmpeq_epi8(r2, r3);
    }).to_bitmask();

@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

sse still failed the perf check, and avx succeeded but was still slightly negative.

If this particular map() change to interleave instructions doesn't make things better, I think I'd rather go back to the previous version, since it reads a lot better.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

Huh? It passed? I think I must have been seeing the wrong checks. Looking.

@jkeiser jkeiser requested a review from lemire August 23, 2019 19:49
@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

Looks like the version with interleaved instructions rescues perf! They do look negative in the 0.5% range still, but it's within the real margin for error. I think it's ready to go back in for review.

@lemire if you wouldn't mind double checking if that change settles down the branch misses? Linux Subsystem for Windows is still giving me a perf_event_open: Invalid argument error when I run parse. I'll put fixing that on my shortlist :)

@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

And if you're still convinced there isn't an actual perf degradation, let me know and I'll revert the structural calculations back to a more readable state.

@lemire
Copy link
Member

lemire commented Aug 23, 2019

I will review this again early next week... (hopefully you will stabilize the PR by then... :-) )

@jkeiser
Copy link
Member Author

jkeiser commented Aug 23, 2019

Sounds good :) This last push was just a way to make the interleaved version a little easier on the eyes.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 26, 2019

Pushed rebase from master. I have a new branch with the necessary changes to make utf8 checking use this new method, but I'll do that separately.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 26, 2019

(There are no changes in the rebase.)

@jkeiser
Copy link
Member Author

jkeiser commented Aug 26, 2019

clang-avx is failing, showing a 0.8% performance difference. I really don't know what the odds are of chance there, but I'm going to assume there's something real there that even interleaving didn't fix.

I'm getting a pure Linux machine up tonight so I can see the other numbers you're looking at, and see what's what with instruction count and branch miss. I truly don't understand how branch misses could even be a factor, since I didn't touch any branches; but numbers don't lie.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 26, 2019

Running the avx version on my linux machine, I've concluded that there is no difference in performance (or branch misses) between master and this patch. I think, however, that there might be a performance difference between master and v0.2.1 (with v0.2.1 being very very slightly faster).

Maybe it would be better to have PRs check against master, than against 0.2.1? It seems like a more valid check anyway, if the intent is to find out whether the PR has regressed performance.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 26, 2019

Checking Linux to see if the non-interleaved version is any different ...

@jkeiser
Copy link
Member Author

jkeiser commented Aug 26, 2019

It's a little hard to tell, but interleaved may be a very, very small amount faster--0.5% or so. The new interleaved code is a little less readable than the non-interleaved version, but not so much that I'd want to take any hit at all for it. I'm going to check in a version of this patch that runs perf against master, to see if that clears up the failures.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 26, 2019

I see a failure on Drone, now. Circle is succeeding but has a ridiculously high variance. The Drone failure, on the other hand, looks reasonably consistent at about 0.5%. I'm going to re-run Drone now, particularly since that's been succeeding up until this point, but it's a flag ...

@jkeiser
Copy link
Member Author

jkeiser commented Aug 26, 2019

It succeeded this time, but there's super high variance again--additionally, on both the pr and push, even though drone has high variance, both times 6/7 of the drone throughput numbers were lower than master.

@jkeiser
Copy link
Member Author

jkeiser commented Aug 26, 2019

At this point, I'm just not a good enough statistician to say what any of this means. The spread from the last 2 runs on Drone:

  • Branch: 1.03548 -- 1.16986
  • Master: 1.08713 -- 1.18017

and

  • Branch: 0.982987 -- 1.16856
  • Master: 1.08713 -- 1.18017

@jkeiser
Copy link
Member Author

jkeiser commented Aug 26, 2019

For comparison, when drone is run against master in the PR where I'm doing baseline perf runs, four runs had these spreads:

  • 0.982987 -- 1.16856
  • 1.08713 -- 1.18017
  • 1.16033 -- 1.18552
  • 1.13474 -- 1.18885

If there's a difference, I think it's noise. I just wish I could solidly convince myself of that :)

@lemire
Copy link
Member

lemire commented Aug 26, 2019

Ok. I think I need to invest some time building my own performance analysis tool... I don’t know whether I will be able to hook it up with github but I shoulx be able to generate reports.

Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am 100% at ease with merging this PR as far as code performance is concerned. I cannot see any performance regression in my local test. I realize that you argue for performance regressions... and maybe your concerns are warranted, but I am not worried about it... if you are, then sure, let us dig... but let us not get too paranoid.

I am not sure that the application of macros is entirely for the good of humanity. This is really a debatable change... But it is not like long sequences of intrinsics (with redundant operations) is super great software engineering... so it is a sane change. Also we use macros elsewhere is simdjson, so it is consistent with our coding style generally.

The fact that we reduce the code size without affecting the complexity of the code (arguably) makes this a fine PR.

@jkeiser jkeiser merged commit aef3f4b into master Aug 28, 2019
@jkeiser
Copy link
Member Author

jkeiser commented Aug 28, 2019

Yeah, if there is a perf difference it's hard to distinguish from noise. I'll think of other ways to validate this in the future. The interleaved version at least ought to make the compiler yield identical assembly, at any rate.

@jkeiser jkeiser deleted the wide_mask branch August 28, 2019 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants