-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Genericize bitmask building to make algorithms clearer #296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Love it. |
SSE failed the perf check, though avx succeeded. Rerunning to see if it succeeds, but that's a warning sign. Edit: I accidentally re-ran the entire workflow, and avx failed perf this time. This makes me think we want to get rid of the perf threshold. I would rather we catch the 2% performance differences and deal with re-runs, since they seem to be pretty common. |
I think there is a small perf degradation here. How are you guys checking the difference between old and new assembly? I haven't debugged this before. |
Hmm. I found some stuff wasn't inlined that I thought would have been; and there were static modifiers on the methods; so I am going to try removing those. |
static did the trick, in both local and remove tests. Re-running the circle workflow just to be sure, and marking this as non-draft. |
I'm really glad to know that lambdas don't always kill inlining/perf! My earlier experiments in that didn't pull it off for some reason. |
When I re-ran circleci, avx failed perf again, with even smaller increments. I am pushing a patch for amd64 build_bitmask, since I realized I hadn't written that. It may well fail, I haven't compiled it :) |
bf3b068
to
ddeeec6
Compare
Yeah, there's something weird with perf still here. It's in the < 2% range (again!). I'll figure out how to use disassembly tools :) arm64 is up and using this, too. If the inlining is working (I haven't directly verified), it might be due to the order of instructions? Before, we tended to interleave instructions from each register, which might let the processor run them more in parallel; now we process one register at a time. I'll see what happens if I add a map() to make that better.. |
@jkeiser I am not sure that there is one way to assess the performance. However, my "go to" way is to use Linux performance counters and look at instruction counts. Refactoring should not change the number of instructions being executed. |
So at least for AVX, there is not change to the instruction count... the number of cache accesses is the same... What appears to have changed is the number of mispredicted branches. And our instruction per cycle is down. I have not done the math, but this could be explained by the increase in mispredicted branches. Whether it is the code's fault or it is just blind luck is not clear to me at this point. Could be that the code is fine and we just happen to be unlucky with the particular compiler I am using, or something. Before
After
|
Here are the results on an ARM server... Before
After:
So basically no change. |
Results on an another PC: Before:
After:
|
Overall, I conclude that there is no evidence of a performance degradation. |
My view is that if a lambda, by itself, causes a performance degradation (when starting with conventional code), then it should be considered a bug (at the compiler level). I would hope that C++ compilers are smart enough to handle well-written lambdas as well as conventional functions. Otherwise, it is a depressing world. |
We appear to save about 100 lines of code. |
I am trying something real quick here: see what happens if I interleave the instructions the way they were, using a map() instruction. I also added MAP_BITMASK/MAP_CHUNKS macros that reduce the lambda boilerplate for single-expression calculations (which many of our input mappings are). |
Doing that makes find_whitespace_and_structurals look like this: whitespace = in.MAP_BITMASK( _mm_cmpeq_epi8(chunk, _mm_shuffle_epi8(white_table, chunk)) );
auto r1 = in.MAP_CHUNKS( _mm_add_epi8(struct_offset, chunk) );
auto r2 = in.MAP_CHUNKS( _mm_or_si128(chunk, struct_mask) );
auto r3 = r1.MAP_CHUNKS( _mm_shuffle_epi8(structural_table, chunk) );
structurals = simd_input<ARCHITECTURE>(
_mm_cmpeq_epi8(r2.v0, r3.v0),
_mm_cmpeq_epi8(r2.v1, r3.v1),
_mm_cmpeq_epi8(r2.v2, r3.v2),
_mm_cmpeq_epi8(r2.v3, r3.v3)
).to_bitmask(); It's not quite as pretty, especially because I couldn't (quickly) figure out a way to make a mapper that combines two sets of simd registers (where we cmpeq r2 and r3). But if it restores perf and still reduces complexity / LOC / potential copy/paste errors, it's still a win IMO. For comparison, what it looks like without interleaving:
|
sse still failed the perf check, and avx succeeded but was still slightly negative. If this particular map() change to interleave instructions doesn't make things better, I think I'd rather go back to the previous version, since it reads a lot better. |
Huh? It passed? I think I must have been seeing the wrong checks. Looking. |
Looks like the version with interleaved instructions rescues perf! They do look negative in the 0.5% range still, but it's within the real margin for error. I think it's ready to go back in for review. @lemire if you wouldn't mind double checking if that change settles down the branch misses? Linux Subsystem for Windows is still giving me a |
And if you're still convinced there isn't an actual perf degradation, let me know and I'll revert the structural calculations back to a more readable state. |
I will review this again early next week... (hopefully you will stabilize the PR by then... :-) ) |
Sounds good :) This last push was just a way to make the interleaved version a little easier on the eyes. |
Pushed rebase from master. I have a new branch with the necessary changes to make utf8 checking use this new method, but I'll do that separately. |
to make it easier to run it yourself without having to recompile the world
(There are no changes in the rebase.) |
clang-avx is failing, showing a 0.8% performance difference. I really don't know what the odds are of chance there, but I'm going to assume there's something real there that even interleaving didn't fix. I'm getting a pure Linux machine up tonight so I can see the other numbers you're looking at, and see what's what with instruction count and branch miss. I truly don't understand how branch misses could even be a factor, since I didn't touch any branches; but numbers don't lie. |
Running the avx version on my linux machine, I've concluded that there is no difference in performance (or branch misses) between master and this patch. I think, however, that there might be a performance difference between master and v0.2.1 (with v0.2.1 being very very slightly faster). Maybe it would be better to have PRs check against master, than against 0.2.1? It seems like a more valid check anyway, if the intent is to find out whether the PR has regressed performance. |
Checking Linux to see if the non-interleaved version is any different ... |
It's a little hard to tell, but interleaved may be a very, very small amount faster--0.5% or so. The new interleaved code is a little less readable than the non-interleaved version, but not so much that I'd want to take any hit at all for it. I'm going to check in a version of this patch that runs perf against master, to see if that clears up the failures. |
I see a failure on Drone, now. Circle is succeeding but has a ridiculously high variance. The Drone failure, on the other hand, looks reasonably consistent at about 0.5%. I'm going to re-run Drone now, particularly since that's been succeeding up until this point, but it's a flag ... |
It succeeded this time, but there's super high variance again--additionally, on both the pr and push, even though drone has high variance, both times 6/7 of the drone throughput numbers were lower than master. |
At this point, I'm just not a good enough statistician to say what any of this means. The spread from the last 2 runs on Drone:
and
|
For comparison, when drone is run against master in the PR where I'm doing baseline perf runs, four runs had these spreads:
If there's a difference, I think it's noise. I just wish I could solidly convince myself of that :) |
Ok. I think I need to invest some time building my own performance analysis tool... I don’t know whether I will be able to hook it up with github but I shoulx be able to generate reports. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am 100% at ease with merging this PR as far as code performance is concerned. I cannot see any performance regression in my local test. I realize that you argue for performance regressions... and maybe your concerns are warranted, but I am not worried about it... if you are, then sure, let us dig... but let us not get too paranoid.
I am not sure that the application of macros is entirely for the good of humanity. This is really a debatable change... But it is not like long sequences of intrinsics (with redundant operations) is super great software engineering... so it is a sane change. Also we use macros elsewhere is simdjson, so it is consistent with our coding style generally.
The fact that we reduce the code size without affecting the complexity of the code (arguably) makes this a fine PR.
Yeah, if there is a perf difference it's hard to distinguish from noise. I'll think of other ways to validate this in the future. The interleaved version at least ought to make the compiler yield identical assembly, at any rate. |
This patch introduces
simd_input::build_bitmask
, making it a bit easier to build and read the (relatively frequent) algorithms that walk over 64 bytes of input and produce a bitmask. My ulterior motive is to get closer to actually genericizing the algorithms, but at the moment it still uses the underlying instruction set.It relies on compilers inlining genericized lambdas, and perf isn't taking a hit in my tests; making a PR to see what happens :)