-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: faster switch over string objects #56374
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Some benchmark using 4 chars as a discriminator. [Benchmark(Baseline = true)]
public bool SwitchOverString()
{
bool result;
switch (_s)
{
// aabaaa 0x61006100610062 27303489359118434
case "aabaaa":
result = true;
break;
// aaabaa 0x61006100620061 27303489359183969
case "aaabaa":
result = true;
break;
// aaaaba 0x61006200610061 27303493654085729
case "aaaaba":
result = true;
break;
// aaaaab 0x62006100610061 27584964335829089
case "aaaaab":
result = true;
break;
default:
result = false;
break;
}
return result;
}
[Benchmark]
public bool LongDiscriminator_MemoryMarshal()
{
string candidate = null;
if (_s.Length != 6)
{
return false;
}
// Selects a contiguous region of chars as a long. The range [2..] is chosen because it contains significant
// characters in all possible strings
var discriminator = MemoryMarshal.Read<long>(MemoryMarshal.Cast<char, byte>(_s.AsSpan(2, 4)));
switch (discriminator)
{
// aabaaa 0x61006100610062 27303489359118434
case 0x61006100610062:
candidate = "aabaaa";
break;
// aaabaa 0x61006100620061 27303489359183969
case 0x61006100620061:
candidate = "aaabaa";
break;
// aaaaba 0x61006200610061 27303493654085729
case 0x61006200610061:
candidate = "aaaaba";
break;
// aaaaab 0x62006100610061 27584964335829089
case 0x62006100610061:
candidate = "aaaaab";
break;
}
return candidate == _s;
}
|
I think it would be worthwhile implementing this for the pattern return varName is "aaaaab" or "aaaaba" or "aaabaa" or "aabaaa"; as well |
Ok, so, I don't know if it'll be of any help (since I'm not really sure where this would go, I think it's in the lowering stage but I'm not sure) but I wrote a rewriter poc based on EgorBo's snippet. This is the result of running it on bool IsKnownContentType(string contentTypeValue)
{
{
var __switchExpr__0 = contentTypeValue;
string? __candidate__1 = null;
int __destination__2 = -1;
switch (__switchExpr__0.Length)
{
case 8:
switch (__switchExpr__0[7])
{
case 'l':
__candidate__1 = "text/xml";
__destination__2 = 0;
break;
case 's':
__candidate__1 = "text/css";
__destination__2 = 0;
break;
case 'v':
__candidate__1 = "text/csv";
__destination__2 = 0;
break;
}
break;
case 9:
switch (__switchExpr__0[6])
{
case 'g':
__candidate__1 = "image/gif";
__destination__2 = 0;
break;
case 'p':
__candidate__1 = "image/png";
__destination__2 = 0;
break;
case 't':
__candidate__1 = "text/html";
__destination__2 = 0;
break;
}
break;
case 10:
switch (__switchExpr__0[0])
{
case 't':
__candidate__1 = "text/plain";
__destination__2 = 0;
break;
case 'i':
__candidate__1 = "image/jpeg";
__destination__2 = 0;
break;
}
break;
case 15:
switch (__switchExpr__0[12])
{
case 'p':
__candidate__1 = "application/pdf";
__destination__2 = 0;
break;
case 'x':
__candidate__1 = "application/xml";
__destination__2 = 0;
break;
case 'z':
__candidate__1 = "application/zip";
__destination__2 = 0;
break;
}
break;
case 16:
switch (__switchExpr__0[12])
{
case 'g':
__candidate__1 = "application/grpc";
__destination__2 = 0;
break;
case 'j':
__candidate__1 = "application/json";
__destination__2 = 0;
break;
}
break;
case 19:
__candidate__1 = "multipart/form-data";
__destination__2 = 0;
break;
case 22:
__candidate__1 = "application/javascript";
__destination__2 = 0;
break;
case 24:
switch (__switchExpr__0[0])
{
case 'a':
__candidate__1 = "application/octet-stream";
__destination__2 = 0;
break;
case 't':
__candidate__1 = "text/html; charset=utf-8";
__destination__2 = 0;
break;
}
break;
case 25:
__candidate__1 = "text/plain; charset=utf-8";
__destination__2 = 0;
break;
case 31:
__candidate__1 = "application/json; charset=utf-8";
__destination__2 = 0;
break;
case 33:
__candidate__1 = "application/x-www-form-urlencoded";
__destination__2 = 0;
break;
}
if (__destination__2 != -1 && (object.ReferenceEquals(__switchExpr__0, __candidate__1) || __switchExpr__0?.Equals(__candidate__1) is true))
{
switch (__destination__2)
{
case 0:
return true;
}
}
}
return false;
} |
Hi, I was just pointed to this issue by @akoeplinger as we were looking at the code (both IL and native asm) generated for the switch on string. I was dealing with a similar issue in Xamarin.Android recently (in native code), and I think the runtime Firstly, I use xxHash to hash the strings I need to match (64-bit and 32-bit for the respective word width) as it's both fast and has a very, very low collision ratio. Secondly, I generate two tables - first contains the sorted hashes and the other contains the data I need to access for the given hash. To find a match, I do a binary search over the first table to get the index of that hash, and then I use that index to access the other table. This has the potential of being quite friendly to CPU caches, which is a good win for mobile platforms. The 2nd table might be problematic for Roslyn (but perhaps not?), however I think even an I tried several different approaches to binary search optimization, including loop unrolling and branchless versions, with mixed results. I eventually settled on the "naive" binary search because it delivered the best results across |
i think this is what roslyn does as well once the number of cases is high enough. I see us emitting https://docs.microsoft.com/en-us/dotnet/api/system.reflection.emit.opcodes.bgt_un_s?view=net-6.0 and whatnot in our IL. You can see the logic we have for this here:
// c) After bucketing, generate code to perform a binary search on these buckets array,
// emitting conditional jumps if current bucket sub-array has more than one bucket and
// emitting the switch instruction when we are down to a single bucket in the sub-array. |
@CyrusNajmabadi we were looking at the generated code and it looks like unrolled binary search, but what's puzzling is this part: if (num == 822911587 && s == "4")
{
return true;
} and also else if (s == "5")
{
return true;
} Why always call the Equality operator if the hash matches? Is it because of the high collision ratio of the hash for similar strings? |
Because hash collisions are possible.
No. It's not about a high ratio. It's about a collision always being possible, so you have to validate the string is actually a match. A matching hash is never sufficient (same with hashtables, or any other system). THe only times hashes are ok to use is only when you know all inputs (not all values those inputs are being checked against) and you can then design a perfect hash mapping those fixed input values to your hashing domain. That's not what we have here, so the equality check is always necessary. |
I understand the need for validation, even though it does make the code slower than it needs to be, but how about using xxHash? It's got a really good ratio for small strings. |
@grendello btw, CoreCLR optimizes (unrolls) |
it shouldn't be difficult to do the same with Mono-LLVM, it's just |
I'm not sure what you mean. Validation must be done. There is no way to avoid that. So the validation doesn't make the code slower than necessary. It's mandatory to ensure that the results are correct. |
This is already done. |
This is not necessary. String equality already does this check. |
For reference: bool Test2(string s) => s == "Access-Control-Allow-Credentials"; emits: ; Method Tests:Test2(System.String):bool
vzeroupper
test rcx, rcx ;; <-- is null?
je SHORT G_M48330_IG06
cmp dword ptr [rcx+8], 32 ;; <-- Length == 32
jne SHORT G_M48330_IG05
vmovdqu ymm0, ymmword ptr[rcx+12]
vpxor ymm0, ymm0, ymmword ptr[reloc @RWD00]
vmovdqu ymm1, ymmword ptr[rcx+44]
vpxor ymm1, ymm1, ymmword ptr[reloc @RWD32]
vpor ymm0, ymm0, ymm1
vptest ymm0, ymm0
sete al
movzx rax, al
jmp SHORT G_M48330_IG07
G_M48330_IG05:
xor eax, eax
jmp SHORT G_M48330_IG07
G_M48330_IG06:
xor eax, eax
G_M48330_IG07:
movzx rax, al
vzeroupper
ret For smaller string the codegen is much smaller (e.g. for 4 chars string we can use a single 64bit long constant) dotnet/runtime#65288, dotnet/runtime#66095, dotnet/runtime#65288 |
If the chosen hash function has collision ratio of 0 collisions for a given domain (length), then I think it could be done without risking? |
@EgorBo it would be nice if Mono generated similar code :) |
I'm highly skeptical this will matter in practice. It's only for 8 bytes or less, or 4 characters. How often are you hashing those such that this matters. If the strings are that small, then the equality check is going to be blazingly fast anyways. |
Maybe I'm missing something, but surely that's fundamentally impossible for any scenario where the string has more bits than the hash? You necessarily have collisions in that case, as there are only 2^64 possible hashes, but there are 2^64 possible strings of length 4 (give or take, not all bytes sequences are valid UTF-16, etc). Therefore you must get collisions when you're dealing with any strings > 4 chars. |
@canton7 That's why tehy're specifying for "a given domain (length)" . xx is bijective on <=8 bytes of data: |
Indeed, I thought that was clear to OP, but it seems perhaps not. |
@canton7 in this case, it's producing a 64bit (System.Int64 in .net) hash. So it's saying that it perfectly maps any 64bit input to a distinct 64bit output. |
And I'll repeat myself -- you can only guarantee that there aren't collisions where the number of bits in the input is <= the number of bits in the hash (so 8 chars for a 64 bit hash, 16 chars for a 128 bit hash), and at that point the string equality check is so trivially cheap that it's not really worth eliding anyway. |
At htis point, i'm basically backing out the discussion. I think if this can be done safely and sanely, with a reasonable dependency on some battle tested piece of code, and it itself is tested extensively, it's fine. THis is all an internal impl detail, so as long as it's faster and safe, i'm ok with it. |
Monstrosities like this could take advantage of the small string optimizations. Also, I think you'd be surprised how much "trivial" optimizations like this can improve performance on mobile platforms (especially the underpowered ones). However, I am too backing out of the discussion, I think all points have been made. Thanks! |
If it interests anyone I have also made a source generator helper for doing what was suggested in the original issue (although obviously not production-ready as it wasn't really meant for anyone other than me to use/modify it, but it works): https://github.com/LorettaDevs/Loretta/blob/2f213c2afc49bcba1de6cba677d8c372bd188079/src/Tools/Loretta.Generators.Shared/OptimizedSwitch.cs I'd like to help to implement this into Roslyn but no idea where to start. |
Out of sheer curiosity, I checked what the C# compiler will output if the switch is rewritten as cases of list pattern matching: public bool IsKnownContentType1(string contentTypeValue)
{
switch (contentTypeValue)
{
case "text/xml":
case "text/css":
case "text/csv":
case "image/gif":
case "image/png":
case "text/html":
case "text/plain":
case "image/jpeg":
case "application/pdf":
case "application/xml":
case "application/zip":
case "application/grpc":
case "application/json":
case "multipart/form-data":
case "application/javascript":
case "application/octet-stream":
case "text/html; charset=utf-8":
case "text/plain; charset=utf-8":
case "application/json; charset=utf-8":
case "application/x-www-form-urlencoded":
return true;
}
return false;
} doing: public bool IsKnownContentType2(string contentTypeValue)
{
switch (contentTypeValue)
{
case ['t', 'e', 'x', 't', '/', 'x', 'm', 'l']:
case ['t', 'e', 'x', 't', '/', 'c', 's', 's']:
case ['t', 'e', 'x', 't', '/', 'c', 's', 'v']:
case ['i', 'm', 'a', 'g', 'e', '/', 'g', 'i', 'f']:
case ['i', 'm', 'a', 'g', 'e', '/', 'p', 'n', 'g']:
case ['t', 'e', 'x', 't', '/', 'h', 't', 'm', 'l']:
case ['t', 'e', 'x', 't', '/', 'p', 'l', 'a', 'i', 'n']:
case ['i', 'm', 'a', 'g', 'e', '/', 'j', 'p', 'e', 'g']:
case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'p', 'd', 'f']:
case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'x', 'm', 'l']:
case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'z', 'i', 'p']:
case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'g', 'r', 'p', 'c']:
case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'j', 's', 'o', 'n']:
case ['m', 'u', 'l', 't', 'i', 'p', 'a', 'r', 't', '/', 'f', 'o', 'r', 'm', '-', 'd', 'a', 't', 'a']:
case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'j', 'a', 'v', 'a', 's', 'c', 'r', 'i', 'p', 't']:
case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'o', 'c', 't', 'e', 't', '-', 's', 't', 'r', 'e', 'a', 'm']:
case ['t', 'e', 'x', 't', '/', 'h', 't', 'm', 'l', ';', ' ', 'c', 'h', 'a', 'r', 's', 'e', 't', '=', 'u', 't', 'f', '-', '8']:
case ['t', 'e', 'x', 't', '/', 'p', 'l', 'a', 'i', 'n', ';', ' ', 'c', 'h', 'a', 'r', 's', 'e', 't', '=', 'u', 't', 'f', '-', '8']:
case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'j', 's', 'o', 'n', ';', ' ', 'c', 'h', 'a', 'r', 's', 'e', 't', '=', 'u', 't', 'f', '-', '8']:
case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'x', '-', 'w', 'w', 'w', '-', 'f', 'o', 'r', 'm', '-', 'u', 'r', 'l', 'e', 'n', 'c', 'o', 'd', 'e', 'd']:
return true;
}
return false;
} The result isn't ideal, with multiple opportunities for improvement (e.g. emitting code in a way that the JIT's optimizations around unrolling will kick in, selecting better elements for comparison, etc.), but it's interesting that it shares at least some features in common with the proposed approach. |
I think that'll emit better code having #29033 in place.
What does that refer to? |
Just that the JIT today will generate better code for e.g. |
I have a PR out for this. |
I don't think this has been discussed. We are now doing optimization based on known static information from the switch. The compiler knows nothing about the peculiarities of the incoming data itself. If we could tell the compiler more information about the input data in some way, it could make the optimization better. |
That's what PGO in JIT is supposed to handle |
Why not using another character dispatch if the final switch is going to fallback to hashcode? Alternatively, since the length is known could read N bytes as an IntM and switch on that? though I suppose endianness is going to be an issue. |
Experiments with trie design yielded larger IL and worse performance. The performance could probably be improved (avoid checking length multiple times for example), but the IL size probably not.
Yes, this was mentioned in the bug, but I have not experimented with that. |
Background and Motivation
Currently, when we use
switch
over string objects, Roslyn emits a hash-based chain of if-else comparisons, e.g.:Here we'll run an O(N) hash algorithm (it's not SIMDified for large strings) and do a chain of comparisons with the precomputed ones (with binary search):
For sure it's better than doing
string.Equals
case by case, but it could be way faster.Proposed approach
Roslyn should emit a jump-table based on input's Length and then compare chars at some precomputed positions (sort of Trie), e.g. for our
IsKnownContentType
it could be:So, ideally we can find a matching candidate in just 1-2 jump instructions here.
I wrote a simple benchmark, here are the results:
Optional optimizations
Compare 2/4 chars at once
Inunaligned readsIsKnownContentType_opt
I only compare single chars, but can check for free up to 4 chars at once (on 64bit) or 2 chars for AnyCPU.Extend Roslyn's limitations on Jump-tables
The benchmark results ^ could be even more impressive if that switch over input's Length was emitted as single/solid jump-table by Roslyn and covered all the "missing" cases. It can be explained via this sharplab link, but it's a trade-off between IL size and perf.
Risks
The new implementation should always be faster but might affect IL size in some cases. However, in the case of my sample
IsKnownContentType_opt
is actually 25% smaller than the default one.Related discussion
There were attempts to do it in the past, e.g. dotnet/csharplang#1881 (comment) and #43869
but I think if we won't do a full Trie and only try to do Length + single char (or 4 at once) and fallback to the default implementation if there are cases where we can't do single-char checks - it will be simple and always faster.
/cc @333fred @CyrusNajmabadi
The text was updated successfully, but these errors were encountered: