Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: faster switch over string objects #56374

Closed
EgorBo opened this issue Sep 14, 2021 · 42 comments · Fixed by #66081
Closed

Performance: faster switch over string objects #56374

EgorBo opened this issue Sep 14, 2021 · 42 comments · Fixed by #66081
Assignees
Labels
4 - In Review A fix for the issue is submitted for review. Area-Compilers Concept-API This issue involves adding, removing, clarification, or modification of an API. Feature Request
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Sep 14, 2021

Background and Motivation

Currently, when we use switch over string objects, Roslyn emits a hash-based chain of if-else comparisons, e.g.:

public bool IsKnownContentType(string contentTypeValue)
{
    switch (contentTypeValue)
    {
        case "text/xml":
        case "text/css":
        case "text/csv":
        case "image/gif":
        case "image/png":
        case "text/html":
        case "text/plain":
        case "image/jpeg":
        case "application/pdf":
        case "application/xml":
        case "application/zip":
        case "application/grpc":
        case "application/json":
        case "multipart/form-data":
        case "application/javascript":
        case "application/octet-stream":
        case "text/html; charset=utf-8":
        case "text/plain; charset=utf-8":
        case "application/json; charset=utf-8":
        case "application/x-www-form-urlencoded":
            return true;
    }
    return false;
}

Here we'll run an O(N) hash algorithm (it's not SIMDified for large strings) and do a chain of comparisons with the precomputed ones (with binary search):

int hash = ComputeStringHash(contentTypeValue);
if (hash == 0x34343443) // 0x34343443 is a hash of "text/xml"
   return hash == "text/xml";
else if (hash == 0x54545452) // 0x54545452 is a hash of "text/css"
   return hash == "text/css";
// etc

For sure it's better than doing string.Equals case by case, but it could be way faster.

Proposed approach

Roslyn should emit a jump-table based on input's Length and then compare chars at some precomputed positions (sort of Trie), e.g. for our IsKnownContentType it could be:

public bool IsKnownContentType_opt(string contentTypeValue)
{
    string? candidate = null;
    switch (contentTypeValue.Length)
    {
        case 8:
            switch (contentTypeValue[7]) // at 7th position all chars are unique:
            {
                case 'l': candidate = "text/xml"; break;
                case 's': candidate = "text/css"; break;
                case 'v': candidate = "text/csv"; break;
            }
            break;
        case 9:
            switch (contentTypeValue[6]) // at 6th position all chars are unique:
            {
                case 'g': candidate = "image/gif"; break; 
                case 'p': candidate = "image/png"; break;
                case 't': candidate = "text/html"; break;
            }
            break;
        case 10:
            switch (contentTypeValue[0])
            {
                case 't': candidate = "text/plain"; break; 
                case 'i': candidate = "image/jpeg"; break;
            }
            break;
        case 15:
            switch (contentTypeValue[12])
            {
                case 'p': candidate = "application/pdf"; break;
                case 'x': candidate = "application/xml"; break;
                case 'z': candidate = "application/zip"; break;
            }
            break;
        case 16:
            switch (contentTypeValue[12])
            {
                case 'g': candidate = "application/grpc"; break; 
                case 'j': candidate = "application/json"; break;
            }
            break;
        case 19: candidate = "multipart/form-data"; break;
        case 22: candidate = "application/javascript"; break;
        case 24:
            switch (contentTypeValue[0])
            {
                case 'a': candidate = "application/octet-stream"; break;
                case 't': candidate = "text/html; charset=utf-8"; break;
            }
            break;
        case 25: candidate = "text/plain; charset=utf-8"; break;
        case 31: candidate = "application/json; charset=utf-8"; break;
        case 33: candidate = "application/x-www-form-urlencoded"; break;
    }
    return candidate == contentTypeValue;
}

So, ideally we can find a matching candidate in just 1-2 jump instructions here.
I wrote a simple benchmark, here are the results:

|                 Method |     contentTypeValue |      Mean |     Error |    StdDev |
|----------------------- |--------------------- |----------:|----------:|----------:|
|     IsKnownContentType |     application/grpc |  8.607 ns | 0.0384 ns | 0.0359 ns |
| IsKnownContentType_opt |     application/grpc |  1.475 ns | 0.0029 ns | 0.0026 ns |

|     IsKnownContentType | appli(...)cript [22] | 12.829 ns | 0.0362 ns | 0.0302 ns |
| IsKnownContentType_opt | appli(...)cript [22] |  1.289 ns | 0.0016 ns | 0.0015 ns |

|     IsKnownContentType |     application/json |  8.152 ns | 0.0321 ns | 0.0285 ns |
| IsKnownContentType_opt |     application/json |  1.484 ns | 0.0035 ns | 0.0033 ns |

|     IsKnownContentType | appli(...)utf-8 [31] | 18.141 ns | 0.0161 ns | 0.0126 ns |
| IsKnownContentType_opt | appli(...)utf-8 [31] |  1.269 ns | 0.0020 ns | 0.0018 ns |

|     IsKnownContentType | appli(...)tream [24] | 14.009 ns | 0.0933 ns | 0.0779 ns |
| IsKnownContentType_opt | appli(...)tream [24] |  1.715 ns | 0.0031 ns | 0.0026 ns |

|     IsKnownContentType |      application/pdf |  7.774 ns | 0.0044 ns | 0.0035 ns |
| IsKnownContentType_opt |      application/pdf |  1.701 ns | 0.0174 ns | 0.0155 ns |

|     IsKnownContentType | appli(...)coded [33] | 19.900 ns | 0.0227 ns | 0.0212 ns |
| IsKnownContentType_opt | appli(...)coded [33] |  1.092 ns | 0.0028 ns | 0.0023 ns |

|     IsKnownContentType |      application/xml |  7.756 ns | 0.0235 ns | 0.0208 ns |
| IsKnownContentType_opt |      application/xml |  1.536 ns | 0.0028 ns | 0.0024 ns |

|     IsKnownContentType |      application/zip |  7.593 ns | 0.0107 ns | 0.0100 ns |
| IsKnownContentType_opt |      application/zip |  1.317 ns | 0.0077 ns | 0.0072 ns |

|     IsKnownContentType |            image/gif |  5.118 ns | 0.0048 ns | 0.0040 ns |
| IsKnownContentType_opt |            image/gif |  1.694 ns | 0.0036 ns | 0.0032 ns |

|     IsKnownContentType |           image/jpeg |  5.068 ns | 0.0086 ns | 0.0072 ns |
| IsKnownContentType_opt |           image/jpeg |  1.695 ns | 0.0047 ns | 0.0041 ns |

|     IsKnownContentType |            image/png |  4.478 ns | 0.0531 ns | 0.0497 ns |
| IsKnownContentType_opt |            image/png |  1.944 ns | 0.0087 ns | 0.0068 ns |

|     IsKnownContentType |  multipart/form-data | 10.519 ns | 0.0211 ns | 0.0187 ns |
| IsKnownContentType_opt |  multipart/form-data |  1.305 ns | 0.0114 ns | 0.0107 ns |

|     IsKnownContentType |             text/css |  4.295 ns | 0.0062 ns | 0.0052 ns |
| IsKnownContentType_opt |             text/css |  1.537 ns | 0.0039 ns | 0.0030 ns |

|     IsKnownContentType |             text/csv |  4.409 ns | 0.0434 ns | 0.0362 ns |
| IsKnownContentType_opt |             text/csv |  1.484 ns | 0.0035 ns | 0.0031 ns |

|     IsKnownContentType |            text/html |  4.402 ns | 0.0041 ns | 0.0038 ns |
| IsKnownContentType_opt |            text/html |  1.526 ns | 0.0030 ns | 0.0026 ns |

|     IsKnownContentType | text/(...)utf-8 [24] | 13.811 ns | 0.0159 ns | 0.0132 ns |
| IsKnownContentType_opt | text/(...)utf-8 [24] |  1.537 ns | 0.0036 ns | 0.0032 ns |

|     IsKnownContentType |           text/plain |  5.161 ns | 0.0207 ns | 0.0184 ns |
| IsKnownContentType_opt |           text/plain |  1.320 ns | 0.0068 ns | 0.0057 ns |

|     IsKnownContentType | text/(...)utf-8 [25] | 14.368 ns | 0.0210 ns | 0.0186 ns |
| IsKnownContentType_opt | text/(...)utf-8 [25] |  1.289 ns | 0.0050 ns | 0.0042 ns |

|     IsKnownContentType |             text/xml |  4.015 ns | 0.0069 ns | 0.0057 ns |
| IsKnownContentType_opt |             text/xml |  1.473 ns | 0.0022 ns | 0.0018 ns |

Optional optimizations

Compare 2/4 chars at once

In IsKnownContentType_opt I only compare single chars, but can check for free up to 4 chars at once (on 64bit) or 2 chars for AnyCPU. unaligned reads

Extend Roslyn's limitations on Jump-tables

The benchmark results ^ could be even more impressive if that switch over input's Length was emitted as single/solid jump-table by Roslyn and covered all the "missing" cases. It can be explained via this sharplab link, but it's a trade-off between IL size and perf.

Risks

The new implementation should always be faster but might affect IL size in some cases. However, in the case of my sample IsKnownContentType_opt is actually 25% smaller than the default one.

Related discussion

There were attempts to do it in the past, e.g. dotnet/csharplang#1881 (comment) and #43869
but I think if we won't do a full Trie and only try to do Length + single char (or 4 at once) and fallback to the default implementation if there are cases where we can't do single-char checks - it will be simple and always faster.

/cc @333fred @CyrusNajmabadi

@EgorBo EgorBo added Concept-API This issue involves adding, removing, clarification, or modification of an API. Feature Request labels Sep 14, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged Issues and PRs which have not yet been triaged by a lead label Sep 14, 2021
@jaredpar jaredpar added Area-Compilers and removed untriaged Issues and PRs which have not yet been triaged by a lead labels Sep 14, 2021
@jaredpar jaredpar added this to the Compiler.Next milestone Sep 14, 2021
@george-polevoy
Copy link

george-polevoy commented Sep 15, 2021

Some benchmark using 4 chars as a discriminator.
Strings "aaaaab, aaaaba, aaabaa, aabaaa" are chosen because they need a long discriminator among a relatively short string list.
This set of strings makes a good worst case scenario.

        [Benchmark(Baseline = true)]
        public bool SwitchOverString()
        {
            bool result;
            switch (_s)
            {
                // aabaaa 0x61006100610062 27303489359118434
                case "aabaaa":
                    result = true;
                    break;
                // aaabaa 0x61006100620061 27303489359183969
                case "aaabaa":
                    result = true;
                    break;
                // aaaaba 0x61006200610061 27303493654085729
                case "aaaaba":
                    result = true;
                    break;
                // aaaaab 0x62006100610061 27584964335829089
                case "aaaaab":
                    result = true;
                    break;
                default:
                    result = false;
                    break;
            }

            return result;
        }

        [Benchmark]
        public bool LongDiscriminator_MemoryMarshal()
        {
            string candidate = null;

            if (_s.Length != 6)
            {
                return false;
            }

            // Selects a contiguous region of chars as a long. The range [2..] is chosen because it contains significant
            // characters in all possible strings
            var discriminator = MemoryMarshal.Read<long>(MemoryMarshal.Cast<char, byte>(_s.AsSpan(2, 4)));

            switch (discriminator)
            {
                // aabaaa 0x61006100610062 27303489359118434
                case 0x61006100610062:
                    candidate = "aabaaa";
                    break;
                // aaabaa 0x61006100620061 27303489359183969
                case 0x61006100620061:
                    candidate = "aaabaa";
                    break;
                // aaaaba 0x61006200610061 27303493654085729
                case 0x61006200610061:
                    candidate = "aaaaba";
                    break;
                // aaaaab 0x62006100610061 27584964335829089
                case 0x62006100610061:
                    candidate = "aaaaab";
                    break;
            }

            return candidate == _s;
        }
|                          Method |      Mean |     Error |    StdDev | Ratio |
|-------------------------------- |----------:|----------:|----------:|------:|
|                SwitchOverString | 16.307 ns | 1.6814 ns | 0.0922 ns |  1.00 |
| LongDiscriminator_MemoryMarshal |  2.681 ns | 0.2040 ns | 0.0112 ns |  0.16 |

@GGG-KILLER
Copy link

I think it would be worthwhile implementing this for the pattern

return varName is "aaaaab" or "aaaaba" or "aaabaa" or "aabaaa";

as well

@GGG-KILLER
Copy link

GGG-KILLER commented Oct 9, 2021

Ok, so, I don't know if it'll be of any help (since I'm not really sure where this would go, I think it's in the lowering stage but I'm not sure) but I wrote a rewriter poc based on EgorBo's snippet.

This is the result of running it on IsKnownContentType:

bool IsKnownContentType(string contentTypeValue)
{
    {
        var __switchExpr__0 = contentTypeValue;
        string? __candidate__1 = null;
        int __destination__2 = -1;
        switch (__switchExpr__0.Length)
        {
            case 8:
                switch (__switchExpr__0[7])
                {
                    case 'l':
                        __candidate__1 = "text/xml";
                        __destination__2 = 0;
                        break;
                    case 's':
                        __candidate__1 = "text/css";
                        __destination__2 = 0;
                        break;
                    case 'v':
                        __candidate__1 = "text/csv";
                        __destination__2 = 0;
                        break;
                }

                break;
            case 9:
                switch (__switchExpr__0[6])
                {
                    case 'g':
                        __candidate__1 = "image/gif";
                        __destination__2 = 0;
                        break;
                    case 'p':
                        __candidate__1 = "image/png";
                        __destination__2 = 0;
                        break;
                    case 't':
                        __candidate__1 = "text/html";
                        __destination__2 = 0;
                        break;
                }

                break;
            case 10:
                switch (__switchExpr__0[0])
                {
                    case 't':
                        __candidate__1 = "text/plain";
                        __destination__2 = 0;
                        break;
                    case 'i':
                        __candidate__1 = "image/jpeg";
                        __destination__2 = 0;
                        break;
                }

                break;
            case 15:
                switch (__switchExpr__0[12])
                {
                    case 'p':
                        __candidate__1 = "application/pdf";
                        __destination__2 = 0;
                        break;
                    case 'x':
                        __candidate__1 = "application/xml";
                        __destination__2 = 0;
                        break;
                    case 'z':
                        __candidate__1 = "application/zip";
                        __destination__2 = 0;
                        break;
                }

                break;
            case 16:
                switch (__switchExpr__0[12])
                {
                    case 'g':
                        __candidate__1 = "application/grpc";
                        __destination__2 = 0;
                        break;
                    case 'j':
                        __candidate__1 = "application/json";
                        __destination__2 = 0;
                        break;
                }

                break;
            case 19:
                __candidate__1 = "multipart/form-data";
                __destination__2 = 0;
                break;
            case 22:
                __candidate__1 = "application/javascript";
                __destination__2 = 0;
                break;
            case 24:
                switch (__switchExpr__0[0])
                {
                    case 'a':
                        __candidate__1 = "application/octet-stream";
                        __destination__2 = 0;
                        break;
                    case 't':
                        __candidate__1 = "text/html; charset=utf-8";
                        __destination__2 = 0;
                        break;
                }

                break;
            case 25:
                __candidate__1 = "text/plain; charset=utf-8";
                __destination__2 = 0;
                break;
            case 31:
                __candidate__1 = "application/json; charset=utf-8";
                __destination__2 = 0;
                break;
            case 33:
                __candidate__1 = "application/x-www-form-urlencoded";
                __destination__2 = 0;
                break;
        }

        if (__destination__2 != -1 && (object.ReferenceEquals(__switchExpr__0, __candidate__1) || __switchExpr__0?.Equals(__candidate__1) is true))
        {
            switch (__destination__2)
            {
                case 0:
                    return true;
            }
        }
    }

    return false;
}

@grendello
Copy link

Hi,

I was just pointed to this issue by @akoeplinger as we were looking at the code (both IL and native asm) generated for the switch on string. I was dealing with a similar issue in Xamarin.Android recently (in native code), and I think the runtime
could benefit from the approach I took there.

Firstly, I use xxHash to hash the strings I need to match (64-bit and 32-bit for the respective word width) as it's both fast and has a very, very low collision ratio.

Secondly, I generate two tables - first contains the sorted hashes and the other contains the data I need to access for the given hash. To find a match, I do a binary search over the first table to get the index of that hash, and then I use that index to access the other table. This has the potential of being quite friendly to CPU caches, which is a good win for mobile platforms. The 2nd table might be problematic for Roslyn (but perhaps not?), however I think even an if-then-else performed on integers should be fast enough.

I tried several different approaches to binary search optimization, including loop unrolling and branchless versions, with mixed results. I eventually settled on the "naive" binary search because it delivered the best results across
the different architectures we support. It will (IIRC) perform log2(n)-1 comparisons and jumps, which is not too bad.

@CyrusNajmabadi
Copy link
Member

It will (IIRC) perform log2(n)-1 comparisons and jumps, which is not too bad.

i think this is what roslyn does as well once the number of cases is high enough. I see us emitting https://docs.microsoft.com/en-us/dotnet/api/system.reflection.emit.opcodes.bgt_un_s?view=net-6.0 and whatnot in our IL.

You can see the logic we have for this here:

            //  c)	After bucketing, generate code to perform a binary search on these buckets array, 
            //      emitting conditional jumps if current bucket sub-array has more than one bucket and
            //      emitting the switch instruction when we are down to a single bucket in the sub-array.

@grendello
Copy link

@CyrusNajmabadi we were looking at the generated code and it looks like unrolled binary search, but what's puzzling is this part:

if (num == 822911587 && s == "4")
{
    return true;
}

and also

else if (s == "5")
{
   return true;
}

Why always call the Equality operator if the hash matches? Is it because of the high collision ratio of the hash for similar strings?

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented May 19, 2022

Why always call the Equality operator if the hash matches?

Because hash collisions are possible.

Is it because of the high collision ratio of the hash for similar strings?

No. It's not about a high ratio. It's about a collision always being possible, so you have to validate the string is actually a match. A matching hash is never sufficient (same with hashtables, or any other system).

THe only times hashes are ok to use is only when you know all inputs (not all values those inputs are being checked against) and you can then design a perfect hash mapping those fixed input values to your hashing domain. That's not what we have here, so the equality check is always necessary.

@grendello
Copy link

grendello commented May 19, 2022

I understand the need for validation, even though it does make the code slower than it needs to be, but how about using xxHash? It's got a really good ratio for small strings.
Given that we know the size of input strings, could the code be modified (assuming we use xxHash) to verify the string when collisions are possible? As you can see from the tables above, xxHash has 0 collisions for strings with lengths between 8 and 1024 bytes. String validation could also be performed only if strings are equal in length.

@EgorBo
Copy link
Member Author

EgorBo commented May 19, 2022

@grendello btw, CoreCLR optimizes (unrolls) x == "str" for small strings (up to 32 chars afair) so it's almost free - e.g.
using SWAR or SIMD (on both x86 and arm)

@EgorBo
Copy link
Member Author

EgorBo commented May 19, 2022

it shouldn't be difficult to do the same with Mono-LLVM, it's just x == "str" should be lowered down to memcmp LLVM intrinsic and then LLVM will unroll it too

@CyrusNajmabadi
Copy link
Member

I understand the need for validation, even though it does make the code slower than it needs to be,

I'm not sure what you mean. Validation must be done. There is no way to avoid that. So the validation doesn't make the code slower than necessary. It's mandatory to ensure that the results are correct.

@CyrusNajmabadi
Copy link
Member

String validation could also be performed only if strings are equal in length.

This is already done. == already will validate string length as the very first thing it does to avoid any O(n) work.

@CyrusNajmabadi
Copy link
Member

object.ReferenceEquals(__switchExpr__0, __candidate__1)

This is not necessary. String equality already does this check.

@EgorBo
Copy link
Member Author

EgorBo commented May 19, 2022

For reference:

bool Test2(string s) => s == "Access-Control-Allow-Credentials";

emits:

; Method Tests:Test2(System.String):bool
       vzeroupper 
       test     rcx, rcx  ;;  <-- is null?
       je       SHORT G_M48330_IG06
       cmp      dword ptr [rcx+8], 32  ;;  <-- Length == 32
       jne      SHORT G_M48330_IG05
       vmovdqu  ymm0, ymmword ptr[rcx+12]
       vpxor    ymm0, ymm0, ymmword ptr[reloc @RWD00]
       vmovdqu  ymm1, ymmword ptr[rcx+44]
       vpxor    ymm1, ymm1, ymmword ptr[reloc @RWD32]
       vpor     ymm0, ymm0, ymm1
       vptest   ymm0, ymm0
       sete     al
       movzx    rax, al
       jmp      SHORT G_M48330_IG07
G_M48330_IG05:
       xor      eax, eax
       jmp      SHORT G_M48330_IG07
G_M48330_IG06:
       xor      eax, eax
G_M48330_IG07:
       movzx    rax, al
       vzeroupper 
       ret

For smaller string the codegen is much smaller (e.g. for 4 chars string we can use a single 64bit long constant)

dotnet/runtime#65288, dotnet/runtime#66095, dotnet/runtime#65288

@grendello
Copy link

I understand the need for validation, even though it does make the code slower than it needs to be,

I'm not sure what you mean. Validation must be done. There is no way to avoid that. So the validation doesn't make the code slower than necessary. It's mandatory to ensure that the results are correct.

If the chosen hash function has collision ratio of 0 collisions for a given domain (length), then I think it could be done without risking?

@grendello
Copy link

@EgorBo it would be nice if Mono generated similar code :)

@CyrusNajmabadi
Copy link
Member

If the chosen hash function has collision ratio of 0 collisions for a given domain (length), then I think it could be done without risking?

I'm highly skeptical this will matter in practice. It's only for 8 bytes or less, or 4 characters. How often are you hashing those such that this matters. If the strings are that small, then the equality check is going to be blazingly fast anyways.

@canton7
Copy link
Contributor

canton7 commented May 19, 2022

If the chosen hash function has collision ratio of 0 collisions for a given domain (length), then I think it could be done without risking?

Maybe I'm missing something, but surely that's fundamentally impossible for any scenario where the string has more bits than the hash? You necessarily have collisions in that case, as there are only 2^64 possible hashes, but there are 2^64 possible strings of length 4 (give or take, not all bytes sequences are valid UTF-16, etc). Therefore you must get collisions when you're dealing with any strings > 4 chars.

@CyrusNajmabadi
Copy link
Member

Maybe I'm missing something, but surely that's fundamentally impossible for any scenario where the string has more bits than the hash?

@canton7 That's why tehy're specifying for "a given domain (length)" . xx is bijective on <=8 bytes of data:

image

@canton7
Copy link
Contributor

canton7 commented May 19, 2022

Indeed, I thought that was clear to OP, but it seems perhaps not.

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented May 19, 2022

@canton7 in this case, it's producing a 64bit (System.Int64 in .net) hash. So it's saying that it perfectly maps any 64bit input to a distinct 64bit output.

@canton7
Copy link
Contributor

canton7 commented May 19, 2022

And I'll repeat myself -- you can only guarantee that there aren't collisions where the number of bits in the input is <= the number of bits in the hash (so 8 chars for a 64 bit hash, 16 chars for a 128 bit hash), and at that point the string equality check is so trivially cheap that it's not really worth eliding anyway.

@CyrusNajmabadi
Copy link
Member

At htis point, i'm basically backing out the discussion. I think if this can be done safely and sanely, with a reasonable dependency on some battle tested piece of code, and it itself is tested extensively, it's fine. THis is all an internal impl detail, so as long as it's faster and safe, i'm ok with it.

@grendello
Copy link

Monstrosities like this could take advantage of the small string optimizations. Also, I think you'd be surprised how much "trivial" optimizations like this can improve performance on mobile platforms (especially the underpowered ones). However, I am too backing out of the discussion, I think all points have been made. Thanks!

@GGG-KILLER
Copy link

GGG-KILLER commented May 19, 2022

@stephentoub might also have some interest here.

I do :) We should experiment with an approach like @EgorBo proposes... we manually do such things in a variety of hot paths, and it'd be nice to a) not have to do so and b) extend the wins to other places as well. e.g. https://github.com/dotnet/runtime/blob/main/src/libraries/System.Net.Http/src/System/Net/Http/Headers/KnownHeaders.cs#L161-L414 Another approach that can be considered is to first switch over the length and then proceed with Roslyn's current scheme within each length bucket. ASP.NET has a code generator that essentially does that: https://github.com/dotnet/aspnetcore/blob/8b30d862de6c9146f466061d51aa3f1414ee2337/src/Shared/HttpSys/RequestProcessing/RequestHeaders.Generated.cs#L1170-L1704

If it interests anyone I have also made a source generator helper for doing what was suggested in the original issue (although obviously not production-ready as it wasn't really meant for anyone other than me to use/modify it, but it works): https://github.com/LorettaDevs/Loretta/blob/2f213c2afc49bcba1de6cba677d8c372bd188079/src/Tools/Loretta.Generators.Shared/OptimizedSwitch.cs

I'd like to help to implement this into Roslyn but no idea where to start.

@stephentoub
Copy link
Member

Out of sheer curiosity, I checked what the C# compiler will output if the switch is rewritten as cases of list pattern matching:
SharpLab
e.g. instead of the original example:

    public bool IsKnownContentType1(string contentTypeValue)
    {
        switch (contentTypeValue)
        {
            case "text/xml":
            case "text/css":
            case "text/csv":
            case "image/gif":
            case "image/png":
            case "text/html":
            case "text/plain":
            case "image/jpeg":
            case "application/pdf":
            case "application/xml":
            case "application/zip":
            case "application/grpc":
            case "application/json":
            case "multipart/form-data":
            case "application/javascript":
            case "application/octet-stream":
            case "text/html; charset=utf-8":
            case "text/plain; charset=utf-8":
            case "application/json; charset=utf-8":
            case "application/x-www-form-urlencoded":
                return true;
        }
        return false;
    }

doing:

    public bool IsKnownContentType2(string contentTypeValue)
    {
        switch (contentTypeValue)
        {
            case ['t', 'e', 'x', 't', '/', 'x', 'm', 'l']:
            case ['t', 'e', 'x', 't', '/', 'c', 's', 's']:
            case ['t', 'e', 'x', 't', '/', 'c', 's', 'v']:
            case ['i', 'm', 'a', 'g', 'e', '/', 'g', 'i', 'f']:
            case ['i', 'm', 'a', 'g', 'e', '/', 'p', 'n', 'g']:
            case ['t', 'e', 'x', 't', '/', 'h', 't', 'm', 'l']:
            case ['t', 'e', 'x', 't', '/', 'p', 'l', 'a', 'i', 'n']:
            case ['i', 'm', 'a', 'g', 'e', '/', 'j', 'p', 'e', 'g']:
            case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'p', 'd', 'f']:
            case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'x', 'm', 'l']:
            case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'z', 'i', 'p']:
            case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'g', 'r', 'p', 'c']:
            case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'j', 's', 'o', 'n']:
            case ['m', 'u', 'l', 't', 'i', 'p', 'a', 'r', 't', '/', 'f', 'o', 'r', 'm', '-', 'd', 'a', 't', 'a']:
            case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'j', 'a', 'v', 'a', 's', 'c', 'r', 'i', 'p', 't']:
            case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'o', 'c', 't', 'e', 't', '-', 's', 't', 'r', 'e', 'a', 'm']:
            case ['t', 'e', 'x', 't', '/', 'h', 't', 'm', 'l', ';', ' ', 'c', 'h', 'a', 'r', 's', 'e', 't', '=', 'u', 't', 'f', '-', '8']:
            case ['t', 'e', 'x', 't', '/', 'p', 'l', 'a', 'i', 'n', ';', ' ', 'c', 'h', 'a', 'r', 's', 'e', 't', '=', 'u', 't', 'f', '-', '8']:
            case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'j', 's', 'o', 'n', ';', ' ', 'c', 'h', 'a', 'r', 's', 'e', 't', '=', 'u', 't', 'f', '-', '8']:
            case ['a', 'p', 'p', 'l', 'i', 'c', 'a', 't', 'i', 'o', 'n', '/', 'x', '-', 'w', 'w', 'w', '-', 'f', 'o', 'r', 'm', '-', 'u', 'r', 'l', 'e', 'n', 'c', 'o', 'd', 'e', 'd']:
                return true;
        }
        return false;
    }

The result isn't ideal, with multiple opportunities for improvement (e.g. emitting code in a way that the JIT's optimizations around unrolling will kick in, selecting better elements for comparison, etc.), but it's interesting that it shares at least some features in common with the proposed approach.

@alrz
Copy link
Member

alrz commented Sep 16, 2022

I think that'll emit better code having #29033 in place.

emitting code in a way that the JIT's optimizations around unrolling will kick in

What does that refer to?

@stephentoub
Copy link
Member

What does that refer to?

Just that the JIT today will generate better code for e.g. value.StartsWith("abcdefgh", StringComparison.Ordinal) than it would for value.Length >= 8 && value[0] == 'a' && value[1] == 'b' && value[2] == 'c' && value[3] == 'd' && value[4] == 'e' && value[5] == 'f' && value[6] == 'g' && value[7] == 'h'.

@jcouv
Copy link
Member

jcouv commented Dec 21, 2022

I have a PR out for this.
From my experimentation so far, the optimization seems beneficial (ie. significant speed improvement to normal case, little to no degradation in worst case, comparable IL size) when there are more than 7 cases (otherwise linear sequence of string comparisons is still faster) and the buckets that result from Length+char checks have 5 or fewer cases. We'll likely refine this criteria after testing on more platforms/environments.
From the examples collected so far (thanks @stephentoub) many switches in the BCL will benefit. But the ones that do case-insensitive comparison will still need to be hand-optimized.

@iSazonov
Copy link

I don't think this has been discussed. We are now doing optimization based on known static information from the switch. The compiler knows nothing about the peculiarities of the incoming data itself. If we could tell the compiler more information about the input data in some way, it could make the optimization better.
For example, we could specify a hot path. (If we know GET is 80% of HTTP requests.)

@EgorBo
Copy link
Member Author

EgorBo commented Dec 22, 2022

I don't think this has been discussed. We are now doing optimization based on known static information from the switch. The compiler knows nothing about the peculiarities of the incoming data itself. If we could tell the compiler more information about the input data in some way, it could make the optimization better. For example, we could specify a hot path. (If we know GET is 80% of HTTP requests.)

That's what PGO in JIT is supposed to handle

@alrz
Copy link
Member

alrz commented Dec 22, 2022

Why not using another character dispatch if the final switch is going to fallback to hashcode?

Alternatively, since the length is known could read N bytes as an IntM and switch on that? though I suppose endianness is going to be an issue.

@jcouv
Copy link
Member

jcouv commented Dec 23, 2022

Why not using another character dispatch if the final switch is going to fallback to hashcode?

Experiments with trie design yielded larger IL and worse performance. The performance could probably be improved (avoid checking length multiple times for example), but the IL size probably not.

could read N bytes as an IntM

Yes, this was mentioned in the bug, but I have not experimented with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - In Review A fix for the issue is submitted for review. Area-Compilers Concept-API This issue involves adding, removing, clarification, or modification of an API. Feature Request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.