-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Add fast mb_strcut implementation for UTF-8 #12337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Maybe @iluuu1994 might have some comment? |
I find different behavior. Is this expected?
But now available versions: https://3v4l.org/edcf7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable? One thing which might make it easier to check is to have the new test be in a prior commit with the current behaviour, and then have the feature as a follow-up.
Might want the opinions of @youkidearitai tho
Hi, @youkidearitai. Yes, this is expected: if you look again at the commit message, it mentions that the new code has identical results on valid UTF-8 strings, but that is an invalid UTF-8 string. The explanation is exactly as written above in the description of this PR. |
@youkidearitai BTW, I find it very impressive how you are (usually) able to discover interesting test cases for new code. |
Good idea, I can do that. |
@alexdowad Thank you for reply.
3v4l: https://3v4l.org/22jKX Should this be returned as |
@youkidearitai Thanks... I have just run your test cases in a debugger to see what is happening. The existing implementation of UTF-8 is one of the text encodings with This means that the existing implementation can handle bytes like It is not possible to perfectly emulate the existing behavior (on invalid strings) without running through the entire string from the start. I could follow it a bit more closely, while still maintaining O(1) runtime, but there will always be corner cases where the behavior on invalid strings would differ. Please mention if you have more thoughts about this. |
The old implementation runs through the entire string to pick out the part which should be returned by mb_strcut. This creates significant performance overhead. The new specialized implementation of mb_strcut for UTF-8 usually only examines a few bytes around the starting and ending cut points, meaning it generally runs in constant time. For UTF-8 strings just a few bytes long, the new implementation is around 10% faster (according to microbenchmarks which I ran locally). For strings around 10,000 bytes in length, it is 50-300x faster. (Yes, that is 300x and not 300%.) At the same time, I also added many more unit tests for mb_strcut. This will help to avoid unintended behavior changes as the function undergoes further performance work. The new implementation behaves identically to the old one on VALID UTF-8 strings; a fuzzer was used to help ensure this is the case. On invalid UTF-8 strings, there is a difference: in some cases, the old implementation will pass invalid byte sequences through unchanged, while in others it will remove them. The new implementation has behavior which is perhaps slightly more predictable: it simply backs up the starting and ending cut points to the preceding "starter byte" (one which is not a UTF-8 continuation byte).
Just updated commit message and UPGRADING message to make them more accurate, based on what I learned from analyzing test cases provided by @youkidearitai. |
I understand about UPGRADING. By the way, personally, I think an ideal, if when appears invalid UTF-8 byte sequence convert to other byte (ex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my side!
Regarding this matter, when I investigate about |
I have split the new test cases into a separate commit as advised by @Girgias. |
Landed on |
The old implementation decodes the entire string to pick out the part which should be returned by
mb_strcut
. This creates significant performance overhead. The new specialized implementation ofmb_strcut
for UTF-8 usually only examines a few bytes around the starting and ending cut points, meaning it generally runs in constant time.For UTF-8 strings just a few bytes long, the new implementation is around 10% faster (according to microbenchmarks which I ran locally). For strings around 10,000 bytes in length, it is 50-300x faster. (Yes, that is 300x and not 300%.)
At the same time, I also added many more unit tests for mb_strcut. This will help to avoid unintended behavior changes as the function undergoes further performance work.
The new implementation behaves identically to the old one on valid UTF-8 strings; a fuzzer was used to help ensure this is the case. On invalid UTF-8 strings, there is a difference: the old implementation would convert invalid UTF-8 byte sequences to error markers (
'?'
), but the new implementation just cuts a subsequence of bytes out of the source string without performing any conversion on it.Any comments will be much appreciated! @Girgias @cmb69 @youkidearitai @kamil-tekiela